Cascading 2.5 Supports Hadoop 2
Boris Lublinsky, InfoQ
November 19, 2013
Despite a wide and growing adoption of Hadoop, enterprises are still facing the problem of finding the right approach for fast and cost effective development of Hadoop-based applications. One of the ways to achieve this goal is using Domain Specific Languages (DSL) that often allow for significant simplification of Hadoop implementations.
One of the most popular Java DSL on top of the low-level MapReduce API is Cascading. It was introduced in late 2007 as a DSL to implement functional programming for large scale data workflows. It is based on a “plumbing” metaphor to define data processing as a workflow built out of familiar elements: Pipes, Taps, Tuple Rows, Filters, Joins, Traps, etc.
Cascading introduced a new version of the product – Cascading 2.5 – this week, delivering support for Hadoop 2, including YARN. According to the company’s press release, this new release features:
- Support for Hadoop 2 and its new features, including YARN. Cascading users looking to upgrade to Hadoop 2 will now be able to seamlessly migrate their applications and take advantage of new advance features like YARN.
- Added performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS.
- Additional broad compatibility with other Hadoop vendors and Hadoop as a service providers, including Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR, among others give Cascading users a richer set of deployment options and services available to them, whether on-premise or in the cloud.
Simultaneously, Concurrent announced general availability of Cascading Lingual – an open source project that provides a comprehensive ANSI SQL interface for accessing Hadoop-based data. This project covers more than 7,000 SQL-99 statements derived from sophisticated industry standard OLAP tools, and according to Cascading:
delivers the broadest SQL coverage for any tool in the Hadoop ecosystem. It’s innovative by making Hadoop simple and accessible, and by providing easy systems integration for multiple data stores into Hadoop by using just one SQL statement.
InfoQ had a chance to discuss the latest Cascading release with Chris K Wensel, CTO and Founder of Concurrent, Inc.
InfoQ: When you are talking about supporting YARN in Cascading 2.5, what do you mean exactly – the fact that MapReduce code uses the YARN resource manager or are you actually leveraging YARN for creating a new application manager, one specific to Cascading?
Wensel: Cascading 2.5 implicitly supports YARN, meaning that because Cascading 2.5 supports Hadoop 2, YARN functionality will also be supported. Cascading does not actually leverage YARN for application development.
InfoQ: Do you have any plans for leveraging Apache Tez to further improve the performance of Cascading applications?
Wensel: Yes, we do. We have plans for Tez in our roadmap and will communicate those updates at the appropriate time.
InfoQ: Can you elaborate on optimizations and performance improvements for complex joins?
Wensel: We have updated the API to allow for more complex and custom join types. Cascalog, for example, would leverage this feature under some circumstances.
InfoQ: In your opinion, is the current emphasis on SQL-based processing limiting the spectrum of applications developed in Hadoop? In spite of all its power, SQL is good for solving only a certain class of problems, representing a limited subset of applications for which enterprises can leverage Hadoop.
Wensel: SQL allows the other 99% of developers, analysts, and legacy systems to leverage Hadoop. Yes, you may hit a wall with SQL, but 90% of most problems are reasonably expressed as SQL. Cascading allows you to choose your battles. A good question to ask is: who wants to write a bunch of Java to do something that is one line of SQL? And also, who wants to write hundreds of lines of SQL to do something best written and tested in Java? Cascading gives developers the flexibility.