Category Archives: News

Bossie Awards 2014: The best open source big data tools

Steve Nunez, InfoWorld
Sep 29, 2014
http://www.infoworld.com/article/2688074/big-data/big-data-164727-bossie-awards-2014-the-best-open-source-big-data-tools

InfoWorld’s top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

2014 Bossie Logo

Cascading
The learning curve for writing Hadoop applications can be steep. Cascading is an SDK that brings a functional programming paradigm to Hadoop data workflows. With the 3.0 release, Cascading provides an easier way for application developers to take advantage of next-generation Hadoop features like YARN and Tez.

The SDK provides a rich set of commonly used ETL patterns that abstract away much of the complexity of Hadoop, increasing the robustness of applications and making it simpler for Java developers to utilize their skills in a Hadoop environment. Connectors for common third-party applications are available, enabling Cascading applications to tap into databases, ERP, and other enterprise data sources.

— Steven Nunez

Cascading on Apache Tez — Delivering on the promise of next generation compute

Gary Nakamura, Concurrent, Inc.
Sep 23, 2014
http://hortonworks.com/blog/cascading-on-apache-tez

Concurrent Inc. is a Hortonworks Technology Partner and recently announced that Cascading 3.0 now supports Apache Tez as an application runtime. Cascading is a powerful development framework for building enterprise data applications on Hadoop and is one of the most widely deployed technologies for data applications, with more than 175,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in data application development on Hadoop.

In this guest blog, Gary Nakamura, CEO at Concurrent, talks about Concurrent’s recent milestone and the road ahead.

The “developer release” of Apache Tez is here, and we are happy to re-affirm our support for the community and the project.

Concurrent, the team behind Cascading, would like to add our congratulations to the Apache Tez community on achieving this milestone. This is an important project for the broader ecosystem, and we expect to see Tez continue to move forward quickly.

What Cascading on Tez Means for ISVs
It’s early days for Cascading on Apache Tez. Simultaneously delivering on performance, scale and reliability is no small feat, but we see Hortonworks and the Apache Tez community delivering on the promise of a next generation compute engine.

Cascading and Tez together represent another important milestone, providing users and independent software vendors (ISVs) the flexibility to quickly build their data apps and then choose the appropriate compute engine for the business problem at hand (in-memory, batch mode, streaming or otherwise).

This week we announced that the latest Cascading 3.0 WIP adds Apache Tez as a supported runtime platform. This was a significant milestone for Cascading in its own right as we delivered a pluggable query planner to make this support possible. With this release, Cascading users can start testing their existing applications on the Apache Tez compute engine.

Thousands of enterprises around the world will welcome a more efficient, high-performance compute engine – one that delivers the reliability and scale that they are accustomed to and one that will allow them to easily and seamlessly migrate their business-critical data applications. Tez has promised this and that commitment stands to benefit the entire Hadoop ecosystem.

What’s Next?
From here, we will work closely with the Tez community to run performance and scalability tests, and capture feedback from new and existing users. We will also work with the broader Cascading community to migrate Scalding, Cascalog, Lingual and Pattern to Apache Tez.

This is a big win for the community, our contributors, partners, ISVs and for enterprises driving their data strategy and next-generation data applications on Hadoop.

We share an unwavering commitment to developer productivity, ease of deployment, ease of manageability, and above all, innovation for the future of data app development.

At the end of the day, we are all in the data business.

Resources for Cascading 3.0 on Apache Tez

Eight tips for resilient big data apps

Howard Solomon, IT World Canada
August 25, 2014
http://www.itworldcanada.com/post/eight-tips-for-resilent-big-data-apps

One of the problems with big data applications is they have to handle big data — we’re talking huge data sets.

As Supreet Oberoi, vice-president of Concurrent Inc. , a maker of the Cascading application development framework, points out in a column for GigaOM, if they aren’t tough enough they may fail in production.

The solution is to build resilient, well-tested applications before they go out the door. “This is a matter of philosophy and architecture as much as technology,” he says, in putting forward eight tips for building big data apps that can hold up to demanding environments:

–Define a blueprint for resilient applications, with a systemic enterprise architecture and methodology for your company approaches big data applications.

This means answering a number of questions, including where your current architecture is failing;

–Size shouldn’t matter. Apps have to be tested with small-scale datasets, then fail or take too long with larger ones. They have to handle all sizes of data;

–Have a transparent process for finding problems, so developers and operations staff can diagnose and respond to problems when they happen;

–Abstraction and simplicity work. “Resilient applications tend to be future-proof because they employ abstractions that simplify development, improve productivity and allow substitution of implementation technology,” he writes. Developers should be able to build apps without being mired in the implementation details. Then data scientists should b able to use the app and access any type of data source;

–Build in security, auditing and compliance;

–Test-driven development should provide the ability to step through the code, establish invariants, and utilize other defensive programming techniques;

–Be portable. Applications should be designed to run on a variety of platforms and products;

–No black arts. Code should be shared, reviewed and commonly owned by multiple developers, not dependent on one person.

“If companies follow these eight rules, they will create resilient, scalable applications that allow them to tap into the full power of big data,” Oberoi writes.

How many of these rules does your developer team follow — or break?

The eight must-have elements for resilient big data apps

Supreet Oberoi, Concurrent
August 23, 2014
http://gigaom.com/2014/08/23/the-eight-must-have-elements-for-resilient-big-data-apps

If your company is building big data applications, here are eight things you need to consider.

As big data applications move from proof-of-concept to production, resiliency becomes an urgent concern. When applications lack resiliency, they may fail when data sets are too large, they lack transparency into testing and operations, and they are insecure. As a result, defects must be fixed after applications are in production, which wastes time and money.

The solution is to start by building resilient applications: robust, tested, changeable, auditable, secure, performant, and monitorable. This is a matter of philosophy and architecture as much as technology. Here are the key dimensions of resiliency that I recommend for anyone building big data apps.

1. Define a blueprint for resilient applications

The first step is to create a systemic enterprise architecture and methodology for how your company approaches big data applications. What data are you after? What kinds of analytics are most important? How will metrics, auditing, security and operational features be built in? Can you prove that all data was processed? These capabilities must be built into the architecture.

Other questions to consider: What technology will be crucial? What technology is being used as a matter of convenience? Your blueprint must include honest, accurate assessments of where your current architecture is failing. Keep in mind that a resilient framework for building big data applications may take time to assemble, but is definitely worth it.

2. Size shouldn’t matter

If an application fails when it attempts to tackle larger datasets, it is not resilient. Often, applications are tested with small-scale datasets and then fail or take far too long with larger ones. To be resilient, applications must handle datasets of any size (and the size of the dataset in the case of your application may mean depth, width, rate of ingestion, or all of the above). Applications must also adapt to new technologies. Otherwise, companies are constantly reconfiguring, rebuilding and recoding. Obviously, this wastes time, resources and money.

3. Transparency and high fidelity execution analysis

With complicated applications, chasing down scaling and other resiliency problems is far from automated. Ideally, it should be easy to see how long each step in a complicated pipeline takes so that problems can be caught and fixed right away. It is critical to pinpoint where the problem is: in the code, the data, the infrastructure, or the network. But this type of transparency shouldn’t have to be constructed for each application; it should be a part of a larger platform so that developers and operations staff can diagnose and respond to problems as they arise.

Once you find a problem, it is vital to relate the application behavior to the code — ideally through the same monitoring application that reported the error. Too often, getting access to code involves consulting multiple developers and following a winding chain of custody.

4. Abstraction, productivity and simplicity

Resilient applications tend to be future-proof because they employ abstractions that simplify development, improve productivity and allow substitution of implementation technology. As part of the architecture, technology should allow developers to build applications without miring them in the implementation details. This type of simplicity allows any data scientist to use the application and access any type of data source. Without such abstractions, productivity suffers, changes are more expensive and users drown in complexity.

5. Security, auditing and compliance

A resilient application comes with its own audit trail, one that shows who used the application, who is allowed to use it, what data was accessed and how policies were enforced. Building these capabilities into applications is the key to meeting the ever-increasing array of privacy, security, governance and control challenges and regulations that businesses face with big data

6. Completeness and test-driven development

To be resilient, applications have to prove they have not lost data. Failing to do so can lead to dramatic consequences. For instance, as I witnessed during my time in financial services, fines and charges of money laundering or fraud can result if a company fails to account for every transaction because the application code missed one or two lines of data. Execution analysis, audit trails and test-driven development are foundational to proving that all the data was processed properly. Test-driven development means having the technology and the architecture to test application code in a sandbox and getting the same behavior when it is deployed in production. Test-driven development should provide the ability to step through the code, establish invariants, and utilize other defensive programming techniques.

7. Application and data portability

Evolving business requirements frequently drive changes in technology. As a result, applications must run on and work with a variety of platforms and products. The goal is to make data, wherever it lives, accessible to the end user via SQL and standard APIs. For example, a state of the art platform should allow data that is in Hadoop and processed through MapReduce to be moved to Spark or Tez and processed there with minimal or no impact to the code.

8. No black arts

Applications should be written in code that is not dependent on an individual virtuoso. Code should be shared, reviewed and commonly owned by multiple developers. Such a strategy allows for building teams that can take collective responsibility for applications.

If companies follow these eight rules, they will create resilient, scalable applications that allow them to tap into the full power of big data.

How Hadoop Can Extend the Life of Your Legacy Assets

Supreet Oberoi, VP Field Engineering, Concurrent, Inc.
August 18, 2014
http://insights.wired.com/profiles/blogs/how-hadoop-can-extend-the-life-of-your-legacy-assets

The fragmentation of data across repositories is accelerating. Much of the new data growth is occurring inside Hadoop, but it is clear that enterprise data warehouses won’t be shut down anytime soon. This situation leads to a familiar but urgent question: If we cannot have a unified physical repository, is it possible to have a single logical repository that allows application developers to think only about the data, not about the details of where it lives?

In short, yes, we can have that. To get the most out of data, the job of finding data in far-flung repositories, collecting it in one place and knitting it together should be part of the application development platform. In other words, to really make big data application development productive, we need a query that can assemble, correlate and distill data from multiple sources at execution time. Anyone who is serious about exploiting big data is going to want to improve developer productivity by providing this capability. Here is the logic driving this transformation.

The Permanently Fragmented World of Data

Very few companies have actually managed to bring all of their enterprise data into one unified master repository. And even when they have, mergers quickly re-fragment the data landscape. In addition, once a data warehouse has been created, companies are unlikely to shut it down or move it elsewhere. This makes the merging of data warehouses a rare occurrence, which means that a unified master repository is a pipe dream.

Hadoop plays a powerful new role by serving as a unified data repository (sometimes called a data lake) for the vast amounts of new types of data companies are continually procuring. Most of the time Hadoop supplements an existing data warehouse. Sometimes, workloads for ETL move from the data warehouse to Hadoop. In companies where data is power, moving data from a data warehouse to Hadoop essentially means shutting down or dramatically reducing the importance of the data warehouse. Whether or not this is a good idea, in most companies, it isn’t going to happen. In addition, few data warehouses or applications will want to suffer the additional load (and complexity in governance and compliance) of allowing constant replication of data to another repository.

In the future, the data for most big data applications will reside in many locations: in data warehouses, in Hadoop and in various application-specific locales. Looking at the ways in which big data leaders currently run their data platforms – companies including Netflix, Facebook and Etsy – you’ll see that this is exactly the structure they have.

Creating One Logical View

One way to solve this (which I do not recommend) is by making your application complex. This means bringing all of your data into an application through separate queries and combining and analyzing it inside the application itself. Before we had databases and SQL, data lived in files and it was the program’s job to do joins and “group bys” and such. Writing applications this way means lots of code that could be standardized is instead developed from scratch and must be debugged and maintained. Such an approach is not a recipe for productivity or resilience. SQL moved much of this work to the database, dramatically simplifying applications.

Using abstraction is the best way to create a unified logical model. It allows you to express the data you want that can later be used to retrieve it from a number of different repositories. That’s what Concurrent Inc.’s Cascading and Lingual projects do, moving even more work from the application to a standardized, reusable layer. While Cascading allows big data applications to be expressed through an API, usually to access data in Hadoop, Lingual allows Cascading to use SQL to reach out to other repositories and bring selected data back to Hadoop for analysis. Creating a unified logical model of data across Hadoop and traditional data stores is enough to support a huge range of applications. As stated, developers don’t think about where data is, but about what data they want and how they want it connected and distilled.

When the application runs, all the data in various repositories is gathered together in Hadoop where the work of the application is completed. With Cascading, data from Hadoop, SQL and any other repository is moved to a staging area in Hadoop, then distilled into the answer the developer wants. The data assembly and query processing takes place automatically as specified by the API. Much smaller subsets of data are transferred for most applications. Notably, Teradata has also found the need for a unified query across many types of repositories, which they call the QueryGrid. But in their model, the data the program needs comes back to the data warehouse to be consolidated and processed. Concurrent’s technology makes Hadoop the location where the data is consolidated and analyzed.

Avoiding Hadoop Shelfware

Creating a robust and scalable unified query as we have through the combination of Cascading and Lingual is key to exploiting the full value of all of your data, whether it is stored in Hadoop, a data warehouse or anywhere else. Almost every important application will be using data from Hadoop and from other sources. This approach will help you realize the full value both of Hadoop and of your existing infrastructure.

To make the most of the data and legacy assets you have and your investment in Hadoop, it is vital to allow programmers to easily deal with this heterogeneous reality. That’s why a unified query mechanism is a must-have for companies that are serious about big data. Creating a single logical view of data for analysts and developers lowers the cost of creating hundreds of applications that will be useful to your business.

Supreet Oberoi is vice president of field engineering at Concurrent.

Concurrent and Elasticsearch Team Up to Accelerate Data Application Deployment

Daniel Gutierrez, Inside Big Data
August 4, 2014
http://inside-bigdata.com/2014/08/04/concurrent-elasticsearch-team-accelerate-data-application-deployment

Concurrent, Inc., a leader in data application infrastructure, and Elasticsearch, Inc., provider of an end-to-end real-time search and analytics stack, has announced a partnership to accelerate the time to market for Hadoop-based enterprise data applications.

Enterprises seeking to make the most out of their Big Data can now easily build applications with Cascading that read and write directly to Elasticsearch, a search and analytics engine that makes data available for search and analysis in a matter of seconds. Once data is in Elasticsearch, users can also take advantage of Kibana, Elasticsearch’s data visualization tool, to generate pie charts, bar graphs, scatter plots and histograms that allow them to easily explore and extract insights from their data.

We’re on a mission to make massive amounts of data usable for businesses everywhere, so it’s no surprise that we’re teaming up with Concurrent, a company that’s leading the way in Big Data application infrastructure for the enterprise,” said Shay Banon, co-founder and CTO, Elasticsearch, Inc. “Now, when developers use Cascading to build Hadoop-based applications, they can easily utilize Elasticsearch to instantly query and analyze their data, allowing them to provide a fast and robust search experience, as well as gain valuable insights.”

As enterprises continue to heavily invest in the building of data-centric applications to connect business strategy to data, they need a reliable and repeatable way to consistently deliver data products to customers. With more than 175,000 downloads a month, Cascading is the enterprise development framework of choice and has become the de-facto standard for building data-centric applications. With a proven framework and support for various computation fabrics, Cascading enables enterprises to leverage existing skill-sets, investments and systems to deliver on the promise of Big Data.

We continue to set the industry standard for building data applications,” said Chris Wensel, founder and CTO, Concurrent, Inc. “Bringing the power of Elasticsearch for fast, distributed, data search and analytics together with Cascading reflects our common goal – faster time to value in the deployment of data-centric applications. This is a feature our customers have been asking for, and we expect tremendous response from the community.”

Download the open source extension HERE.

Emerging Vendors 2014: Big Data Vendors

Rick Whiting, CRN
July 30, 2014
http://www.crn.com/slide-shows/applications-os/300073545/emerging-vendors-2014-big-data-vendors-part-1.htm/pgno/0/13

Rising Big Data Stars

Technology for managing and analyzing big data is one of the fastest-growing segments of the IT industry right now. And it’s an area where innovative startups seem to have a competitive edge over the major, established IT vendors. What’s more, the vendors on this list know the value of a good partnership and either have a serious commitment to the channel or plan to leverage it as they “cross the chasm” and go mainstream with their bleeding-edge technologies.

Take a look at the hottest startups in the big data segment from the Emerging Vendors list for 2014. Altogether there are more than 80 startups in the big data/business intelligence space, so we’ve split them into two shows with vendors A-M included here and vendors N-Z in a slideshow to follow.

Concurrent offers application middleware technology that businesses use to develop, deploy, run and manage big data applications. The company’s products include the Cascading application development framework and Driven application performance management software.

EmergingVendors-2014

Big Data 50 – the hottest Big Data startups of 2014

Jeff Vance, Startup50
July 28, 2014
http://startup50.com/Big-Data-Startups/Big-Data-50/7-27/

From “Fast Data” to visualization software to tools used to track “Social Whales,” these Big Data startups have it covered.

The 50 Big Data startups in the Big Data 50 are an impressive lot. In fact, the Big Data space in general is so hot that you might start worrying about it overheating – kind of like one of those mid-summer drives through the Mojave Desert. The signs warn you to turn off your AC for a reason.

Personally, I think we’re a long way away from any sort of Big Data bubble. Our economy is so used to trusting decision makers who “trust their gut” that we have much to learn before the typical business is even ready for data Kindergarten.

In fact, after a few decades following the “voodoo” of supply side economics, which fetishized the mysterious and elusive “rational consumer,” the strides we’re making towards being a more evidence-based economy still have us pretty much just playing catch up.

These 50 Big Data startups are working to change that.

Here’s the Big Data startups in the Big Data 50 report.

Use a Factory Approach to Streamline Application Development

Cindy Waxer, DataInformed
July 14, 2014
http://data-informed.com/use-a-factory-approach-streamline-application-development

“Big data factory” may not be as glamorous a description for an organization’s analytics talent as “data rock star” or “virtuoso,” but it’s a term an increasing number of organizations are embracing as the race to build big data applications heats up.

For years, lone data scientists have dominated analytics departments, building applications one at a time while handing over dozens of files with multiple scripts to various departments. But that’s all changing as companies like BloomReach discover a new approach to app development that more closely resembles an auto manufacturer’s plant floor than a Silicon Valley cubicle.

BloomReach creates big data market applications by leveraging analytics to personalize website content and enable site searches for big-name brands like Crate&Barrel and NeimanMarcus. Although the startup relies on application development platform Cascading from Concurrent Inc. to build on Hadoop, Seth Rogers, a member of BloomReach’s technical staff, says that developing applications “is a time-consuming process. There’s a lot of trial and error and iterations that you have to go through.”

Rogers said that to minimize this heavy lifting and make its app development process more flexible, BloomReach has created its own big data factory. According to Rogers, application development traditionally has involved a hodgepodge of development tools, programming languages, and raw Hadoop in which “every product is in its own silo, which is very specific and very opaque. Nobody really understands what’s going on with programming languages or data storage.”

As a result, data scientists are left experimenting on their own, building one app at a time without a repeatable platform or reproducible results. Not only does this require starting from scratch with every new application, but if a data scientist suddenly jumps ship, a huge knowledge vacuum and a corresponding negative impact on production results.

To make application design simpler with repeatable development processes, BloomReach created a big data factory for the four terabytes of data it processes weekly and the 150 million consumer interactions it encounters daily.

Rogers said a successful big data factory comprises five key components:

  • A set of common building blocks and tools for app development. Rogers said the Cascading app development platform allows BloomReach to access “programming libraries we’ve already written so that we can incorporate them into new apps in a fairly straightforward way and without customization.”
  • A standardized infrastructure, in which data sets are stored in standard formats so that “when we are building a product, we are not starting from scratch,” Rogers said. “We already have our basic infrastructure in place.”
  • A selection of monitoring tools that regularly tests the performance of systems and measures computing power usage to ensure peak performance.
  • A standardized process for debugging along with databases for ad hoc queries. Rogers said it’s not uncommon for a customer’s Web site to experience an occasional glitch. The problem, however, is that it’s often difficult to sift through files for a nugget of bad data, especially if it hasn’t been indexed. Easy-to-query databases, however, “make it easy to find information without having to open up these gigantic, 10-terabyte files and run through them,” Rogers said. “We can go in there and easily do a search and see what exactly went wrong, whether it was user error or a legitimate bug.” In turn, apps can be modified quickly and effectively without having to go back to start.
  • Centralized storage for data. Storage typically comes in three forms: a shared file system that is excellent for transforming data; a key store that lets users quickly look up data using a specific key for fast analysis; and a relational database that cleans up data, converts it into a structured record, and loads it into a regular SQL database “so that users can conduct queries on any field” and play with the data, said Rogers.

Application development tools, debugging solutions, storage options – they have always been available to a big data team. But by combining data, coding, and monitoring processes into a standardized, centralized, and simplified strategy for app development, Rogers said, a big data factory can accelerate the app development process, resulting in significant savings in developer time, storage capacity, and computing power resources.

Questioning the Lambda Architecture

Jay Kreps, Radar
Jul 2, 2014
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

The Lambda Architecture has its merits, but alternatives are worth exploring.

Nathan Marz wrote a popular blog post describing an idea he called the Lambda Architecture (“How to beat the CAP theorem“). The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. Since I’ve been involved in building out the real-time data processing infrastructure at LinkedIn using Kafka and Samza, I often get asked about the Lambda Architecture. I thought I would describe my thoughts and experiences.

What is a Lambda Architecture and how do I become one?

The Lambda Architecture looks something like this:

lambda

The way this works is that an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer.
There are a lot of variations on this, and I’m intentionally simplifying a bit. For example, you can swap in various similar systems for Kafka, Storm, and Hadoop, and people often use two different databases to store the output tables, one optimized for real time and the other optimized for batch updates.

The Lambda Architecture is aimed at applications built around complex asynchronous transformations that need to run with low latency (say, a few seconds to a few hours). A good example would be a news recommendation system that needs to crawl various news sources, process and normalize all the input, and then index, rank, and store it for serving.

I have been involved in building a number of real-time data systems and pipelines at LinkedIn. Some of these worked in this style, and upon reflection, it is not my favorite approach. I thought it would be worthwhile to describe what I see as the pros and cons of this architecture, and also give an alternative I prefer.

What’s good about this?

I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. I think this lesson translates well to the stream processing domain. I’ve written some of my thoughts about capturing and transforming immutable data streams here.

I also like that this architecture highlights the problem of reprocessing data. Reprocessing is one of the key challenges of stream processing but is very often ignored. By “reprocessing,” I mean processing input data over again to re-derive output. This is a completely obvious but often ignored requirement. Code will always change. So, if you have code that derives output data from an input stream, whenever the code changes, you will need to recompute your output to see the effect of the change.

Why does code change? It might change because your application evolves and you want to compute new output fields that you didn’t previously need. Or it might change because you found a bug and need to fix it. Regardless, when it does, you need to regenerate your output. I have found that many people who attempt to build real-time data processing systems don’t put much thought into this problem and end-up with a system that simply cannot evolve quickly because it has no convenient way to handle reprocessing. The Lambda Architecture deserves a lot of credit for highlighting this problem.

There are a number of other motivations proposed for the Lambda Architecture, but I don’t think they make much sense. One is that real-time processing is inherently approximate, less powerful, and more lossy than batch processing. I actually do not think this is true. It is true that the existing set of stream processing frameworks are less mature than MapReduce, but there is no reason that a stream processing system can’t give as strong a semantic guarantee as a batch system.

Another explanation I have heard is that the Lambda Architecture somehow “beats the CAP theorem” by allowing a mixture of different data systems with different trade-offs. Long story short, although there are definitely latency/availability trade-offs in stream processing, this is an architecture for asynchronous processing, so the results being computed are not kept immediately consistent with the incoming data. The CAP theorem, sadly, remains intact.

And the bad…

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable.

Programming in distributed frameworks like Storm and Hadoop is complex. Inevitably, code ends up being specifically engineered toward the framework it runs on. The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.

Why can’t the stream processing system be improved to handle the full problem set in its target domain?

One proposed approach to fixing this is to have a language or framework that abstracts over both the real-time and batch framework. You write your code using this higher level framework and then it “compiles down” to stream processing or MapReduce under the covers. Summingbird is a framework that does this. This definitely makes things a little better, but I don’t think it solves the problem.

Ultimately, even if you can avoid coding your application twice, the operational burden of running and debugging two systems is going to be very high. And any new abstraction can only provide the features supported by the intersection of the two systems. Worse, committing to this new uber-framework walls off the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig, Crunch, Cascading, Oozie, etc).

By way of analogy, consider the notorious difficulties in making cross-database ORM really transparent. And consider that this is just a matter of abstracting over very similar systems providing virtually identical capabilities with a (nearly) standardized interface language. The problem of abstracting over totally divergent programming paradigms built on top of barely stable distributed systems is much harder.

We have done this experiment

We have actually been through a number of rounds of this at LinkedIn. We have built various hybrid-Hadoop architectures and even a domain-specific API that would allow code to be “transparently” run either in real time or in Hadoop. These approaches worked, but none were very pleasant or productive. Keeping code written in two different systems perfectly in sync was really, really hard. The API meant to hide the underlying frameworks proved to be the leakiest of abstractions. It ended up requiring deep Hadoop knowledge as well as deep knowledge of the real-time layer — and adding the new requirement that you understand enough about how the API would translate to these underlying systems whenever you were debugging problems or trying to reason about performance.

These days, my advice is to use a batch processing framework like MapReduce if you aren’t latency sensitive, and use a stream processing framework if you are, but not to try to do both at the same time unless you absolutely must.

So, why the excitement about the Lambda Architecture? I think the reason is because people increasingly need to build complex, low-latency processing systems. What they have at their disposal are two things that don’t quite solve their problem: a scalable high-latency batch system that can process historical data and a low-latency stream processing system that can’t reprocess results. By duct taping these two things together, they can actually build a working solution.

In this sense, even though it can be painful, I think the Lambda Architecture solves an important problem that was otherwise generally ignored. But I don’t think this is a new paradigm or the future or big data. It is just a temporary state driven by the current limitation of off-the-shelf tools. I also think there are better alternatives.

An alternative

As someone who designs infrastructure, I think the glaring question is this: why can’t the stream processing system just be improved to handle the full problem set in its target domain? Why do you need to glue on another system? Why can’t you do both real-time processing and also handle the reprocessing when code changes? Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? The answer is that you can do this, and I think this it is actually a reasonable alternative architecture if you are building this type of system today.

When I’ve discussed this with people, they sometimes tell me that stream processing feels inappropriate for high-throughput processing of historical data. But I think this is an intuition based mostly on the limitations of systems they have used, which either scale poorly or can’t save historical data. This leaves them with a sense that a stream processing system is inherently something that computes results off some ephemeral streams and then throws all the underlying data away. But there is no reason this should be true. The fundamental abstraction in stream processing is data flow DAGs, which are exactly the same underlying abstraction in a traditional data warehouse (a la Volcano) as well as being the fundamental abstraction in the MapReduce successor Tez. Stream processing is just a generalization of this data-flow model that exposes checkpointing of intermediate results and continual output to the end user.

So, how can we do the reprocessing directly from our stream processing job? My preferred approach is actually stupidly simple:

Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.
When the second job has caught up, switch the application to read from the new table.
Stop the old version of the job, and delete the old output table.
This architecture looks something like this:

kappa

Unlike the Lambda Architecture, in this approach you only do reprocessing when your processing code changes, and you actually need to recompute your results. And, of course, the job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data. Naturally, you will want to bump up the parallelism on your reprocessing job so it completes very quickly.

Maybe we could call this the Kappa Architecture, though it may be too simple of an idea to merit a Greek letter.

Of course, you can optimize this further. In many cases, you could combine the two output tables. However, I think there are some benefits to having both for a short period of time. This allows you to revert back instantaneously to the old logic by just having a button that redirects the application to the old table. And in cases that are particularly important (your ad targeting criteria, say), you can control the cut-over with an automatic A/B test or bandit algorithm to ensure whatever bug fix or code improvement you are rolling out hasn’t accidentally degraded things in comparison to the prior version.

Note that this this doesn’t mean your data can’t go to HDFS; it just means that you don’t run your reprocessing there. Kafka has good integration with Hadoop, so mirroring any Kafka topic into HDFS is easy. It is often useful for the output or even intermediate streams from a stream processing job to be available in Hadoop for analysis in tools like Hive or for use as input for other, offline data processing flows.

We have documented implementing this approach as well as other variations on reprocessing architectures using Samza.

Some background

For those less familiar with Kafka, what I just described may not make sense. A quick refresher will hopefully straighten things out. Kafka maintains ordered logs like this:

log

A Kafka “topic” is a collection of these logs:

partitioned_log

A stream processor consuming this data just maintains an “offset,” which is the log entry number for the last record it has processed on each of these partitions. So, changing the consumer’s position to go back and reprocess data is as simple as restarting the job with a different offset. Adding a second consumer for the same data is just another reader pointing to a different position in the log.

Kafka supports replication and fault-tolerance, runs on cheap, commodity hardware, and is glad to store many TBs of data per machine. So, retaining large amounts of data is a perfectly natural and economical thing to do and won’t hurt performance. LinkedIn keeps more than a petabyte of Kafka storage online, and a number of applications make good use of this long retention pattern for exactly this purpose.

Cheap consumers and the ability to retain large amounts of data make adding the second “reprocessing” job just a matter of firing up a second instance of your code but starting from a different position in the log.

This design is not an accident. We built Kafka with the intent of using it as a substrate for stream processing, and we had in mind exactly this model for handling reprocessing data. For the curious, you can find more information on Kafka here.

Fundamentally, though, there is nothing that ties this idea to Kafka. You could substitute any system that supports long retention of ordered data (for example HDFS, or some kind of database). Indeed, a lot of people are familiar with similar patterns that go by the name Event Sourcing or CQRS. And, of course, the distributed database people will tell you this is just a slight rebranding of materialized view maintenance, which, as they will gladly remind you, they figured out a long long time ago, sonny.

Comparison

I know this approach works well using Samza as the stream processing system because we do it at LinkedIn. But I am not aware of any reason it shouldn’t work equally well in Storm or other stream processing systems. I’m not familiar enough with Storm to work through the practicalities, so I’d be glad to hear if others are doing this already. In any case, I think the general ideas are fairly system independent.

The efficiency and resource trade-offs between the two approaches are somewhat of a wash. The Lambda Architecture requires running both reprocessing and live processing all the time, whereas what I have proposed only requires running the second copy of the job when you need reprocessing. However, my proposal requires temporarily having 2x the storage space in the output database and requires a database that supports high-volume writes for the re-load. In both cases, the extra load of the reprocessing would likely average out. If you had many such jobs, they wouldn’t all reprocess at once, so on a shared cluster with several dozen such jobs you might budget an extra few percent of capacity for the few jobs that would be actively reprocessing at any given time.

The real advantage isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. So, in cases where simplicity is important, consider this approach as an alternative to the Lambda Architecture.