All posts by KIm Loughead

The eight must-have elements for resilient big data apps

August 23, 2014NewsKIm Loughead

Supreet Oberoi, Concurrent
August 23, 2014
http://gigaom.com/2014/08/23/the-eight-must-have-elements-for-resilient-big-data-apps

If your company is building big data applications, here are eight things you need to consider.

As big data applications move from proof-of-concept to production, resiliency becomes an urgent concern. When applications lack resiliency, they may fail when data sets are too large, they lack transparency into testing and operations, and they are insecure. As a result, defects must be fixed after applications are in production, which wastes time and money.

The solution is to start by building resilient applications: robust, tested, changeable, auditable, secure, performant, and monitorable. This is a matter of philosophy and architecture as much as technology. Here are the key dimensions of resiliency that I recommend for anyone building big data apps.

1. Define a blueprint for resilient applications

The first step is to create a systemic enterprise architecture and methodology for how your company approaches big data applications. What data are you after? What kinds of analytics are most important? How will metrics, auditing, security and operational features be built in? Can you prove that all data was processed? These capabilities must be built into the architecture.

Other questions to consider: What technology will be crucial? What technology is being used as a matter of convenience? Your blueprint must include honest, accurate assessments of where your current architecture is failing. Keep in mind that a resilient framework for building big data applications may take time to assemble, but is definitely worth it.

2. Size shouldn’t matter

If an application fails when it attempts to tackle larger datasets, it is not resilient. Often, applications are tested with small-scale datasets and then fail or take far too long with larger ones. To be resilient, applications must handle datasets of any size (and the size of the dataset in the case of your application may mean depth, width, rate of ingestion, or all of the above). Applications must also adapt to new technologies. Otherwise, companies are constantly reconfiguring, rebuilding and recoding. Obviously, this wastes time, resources and money.

3. Transparency and high fidelity execution analysis

With complicated applications, chasing down scaling and other resiliency problems is far from automated. Ideally, it should be easy to see how long each step in a complicated pipeline takes so that problems can be caught and fixed right away. It is critical to pinpoint where the problem is: in the code, the data, the infrastructure, or the network. But this type of transparency shouldn’t have to be constructed for each application; it should be a part of a larger platform so that developers and operations staff can diagnose and respond to problems as they arise.

Once you find a problem, it is vital to relate the application behavior to the code — ideally through the same monitoring application that reported the error. Too often, getting access to code involves consulting multiple developers and following a winding chain of custody.

4. Abstraction, productivity and simplicity

Resilient applications tend to be future-proof because they employ abstractions that simplify development, improve productivity and allow substitution of implementation technology. As part of the architecture, technology should allow developers to build applications without miring them in the implementation details. This type of simplicity allows any data scientist to use the application and access any type of data source. Without such abstractions, productivity suffers, changes are more expensive and users drown in complexity.

5. Security, auditing and compliance

A resilient application comes with its own audit trail, one that shows who used the application, who is allowed to use it, what data was accessed and how policies were enforced. Building these capabilities into applications is the key to meeting the ever-increasing array of privacy, security, governance and control challenges and regulations that businesses face with big data

6. Completeness and test-driven development

To be resilient, applications have to prove they have not lost data. Failing to do so can lead to dramatic consequences. For instance, as I witnessed during my time in financial services, fines and charges of money laundering or fraud can result if a company fails to account for every transaction because the application code missed one or two lines of data. Execution analysis, audit trails and test-driven development are foundational to proving that all the data was processed properly. Test-driven development means having the technology and the architecture to test application code in a sandbox and getting the same behavior when it is deployed in production. Test-driven development should provide the ability to step through the code, establish invariants, and utilize other defensive programming techniques.

7. Application and data portability

Evolving business requirements frequently drive changes in technology. As a result, applications must run on and work with a variety of platforms and products. The goal is to make data, wherever it lives, accessible to the end user via SQL and standard APIs. For example, a state of the art platform should allow data that is in Hadoop and processed through MapReduce to be moved to Spark or Tez and processed there with minimal or no impact to the code.

8. No black arts

Applications should be written in code that is not dependent on an individual virtuoso. Code should be shared, reviewed and commonly owned by multiple developers. Such a strategy allows for building teams that can take collective responsibility for applications.

If companies follow these eight rules, they will create resilient, scalable applications that allow them to tap into the full power of big data.

How Hadoop Can Extend the Life of Your Legacy Assets

August 18, 2014NewsKIm Loughead

Supreet Oberoi, VP Field Engineering, Concurrent, Inc.
August 18, 2014
http://insights.wired.com/profiles/blogs/how-hadoop-can-extend-the-life-of-your-legacy-assets

The fragmentation of data across repositories is accelerating. Much of the new data growth is occurring inside Hadoop, but it is clear that enterprise data warehouses won’t be shut down anytime soon. This situation leads to a familiar but urgent question: If we cannot have a unified physical repository, is it possible to have a single logical repository that allows application developers to think only about the data, not about the details of where it lives?

In short, yes, we can have that. To get the most out of data, the job of finding data in far-flung repositories, collecting it in one place and knitting it together should be part of the application development platform. In other words, to really make big data application development productive, we need a query that can assemble, correlate and distill data from multiple sources at execution time. Anyone who is serious about exploiting big data is going to want to improve developer productivity by providing this capability. Here is the logic driving this transformation.

The Permanently Fragmented World of Data

Very few companies have actually managed to bring all of their enterprise data into one unified master repository. And even when they have, mergers quickly re-fragment the data landscape. In addition, once a data warehouse has been created, companies are unlikely to shut it down or move it elsewhere. This makes the merging of data warehouses a rare occurrence, which means that a unified master repository is a pipe dream.

Hadoop plays a powerful new role by serving as a unified data repository (sometimes called a data lake) for the vast amounts of new types of data companies are continually procuring. Most of the time Hadoop supplements an existing data warehouse. Sometimes, workloads for ETL move from the data warehouse to Hadoop. In companies where data is power, moving data from a data warehouse to Hadoop essentially means shutting down or dramatically reducing the importance of the data warehouse. Whether or not this is a good idea, in most companies, it isn’t going to happen. In addition, few data warehouses or applications will want to suffer the additional load (and complexity in governance and compliance) of allowing constant replication of data to another repository.

In the future, the data for most big data applications will reside in many locations: in data warehouses, in Hadoop and in various application-specific locales. Looking at the ways in which big data leaders currently run their data platforms – companies including Netflix, Facebook and Etsy – you’ll see that this is exactly the structure they have.

Creating One Logical View

One way to solve this (which I do not recommend) is by making your application complex. This means bringing all of your data into an application through separate queries and combining and analyzing it inside the application itself. Before we had databases and SQL, data lived in files and it was the program’s job to do joins and “group bys” and such. Writing applications this way means lots of code that could be standardized is instead developed from scratch and must be debugged and maintained. Such an approach is not a recipe for productivity or resilience. SQL moved much of this work to the database, dramatically simplifying applications.

Using abstraction is the best way to create a unified logical model. It allows you to express the data you want that can later be used to retrieve it from a number of different repositories. That’s what Concurrent Inc.’s Cascading and Lingual projects do, moving even more work from the application to a standardized, reusable layer. While Cascading allows big data applications to be expressed through an API, usually to access data in Hadoop, Lingual allows Cascading to use SQL to reach out to other repositories and bring selected data back to Hadoop for analysis. Creating a unified logical model of data across Hadoop and traditional data stores is enough to support a huge range of applications. As stated, developers don’t think about where data is, but about what data they want and how they want it connected and distilled.

When the application runs, all the data in various repositories is gathered together in Hadoop where the work of the application is completed. With Cascading, data from Hadoop, SQL and any other repository is moved to a staging area in Hadoop, then distilled into the answer the developer wants. The data assembly and query processing takes place automatically as specified by the API. Much smaller subsets of data are transferred for most applications. Notably, Teradata has also found the need for a unified query across many types of repositories, which they call the QueryGrid. But in their model, the data the program needs comes back to the data warehouse to be consolidated and processed. Concurrent’s technology makes Hadoop the location where the data is consolidated and analyzed.

Avoiding Hadoop Shelfware

Creating a robust and scalable unified query as we have through the combination of Cascading and Lingual is key to exploiting the full value of all of your data, whether it is stored in Hadoop, a data warehouse or anywhere else. Almost every important application will be using data from Hadoop and from other sources. This approach will help you realize the full value both of Hadoop and of your existing infrastructure.

To make the most of the data and legacy assets you have and your investment in Hadoop, it is vital to allow programmers to easily deal with this heterogeneous reality. That’s why a unified query mechanism is a must-have for companies that are serious about big data. Creating a single logical view of data for analysts and developers lowers the cost of creating hundreds of applications that will be useful to your business.

Supreet Oberoi is vice president of field engineering at Concurrent.

Webinar | Real-time Analytics and Anomaly Detection using Elasticsearch and Apache Hadoop – Aug 20, 2014

August 11, 2014EventsKIm Loughead

Date: Wednesday, August 20, 2014
Time: 9am Pacific
Register at: http://www.elasticsearch.org/webinars/elasticsearch-and-apache-hadoop

Finding relevant information fast has always been a challenge, even more so in today’s growing “oceans” of data. Over the past few years, leading businesses have deployed Apache Hadoop extensively to store and process this ocean. Today’s challenge is to maximize analytical insights and return on investment from this existing Hadoop infrastructure.

Enter Elasticsearch for Apache Hadoop, affectionately known as es-hadoop. es-hadoop enables data-hungry businesses to enhance their Hadoop workflows with a full-blown search and analytics engine. Best of all, es-hadoop allows businesses to gain insights from their data in real-time.

In this webinar, Costin Leau, lead developer for es-hadoop, will discuss:

An overview of how Elasticsearch plays in the overall Hadoop ecosystem
What is es-hadoop? A full feature overview and the benefits of using it
How es-hadoop augments existing Hadoop deployments, regardless of flavor of Hadoop distro
How Elasticsearch and es-hadop help businesses extract insights and analytics from their Hadoop deployments, all in real-time

After an overview of es-hadoop’s functionality, Costin will treat us to a use case deep dive. He’ll demo using es-hadoop coupled with Elasticsearch as a platform to perform search and analytics, such as anomaly detection. During his demo, Costin will be show you how to do the same analytics on your own Hadoop infrastructure. He’ll conclude with some tips for success for initial deployment of es-hadoop and Elasticsearch.

Concurrent and Elasticsearch Team Up to Accelerate Data Application Deployment

August 4, 2014NewsKIm Loughead

Daniel Gutierrez, Inside Big Data
August 4, 2014
http://inside-bigdata.com/2014/08/04/concurrent-elasticsearch-team-accelerate-data-application-deployment

Concurrent, Inc., a leader in data application infrastructure, and Elasticsearch, Inc., provider of an end-to-end real-time search and analytics stack, has announced a partnership to accelerate the time to market for Hadoop-based enterprise data applications.

Enterprises seeking to make the most out of their Big Data can now easily build applications with Cascading that read and write directly to Elasticsearch, a search and analytics engine that makes data available for search and analysis in a matter of seconds. Once data is in Elasticsearch, users can also take advantage of Kibana, Elasticsearch’s data visualization tool, to generate pie charts, bar graphs, scatter plots and histograms that allow them to easily explore and extract insights from their data.

We’re on a mission to make massive amounts of data usable for businesses everywhere, so it’s no surprise that we’re teaming up with Concurrent, a company that’s leading the way in Big Data application infrastructure for the enterprise,” said Shay Banon, co-founder and CTO, Elasticsearch, Inc. “Now, when developers use Cascading to build Hadoop-based applications, they can easily utilize Elasticsearch to instantly query and analyze their data, allowing them to provide a fast and robust search experience, as well as gain valuable insights.”

As enterprises continue to heavily invest in the building of data-centric applications to connect business strategy to data, they need a reliable and repeatable way to consistently deliver data products to customers. With more than 175,000 downloads a month, Cascading is the enterprise development framework of choice and has become the de-facto standard for building data-centric applications. With a proven framework and support for various computation fabrics, Cascading enables enterprises to leverage existing skill-sets, investments and systems to deliver on the promise of Big Data.

We continue to set the industry standard for building data applications,” said Chris Wensel, founder and CTO, Concurrent, Inc. “Bringing the power of Elasticsearch for fast, distributed, data search and analytics together with Cascading reflects our common goal – faster time to value in the deployment of data-centric applications. This is a feature our customers have been asking for, and we expect tremendous response from the community.”

Download the open source extension HERE.

Concurrent, Inc. and Elasticsearch, Inc. Team Up to Accelerate Data Application Deployment

July 31, 2014Press ReleasesKIm Loughead

Enterprises Can Now Use Cascading and Elasticsearch to Accelerate Time to Market for Data Products that Require Reliable and Scalable Data Processing and Powerful Search and Analytics Capabilities

SAN FRANCISCO and LOS ALTOS, Calif. – July 31, 2014 – Concurrent, Inc., the leader in data application infrastructure, and Elasticsearch, Inc., provider of an end-to-end real-time search and analytics stack, today announced a partnership to accelerate the time to market for Hadoop-based enterprise data applications.

As enterprises continue to heavily invest in the building of data-centric applications to connect business strategy to data, they need a reliable and repeatable way to consistently deliver data products to customers. With more than 175,000 downloads a month, Cascading is the enterprise development framework of choice and has become the de-facto standard for building data-centric applications. With a proven framework and support for various computation fabrics, Cascading enables enterprises to leverage existing skillsets, investments and systems to deliver on the promise of Big Data.

Download the open source extension at http://www.elasticsearch.org/overview/hadoop/.

Supporting Quotes
“As a Big Data developer, author, and previous engineer and cloud architect, I’m committed to using, knowing and writing about the best technologies that benefit businesses and customers. As a user of both Elasticsearch and Cascading, I’m excited to see this partnership take place and see the great technologies that will integrate and emerge from this collaboration.”
– Antonios Chalkiopoulos, author of “Programming MapReduce with Scalding”

“We’re on a mission to make massive amounts of data usable for businesses everywhere, so it’s no surprise that we’re teaming up with Concurrent, a company that’s leading the way in Big Data application infrastructure for the enterprise. Now, when developers use Cascading to build Hadoop-based applications, they can easily utilize Elasticsearch to instantly query and analyze their data, allowing them to provide a fast and robust search experience, as well as gain valuable insights.”
– Shay Banon, co-founder and CTO, Elasticsearch, Inc.

“We continue to set the industry standard for building data applications. Bringing the power of Elasticsearch for fast, distributed, data search and analytics together with Cascading reflects our common goal – faster time to value in the deployment of data-centric applications. This is a feature our customers have been asking for, and we expect tremendous response from the community.”
– Chris Wensel, founder and CTO, Concurrent, Inc.

Supporting Resources

Company: http://concurrentinc.com
Cascading: http://cascading.org
Driven: http://www.cascading.io/driven
Contact us: http://concurrentinc.com/contact
Twitter: http://twitter.com/concurrent
LinkedIn: http://www.linkedin.com/company/concurrent-inc
YouTube: http://www.youtube.com/getcascading

About Elasticsearch, Inc.
Elasticsearch is on a mission to make massive amounts of data usable for businesses everywhere by delivering the world’s most advanced search and analytics engine. With a laser focus on achieving the best user experience imaginable, the Elasticsearch ELK stack – comprised of Elasticsearch, Logstash and Kibana – has become one of the most popular and rapidly growing open source solutions in the market. Used by thousands of enterprises in virtually every industry today, Elasticsearch, Inc. provides production support, development support and training for the full ELK stack.

Elasticsearch, Inc. was founded in 2012 by the people behind the Elasticsearch and Apache Lucene open source projects. Since its initial release, Elasticsearch has more than 10 million cumulative downloads. Elasticsearch, Inc. is backed by Benchmark Capital, Index Ventures and NEA, with headquarters in Amsterdam and Los Altos, California, and offices around the world.

About Concurrent, Inc.
Concurrent, Inc. is the leader in data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 175,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

###

Media Contacts
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

Manal Hammoudeh
Knock Twice for Elasticsearch, Inc.
(510) 612-1885
manal@knock2x.com

Emerging Vendors 2014: Big Data Vendors

July 30, 2014NewsKIm Loughead

Rick Whiting, CRN
July 30, 2014
http://www.crn.com/slide-shows/applications-os/300073545/emerging-vendors-2014-big-data-vendors-part-1.htm/pgno/0/13

Rising Big Data Stars

Technology for managing and analyzing big data is one of the fastest-growing segments of the IT industry right now. And it’s an area where innovative startups seem to have a competitive edge over the major, established IT vendors. What’s more, the vendors on this list know the value of a good partnership and either have a serious commitment to the channel or plan to leverage it as they “cross the chasm” and go mainstream with their bleeding-edge technologies.

Take a look at the hottest startups in the big data segment from the Emerging Vendors list for 2014. Altogether there are more than 80 startups in the big data/business intelligence space, so we’ve split them into two shows with vendors A-M included here and vendors N-Z in a slideshow to follow.

Concurrent offers application middleware technology that businesses use to develop, deploy, run and manage big data applications. The company’s products include the Cascading application development framework and Driven application performance management software.

Big Data 50 – the hottest Big Data startups of 2014

July 28, 2014NewsKIm Loughead

Jeff Vance, Startup50
July 28, 2014
http://startup50.com/Big-Data-Startups/Big-Data-50/7-27/

From “Fast Data” to visualization software to tools used to track “Social Whales,” these Big Data startups have it covered.

The 50 Big Data startups in the Big Data 50 are an impressive lot. In fact, the Big Data space in general is so hot that you might start worrying about it overheating – kind of like one of those mid-summer drives through the Mojave Desert. The signs warn you to turn off your AC for a reason.

Personally, I think we’re a long way away from any sort of Big Data bubble. Our economy is so used to trusting decision makers who “trust their gut” that we have much to learn before the typical business is even ready for data Kindergarten.

In fact, after a few decades following the “voodoo” of supply side economics, which fetishized the mysterious and elusive “rational consumer,” the strides we’re making towards being a more evidence-based economy still have us pretty much just playing catch up.

These 50 Big Data startups are working to change that.

Here’s the Big Data startups in the Big Data 50 report.

Webinar | Accelerate Big Data Application Development with Cascading and Qubole – Jul 30, 2014

July 16, 2014EventsKIm Loughead

Big Data is moving to the next level of maturity and it’s all about the applications.

Date: Wednesday, July 30, 2014
Time: 10am Pacific / 1pm Eastern
Register at: http://hub.am/Wia1v7

Join the minds behind Cascading, the most widely used and deployed development framework for building Big Data applications, as they discuss how Cascading and Qubole enable developers to accelerate the time to market for their data applications, from development to production.

In this webinar, we will introduce how to easily and reliably develop, test, and scale your data applications and then deploy them on Qubole’s auto-scaling Hadoop infrastructure.

Learn how to:

Easily and reliably develop, test, and scale your data applications
How to seamlessly deploy your applications on auto-scaling Hadoop infrastructure
Tackle a wide variety of big data challenges

Use a Factory Approach to Streamline Application Development

July 14, 2014NewsKIm Loughead

Cindy Waxer, DataInformed
July 14, 2014
http://data-informed.com/use-a-factory-approach-streamline-application-development

“Big data factory” may not be as glamorous a description for an organization’s analytics talent as “data rock star” or “virtuoso,” but it’s a term an increasing number of organizations are embracing as the race to build big data applications heats up.

For years, lone data scientists have dominated analytics departments, building applications one at a time while handing over dozens of files with multiple scripts to various departments. But that’s all changing as companies like BloomReach discover a new approach to app development that more closely resembles an auto manufacturer’s plant floor than a Silicon Valley cubicle.

BloomReach creates big data market applications by leveraging analytics to personalize website content and enable site searches for big-name brands like Crate&Barrel and NeimanMarcus. Although the startup relies on application development platform Cascading from Concurrent Inc. to build on Hadoop, Seth Rogers, a member of BloomReach’s technical staff, says that developing applications “is a time-consuming process. There’s a lot of trial and error and iterations that you have to go through.”

Rogers said that to minimize this heavy lifting and make its app development process more flexible, BloomReach has created its own big data factory. According to Rogers, application development traditionally has involved a hodgepodge of development tools, programming languages, and raw Hadoop in which “every product is in its own silo, which is very specific and very opaque. Nobody really understands what’s going on with programming languages or data storage.”

As a result, data scientists are left experimenting on their own, building one app at a time without a repeatable platform or reproducible results. Not only does this require starting from scratch with every new application, but if a data scientist suddenly jumps ship, a huge knowledge vacuum and a corresponding negative impact on production results.

To make application design simpler with repeatable development processes, BloomReach created a big data factory for the four terabytes of data it processes weekly and the 150 million consumer interactions it encounters daily.

Rogers said a successful big data factory comprises five key components:

A set of common building blocks and tools for app development. Rogers said the Cascading app development platform allows BloomReach to access “programming libraries we’ve already written so that we can incorporate them into new apps in a fairly straightforward way and without customization.”
A standardized infrastructure, in which data sets are stored in standard formats so that “when we are building a product, we are not starting from scratch,” Rogers said. “We already have our basic infrastructure in place.”
A selection of monitoring tools that regularly tests the performance of systems and measures computing power usage to ensure peak performance.
A standardized process for debugging along with databases for ad hoc queries. Rogers said it’s not uncommon for a customer’s Web site to experience an occasional glitch. The problem, however, is that it’s often difficult to sift through files for a nugget of bad data, especially if it hasn’t been indexed. Easy-to-query databases, however, “make it easy to find information without having to open up these gigantic, 10-terabyte files and run through them,” Rogers said. “We can go in there and easily do a search and see what exactly went wrong, whether it was user error or a legitimate bug.” In turn, apps can be modified quickly and effectively without having to go back to start.
Centralized storage for data. Storage typically comes in three forms: a shared file system that is excellent for transforming data; a key store that lets users quickly look up data using a specific key for fast analysis; and a relational database that cleans up data, converts it into a structured record, and loads it into a regular SQL database “so that users can conduct queries on any field” and play with the data, said Rogers.

Application development tools, debugging solutions, storage options – they have always been available to a big data team. But by combining data, coding, and monitoring processes into a standardized, centralized, and simplified strategy for app development, Rogers said, a big data factory can accelerate the app development process, resulting in significant savings in developer time, storage capacity, and computing power resources.

Questioning the Lambda Architecture

July 2, 2014NewsKIm Loughead

Jay Kreps, Radar
Jul 2, 2014
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

The Lambda Architecture has its merits, but alternatives are worth exploring.

Nathan Marz wrote a popular blog post describing an idea he called the Lambda Architecture (“How to beat the CAP theorem“). The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. Since I’ve been involved in building out the real-time data processing infrastructure at LinkedIn using Kafka and Samza, I often get asked about the Lambda Architecture. I thought I would describe my thoughts and experiences.

What is a Lambda Architecture and how do I become one?

The Lambda Architecture looks something like this:

The way this works is that an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer.
There are a lot of variations on this, and I’m intentionally simplifying a bit. For example, you can swap in various similar systems for Kafka, Storm, and Hadoop, and people often use two different databases to store the output tables, one optimized for real time and the other optimized for batch updates.

The Lambda Architecture is aimed at applications built around complex asynchronous transformations that need to run with low latency (say, a few seconds to a few hours). A good example would be a news recommendation system that needs to crawl various news sources, process and normalize all the input, and then index, rank, and store it for serving.

I have been involved in building a number of real-time data systems and pipelines at LinkedIn. Some of these worked in this style, and upon reflection, it is not my favorite approach. I thought it would be worthwhile to describe what I see as the pros and cons of this architecture, and also give an alternative I prefer.

What’s good about this?

I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. I think this lesson translates well to the stream processing domain. I’ve written some of my thoughts about capturing and transforming immutable data streams here.

I also like that this architecture highlights the problem of reprocessing data. Reprocessing is one of the key challenges of stream processing but is very often ignored. By “reprocessing,” I mean processing input data over again to re-derive output. This is a completely obvious but often ignored requirement. Code will always change. So, if you have code that derives output data from an input stream, whenever the code changes, you will need to recompute your output to see the effect of the change.

Why does code change? It might change because your application evolves and you want to compute new output fields that you didn’t previously need. Or it might change because you found a bug and need to fix it. Regardless, when it does, you need to regenerate your output. I have found that many people who attempt to build real-time data processing systems don’t put much thought into this problem and end-up with a system that simply cannot evolve quickly because it has no convenient way to handle reprocessing. The Lambda Architecture deserves a lot of credit for highlighting this problem.

There are a number of other motivations proposed for the Lambda Architecture, but I don’t think they make much sense. One is that real-time processing is inherently approximate, less powerful, and more lossy than batch processing. I actually do not think this is true. It is true that the existing set of stream processing frameworks are less mature than MapReduce, but there is no reason that a stream processing system can’t give as strong a semantic guarantee as a batch system.

Another explanation I have heard is that the Lambda Architecture somehow “beats the CAP theorem” by allowing a mixture of different data systems with different trade-offs. Long story short, although there are definitely latency/availability trade-offs in stream processing, this is an architecture for asynchronous processing, so the results being computed are not kept immediately consistent with the incoming data. The CAP theorem, sadly, remains intact.

And the bad…

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable.

Programming in distributed frameworks like Storm and Hadoop is complex. Inevitably, code ends up being specifically engineered toward the framework it runs on. The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.

Why can’t the stream processing system be improved to handle the full problem set in its target domain?

One proposed approach to fixing this is to have a language or framework that abstracts over both the real-time and batch framework. You write your code using this higher level framework and then it “compiles down” to stream processing or MapReduce under the covers. Summingbird is a framework that does this. This definitely makes things a little better, but I don’t think it solves the problem.

Ultimately, even if you can avoid coding your application twice, the operational burden of running and debugging two systems is going to be very high. And any new abstraction can only provide the features supported by the intersection of the two systems. Worse, committing to this new uber-framework walls off the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig, Crunch, Cascading, Oozie, etc).

By way of analogy, consider the notorious difficulties in making cross-database ORM really transparent. And consider that this is just a matter of abstracting over very similar systems providing virtually identical capabilities with a (nearly) standardized interface language. The problem of abstracting over totally divergent programming paradigms built on top of barely stable distributed systems is much harder.

We have done this experiment

We have actually been through a number of rounds of this at LinkedIn. We have built various hybrid-Hadoop architectures and even a domain-specific API that would allow code to be “transparently” run either in real time or in Hadoop. These approaches worked, but none were very pleasant or productive. Keeping code written in two different systems perfectly in sync was really, really hard. The API meant to hide the underlying frameworks proved to be the leakiest of abstractions. It ended up requiring deep Hadoop knowledge as well as deep knowledge of the real-time layer — and adding the new requirement that you understand enough about how the API would translate to these underlying systems whenever you were debugging problems or trying to reason about performance.

These days, my advice is to use a batch processing framework like MapReduce if you aren’t latency sensitive, and use a stream processing framework if you are, but not to try to do both at the same time unless you absolutely must.

So, why the excitement about the Lambda Architecture? I think the reason is because people increasingly need to build complex, low-latency processing systems. What they have at their disposal are two things that don’t quite solve their problem: a scalable high-latency batch system that can process historical data and a low-latency stream processing system that can’t reprocess results. By duct taping these two things together, they can actually build a working solution.

In this sense, even though it can be painful, I think the Lambda Architecture solves an important problem that was otherwise generally ignored. But I don’t think this is a new paradigm or the future or big data. It is just a temporary state driven by the current limitation of off-the-shelf tools. I also think there are better alternatives.

An alternative

As someone who designs infrastructure, I think the glaring question is this: why can’t the stream processing system just be improved to handle the full problem set in its target domain? Why do you need to glue on another system? Why can’t you do both real-time processing and also handle the reprocessing when code changes? Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? The answer is that you can do this, and I think this it is actually a reasonable alternative architecture if you are building this type of system today.

When I’ve discussed this with people, they sometimes tell me that stream processing feels inappropriate for high-throughput processing of historical data. But I think this is an intuition based mostly on the limitations of systems they have used, which either scale poorly or can’t save historical data. This leaves them with a sense that a stream processing system is inherently something that computes results off some ephemeral streams and then throws all the underlying data away. But there is no reason this should be true. The fundamental abstraction in stream processing is data flow DAGs, which are exactly the same underlying abstraction in a traditional data warehouse (a la Volcano) as well as being the fundamental abstraction in the MapReduce successor Tez. Stream processing is just a generalization of this data-flow model that exposes checkpointing of intermediate results and continual output to the end user.

So, how can we do the reprocessing directly from our stream processing job? My preferred approach is actually stupidly simple:

Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.
When the second job has caught up, switch the application to read from the new table.
Stop the old version of the job, and delete the old output table.
This architecture looks something like this:

Unlike the Lambda Architecture, in this approach you only do reprocessing when your processing code changes, and you actually need to recompute your results. And, of course, the job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data. Naturally, you will want to bump up the parallelism on your reprocessing job so it completes very quickly.

Maybe we could call this the Kappa Architecture, though it may be too simple of an idea to merit a Greek letter.

Of course, you can optimize this further. In many cases, you could combine the two output tables. However, I think there are some benefits to having both for a short period of time. This allows you to revert back instantaneously to the old logic by just having a button that redirects the application to the old table. And in cases that are particularly important (your ad targeting criteria, say), you can control the cut-over with an automatic A/B test or bandit algorithm to ensure whatever bug fix or code improvement you are rolling out hasn’t accidentally degraded things in comparison to the prior version.

Note that this this doesn’t mean your data can’t go to HDFS; it just means that you don’t run your reprocessing there. Kafka has good integration with Hadoop, so mirroring any Kafka topic into HDFS is easy. It is often useful for the output or even intermediate streams from a stream processing job to be available in Hadoop for analysis in tools like Hive or for use as input for other, offline data processing flows.

We have documented implementing this approach as well as other variations on reprocessing architectures using Samza.

Some background

For those less familiar with Kafka, what I just described may not make sense. A quick refresher will hopefully straighten things out. Kafka maintains ordered logs like this:

A Kafka “topic” is a collection of these logs:

A stream processor consuming this data just maintains an “offset,” which is the log entry number for the last record it has processed on each of these partitions. So, changing the consumer’s position to go back and reprocess data is as simple as restarting the job with a different offset. Adding a second consumer for the same data is just another reader pointing to a different position in the log.

Kafka supports replication and fault-tolerance, runs on cheap, commodity hardware, and is glad to store many TBs of data per machine. So, retaining large amounts of data is a perfectly natural and economical thing to do and won’t hurt performance. LinkedIn keeps more than a petabyte of Kafka storage online, and a number of applications make good use of this long retention pattern for exactly this purpose.

Cheap consumers and the ability to retain large amounts of data make adding the second “reprocessing” job just a matter of firing up a second instance of your code but starting from a different position in the log.

This design is not an accident. We built Kafka with the intent of using it as a substrate for stream processing, and we had in mind exactly this model for handling reprocessing data. For the curious, you can find more information on Kafka here.

Fundamentally, though, there is nothing that ties this idea to Kafka. You could substitute any system that supports long retention of ordered data (for example HDFS, or some kind of database). Indeed, a lot of people are familiar with similar patterns that go by the name Event Sourcing or CQRS. And, of course, the distributed database people will tell you this is just a slight rebranding of materialized view maintenance, which, as they will gladly remind you, they figured out a long long time ago, sonny.

Comparison

I know this approach works well using Samza as the stream processing system because we do it at LinkedIn. But I am not aware of any reason it shouldn’t work equally well in Storm or other stream processing systems. I’m not familiar enough with Storm to work through the practicalities, so I’d be glad to hear if others are doing this already. In any case, I think the general ideas are fairly system independent.

The efficiency and resource trade-offs between the two approaches are somewhat of a wash. The Lambda Architecture requires running both reprocessing and live processing all the time, whereas what I have proposed only requires running the second copy of the job when you need reprocessing. However, my proposal requires temporarily having 2x the storage space in the output database and requires a database that supports high-volume writes for the re-load. In both cases, the extra load of the reprocessing would likely average out. If you had many such jobs, they wouldn’t all reprocess at once, so on a shared cluster with several dozen such jobs you might budget an extra few percent of capacity for the few jobs that would be actively reprocessing at any given time.

The real advantage isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. So, in cases where simplicity is important, consider this approach as an alternative to the Lambda Architecture.