Supreet Oberoi, VP Field Engineering, Concurrent, Inc.
August 18, 2014
The fragmentation of data across repositories is accelerating. Much of the new data growth is occurring inside Hadoop, but it is clear that enterprise data warehouses won’t be shut down anytime soon. This situation leads to a familiar but urgent question: If we cannot have a unified physical repository, is it possible to have a single logical repository that allows application developers to think only about the data, not about the details of where it lives?
In short, yes, we can have that. To get the most out of data, the job of finding data in far-flung repositories, collecting it in one place and knitting it together should be part of the application development platform. In other words, to really make big data application development productive, we need a query that can assemble, correlate and distill data from multiple sources at execution time. Anyone who is serious about exploiting big data is going to want to improve developer productivity by providing this capability. Here is the logic driving this transformation.
The Permanently Fragmented World of Data
Very few companies have actually managed to bring all of their enterprise data into one unified master repository. And even when they have, mergers quickly re-fragment the data landscape. In addition, once a data warehouse has been created, companies are unlikely to shut it down or move it elsewhere. This makes the merging of data warehouses a rare occurrence, which means that a unified master repository is a pipe dream.
Hadoop plays a powerful new role by serving as a unified data repository (sometimes called a data lake) for the vast amounts of new types of data companies are continually procuring. Most of the time Hadoop supplements an existing data warehouse. Sometimes, workloads for ETL move from the data warehouse to Hadoop. In companies where data is power, moving data from a data warehouse to Hadoop essentially means shutting down or dramatically reducing the importance of the data warehouse. Whether or not this is a good idea, in most companies, it isn’t going to happen. In addition, few data warehouses or applications will want to suffer the additional load (and complexity in governance and compliance) of allowing constant replication of data to another repository.
In the future, the data for most big data applications will reside in many locations: in data warehouses, in Hadoop and in various application-specific locales. Looking at the ways in which big data leaders currently run their data platforms – companies including Netflix, Facebook and Etsy – you’ll see that this is exactly the structure they have.
Creating One Logical View
One way to solve this (which I do not recommend) is by making your application complex. This means bringing all of your data into an application through separate queries and combining and analyzing it inside the application itself. Before we had databases and SQL, data lived in files and it was the program’s job to do joins and “group bys” and such. Writing applications this way means lots of code that could be standardized is instead developed from scratch and must be debugged and maintained. Such an approach is not a recipe for productivity or resilience. SQL moved much of this work to the database, dramatically simplifying applications.
Using abstraction is the best way to create a unified logical model. It allows you to express the data you want that can later be used to retrieve it from a number of different repositories. That’s what Concurrent Inc.’s Cascading and Lingual projects do, moving even more work from the application to a standardized, reusable layer. While Cascading allows big data applications to be expressed through an API, usually to access data in Hadoop, Lingual allows Cascading to use SQL to reach out to other repositories and bring selected data back to Hadoop for analysis. Creating a unified logical model of data across Hadoop and traditional data stores is enough to support a huge range of applications. As stated, developers don’t think about where data is, but about what data they want and how they want it connected and distilled.
When the application runs, all the data in various repositories is gathered together in Hadoop where the work of the application is completed. With Cascading, data from Hadoop, SQL and any other repository is moved to a staging area in Hadoop, then distilled into the answer the developer wants. The data assembly and query processing takes place automatically as specified by the API. Much smaller subsets of data are transferred for most applications. Notably, Teradata has also found the need for a unified query across many types of repositories, which they call the QueryGrid. But in their model, the data the program needs comes back to the data warehouse to be consolidated and processed. Concurrent’s technology makes Hadoop the location where the data is consolidated and analyzed.
Avoiding Hadoop Shelfware
Creating a robust and scalable unified query as we have through the combination of Cascading and Lingual is key to exploiting the full value of all of your data, whether it is stored in Hadoop, a data warehouse or anywhere else. Almost every important application will be using data from Hadoop and from other sources. This approach will help you realize the full value both of Hadoop and of your existing infrastructure.
To make the most of the data and legacy assets you have and your investment in Hadoop, it is vital to allow programmers to easily deal with this heterogeneous reality. That’s why a unified query mechanism is a must-have for companies that are serious about big data. Creating a single logical view of data for analysts and developers lowers the cost of creating hundreds of applications that will be useful to your business.
Supreet Oberoi is vice president of field engineering at Concurrent.