Supreet Oberoi, Concurrent
August 23, 2014
If your company is building big data applications, here are eight things you need to consider.
As big data applications move from proof-of-concept to production, resiliency becomes an urgent concern. When applications lack resiliency, they may fail when data sets are too large, they lack transparency into testing and operations, and they are insecure. As a result, defects must be fixed after applications are in production, which wastes time and money.
The solution is to start by building resilient applications: robust, tested, changeable, auditable, secure, performant, and monitorable. This is a matter of philosophy and architecture as much as technology. Here are the key dimensions of resiliency that I recommend for anyone building big data apps.
1. Define a blueprint for resilient applications
The first step is to create a systemic enterprise architecture and methodology for how your company approaches big data applications. What data are you after? What kinds of analytics are most important? How will metrics, auditing, security and operational features be built in? Can you prove that all data was processed? These capabilities must be built into the architecture.
Other questions to consider: What technology will be crucial? What technology is being used as a matter of convenience? Your blueprint must include honest, accurate assessments of where your current architecture is failing. Keep in mind that a resilient framework for building big data applications may take time to assemble, but is definitely worth it.
2. Size shouldn’t matter
If an application fails when it attempts to tackle larger datasets, it is not resilient. Often, applications are tested with small-scale datasets and then fail or take far too long with larger ones. To be resilient, applications must handle datasets of any size (and the size of the dataset in the case of your application may mean depth, width, rate of ingestion, or all of the above). Applications must also adapt to new technologies. Otherwise, companies are constantly reconfiguring, rebuilding and recoding. Obviously, this wastes time, resources and money.
3. Transparency and high fidelity execution analysis
With complicated applications, chasing down scaling and other resiliency problems is far from automated. Ideally, it should be easy to see how long each step in a complicated pipeline takes so that problems can be caught and fixed right away. It is critical to pinpoint where the problem is: in the code, the data, the infrastructure, or the network. But this type of transparency shouldn’t have to be constructed for each application; it should be a part of a larger platform so that developers and operations staff can diagnose and respond to problems as they arise.
Once you find a problem, it is vital to relate the application behavior to the code — ideally through the same monitoring application that reported the error. Too often, getting access to code involves consulting multiple developers and following a winding chain of custody.
4. Abstraction, productivity and simplicity
Resilient applications tend to be future-proof because they employ abstractions that simplify development, improve productivity and allow substitution of implementation technology. As part of the architecture, technology should allow developers to build applications without miring them in the implementation details. This type of simplicity allows any data scientist to use the application and access any type of data source. Without such abstractions, productivity suffers, changes are more expensive and users drown in complexity.
5. Security, auditing and compliance
A resilient application comes with its own audit trail, one that shows who used the application, who is allowed to use it, what data was accessed and how policies were enforced. Building these capabilities into applications is the key to meeting the ever-increasing array of privacy, security, governance and control challenges and regulations that businesses face with big data
6. Completeness and test-driven development
To be resilient, applications have to prove they have not lost data. Failing to do so can lead to dramatic consequences. For instance, as I witnessed during my time in financial services, fines and charges of money laundering or fraud can result if a company fails to account for every transaction because the application code missed one or two lines of data. Execution analysis, audit trails and test-driven development are foundational to proving that all the data was processed properly. Test-driven development means having the technology and the architecture to test application code in a sandbox and getting the same behavior when it is deployed in production. Test-driven development should provide the ability to step through the code, establish invariants, and utilize other defensive programming techniques.
7. Application and data portability
Evolving business requirements frequently drive changes in technology. As a result, applications must run on and work with a variety of platforms and products. The goal is to make data, wherever it lives, accessible to the end user via SQL and standard APIs. For example, a state of the art platform should allow data that is in Hadoop and processed through MapReduce to be moved to Spark or Tez and processed there with minimal or no impact to the code.
8. No black arts
Applications should be written in code that is not dependent on an individual virtuoso. Code should be shared, reviewed and commonly owned by multiple developers. Such a strategy allows for building teams that can take collective responsibility for applications.
If companies follow these eight rules, they will create resilient, scalable applications that allow them to tap into the full power of big data.