May 8th 2015
I’m often asked which Hadoop tools are “best.” The answer, of course, is that it depends, and what it depends on is the stage of the Hadoop journey you’re trying to navigate. Would you show up with a diamond ring on a first date? Would you arrange dinner with your spouse by swiping right on Tinder? In the same way, we can say that some Hadoop tools are certainly better than others, but only once we know where you are in your data project’s lifecycle.
Here then, are the major stages of a happy and fulfilling relationship with Hadoop, and the tools and platforms most appropriate for navigating each one.
Dating: Exploring your new Hadoop partner
When a business analyst is first matched with a Hadoop deployment, he or she is typically drawn by great promise. What kind of data does the deployment hold? What does that data actually look like? Can you combine it with other data sets to learn something interesting?
Answering these questions doesn’t require fancy, large-volume clusters. What’s ideal is a simple, Excel-like interface for reading a few rows, teasing out the fields and getting to know the distribution. At this data exploration stage, visualization platforms like Datameer and Trifacta deliver excellent productivity. If you’re comfortable with SQL, you could also try Hive, but that might be overkill given the learning curve.
One reason that some folks fail to make it past a first date with Hadoop is that they equate visualization with reporting; they go right to a BI tool like Tableau. The visualization tools above are easier to use and therefore better suited to exploration and quick hypothesis building.
Moving In: Getting comfortable with further Hadoop developments
Before jumping into a production-grade data application, the analyst needs to see a more concrete vision of it if he or she hopes to win the organization’s blessings — and funding. In traditional software terms, it’s like building a prototype with Visual Basic to get everyone on board, without worrying about all the customer-support scaffolding that will eventually be necessary. Basically, you want a nice, slick hack.
Here again, the visualization tools can be helpful, but even better are GUI-based application development platforms like SnapLogic, Platfora and Pentaho. These tools make building basic applications fast and easy, and they’re relatively simple to learn.
Tying the knot: Committing to a production relationship with Hadoop
Once a prototype’s value is recognized, the enterprise typically turns the application over to an operations team for production. That team cares not only about the application’s functionality, but also about its capacity to be deployed and run reliably, its scalability and its portability across data fabrics.
The operations folks must also make sure the application’s execution honors service-level agreements (SLAs), and that it integrates seamlessly with inbound data sources like data warehouses, mainframes and other databases, as well as with outbound systems such as HBase and Elastic Search.
At this production deployment stage, development and operations teams typically forgo the GUI-based platforms, opting instead for the control and flexibility of API-based frameworks such as Cascading, Scalding and Spark. (Full disclosure: I work for Concurrent, the company behind Cascading.)
With these you get not only the control necessary to engineer applications that address the complexity of your enterprise, but also test-driven development, code reusability, complex orchestration of data pipelines, and continuous performance tuning. These capabilities are vital for production-grade applications, and are not available in GUI-based platforms or ad hoc query tools.
Building a big data family: Nurturing a thriving nursery of Hadoop apps
Once a team is married to a big data platform, they soon find themselves with a growing family of applications that execute on a common Hadoop cluster (or other shared resource).
There are two big challenges at this stage. First, each application must continue to operate under defined controls. These include SLA and governance, of course, but it’s also crucial to prevent unauthorized use of sensitive data fields across use cases and that proper privacy curtains are respected.
Second, production teams must carefully manage how a brood of apps uses the shared cluster resources. Specifically, utilization should represent the relative importance of lines of businesses and use cases. In addition, teams must make sure that no application goes hungry for compute resources, either now or in the future (the discipline known as capacity planning).
The ideal tools at this stage deliver performance monitoring for enterprises looking to achieve operational excellence on Hadoop. These include Driven, Cloudera Manager, and even Hadoop’s out-of-the-box Job Tracker. (Full disclosure: my company, Concurrent, also makes Driven.)
Maturing with big datatering a rich, lifelong partnership with Hadoop
Like any relationship, the one with Hadoop can become more complex as it matures. Typically, the enterprise will transform production data applications in rich, enterprise-class data products, both by piping them into downstream systems (for example, a recommendation engine) and by making their output available to businesspeople to foster better decisions.
At this most advanced stage in the data product lifecycle, the enterprise’s needs are the most diverse. On top of your production-ready development platform, you’ll want highly intelligent pipeline monitoring. For example, if one application encounters a problem and a downstream one depends on its output, you’ll want to raise an alert so your team can react quickly to resolve the problem.
You’ll also want tools that quickly pinpoint whether a performance bottleneck resulted from a problem in code, data, hardware or shared resources (your cluster configuration). Driven, Cloudera Manager, and Hadoop’s Job Tracker are also helpful here. You may also want to give businesspeople an easy, flexible ways of getting results they want, while still offering a level of abstraction from the compute interface – a query framework like SQL-based Hive is a great choice here.
To choose the right tool, know where you are in the Hadoop relationship cycle
In conclusion, before you can answer which tools and approaches work best, you have to understand where you are in the lifecycle of your data operation. Like any true love, your relationship with Hadoop will mature over time, and you’ll need to reevaluate your needs as they mature along with it.