All posts by Pierre-Yves Poli

Banishing the Confusion of Eight Big Data Myths

December 9, 2014
Chris Preimesberger
http://www.eweek.com/database/slideshows/banishing-the-confusion-of-eight-big-data-myths.html

Enterprises of all types and sizes are realizing that data sets being stored or archived in silos or in clouds—information they might have had considered too old or irrelevant, or only for regulation purposes—may have great potential value. It’s all about looking at a business’ history, making cogent queries, discovering insights and projecting what is likely to happen in the future, in order to become more customer-centric and inventory-effective. These companies are going into the internal business of analyzing data. As a result, organizations are in search of the necessary tools and information to take full advantage of the potential this movement offers. However, big data brings big hype, and big hype only brings big confusion of what’s what in the data market. In this slide show, eWEEK and Gary Nakamura, CEO of data application infrastructure provider Concurrent, discuss—and dismiss—the biggest myths that are disrupting the big data industry. Some of what turns out to be a myth may surprise you.

Myth 1: We Must Hire a Hadoop Expert

Hadoop is built on intricate concepts such as MapReduce, YARN, Spark and Hadoop Distributed File Systems (HDFS), and the constant change and announcements of subsystem-level technology further convolute the picture. Plenty of products and tools reduce the complexity and shield users entirely from this. There are open-source application frameworks and commercial products that significantly improve productivity and accessibility when working with Hadoop, up to the point where companies can use internal resources to execute on their big data strategy: enterprise Java developers, data warehouse developers and data analysts can quickly and easily leverage Hadoop.

Myth 2: Buying a Big Data Solution Means I’m Using Big Data

You’ve just convinced your organization to adopt a big data strategy, and you’ve purchased a solution. What’s next? Enterprises often get stuck at a point where they have the hardware and Hadoop software in place but don’t have the skill set to take advantage of it. Using big data means that you are using your data, executing a data strategy and helping your business with cost savings, revenue opportunities or additional insights. The key is lowering the bar for your organization to execute and deliver data products as quickly as possible. Delivering and running these production applications reliably and on time is the next set of challenges. When you achieve this level, you will know because your users will want more.

Myth 3: Big Data Is a Fad That Will Go Away in a Few Years

Ninety percent of the world’s data was created in the last three years. Sticking your head in sand and hoping that it will go away is a career-ending move. We may drop the “big” in big data in a few years, but whether you like it or not, your company will be in the business of data.

Myth 4: Businesses Need One Data Scientist for All Big Data Needs

For too long, businesses have been upholding the myth of the data science hero—the virtuoso who slays dragons and emerges with a treasure of an amazing app based on insights from big data. The truth is they can’t afford to rely on a single data scientist or developer because employees can leave an organization at any time. By building a “big data app factory” of processes and teams, companies can ensure that great work can be done over and over again—regardless of personnel changes.

Myth 5: Traditional Enterprise Data Warehouses Will Go Away

It’s unlikely that the technology of the past will completely go away. Enterprises will continue to rely on traditional enterprise data warehouses (EDWs). However, with the rapid evolution of Hadoop and accompanying products and technologies, the role of the EDW in the enterprise will significantly diminish. The flow of data will change, and it’s likely that Hadoop will be its first stop.

Myth 6: Apache Spark Is the Future of Hadoop

As usual, the new, sexy young object is always the most alluring. Apache Spark is currently one of those: It is a fast and general engine for large-scale, clustered data processing. However, rest assured, another will come along and take its place as the hottest thing on the market. What people often forget is that old reliable is old and reliable for a reason, as it usually has the breadth and depth needed to move your big data project forward. Resist the urge to move to the latest; if it ain’t broke, don’t fix it. Stick with what you know.

Myth 7: Big Data Is Only for the Largest of Enterprises

The “big” in big data is misleading. Everyone—including organizations large and small—is in the business of data. Sure, large enterprises collect massive amounts of data, but the abundance of data that small enterprises can collect and leverage for competitive advantage also can be immense. Just because your data may be small in volume does not mean you shouldn’t have a data strategy in place.

Myth 8: Big Data Is for Hadoop Experts

Enterprises today are rapidly adopting Hadoop to process, manage and make sense of growing volumes of data, and enterprises are now leveraging existing internal resources to drive their data strategies forward. There are now mature, reliable tools readily available for all software engineers to use to unlock the full potential of big data and Hadoop. As a result, no Hadoop expertise is required.

NoSQL Databases: Niche Tools Slowly Taking Their Place in the Enterprise

December 8, 2014
Dick Weisinger
http://formtek.com/blog/nosql-databases-niche-tools-slowly-taking-their-place-in-the-enterprise/

NoSQL databases store and retrieve data using techniques other keeping data in structured tabular format like that used in traditional relational databases.  Compared to relational databases, NoSQL databases are often simpler in design, can scale more flexibly, and enable very fine-grain access to data.

While NoSQL databases dominate database technology news, actually NoSQL databases have a very low presence across enterprises.  A 2014 InformationWeek survey  to determine which database technologies are used by enterprises found that only 13 percent of organizations surveyed have installed Hadoop, 5 percent use MongoDB and 3 percent have SAP HANA licenses.  Compare that to 75 percent of organizations using Microsoft SQL Server and 47 percent using Oracle RDBMS.  Joe Masters Emison, CTO of BuildFax, comments that even FileMaker currently  still has a higher adoption percentage than highly-hyped NoSQL contenders Cassandra, Riak, and MariaDB.

Forrester Research estimates that NoSQL adoption at enterprises actually to be significantly higher, closer to being used by 20 percent of enterprises.  Forrester also expects the use of NoSQL to double by 2017.

Forrester identifies four use cases where NoSQL shine:

  • Operational databases for real-time and predictive analytics
  • Stream processing that scales across many nodes in a clustered configuration
  • Databases with low-latency ad hoc queries
  • Applications that require large volumes of rapidly growing structured and unstructured data

Gary Nakamura, CEO of Concurrent, said that “sometimes SQL is overkill for what you are trying to do. You don’t need a query language, just a key with a particular value, and that could be a lot simpler and faster.”

Alex Miller, Developer at Cognitect, said that  NoSQL “models let the developer start working with their data without committing to anywhere near as much up-front structure as a relational database. On initial usage, that makes them feel much lighter-weight and agile from a developer point of view.  The important thing is to consider your data use case and find the most appropriate technology to match it.”

Even With 20 Years in Tech I Learned These Lessons as a First-Time CEO

December 8, 2014
Gary Nakamura
http://www.entrepreneur.com/article/240524

After working as an executive in the high tech industry for more than 20 years, I’ve learned my fair share of lessons along the way. After completing my first year as a first-time CEO, a startup nonetheless, it’s become clear that growing, managing and running a company is no different.

However, like the old adage says, “you live and you learn.” That has certainly been a constant throughout the trials and tribulations of my first year at Concurrent, Inc. As I look back, here are the five most important lessons I’ve learned as a first-time CEO:

1. Check your ego at the door.

As CEO, I am focused on building a company around the vision of our founder and the talent of the people we put around him. This means I need to keep my ego in check everyday when I walk into the office so that the focus stays on “we the team” rather than on “me the individual”.

Related: 8 Ways Rookie CEOs Can Succeed Faster

2. Planning ahead is your job.

In my spare time, I like to garden, and I often think of a company like a garden. Sometimes the seeds you plant won’t bloom until much later, but the effort put forth is still worth it. On the flip side, plants often die – you did something wrong, such lack of or excessive watering, detrimental plant placement, etc., but you learn from your mistakes and move on.

The same can be said when an initiative or project fails at your company, and – trust me – this will happen. Great CEOs, like gardeners, seek knowledge, plan ahead, are patient, are disciplined and are persistent. As a CEO, you need to think long term and big picture.

3. I am human – therefore, I “shave the peak.”

The job demands of a CEO, at times, require superhero abilities, and with the job demands, come the pressures of the job. But it’s your responsibility to keep these situations and stress in check. Remember that we’re still human.

There are always a million things that can keep you up at night, but to survive in the long run, you need to counter that stress and anxiety by taking care of yourself. I call that “shaving the peak.” I achieve this through daily exercise, but there are many other ways to clear your head and unwind. Find your way and stick with it.

Related: Mindfulness and the Startup CEO

4. Don’t be a penny wise and a pound foolish.

Startup life is very different from life at a big company (read: big budget). You have to keep your spend in check. However, being penny wise brings a danger of focusing on the wrong things and not spending more where it really matters.

Some things are definitely worth the extra expense if it makes your product better, your customers happier and your employees feel more appreciated and more productive. Only a fool will cut back on something to save in the near term, rather than keep the big picture in sight.

5. Don’t babysit.

At any company, there’s a mixed bag of personalities, plenty of opinions and times when team members at every level don’t see eye to eye. As a CEO, it’s important to know when to step back and ensure the team works together to figure things out, rather than hold everyone’s hands as they sort through their differences.

Every CEO is different, and so are the failures and successes from which they learn. However, whether you’re just starting out or well seasoned in the position, there are new opportunities from which to grow and learn every day. Being a (first-time) CEO is hard work and challenging, but with a level head on your shoulders, a good understanding of what is important (and what is not), and strong and supportive team on your side, it is an exciting and fulfilling ride.

MeetUp | Hadoop Meetup at Viadeo on Cascading/Tez with Concurrent Inc and Hortonworks – November 18, 2014

Sign-up here: http://www.meetup.com/Hadoop-User-Group-France/events/218753457/

When:
Tuesday, November 25, 2014, 6:30 PM
Mardi 25 Novembre à 18h30

Where:
Viadeo
30 Rue de la Victoire
75009 Paris

What:
Hadoop Meetup chez Viadeo sur Cascading/Tez avec Concurrent Inc et Hortonworks

• Talk #1: Introduction sur Tez par Olivier RENAULT de HortonWorks (session en francais).

Speaker : Olivier Renault is a Principal Solution Engineer at Hortonworks the company behind Hortonworks Data Platform. Olivier is an expert on how to deploy Hadoop at scale in a secure and performant manner.

Abstract: During this presentation, Olivier will introduce Apache Tez. What it does ? Why is it seen by many as the Map Reduce v2. How is it helping Hive / Pig / Cascading and other increase their performance.

• Talk #2: The Cascading (big) data application framework from Andre KELPE , ConcurrentInc (session en anglais).

Speaker: André Kelpe is a Senior Software Engineer at Concurrent the company behind Cascading, Lingual and Driven. André has spoken about Cascading and Lingual at various tech meetups, devoxx 2013 and the Technical University of Berlin. Prior to concurrent he worked in the world of digital maps and navigation.

Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without any rewriting of the application code.

• 21h00-… : Apéro networking

Concurrent Releases New Version of Big Data Application Performance-Monitoring and Management System

http://www.dbta.com/Editorial/News-Flashes/Concurrent-Releases-New-Version-of-Big-Data-Application-Performance-Monitoring-and-Management-System-100611.aspx

November 18, 2014

Hadoop is still a young technology and working with it can be difficult for enterprise organizations.  To help alleviate the challenges, Concurrent has announced the latest version of Driven, a big data application performance-monitoring and management system.

Driven is purpose-built to address the challenges of enterprise application development and deployment for business-critical data applications, delivering control and performance management for enterprises seeking to achieve operational excellence.

As a company, Concurrent is mainly focused on app development and app management. Concurrent defines apps as business IPs that come together with the data and create a competitive advantage for how a business runs. Concurrent was initially founded around an open source framework called Cascading. The purpose of Cascading is to make it easier for users to build data oriented applications on top of Hadoop. Cascading has performed very well to the tune of 285,000 downloads per month.

Driven was spawned out of the need to manage many of applications that users created with Cascading on top of Hadoop.

When business users create these applications, many of the platforms for Hadoop do not offer performance management of the applications as they run. Driven is the tool that business users can apply to their applications to view its performance as it runs.  According to Concurrent, these data pipelines can be thought of as supply chains. The output is the data product that an analyst or data scientist consumes at the end. “Driven is focused around the most minute detail. I can tune it at a very fine-grain level, but I can zoom back out and look at it from a specific application level,” explained Chris Wensel, founder and CTO of Concurrent.

The newest version of Driven includes a deeper visualization into data apps, a metadata repository, and new segmentation techniques. The deeper visualization allows users to debug, manage, monitor and search applications more effectively and in real time. The metadata repository is a scalable, searchable, and fine-grained metadata repository that easily captures end-to-end visibility of data applications. By having a complete history of the applications telemetry users are able view the applications performance since its inception. New segmentation support provides greater insights across all applications. Users have the ability to segment applications by tags, names, teams or organization, and easily track for general Hadoop utilization, SLA management or internal/external chargeback.

New products of the week

November 17, 2014
Ryan Francis

http://www.networkworld.com/article/2848015/data-center/new-products-of-the-week-11-17-2014.html

Visit the link above to see the full slideshow. Driven is featured on the fourth slide.

Key features: The latest release of Driven offers enterprises unprecedented visibility into their data applications, providing deep insights, search, segmentation and visualizations for SLA management while collecting rich operational metadata in a scalable repository.

Concurrent Delivers Performance Management for Apache Hive and MapReduce Applications

November 10, 2014
Richard Brueckner

Concurrent, Inc., a leader in data application infrastructure, has introduced a new version of Driven, a leading application performance management product for the data-centric enterprise. Driven is purpose-built to address the challenges of enterprise application development and deployment for business-critical data applications, delivering control and performance management for enterprises seeking to achieve operational excellence.

Driven offers enterprise users – developers, operations and line of business – unprecedented visibility into their data applications, providing deep insights, search, segmentation and visualizations for service-level agreement (SLA) management – all while collecting rich operational metadata in a scalable data repository. This allows users to isolate, control, report and manage a broad range of data applications, from the simplest to the most complex data processes. Driven is a proven performance management solution that enterprises can rely on to deliver against their data strategies.

The latest version of Driven introduces:

  • Deeper Visualization into Data Apps: Enhanced support allows users to debug, manage, monitor and search applications more effectively and in real time. Users can also track and store complete history of each application’s performance and operational metadata.
  • Powerful Search: Fast and rich search capabilities enable users take the guess work out of managing Hadoop applications. Driven provides greater control over managing user data processing. It quickly identifies problematic applications and the associated owners, and finds and compares specific applications with previous iterations to ensure that all applications are meeting SLAs.
  • Operational Insights for SLA Management: Users can now visualize all applications over customizable timelines to manage trending application utilization. Driven quickly segments applications by name, user-defined metadata, teams and organizations for deeper insights.
  • Segmentation for Greater Manageability: New segmentation support provides greater insights across all applications. Users have the ability to segment applications by tags, names, teams or organization, and easily track for general Hadoop utilization, SLA management or internal/external chargeback.
  • Metadata Repository: A scalable, searchable, fine-grained metadata repository easily captures end-to-end visibility of data applications, as well as related data sources, fields and more. By retaining a complete history of applications’ operational telemetry, enterprises can leverage Driven for operational excellence from development to production to compliance-related requirements.
  • Integration with Existing Systems: Users can leverage the vast capabilities of Driven and deliver runtime metrics and notifications to existing enterprise monitoring systems.
  • Additional Framework Support: In addition to Cascading, Scalding and Cascalog applications, Driven now supports Apache Hive and native MapReduce processes, allowing enterprises to leverage Driven’s capabilities across a wide variety of application frameworks.

Pricing and Availability
Driven is available as a free service on cascading.io and licensable for production use as an annual subscription. Also, Driven will soon be available as an enterprise deployment.

Cascading Backer Boosts Hadoop App Performance Management

Doug Henschen
November 6th, 2014
http://www.informationweek.com/big-data/software-platforms/cascading-backer-boosts-hadoop-app-performance-management/d/d-id/1317282

With Hadoop quickly emerging as an applications platform as well as a big data-processing environment, Concurrent is broadening its Driven application performance-management system to monitor and manage a variety of data-centric applications.

Concurrent is the commercial vendor behind open source Cascading, arguably the most popular big data application-development option going — after native coding on separate platforms. Driven is Concurrent’s commercial product, but it’s not a souped-up version of Cascading. Rather, Driven is a separate big data application performance-monitoring and management system.

Where Hadoop vendors and analytics platforms like Apache Spark have their own management consoles that look at the health and performance of their clusters, Driven monitors and helps troubleshoot the performance of data-driven applications across multiple platforms and environments. That could be various Hadoop distributions or emerging systems like Spark, Storm, Tez, or other analytic platforms.

“Those other management consoles focus on the data fabrics where Driven focuses on the applications,” said Chris Wensel, founder and CTO of Concurrent, in a phone interview with InformationWeek. “We bring visibility to the version, the developer, and the process owner, and we help you understand what the application does, what libraries it depends upon, and most importantly, how it interacts with upstream and downstream applications.”

Where other management systems might help with post-mortem analysis, Wensel said Driven lets developers, operations, and line-of-business staff visualize myriad apps running on clusters and measure growth in demand, by app and by business unit, over time. And when applications fail, Driven is designed to surface how that will impact other applications so specific jobs can be killed or rerun and users or customers can be notified if there will be disruptions.

The first release of Driven, which came out in June, supported monitoring and management of Cascading, Scalding, and Cascalog applications, but with this week’s 1.1 update, Concurrent is adding support for Hive and bespoke MapReduce applications. Despite the emergence of multiple SQL-on-Hadoop options and MapReduce alternatives, these two options are still doing the bulk of the heavy lifting in Hadoop environments.

“Everybody wants to get to the next thing that will be faster than MapReduce, but they probably won’t go there for another two years because MapReduce works, they understand it, and they know the operational risks,” said Gary Nakamura, Concurrent’s CEO.

The combination of Cascading and Driven will let big data practitioners keep applications running and well managed, yet requirements for change to those apps will be minimal if they end up switching from MapReduce to alternatives like Spark or Tez, Nakamura said.

Other upgrades in Driven 1.1 include deeper visualizations for monitoring, managing, and debugging applications; search capabilities designed to quickly spot problematic applications; timeline visualizations to track app utilization trends; and app-segmentation support by tags, names, teams, or organizations so teams can track service-level agreement compliance and Hadoop utilization for internal or external chargebacks.

Concurrent backs the open source Cascading big data application development platform. Driven, pictured above, is its commercial app performance-management system.<

Driven has been generally available for only four months, so Wensel said it’s no surprise there are fewer than a dozen customers at this point. The only publicly identified Driven customer is the Dutch email advertising optimization vendor Mojn.

“With Driven, our developers have unmatched operational visibility and control across all Cascading applications — including real-time monitoring, history and performance tracking over time,” said Johannes Alkjær, lead architect at Mojn, in a statement from Concurrent. “Driven [lets us] drive differentiation through our data and manage our data applications more efficiently.”

Concurrent is counting on the popularity of Cascading to drive interest in Driven. There are more than of 8,000 production deployments of Cascading (including uses at Twitter, United Healthcare, Etsy, and Nokia), and the software is getting more than 285, 000 downloads per month, according to Concurrent.

Cascading owes its popularity to the fact that it abstracts developers from the complexities of Hadoop programming so they can write once and deploy across multiple distributions and generations of distributions. Concurrent does the work making sure its platform stays up to date and compatible with multiple big data platforms as they evolve.

Cascading has been certified to work on multiple distributions and works with the YARN resource management framework. Concurrent also offers beta Cascading software and is preparing future production releases that will support for Spark, Storm, and Tez as they become generally available.

Hortonworks Goes Broad and Deep with HDP 2.2

From full support for Apache Spark, Apache Kafka, and the Cascading framework to updated management consoles and SQL enhancements in Hive, there’s something for everybody in Hortonworks’ latest Hadoop distribution, which was revealed today at the Strata + Hadoop World conference in New York City.

Hadoop started out with two main pieces: MapReduce and HDFS. But today’s distributions are massive vehicles that wrap up and deliver a host of sub-components, engines, and consoles that are desired and needed for running modern Hadoop applications. Ensuring that each of the pieces in the Hadoop stack works with all the others is what Hadoop distributors like Hortonworks do.

Jim Walker, director of product marketing for the Palo Alto, California company, breaks it down into vertical and horizontal approaches. “Vertically, we’re going to integrate each one of these engines…both into YARN and then deeper into HDFS,” he says. “But we’re also making sure that I’m not integrating those horizontally across the key services of governance, security and operations.”

A shiny new engine in the guise of Apache Spark is undoubtedly the highlight of Hortonworks Data Platform (HDP) version 2.2. Hortonworks had offered Spark–the in-memory processing framework that’s taken the Hadoop community by storm in 2014–as a technical preview earlier this year, but HDP customers should get better value from Spark with version 2.2. thanks to the work Hortonworks has done to integrate Spark into Hadoop.

HortonworksInvestmentLikewise, support for the Apache Kafka messaging layer in HDP 2.2 should help Hortonworks customers add real-time data feeds for Hadoop applications. The vendor seeks Kafka–which was originally developed by LinkedIn–front-ending real-time streaming applications built in Storm and Spark. It also sees Kafka as a key player in enabling “Internet-of-Things” types of applications.

Hortonworks has been working closely with Concurrent, the company behind the Cascading framework, to integrate Cascading into Tez and YARN, two key components of the modern Hadoop 2 architecture. That work is complete and now the Cascading development framework is a first-class citizen in Hadoop 2, and fully supported in HDP 2.2. For Hortonworks customers, it means they can now write scalable Hadoop apps applications using an easy-to-use Java or Scala API, which shields them from the complexity of low-level MapReduce coding.

HDP 2.2 also brings a host of Apache Hive enhancements for running interactive SQL queries against Hadoop data, as well as the completion of the first phase of the Stinger.next initiative. While Hive .14 hasn’t been finalized yet, that’s almost certainly the nomenclature that will be carried in HDP 2.2 when the Hive community finalizes version .14.

Among the Hive enhancements are support for ACID transactions during data ingestion, which will make it easier to prepare data to be queried, Walker says. “If I’m streaming data in via Storm, I can write these things directly into Hive in real-time, so it’s instantly accessible via a SQL engine as well,” he says. “I couldn’t do that before because basically we were just kind of appending to rows. Now with support for updates and deletes and inserts, we can do these sort of workloads or accommodate these cross engine workloads in Hive.”

The version of Hive that will be in HDP 2.2 when it ships in November will also include a cost-based optimizer, which will give Hive some of the more advanced query capabilities–such as support for star schema joins and complex multi-joins–that are commonplace in established data warehousing products.

HDP 2.2 project chart FINALHortonworks is also rolling out a new release of Argus, its centralized console for establishing and enforcing security policies within a Hadoop cluster. Argus is based on the technology Hortonworks obtained (and subsequently donated to open source) in its June acquisition of XA-Secure, and is currently in the incubation stage at the Apache Software Foundation. With HDP 2.2, Hortonworks extended Argus so that it can work with the security aspects of the Spark, Storm, Hive, and HBase engines.

There’s also work being done around Apache Ambari, the open source centralized management and control console for Hadoop. With HDP 2.2, Hortonworks is supporting the Ambari Views Framework, which allows third-parties to create custom views within the Ambari console. It also introduces Ambari Blueprints, which simplifies the process of defining a Hadoop cluster instance.

As Hadoop clusters go into production, downtime becomes a bigger issue, so Hortonworks introduced the concept of rolling upgrades with this release. The feature uses the HDFS High Availability configuration to allow users to upgrade their clusters without downtime.

HDP 2.2 marks the second major release of the product since the company unveiled its first Hadoop 2 offering a year ago. A lot of water has passed under the Hadoop bridge since then, says Walker, and the promise of Hadoop as a big data architecture keeps growing.

“It’s nothing short of astonishing how much work is going on in and around the Hadoop ecosystem,” Walker says. “We have digitized our society. We have changed the way we think. We changed the way we differentiate ourselves. And it’s all comes back to the data, to becoming a data-driven business and undertaking some sort of journey to get to that end.”

Hadoop is in the Mind of the Beholder

Mar 24, 2014
Nick Heudecker and Merv Adrian

http://blogs.gartner.com/nick-heudecker/hadoop-is-in-the-mind-of-the-beholder/

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects”, mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012, the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree, all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.