Category Archives: Uncategorized

MeetUp | Cascading: A Java Developer’s Companion to the Hadoop World – Nov 11, 2014

Sign-up here:

Tuesday, November 11, 2014
6:00 PM

Twitter Office
1355 Market St #900
San Francisco, CA 94103

Amid all the hype and investment around Big Data technologies, many Java software engineers are asking what it takes to become big data engineers. As Java professionals, towards which path shall I steer my career?

Join Dhruv Kumar as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. We’ll provide an overview of the application development landscape for developing applications on Hadoop and explain why Cascading has become so popular, comparing it to other abstractions such as Pig and Hive. Dhruv will also show you how Java developers can easily get started building applications on Hadoop with live examples of good ‘ole Java code.

About Dhruv Kumar
Dhruv Kumar is a Solutions Architect at Concurrent Inc. and has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst.

Concurrent Delivers Performance Management for Apache Hive and MapReduce Applications

November 10, 2014
Richard Brueckner

Concurrent, Inc., a leader in data application infrastructure, has introduced a new version of Driven, a leading application performance management product for the data-centric enterprise. Driven is purpose-built to address the challenges of enterprise application development and deployment for business-critical data applications, delivering control and performance management for enterprises seeking to achieve operational excellence.

Driven offers enterprise users – developers, operations and line of business – unprecedented visibility into their data applications, providing deep insights, search, segmentation and visualizations for service-level agreement (SLA) management – all while collecting rich operational metadata in a scalable data repository. This allows users to isolate, control, report and manage a broad range of data applications, from the simplest to the most complex data processes. Driven is a proven performance management solution that enterprises can rely on to deliver against their data strategies.

The latest version of Driven introduces:

  • Deeper Visualization into Data Apps: Enhanced support allows users to debug, manage, monitor and search applications more effectively and in real time. Users can also track and store complete history of each application’s performance and operational metadata.
  • Powerful Search: Fast and rich search capabilities enable users take the guess work out of managing Hadoop applications. Driven provides greater control over managing user data processing. It quickly identifies problematic applications and the associated owners, and finds and compares specific applications with previous iterations to ensure that all applications are meeting SLAs.
  • Operational Insights for SLA Management: Users can now visualize all applications over customizable timelines to manage trending application utilization. Driven quickly segments applications by name, user-defined metadata, teams and organizations for deeper insights.
  • Segmentation for Greater Manageability: New segmentation support provides greater insights across all applications. Users have the ability to segment applications by tags, names, teams or organization, and easily track for general Hadoop utilization, SLA management or internal/external chargeback.
  • Metadata Repository: A scalable, searchable, fine-grained metadata repository easily captures end-to-end visibility of data applications, as well as related data sources, fields and more. By retaining a complete history of applications’ operational telemetry, enterprises can leverage Driven for operational excellence from development to production to compliance-related requirements.
  • Integration with Existing Systems: Users can leverage the vast capabilities of Driven and deliver runtime metrics and notifications to existing enterprise monitoring systems.
  • Additional Framework Support: In addition to Cascading, Scalding and Cascalog applications, Driven now supports Apache Hive and native MapReduce processes, allowing enterprises to leverage Driven’s capabilities across a wide variety of application frameworks.

Pricing and Availability
Driven is available as a free service on and licensable for production use as an annual subscription. Also, Driven will soon be available as an enterprise deployment.

Hortonworks Goes Broad and Deep with HDP 2.2

From full support for Apache Spark, Apache Kafka, and the Cascading framework to updated management consoles and SQL enhancements in Hive, there’s something for everybody in Hortonworks’ latest Hadoop distribution, which was revealed today at the Strata + Hadoop World conference in New York City.

Hadoop started out with two main pieces: MapReduce and HDFS. But today’s distributions are massive vehicles that wrap up and deliver a host of sub-components, engines, and consoles that are desired and needed for running modern Hadoop applications. Ensuring that each of the pieces in the Hadoop stack works with all the others is what Hadoop distributors like Hortonworks do.

Jim Walker, director of product marketing for the Palo Alto, California company, breaks it down into vertical and horizontal approaches. “Vertically, we’re going to integrate each one of these engines…both into YARN and then deeper into HDFS,” he says. “But we’re also making sure that I’m not integrating those horizontally across the key services of governance, security and operations.”

A shiny new engine in the guise of Apache Spark is undoubtedly the highlight of Hortonworks Data Platform (HDP) version 2.2. Hortonworks had offered Spark–the in-memory processing framework that’s taken the Hadoop community by storm in 2014–as a technical preview earlier this year, but HDP customers should get better value from Spark with version 2.2. thanks to the work Hortonworks has done to integrate Spark into Hadoop.

HortonworksInvestmentLikewise, support for the Apache Kafka messaging layer in HDP 2.2 should help Hortonworks customers add real-time data feeds for Hadoop applications. The vendor seeks Kafka–which was originally developed by LinkedIn–front-ending real-time streaming applications built in Storm and Spark. It also sees Kafka as a key player in enabling “Internet-of-Things” types of applications.

Hortonworks has been working closely with Concurrent, the company behind the Cascading framework, to integrate Cascading into Tez and YARN, two key components of the modern Hadoop 2 architecture. That work is complete and now the Cascading development framework is a first-class citizen in Hadoop 2, and fully supported in HDP 2.2. For Hortonworks customers, it means they can now write scalable Hadoop apps applications using an easy-to-use Java or Scala API, which shields them from the complexity of low-level MapReduce coding.

HDP 2.2 also brings a host of Apache Hive enhancements for running interactive SQL queries against Hadoop data, as well as the completion of the first phase of the initiative. While Hive .14 hasn’t been finalized yet, that’s almost certainly the nomenclature that will be carried in HDP 2.2 when the Hive community finalizes version .14.

Among the Hive enhancements are support for ACID transactions during data ingestion, which will make it easier to prepare data to be queried, Walker says. “If I’m streaming data in via Storm, I can write these things directly into Hive in real-time, so it’s instantly accessible via a SQL engine as well,” he says. “I couldn’t do that before because basically we were just kind of appending to rows. Now with support for updates and deletes and inserts, we can do these sort of workloads or accommodate these cross engine workloads in Hive.”

The version of Hive that will be in HDP 2.2 when it ships in November will also include a cost-based optimizer, which will give Hive some of the more advanced query capabilities–such as support for star schema joins and complex multi-joins–that are commonplace in established data warehousing products.

HDP 2.2 project chart FINALHortonworks is also rolling out a new release of Argus, its centralized console for establishing and enforcing security policies within a Hadoop cluster. Argus is based on the technology Hortonworks obtained (and subsequently donated to open source) in its June acquisition of XA-Secure, and is currently in the incubation stage at the Apache Software Foundation. With HDP 2.2, Hortonworks extended Argus so that it can work with the security aspects of the Spark, Storm, Hive, and HBase engines.

There’s also work being done around Apache Ambari, the open source centralized management and control console for Hadoop. With HDP 2.2, Hortonworks is supporting the Ambari Views Framework, which allows third-parties to create custom views within the Ambari console. It also introduces Ambari Blueprints, which simplifies the process of defining a Hadoop cluster instance.

As Hadoop clusters go into production, downtime becomes a bigger issue, so Hortonworks introduced the concept of rolling upgrades with this release. The feature uses the HDFS High Availability configuration to allow users to upgrade their clusters without downtime.

HDP 2.2 marks the second major release of the product since the company unveiled its first Hadoop 2 offering a year ago. A lot of water has passed under the Hadoop bridge since then, says Walker, and the promise of Hadoop as a big data architecture keeps growing.

“It’s nothing short of astonishing how much work is going on in and around the Hadoop ecosystem,” Walker says. “We have digitized our society. We have changed the way we think. We changed the way we differentiate ourselves. And it’s all comes back to the data, to becoming a data-driven business and undertaking some sort of journey to get to that end.”

Hadoop is in the Mind of the Beholder

Mar 24, 2014
Nick Heudecker and Merv Adrian

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects”, mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012, the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree, all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

These big data companies are ones to watch

Katherine Noyes, Fortune
June 13, 2014

Which companies are breaking new ground with big data technology? We ask 10 industry experts.

“I think graphs have a great future since they show data in its connections rather than a traditional atomic view,” [Gartner’s Sicular] said. “Graph technologies are mostly unexplored by the enterprises but they are the solution that can deliver truly new insights from data.” (She also named Pivotal, The Hive and Concurrent.)

SD Times 100: The Elements of Success

May 30, 2014

Big Data & Business Intelligence 2014

Big Data has hit it big. Every company that has reams of data is looking for ways to effectively store, retrieve and interpret it all. Fortunately these vendors are working on handling this otherwise daunting task, making the mountain of Big Data look like a much more manageable molehill.

  • Apache Hadoop
  • Cloudera
  • Concurrent
  • DataStax
  • Hortonworks
  • MongoDB (10gen)
  • Pentaho
  • Splunk
  • Talend
  • Zettaset

See more at:

Getting a handle on Hadoop

May 28,2014
Alex Handy

Tune and monitor the cluster
A single bad stick of RAM in one machine can make an entire cluster sluggish. When you’re building your applications and your Hadoop cluster itself, you’ll want to be sure you’re able to monitor your jobs all the way through the process. Chris Wensel, CTO and founder of Concurrent, said that you and your team have some important decisions to make as you’re designing your processes and your cluster.

Wensel said that, overall, “reducing latency is your ultimate goal, but also reducing the likelihood of failure. The way these technologies were built, they weren’t intended for operational systems.” As such, it is only recently that Hadoop and its many sub-projects have even added high availability support for the underlying file system.

That means Hadoop can still be somewhat brittle. Wensel said teams must first “decide if your application is something with an SLA. Is it something that has to complete in two hours every day, or every 10 minutes? Is it something you don’t want to think about at 10 p.m. when the pager goes off? If it’s an application that’s driving revenue, you need to really think about that. If you decide it has an SLA, you need to adopt some structural integrity in the application itself.”

Be ready for change
This is a two-sided tip, as it pertains to your cluster, and to Hadoop as a whole. On the micro scale, be sure you keep in mind the fact that your application is going to change once it hits the live data. Said Concurrent’s Wensel: “The other side of the problem is that as you’re developing an application, as you get larger and larger data sets, your application changes. It is a challenge to build an application and have it grow to larger data sets. Be very conscious of the fact that things are changing.” – See more at:

Big Data 100: The emerging Big Data Vendors

May 22, 2014
Rick Whiting


Top Executive: CEO Gary Nakamura

Concurrent, founded in 2008, offers application middleware technology that businesses use to develop, deploy, run and manage big data applications. The company’s products include the Cascading application development framework and Driven application performance management software.

Last month the San Francisco-based company debuted Cascading 3.0 with support for local in-memory, Apache MapReduce and Apache Tez. Concurrent closed on $4 million in Series A funding last year.