What Is Data Flow and Why Should You Care?
Eric Kavanagh, Inside Analysis
February 24, 2014
What goes around surely comes back around, which in the world of data is often called lifecycle management. To be blunt, very few organizations have ever formalized and implemented such a grandiose practice, but that’s not a pejorative statement, for only until recently has the concept become seriously doable without great expense.
Lifecycle management means following data from cradle to grave, or more precisely, from acquisition through integration, transformation, aggregation, crunching, archiving, and ultimately (if ever) deletion. That last leg is often a real kicker, and has entered the spotlight largely in the context of eDiscovery, which tends to be discussed in the legal arena – too much old data lying around can become a definite liability if some lawyer can use it against you.
But there’s a new, much more granular version of lifecycle management circulating these days, and it’s described by Dr. Robin Bloor as Data Flow. In fact, he even talks of a data flow architecture which can be leveraged to get value from data long before it ever enters a data warehouse (if it ever even does). Data Flow in this context can be a really big deal, because it can deliver immediate value without ever beating at the door of a data warehouse.
Data streams embody one of the hotter trends in data flow. Streams are essentially live feeds of data spinning out of various systems of all kinds – air traffic control data, stock ticker data, and a vast array of machine-generated data, whether in manufacturing, healthcare, Web traffic, you name it. Several innovative vendors are focused intently on data streams these days, such as IBM, SQLstream, Vitria, Extrahop Networks and others. The use cases typically revolve around the growing field of Operational Intelligence.
Data Flow Oriented Vendors
Finding ways to effectively visualize data flows can be a real treat. Some of the most talked-about vendors these days have worked to provide windows into the world of data flow – or at least basic systems management – primarily using so-called big data. Both Cloudera and Hortonworks have built their enterprises on the shoulders of Apache Hadoop, the powerful if somewhat cryptic engine that has the entire world of enterprise software in a genuine tizzy.
But there are a few other vendors who have excelled in the domain of providing detailed visibility into how data flows in certain contexts. The first that comes to mind is Concurrent, which just unveiled their Driven product. This offering is almost like the enterprise data version of a glass-encased ant farm. Remember those from childhood days? You could actually watch the ants build their tunnels, take care of business, cruise all around, get things done. For Driven, this is systems management 3.0 – you can actually see how the data moves through applications, where things go awry, and thus fine-tune your architecture.
Another vendor that talks all about data flow is Actian. Formerly known as Ingres, the recently renamed vendor went on an acquisition spree in recent times, folding ParAccel, Pervasive Software and Versant into its platform data-oriented portfolio of products. Mike Hoskins, once the CTO of Pervasive is now the CTO, of Actian and can be credited years ago with having the vision to build a parallel data flow platform which originally went by the name of DataRush but is now simply referred to as Actian DataFlow. Actian’s view of the Big Data landscape involves Hadoop as a natural data collection vehicle and its DataFlow product as a means either of processing Hadoop data in situ or flowing it (also employing its data integration products) to an appropriate data engine of which it has several, including Matrix (a scale-out analytical engine once known as ParAccel) and Vector (a scale-up engine).
And then there’s Alpine Data Labs, a cloud-based solution that offers a data science workbench. The collection of Alpine offerings, some of which derive from Greenplum Chorus, provide a wide range of functionality for doing all things data: modeling, visualizing, crunching, analyzing. And when you push the big red button to make the magic happen, you get a neat visual display of where the process is at any given point. This is both functional and didactic, helping aspiring data scientists better understand what’s happening where.
Like almost all data management vendors, Alpine touts a user-friendly, self-service environment. That said, the “self” who serves in such an atmosphere needs to be a very savvy information executive, someone who understands a fair amount about all the nuts and bolts of data lifecycle management. And though Alpine also talks of no data movement, what they really mean by that is data movement in the old ETL-around-the-mulberry bush sense. You still need to move data into the cloud, and set your update schedule, which incidentally runs via REST.
Of course, in a certain sense, there’s not too much new under the sun. After all, data flow happened in the very earliest information systems, even in the punch card era. But these days, the visibility into that movement will provide game-changing awareness of what data is, does, and can be.