Category Archives: News

Concurrent, Inc. Continues to Expand Supported Ecosystem to Deliver Deep Visibility and Insight for Hadoop Applications, Announces Driven 1.3

New Release Offers Advanced Team Collaboration, Support for Apache Hive, Cascading 3.0 and Apache Tez, Support for Hadoop Applications 

SAN FRANCISCO – Aug. 25, 2015 – Concurrent, Inc., the leader in Big Data application infrastructure, today announced the latest release of Driven, the industry’s leading application performance management product for monitoring and managing Hadoop applications. Driven is built to address the challenges of business-critical Big Data application development and deployment, delivering control and performance management for enterprises seeking to achieve operational excellence on Hadoop.

Driven offers enterprise users – developers, operations and lines of business – unprecedented visibility into applications written in Cascading, Scalding, Cascalog, Apache Hive and MapReduce. It provides deep operational insights, search, segmentation and visualizations for rapid troubleshooting and performance management. To achieve this, Driven collects rich operational metadata in a scalable data repository, enabling users to isolate, control and report on almost any topic relevant to their business, such as application SLAs, KPIs and data lineage.

The latest version of Driven includes:

Advanced application performance analytics: Customizable application views include key statistics about application performance over time. This new enhancement also includes anomaly detection – the ability to go back in time to determine when the anomaly happened and view the current environment to define the cause of the problem.

Deeper collaboration and sharing: This ability to share a created customized analytic, application or status view ensures that teams are referencing the same data when troubleshooting a problem. When sharing the view, users can select whether to share only with other teams members or with any individual. 

Enhanced SLA management: Users now have the option to set SLA thresholds and alerting. For example, users can set duration thresholds to report on all applications that exceed their allotted run-time. 

Plug-in agent for Apache Hive and Map Reduce: Driven now supports Apache Hive and MapReduce. With Driven’s agent technology, enterprises can seamlessly and transparently collect all the operational intelligence for Apache Hive and MapReduce jobs and tasks, delivering all the rich capabilities and operational analytics offered by Driven.

Cascading 3.0 and Apache Tez Support: This new support enables Cascading 3.0 users to leverage all the capabilities of Driven to manage and monitor applications running on multiple compute fabrics.

Driven is a proven performance management solution that enterprises can rely on to deliver against their data strategies. Benefits of Driven include accelerated application development cycles, immediate application failure diagnosis, improved application performance, easier audit reporting and reduced cluster utilization costs.

Driven can be accessed now for free at http://drivenio.staging.wpengine.com/choose-trial.

Supporting Quotes

“Enterprise needs have not changed, and as Hadoop pushes further into the mainstream, running business critical data processes has challenges. Enterprises are grappling with basic visibility, data governance, compliance and performance management. Driven arms enterprises with the right solution to deliver against their big data strategies and move plans forward.”

– Gary Nakamura, CEO, Concurrent, Inc.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the leader in data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 500,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

 

An Inside View of Mainstream Enterprise Hadoop Adoption

Nicole Hemsoth
June 1st 2015

Few organizations have holistic insight into how the overall Hadoop ecosystem is trending. From the analysts to the Hadoop vendors themselves, there is always some other piece of the market or data that tends to be somewhat incomplete due to a single-vendor view of adoption or market data that might be outdated to allow time for analysis.

But for a company like Concurrent, the vendor behind the application development platform, Cascading, which lets users build data oriented applications on top of Hadoop in a more streamlined way, there are no blind spots when it comes to seeing the writing on the Hadoop wall. And Concurrent has had plenty of time to watch the Hadoop story unfold in full. Beginning in 2008 and through some big name use cases at web scale companies, the small company (now 25 folks, 60% of whom are engineers) saw the first flush of the Hadoop boom and rode the tide, with a combined total of close to $15 million in investments.

Beyond what Concurrent does for end users of Hadoop who want to build and deploy applications on the framework as well as monitor and track their performance via the new Driven tooling they rolled out this month, part of what makes the company interesting from the high level is that they have some unique insight into how companies are using Hadoop at scale.

As Gary Nakamura, CEO of Concurrent, tells The Platform, while they do watch what the analysts groups say about Hadoop’s rise through the mainstream enterprise ranks, their insight into what’s actually happening in the market through the use of their Cascading (and now Driven) product lines extends across all three of the major Hadoop distribution vendors as well as the open source version. While they do not have specific numbers to share, aside from an approximate 290,000 downloads of Cascading per month, Nakamura says there are distinct adoption trends they have been tracking showing healthy (although not meteoric) growth of Hadoop adoption for production workloads in mainstream enterprise settings.

By mainstream, Nakamura means large to mid-sized companies in telco, finance, healthcare, and retail. These users take a “very pragmatic approach that tends to follow milestones with 0-24 months being the experimental phase” before shifting out to build larger clusters and add more applications to the ranks. “We have many mainstream enterprise users who are somewhere in that 24 months to seven years category with an average node count for Hadoop workloads being somewhere between 100 and 200, although we also have users in the mainstream [not the LinkedIn, Twitter, and similar companies] with around 1500 nodes.”

“Beyond the startups and bleeding edge, for the mainstream world, this pragmatism extends to wanting to show a particular ROI for specific projects, then they move out to look at how they can move other applications. For mainstream users though, this is a measured approach in part because this is not a trivial expense. And even though the chasm will close slower than people think with Hadoop adoption, it will happen. If you look at MapR, Cloudera, and Hortonworks, they are getting a lot of new logos each quarter, the growth is there, but mainstream companies are very measured in how they are looking at Hadoop, especially at that 0-24 month stage, where a lot of them are,” Nakamura explains.

The leading edge companies tend to jump immediately into the deep end, but for the high-value companies that are in the infant stages (where the majority of mainstream companies are now, according to Nakamura), it’s about proving an ROI on a Hadoop investment. The Driven product that announced this month allows for complete monitoring and visualization on the health and performance of individual jobs as well as cluster-level metrics, in part to give these mainstream enterprises something that aid in their ability to show the value of the jobs they’ve pushed to Hadoop, oftentimes legacy applications that are moved in pieces—one by one to start, before more applications are moved with the help of Cascading for building out the broader Hadoop strategy.

“These mainstream users tend to come us very well informed about Hadoop. They know what they are doing. Their questions by the time to get us are more clarifying—how many deployments are there, are there similar cases of migrating business-level use cases to Hadoop that we’ve worked with before. They do not want to be the guinea pigs, in other words.” Nakamura notes that the driver for Hadoop adoption among these users is not coming from the top down (there are no directives from CIOs demanding a Hadoop strategy) but users are seeing clear opportunities for their workloads to run on Hadoop and want to be able to use Cascading to build and deploy new applications, then be able to confirm progress using the new Driven tool, especially since accountability with so many different stakeholders is critical.

Once mainstream larger enterprise users get beyond the initial growing pains (past 24 months) Nakamura says they start to see how they can take advantage of the reusable components and connectors of Cascading, which lets them roll developments made on one application into another. “We see this is as the path to growing Hadoop at these enterprises, they can create new applications much quicker to do the second, third, then before they know it, forty applications. We have users now that in three years have started this way and now have 800 applications in production.”

Nakamura does not disagree that there has been a leveling of the steep adoption curve we saw over the last couple of years, but says that with so many companies in that early 0-24 stage, he expects another rush around the bend.

11 tools for a healthy Hadoop relationship.

Supreet Oberoi
May 8th 2015

http://thenextweb.com/dd/2015/05/08/11-tools-for-a-healthy-hadoop-relationship/

I’m often asked which Hadoop tools are “best.” The answer, of course, is that it depends, and what it depends on is the stage of the Hadoop journey you’re trying to navigate. Would you show up with a diamond ring on a first date? Would you arrange dinner with your spouse by swiping right on Tinder? In the same way, we can say that some Hadoop tools are certainly better than others, but only once we know where you are in your data project’s lifecycle.

Here then, are the major stages of a happy and fulfilling relationship with Hadoop, and the tools and platforms most appropriate for navigating each one.

Dating: Exploring your new Hadoop partner

When a business analyst is first matched with a Hadoop deployment, he or she is typically drawn by great promise. What kind of data does the deployment hold? What does that data actually look like? Can you combine it with other data sets to learn something interesting?

bigdata 520x309 11 tools for a healthy hadoop relationship

Answering these questions doesn’t require fancy, large-volume clusters. What’s ideal is a simple, Excel-like interface for reading a few rows, teasing out the fields and getting to know the distribution. At this data exploration stage, visualization platforms like Datameer and Trifacta deliver excellent productivity. If you’re comfortable with SQL, you could also try Hive, but that might be overkill given the learning curve.

One reason that some folks fail to make it past a first date with Hadoop is that they equate visualization with reporting; they go right to a BI tool like Tableau.  The visualization tools above are easier to use and therefore better suited to exploration and quick hypothesis building.

Moving In: Getting comfortable with further Hadoop developments

Before jumping into a production-grade data application, the analyst needs to see a more concrete vision of it if he or she hopes to win the organization’s blessings — and funding. In traditional software terms, it’s like building a prototype with Visual Basic to get everyone on board, without worrying about all the customer-support scaffolding that will eventually be necessary. Basically, you want a nice, slick hack.

data scientist 520x364 11 tools for a healthy hadoop relationship

Here again, the visualization tools can be helpful, but even better are GUI-based application development platforms like SnapLogic, Platfora and Pentaho. These tools make building basic applications fast and easy, and they’re relatively simple to learn.

Tying the knot: Committing to a production relationship with Hadoop

Once a prototype’s value is recognized, the enterprise typically turns the application over to an operations team for production. That team cares not only about the application’s functionality, but also about its capacity to be deployed and run reliably, its scalability and its portability across data fabrics.

The operations folks must also make sure the application’s execution honors service-level agreements (SLAs), and that it integrates seamlessly with inbound data sources like data warehouses, mainframes and other databases, as well as with outbound systems such as HBase and Elastic Search.

At this production deployment stage, development and operations teams typically forgo the GUI-based platforms, opting instead for the control and flexibility of API-based frameworks such as Cascading, Scalding and Spark. (Full disclosure: I work for Concurrent, the company behind Cascading.)

With these you get not only the control necessary to engineer applications that address the complexity of your enterprise, but also test-driven development, code reusability, complex orchestration of data pipelines, and continuous performance tuning. These capabilities are vital for production-grade applications, and are not available in GUI-based platforms or ad hoc query tools.

180930601 520x332 11 tools for a healthy hadoop relationship

Building a big data family: Nurturing a thriving nursery of Hadoop apps

Once a team is married to a big data platform, they soon find themselves with a growing family of applications that execute on a common Hadoop cluster (or other shared resource).

There are two big challenges at this stage. First, each application must continue to operate under defined controls. These include SLA and governance, of course, but it’s also crucial to prevent unauthorized use of sensitive data fields across use cases and that proper privacy curtains are respected.

Second, production teams must carefully manage how a brood of apps uses the shared cluster resources. Specifically, utilization should represent the relative importance of lines of businesses and use cases. In addition, teams must make sure that no application goes hungry for compute resources, either now or in the future (the discipline known as capacity planning).

The ideal tools at this stage deliver performance monitoring for enterprises looking to achieve operational excellence on Hadoop. These include Driven, Cloudera Manager, and even Hadoop’s out-of-the-box Job Tracker. (Full disclosure: my company, Concurrent, also makes Driven.)

Maturing with big datatering a rich, lifelong partnership with Hadoop

Like any relationship, the one with Hadoop can become more complex as it matures.  Typically, the enterprise will transform production data applications in rich, enterprise-class data products, both by piping them into downstream systems (for example, a recommendation engine) and by making their output available to businesspeople to foster better decisions.

data flow 520x272 11 tools for a healthy hadoop relationship

At this most advanced stage in the data product lifecycle, the enterprise’s needs are the most diverse. On top of your production-ready development platform, you’ll want highly intelligent pipeline monitoring. For example, if one application encounters a problem and a downstream one depends on its output, you’ll want to raise an alert so your team can react quickly to resolve the problem.

You’ll also want tools that quickly pinpoint whether a performance bottleneck resulted from a problem in code, data, hardware or shared resources (your cluster configuration). Driven, Cloudera Manager, and Hadoop’s Job Tracker are also helpful here.  You may also want to give businesspeople an easy, flexible ways of getting results they want, while still offering a level of abstraction from the compute interface – a query framework like SQL-based Hive is a great choice here.

To choose the right tool, know where you are in the Hadoop relationship cycle 

In conclusion, before you can answer which tools and approaches work best, you have to understand where you are in the lifecycle of your data operation. Like any true love, your relationship with Hadoop will mature over time, and you’ll need to reevaluate your needs as they mature along with it.

Hadoop and beyond: A primer on Big Data for the little guy

Alexandra Weber Morales
April 28th 2015

http://sdtimes.com/hadoop-and-beyond-a-primer-on-big-data-for-the-little-guy/

Have you heard the news? A “data lake” overflowing with information about Hadoop and other tools, data science and more threatens to drown IT shops. What’s worse, some Big Data efforts may fail to stay afloat if they don’t prove their worth early on.

“Here’s a credible angle on why Big Data could implode,” began Gary Nakamura, CEO of Concurrent, which makes Cascading, an open-source data application development platform that works with Hadoop, and Driven, a tool for visualizing data pipeline performance. “A CTO could walk into a data center, and when they say, ‘Here is your 2,000-node Hadoop cluster,’ the CTO says, ‘What the hell is that and why am I paying for it?’ That could happen quite easily. I predicted last year that this would be the ‘show me the money’ year for Hadoop.”

While plenty can go wrong, Nakamura is bullish on Hadoop. With companies like his betting robustly on the Hadoop file system (and its attendant components in the Big Data stack), now is a strategic moment to check your data pipelines for leaks. Here’s a primer on where the database market stands, what trends will rock the boat, and how to configure your data science team for success.

Follow the leader
Risks aside, no one—not even the federal government—is immune to the hope that Big Data will bring valuable breakthroughs. Data science has reached presidential heights, with the Obama administration’s appointment of former LinkedIn and Relate IQ quantitative engineer DJ Patil as the United States’ first Chief Data Scientist in February. If Patil’s slick talks and books are any indication, he is at home in a political setting. Though building on government data isn’t new for many companies offering services in real estate (Zillow), employment (LinkedIn), small business (Intuit), mapping (ESRI) or weather (The Climate Corporation), his role should prompt many more to innovate with newly opened data streams via the highly usable data.gov portal.

“I think it’s wonderful that the government sees what’s happening in the Big Data space and wants to grow it. I worked at LinkedIn for three years, and for a period of time [Patil] was my manager. It’s great to see him succeed,” said Jonathan Goldman, director of data science and analytics at Intuit. (Goldman cofounded Level Up Analytics, which Intuit acquired in 2013.)

Defining the kernel, unifying the stack
“In the last 10 years we’ve gone through a massive explosion of technology in the database industry,” said Seth Proctor, CTO of NuoDB. “Ten years ago, there were only a few dozen databases out there. Now you have a few hundred technologies to consider that are in the mainstream, because there are all these different applications and problem spaces.”

After a decade of growth, however, the Hadoop market is consolidating around a new “Hadoop kernel,” similar to the Linux kernel, and the industry standard Open Data Platform announced in February is designed to reduce fragmentation and rapidly accelerate Apache Hadoop’s maturation. Similarly, the Algorithms, Machines and People Laboratory (AMPLab) at the University of California, Berkeley is now halfway through its six-year DARPA-funded Big Data research initiative, and it’s beginning to move up the stack and focus on a “unification philosophy” around more sophisticated machine learning, according to Michael Franklin, director of AMPLab and associate chair of computer science at UC Berkeley.

“If you look at the current Big Data ecosystem, it started off with Google MapReduce, and things that were built at Amazon and Facebook and Yahoo,” he said at the annual AMPLab AMP Camp conference in late 2014. “The first place most of us saw that stuff was when the open-source version of Hadoop MapReduce came out. Everyone thought this is great, I can get scalable processing, but unfortunately the thing I want to do is special: I want to do graphs, I want to do streaming, I want to do database queries.

“What happened is, people started taking the concepts of MapReduce and started specializing. Of course, that specialization leads to a bunch of problems. You end up with stovepipe systems, and for any one problem you want to solve, you have to cobble together a bunch of systems.

“So what we’re doing in the AMPLab is the opposite: Don’t specialize MapReduce; generalize it. Two additions to Hadoop MapReduce can enable all the models.”

First, said Franklin, general directed acyclic graphs will enrich the language, adding more operators such as joins and filters and flattening. Second, data sharing across all the phases of the program will enable better performance that is not disk-bound.

Thus, the Berkeley Data Analytics Stack (BDAS, pronounced “Badass”) starts with the Hadoop File System storage layer with resource virtualization via Mesos and Yarn. A new layer called Tachyon provides caching for reliable cross-cluster data sharing at memory speed. Next is the processing layer with AMPLab’s own Spark and Velox Model Serving. Finally, access and interfaces include BlinkDB, SampleClean, SparkR, Shark, GraphX, MLlib and Spark Streaming.

On the commercial side, Concurrent is one of many companies seeking to simplify Hadoop for enterprise users. “I’ve seen many articles about Hadoop being too complex, and that you need to hire specialized skill,” said Nakamura. “Our approach is the exact opposite. We want to let the mainstream world leverage all they have spent and execute on that strategy without having to go hire specialized skill. Deploying a 50-node cluster is not a trivial task. We solve the higher-order problem: We’re going to help you run and manage data applications.”

The case for enterprise data warehousing
One of the most compelling scenarios for Hadoop and its ilk isn’t as exciting as new applications, but some say it is an economic imperative. With the cost of traditional data warehouses running 20x to 40x higher per terabyte than Hadoop, offloading enterprise data hubs or warehouses to Hadoop running on commodity hardware does more than just save cents and offer scalability; it enables dynamic data schemas for future data science discoveries, rather than the prescriptive, static labels of traditional relational database management systems (RDBMS).

According to the website of Sonra.io, an Irish firm offering services for enterprise Hadoop adoption, “The data lake, a.k.a. Enterprise Data Hub, will not replace the data warehouse. It just addresses its shortcomings. Both are complimentary and two sides of the same coin. Unlike the monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling… If implemented properly, this results in a nearly unlimited potential for agile insight.”

The fight for the hearts of traditional enterprise data warehouse (EDW) and RDBMS customers isn’t a quiet one, however. Some of the biggest posturing has come from the great minds behind seminal database technologies, such as Bill Inmon, “the father of data warehousing.”

Inmon has criticized Cloudera advertisements linking Big Data to the data warehouse. “While it is true that the data warehouse often contains large amounts of data, there the comparison to Big Data ends. Big Data is good at gobbling up large amounts of data. But analyzing the data, using the data for integrated reporting, and trusting the data as a basis for compliance is simply not in the cards,” he wrote on his blog.

“There simply is not the carefully constructed and carefully maintained infrastructure surrounding Big Data that there is for the data warehouse. Any executive that would use Big Data for Sarbanes-Oxley reporting or Basel II reporting isn’t long for his/her job.”

Ralph Kimball is the 1990s rival who countered Inmon’s top-down EDW vision with a proposal for small, star or snowflake schema-based data marts that form a composite, bottom-up EDW. Kimball has boarded the Big Data train, presenting a webinar with Cloudera about building a data warehouse with Hadoop.

“The situation that Hadoop is in now is similar to one the data community was in 20 years ago,” said Eli Collins, chief technologist for Cloudera. “The Inmon vs. Kimball debates were about methodologies. I don’t see it as a hotly debated item anymore, but I think it’s relevant to us. Personally, I don’t want to remake mistakes from the past. The Kimball methodology is about making data accessible so you can ask questions. We want to continue to make data more self-service.”

“We can look to history to see how the future is going,” said Jim Walker, director of product marketing at Hortonworks. “There is some value to looking at the Inmon vs. Kimball debate. Is one approach better than another? We’re now seeing bit more of the Inmon approach. Technology has advanced to the point where we can do that. When everything was strictly typed data, we didn’t have access. Hadoop is starting to tear down some of those walls with schema-on-read over schema-on-write. Technology has opened up to give us a hybrid approach. So there are lessons learned from the two points of view.”

The idea that the data lake is going to completely replace the data warehouse is a myth, according to Eamon O’Neill, director of product management at HP Software in Cambridge, Mass. “I don’t see that happening very soon. It’s going to take time for the data lake to mature,” he said.

“I was just talking to some entertainment companies that run casinos here at the Gartner Business Intelligence and Analytics Summit, and a lot of the most valuable data, they don’t put in Hadoop. It may take years before that’s achieved.”

However, there are cases where it makes sense, continued O’Neill. “It depends on how sensitive the data is and how quickly you want an answer. There are kinds of data you shouldn’t pay a high price for, or that when you query you don’t care if it takes a while. You put it in SAP or Oracle when you want the answer back in milliseconds.”

Data at scale: SQL, NoSQL or NewSQL?
Another bone of contention is the role of NoSQL to solve the scalability limitations of SQL or RDBMS when faced with Web-scale transaction loads. NoSQL databases sacrifice ACID (atomicity, consistency, isolation and durability) data quality constraints to increase performance. MongoDB, Apache HBase, Cassandra, Accumulo, Couchbase, Riak and Redis are among popular NoSQL choices that let analysts query data with SQL but provide different optimizations based on the application (such as streaming, columnar data or scalability).

“One of the brilliant things about SQL is that it is a declarative language; you’re not telling the database what to do, you’re telling it what you want to solve,” said NuoDB’s Proctor, whose company is one of several providing NewSQL alternatives that combine NoSQL scalability with ACID promises. “You understand that there are many different ways to answer that question. One thing that comes out of the new focus on data analysis and science is a different view on the programming model: where the acceptable latencies are, how different sets of problems need different optimizations, and how your architecture can evolve.”

NuoDB was developed by database architect Jim Starkey, who is known for such contributions to database science as the blob column type, type event alerts, arrays, and triggers. Its distributed shared nothing architectures aggregate opt-in nodes in a SQL/ACID-compliant database that “behaves like a flock of birds that fly in an organized fashion but without a central point of control or a single point of failure,” according to Proctor.

Elsewhere, MIT professor and Turing Award winner Michael Stonebraker, the “father of the modern relational database,” cofounded VoltDB in 2009, another NewSQL offering that claims “insane” speed. The company recently touted a benchmark of 686,000 transactions per second for a Spring MVC-enabled application using VoltDB. Another company he founded, Vertica, was acquired by HP Software in 2011 and added to its Haven Big Data platform.

“Vertica is a very fast, scalable data engine with a columnar database architecture that’s very good for OLAP queries,” explained HP’s O’Neill. “It runs on commodity servers and scales out horizontally. Very much like Hadoop, it’s designed to be a cluster of nodes. You can use commodity two-slot servers, and just keep adding them.”

On the SQL side, HP took the Vertica SQL query agent and put it on Hadoop. “It’s comparable to other SQL engines on Hadoop,” said O’Neill. “It’s for when the customer is just in the stage of exploring the data, and they need a SQL query view on it so they can figure out if it’s valuable or not.”

On the NewSQL side, HP Vertica offers a JDBC key-value API to quickly query data on a single node for high-volume requests returning just a few results.

With the explosion of new technologies, it’s likely that the future will include more database specialization. Clustrix, for example, is a San Francisco-based NewSQL competitor that is focusing on e-commerce and also promoting the resurgence of SQL on top of distributed shared nothing architectures.

SQL stays strong
Meanwhile, movements at Google and Facebook are showing the limitations of NoSQL.

“Basically, the traditional database wisdom is just plain wrong,” said Stonebraker in a 2013 talk at MIT where he criticized standard RDBMS architectures as ultimately doomed. According to him, traditional row-based data storage cannot match column-based storage’s 100x performance increase, and he predicted that online analytical processing (OLAP) and data warehouses will migrate to column-based data stores (such as Vertica) within 10 years.

Meanwhile, there’s a race among vendors for the best online transaction processing (OLTP) data storage designs, but classic models spend the bulk of their time on buffer pools, locking, latching and recovery—not on useful data processing.

Despite all that, operational data still relies on SQL—er, NewSQL.

“Hadoop was the first MapReduce system that grabbed mindshare,” said Proctor. “People said, ‘Relational databases won’t scale.’ But then about a year and a half ago, Google turned around and said, ‘We can’t run AdWords without a relational database.’ ”

Proctor was referring to Google’s paper “F1: A Distributed SQL Database that Scales.” In it, Google engineers took aim at the claims of “eventual consistency,” which could not meet the hard requirements they faced with maintaining financial data integrity.

“…Developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date,” according to the F1 paper. “We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1.”

“Intuit is going through a massive transformation to leverage data—not just to understand customers, but also to feed it back into our products,” said Lucian Lita, cofounder of Level Up Analytics and now director of data engineering at Intuit. “We’re building a very advanced Big Data platform and putting Intuit on the map in terms of data science. We’re educating internally, contributing to open source and starting to have a good drumbeat.”

As an example, Quickbooks Financing uses data to solve a classic small business problem: They need financing to grow, but they’re so new that they can’t prove they are viable. “Intuit uses Big Data techniques to get attributes of your business, score it, and we should get something like the normal 70% rejection rate by banks turned into a 70% acceptance rate,” said Intuit’s Goldman.

“Small businesses don’t have access to data like Walmart and Starbucks do. We could enable that: Big Data for the little guy,” he said.

The experience of running a data science startup didn’t just translate into being acquired by an established enterprise. It also gave Goldman, Lita and cofounder Anuranjita Tewary, now director of product management at Intuit, insight into how to form effective data science teams.

“When we were working at Level Up Analytics, we spoke with over 100 companies building data products,” Tewary said. “This helped us understand how to structure teams to succeed. It’s important to hire the right mix of skills: product thinking, data thinking and engineering.”

When she looked at struggling data science projects at a multinational bank, a media company and an advertising company, Tewary saw common pitfalls. One of the most frequent? “Treating the data product as a technology project, ending up with not much to show for it and no business impact,” she said.

“It was more, ‘What technology should we have just for the sake of having this technology?’”

Because the tools have gotten easier to use, the idea of having Big Data for Big Data’s sake (and running it in a silo) may not be long for this world.

“It used to be that we were selling tools to IT. Now we think about analytics tools we can sell to business,” said HP’s O’Neill. “Increasingly, data scientists live in the marketing or financial departments, trying to predict what’s going to happen.”

In 1985, the physicist and statistician Edwin Thompson Jaynes wrote, “It would be very nice to have a formal apparatus that gives us some ‘optimal’ way of recognizing unusual phenomena and inventing new classes of hypotheses that are most likely to contain the true one; but this remains an art for the creative human mind.”

That quote is one of the inspirations behind Zoubin Ghahramani’s Automatic Statistician, a Cambridge Machine Learning Group that won a $750,000 Google Focused Research Award. Using Bayesian inference, the Automatic Statistician system examines unstructured data, explores possible statistical models that could explain it, and then reports back with 10 pages of graphics and natural language, describing patterns in the data.

IBM’s Watson and HP’s IDOL are trying to solve the unstructured data problem. The fourth technology on HP’s Haven Big Data platform, IDOL is for parsing “human data”: prose, e-mails, PDFs, slides, videos, TV, voicemail and more. “It extracts from all these media, finds key concepts and indexes them, categorizes them, performs sentiment analysis—looking for tones of voice like forceful, weak, angry, happy, sad,” said O’Neill. “It groups similar documents, categorizes topics into taxonomies, detects language, and makes these things available for search.”

As the F1 paper explained, “conventional wisdom in the engineering community has been that if you need a highly scalable, high- throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google’s MySQL data store for the AdWords product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.

“Instead of going NoSQL, we built F1, a distributed relational database system that combines high availability, the throughput and scalability of NoSQL systems, and the functionality, usability and consistency of traditional relational databases, including ACID transactions and SQL queries.”

While this results in a more low-level commit latency, improvements in the client application have kept observable end-user speed as good or better than before, according to Google engineers.

In a similar vein, Facebook created Presto, a SQL engine optimized for low-latency interactive analysis of petabytes of data. Netflix’s Big Data Platform team is an enthusiastic proponent of Presto for querying a multi-petabyte scale data warehouse for things like A/B tests, user streaming experiences or recommendation algorithms.

FoundationDB is another hybrid NewSQL database whose proprietary NoSQL-style core key value store can act as universal storage with the transactional integrity of SQL DBMS. Unfortunately, many FoundationDB users were surprised in March when Apple purchased the ISV, possibly for its own high-volume OLTP needs, and its open-source components were promptly removed from GitHub.

Building better data science teams
As with any technology initiative, the biggest success factors aren’t the tools, but the people using them.

Intuit is a case in point: With the 2013 acquisition of Level Up Analytics, the consumer tax and accounting software maker injected a team of data science professionals into the heart of its business.

Futuristic as the possibility of parsing both human data and the data “exhaust” from the Internet of Things sounds, the technology may be the easy part. As the textbook “Doing Data Science” by Cathy O’Neil and Rachel Schutt explained, “You all are not just nerds sitting in the corner. You have increasingly important ethical questions to consider while you work.”

Ultimately, user-generated data will form a feedback loop, reinforcing and influencing subsequent user behaviors. According to O’Neil and Schutt, for data science to thrive, it will be critical to “bring not just a set of machine learning tools, but also our humanity, to interpret and find meaning in data and make ethical, data-driven decisions.”

How to find a data scientist
Claudia Perlich knows quite a bit about competition: She’s a three-time winner of the KDD Cup, the ACM’s annual data mining and knowledge discovery competition. Her recommendation for finding a qualified data scientist? Use a platform like Kaggle and make it a contest.

First, however, you must answer the hardest question: What data science problem to solve. “This is typically one of the hardest questions to answer, but very much business-dependent,” she said. Once you have formulated a compelling question, a Kaggle competition “will get you almost surely the highest-performance solution emerging from the work of thousands of competing data scientists. You could even follow up with whoever is doing well and hire them.

“It turns out that many companies are using Kaggle as a screening platform for DS applicants. It is notably cheaper than paying a headhunter, and you can be confident that the person you hire can in fact solve the problem.”

9 Business Intelligence and Analytics Predictions for 2015

Drew Robb
January 16, 2015

http://www.enterpriseappstoday.com/business-intelligence/9-business-intelligence-and-analytics-predictions-for-2015.html

What’s ahead for business intelligence and data analytics in 2015?

When we asked analysts, forecasters and pundits for their predictions for the most relevant business intelligence and analytics trends for 2015, they came up with a host of topics, including mobile maturity, Hadoop growth, better text analysis and incorporation of the Internet of Things (IoT) into analytics.

Here is what they saw as they gazed into their crystal balls.

Better Mobile BI

Mobile business intelligence has been all the rage for a couple of years now. But reality hasn’t quite matched up to expectations. Adi Azaria, co-founder of Sisense, said it has gotten to the point where mobile requirements are forcing a fundamental change in the approach to BI.

“Rather than elaborate visualizations, you will see hard numbers, simple graphs and conclusions,” Azaria said. “For instance, with wearable devices you might look at an employee and quickly see the key performance indicator.”

Ellie Fields, vice president of Product Marketing at Tableau, also sees progress being made on mobile, which she thinks will translate into more capabilities in the hands of sales and field service personnel.

“The level of maturity being achieved will help mobile workers accomplish light analysis on the road,” she said.

Year of Hadoop

Hadoop has been a high visibility item for a couple of years now. But Gary Nakamura, CEO of Concurrent, believes Hadoop will break through and become a worldwide phenomenon in 2015. A sign that this is occurring, he said, is a growing wave of Hadoop-related acquisitions and IPOs. “Hadoop is rapidly spreading across Europe, Asia and other parts of the world so there will be strong Hadoop adoption this year,” he said.

Text Analysis Matures

While Hadoop is likely to be big this year, not everyone thinks it will realize its huge potential over the course of 2015. An intermediate stage that incorporates better text analysis may be required on the way to realizing the Big Data dream, Azaria said.

“Unstructured data has posed many obstacles in the past, but will come into its own in 2015,” said Azaria. “Text analysis will gain increasing traction, with Web data, documents and images, and companies finally able to tackle unstructured data in meaningful ways.”

IoT Analytics

First we used databases for analysis, then we added in unstructured data sources and data from a far greater number of mobile devices. Up that by two or three orders of magnitude as the Internet of Things (IoT) takes hold.

“When you have millions of devices, systems and machines connected on the Internet generating all kinds of machine log data, making sense of this data becomes a strategic opportunity for manufacturers, service providers and end users,” said Puneet Pandit, CEO of Glassbeam. “The use cases of this application include support automation, remote diagnostics, predictive maintenance, installed base analysis, product quality reporting and service revenue generation.”

Bob Muglia, CEO of Snowflake Computing, pointed out that in addition to IoT data, companies will make better use of log data and device data, which in the past has been largely collected and analyzed in isolated silos using special-purpose tools.

“There is an emerging recognition of the enormous value in analyzing machine data together with transactional data,” said Muglia. “This recognition will catalyze a rapid shift to data processing and analytics that enables business analysts to gain deeper business insight through the combination of structured and semi-structured data for analysis inside a single system.”

Democratic Data Rules

The strategy of bringing democracy to the Middle East hasn’t exactly panned out. But democracy is  coming to analytics, according to Pandit. Business people inside enterprises are increasingly moving away from relying on internal IT to provide analytics. Software-as a-wervice (SaaS) was the first wave of this. The next step is further simplification of the application layer, with business users demanding real time ad-hoc analysis so they can perform their own “what-if” analyses, find hidden nuggets in data, and build charts, graphs and dashboards and then publish them among their user community.

Sustained Analytics Growth

Gartner numbers show that advanced analytics has surpassed the $1 billion per year mark. That makes it the fastest-growing segment of the business intelligence and analytics software market. Gartner analyst Alexander Linden expects that level of growth to continue as more and more business units gain access to applications that can harness analytics.

Eric Berridge, CEO of Bluewolf, is equally bullish.

“Investments in predictive analytics and data intelligence will explode,” he said. “Seventy-one percent of companies will increase their investments in data intelligence and predictive analytics in the coming year,” he said, citing Bluewolf research.

IT/Business Partner on Analytics

IT used to call the shots on what business intelligence and analytics applications were deployed and who had access to them. Then came SaaS and cloud-based freedom. Brett Azuma, an analyst at 451 Research, predicts that we are about to enter an era of compromise where both sides will have a say in more of a peaceful coexistence.

“Self-service data preparation and harmonization will complement and coexist with IT’s traditional data management tools, which will continue to address critical issues around data security, compliance and governance,” Azuma said.

Invisible Analytics

Gartner analyst David Cearley predicts increasing invisibility in analytics as the volume of data involved grows and the trend toward embedded BI accelerates. To his mind, every application will be an analytic app to some degree, and the era of point analytic tools is gradually fading. “Analytics will become deeply, but invisibly, embedded everywhere,” he said.

Consumerization of Business Intelligence

Eldad Farkash, co-founder and CTO of Sisense, predicts that the term “business Intelligence” will morph into “data intelligence” and that BI will become a more integral part of our lives – even outside the office.

“BI will finally evolve from being a reporting tool into data intelligence that every entity from governments to cities to individuals will use to prevent traffic, detect fraud, track diseases, manage personal health and even notify you when your favorite fruit has arrived at your local market,” Farkash said. “We will see the consumerization of BI, where it will extend beyond the business world and become intricately woven into our everyday lives directly impacting the decisions we make.”

Will this be the year of Hadoop? 6 predictions for 2015

January 8, 2015
Mike Wheatley
http://siliconangle.com/blog/2015/01/08/will-this-be-the-year-of-hadoop-6-predictions-for-2015/

With the New Year finally upon us it seems as good a time as any to ask where Hadoop, the open-source Big Data framework, will be heading in 2015.

SiliconANGLE pulled forecasts from an assortment of analysts and industry experts who’ve tried to second guess the next big developments in Hadoop, and the overwhelming consensus is that adoption will accelerate within the enterprise, as more businesses build smart applications with real-time data analysis capabilities atop of the platform.

1. More market consolidation

Only ten years have passed since Google published its MapReduce whitepapers, notes MapR CEO and Co-founder John Schroeder, which means Hadoop is still at a relatively youthful stage of the technology maturity life cycle. What we can expect to see throughout 2015 is Hadoop enter a period of consolidation, with the number of vendors fighting for a piece of the action narrowing down.

“Hadoop is early in the technology maturity life cycle,” said Schroeder. “In 2015, we will see the continued evolution of a new, more nuanced model of OSS to combine deep innovation with community development. The open-source community is paramount for establishing standards and consensus. Competition is the accelerant transforming Hadoop from what started as a batch analytics processor to a full-featured data platform.”

Saggi Neumann, CTO of Xplenty, agreed, telling SiliconANGLE that: “2015 will see more Big Data acquisitions and buyouts than ever before – In 2014, we witnessed the acquisitions of XA Secure, Hadapt, RainStor, DataPad and a few others. Both Cloudera and Hortonworks are now billion dollar companies and are eagerly looking to acquire more brains and technologies. Other giants such as HP, IBM, Oracle, Pivotal and Microsoft are knee-deep in Hadoop business and we’ve yet to see the end of M&As in the category.”

2. Enterprise adoption to gather pace

Forrester Research has already put its reputation on the line and said Hadoop will become anenterprise priority in 2015, and the experts tend to share that opinion. According to Gary Nakamura, CEO of Concurrent, Inc., Hadoop is all set to become a “worldwide phenomenon” in 2015.

“Hundreds of thousands of data points reported from the Cascading ecosystem support the notion that Hadoop is rapidly spreading across Europe and Asia and soon in other parts of the world,” said Nakamura. “Therefore, there will be a strong Hadoop adoption next year for enterprises ramping up their data strategy around Hadoop, creating new jobs, and further disrupting the data market worldwide.”

Larent Bride, CTO, Talend, said a growing number of enterprises will begin deploying Hadoop in more than just proof-of-concept environments. Hadoop will be used for day-to-day operations,” he said. “Organizations are still exploring how best to adopt Hadoop as the primary data warehouse technology. But as Hadoop is used more and the capabilities of YARN become fully realized, more useful opportunities leveraging technology like Apache Spark and Storm will emerge and quickly increase its potential. Even now, real-time/operational analytics are the fastest moving part of the Hadoop ecosystem, and it’s becoming evident that by 2020 Hadoop will be relied on for day-to-day enterprise operations.”

3. SQL to become a “must-have” with Hadoop

The SQL data querying language tool that’s so popular with developers will become one of the most popular applications for Hadoop, reckon the experts.

“Fast and ANSI-compliant SQL on Hadoop creates immediate opportunities for Hadoop to become a useful data platform for enterprises,” said Mike Gualtieri of Forrester Research. This will provide a sandbox for analysis of data that is not currently accessible.

Mike Hoskins, CTO at Actian, agreed, telling SiliconANGLE that: “SQL will be a “must-have” to get the analytic value out of Hadoop data. We’ll see some vendor shake-out as bolt-on, legacy or immature SQL on Hadoop offerings cave to those that offer the performance, maturity and stability organizations need.”

4. No more Hadoop skills shortage

One of the more surprising developments Forrester expects is that the Hadoop skills shortage, which has been so well documented in the last couple of years, will evaporate in 2015. “CIOs won’t have to hire high-priced Hadoop consultants to get projects done,” noted Forrester’s report. “Hadoop projects will get done faster because the enterprise’s very own application developers and operations professionals know the data, the integration points, the applications and the business challenges.”

Key to this will be SQL on Hadoop, Gualtieri said, as it will open the door to familiar access with Hadoop data. Meanwhile, commercial vendors and the open-source community alike are both building better tools to make Hadoop easier for everyone to use.

5. Architecting around Hadoop

Cloudera chief technologist Eli Collins told SiliconANGLE he expects to see more users “architecting around Hadoop”, for example by using it as an Enterprise Data Hub, rather than just using it for bespoke operations. He also expects more users to consume Hadoop via it being embedded in larger applications.

“Analytics is becoming an important part of a lot of applications, often a core part of the application itself so we’ll continue to see more of this,” said Collins. “Customers are applying Hadoop across a lot of industries as they’ve been doing for the last several years, they’re just adopting it more extensively as the platform becomes more capable, more accessible and better integrated with the other technologies they use.”

6. Rise of Hadoop + real-time analytics

As enterprises increase their contribution to the Hadoop ecosystem’s rising growth, and as Hadoop becomes a more attractive alternative to traditional database vendors, the demand for real-time and transactional analytics will rise significantly in 2015, said Ali Ghodsi, head of product management and engineering at Databricks.

“In 2015, enterprises will continue to evolve from the initial incarnation of making use of data through offline operations and signification manual intervention, to one in which organizations will make decisions on streaming data itself in real-time, whether that be through anomaly detection, internet of things, etc.,” said Ghodsi.

“Enterprises will need infrastructures that can scale and ingest any type and size of data from any source and perform a variety of advanced analytics techniques to identify meaningful insights in the necessary amount of time to make an impact on the business. The rise of compatible process engines such as Apache Spark will further enable Hadoop to help address these needs. This year, the approach to analytics will be more predictive to operational and relational.”

Year in Review: The Biggest Developments of 2014

December 29, 2014
http://www.information-management.com/dmradio/Big-Data-Year-Review-Podcast-10026303-1.html

What a difference a year makes! Over the past 12 months, we’ve witnessed the beginning of the end for traditional enterprise software. All the attention seems focused on Big Data, which belies the lingering array of data quality issues that continue to thwart efforts for handling good old-fashioned corporate “small” data. That said, the future of the data management biz has never been brighter, as evidenced by the monster-truck-sized investment that Cloudera received from Intel — a whopping $740 million for an 18% stake. Register for this episode of DM Radio to hear Host Eric Kavanagh interview Seth Proctor of NuoDB, Gary Nakamura of Concurrent, and two special guests.

A Decade On: The Evolution of Hadoop at Age 10

December 22, 2014
Scott Etkin
http://data-informed.com/decade-evolution-hadoop-age-10/

Apache Hadoop turns 10 in 2015. What started as an open-source project intended to enable Yahoo! Internet searches has become, in a relatively short time, the de facto architecture for today’s big data environments.

As big data exploded in 2014, Hadoop adoption and investment expanded along with it. Today, Hadoop is deployed across industries including advertising, retail, healthcare, social media, manufacturing, telecommunications, and government. But it won’t be long before companies begin demanding to see a return on their Hadoop investments.

“Hadoop has been rapidly adopted as the way to execute any go-forward data strategy,” said Gary Nakamura, CEO of Concurrent, Inc. “However, early adopters must now show return on investment, whether its migrating workloads from legacy systems or new data applications. Luckily, products and tools are evolving to keep pace with the trajectory of Hadoop.”

Indeed, Hadoop experts see the platform continuing to evolve and grow in 2015.

MapR Technologies CEO and co-founder John Schroeder predicts that, in 2015, new Hadoop business models will evolve and others will exit the market.

“We are now 20 years into open-source software adoption that has provided tremendous value to the market,” said Schroder. “The technology lifecycle begins with innovation and the creation of highly differentiated products, and ends when products are eventually commoditized.

“Hadoop adoption globally and at scale is far beyond any other data platform just 10 years after initial concept,” he added. “In 2015, we’ll see the continued evolution of a new, more nuanced model of open-source software to combine deep innovation with community development. The open-source community is paramount for establishing standards and consensus. Competition is the accelerant transforming Hadoop from what started as a batch analytics processor to a full-featured data platform.”

Steve Wooledge, Vice President of Product Marketing at MapR, said he sees Hadoop-based data lakes and data hubs becoming the norm in enterprise data architectures in 2015, and self-service data exploration going mainstream.

“Hadoop as a data hub or data lake is a very standard and introductory use case for most organizations,” said Wooledge. “Companies are not sure what value there may be in untapped data sources, such as machine logs from the data center, social media, or mobile interactional data, but they want to harness the data and look for new insights, which they can inject into business processes and operationalize.

Schroeder agreed.

“In 2015, data lakes will evolve as organizations move from batch to real-time processing and integrate file-based, Hadoop, and database engines into their large-scale processing platforms. In other words, it’s not about large-scale storage in a data lake to support bigger queries and reports. The big trend in 2015 will be around the continuous access and processing of events and data in real time to gain constant awareness and take immediate action.”

Ron Bodkin, founder and CEO of Think Big Analytics, said Hadoop will outgrow MapReduce in the coming year and Spark will grow in importance.

“One of the first things that we can expect from 2015 is that Hadoop clusters will start to benefit from other programming models besides MapReduce to deal with large data sets,” he said. “We already saw YARN begin to gain momentum in 2014 when it got across-the-board support from distribution providers like Cloudera as well as Hortonworks. Expect that this investment will begin to pay off in 2015 as more customers start leveraging YARN’s ability to support alternative execution engines, such as Apache Spark.”

Now that Hadoop has matured and gained widespread adoption, Bodkin said that the coming year could see late adopters finally feeling bold enough to embrace Hadoop.

“Hadoop has long since broken free of its web giant and ad tech heritage, penetrating most industries – notably music as streaming became ubiquitous,” said Bodkin. “In 2015, even late adopters will turn their attention to Hadoop, so expect an uptick in cost-driven implementations around better storage and faster load-times: SAN/NAS augmentation, ETL offload, and mainframe conversions.”

Monte Zweben, co-founder and CEO of Splice Machine, sees Hadoop evolving in the direction of concurrent applications in 2015.

“Concurrent Hadoop-based applications will become more prevalent in 2015 because of their ability to access real-time data and process transactions like a traditional RDBMS,” he added. “Emerging technologies that allow concurrent transactions on Hadoop enable data scientists and applications to work with more recent and accurate information instead of data that is hours or days old from batch processing. This is a major step in Hadoop’s ongoing evolution to meet the needs of businesses with mission-critical database applications that are having trouble cost-effectively scaling to meet higher data volumes.”“Big data has bloomed in 2014 as enterprises have invested in platforms like Hadoop. As we enter 2015, getting more out of those initial big data investments will grow as a top priority for businesses,” Zweben said. “Increased competitive pressures and the current appetite for real-time information no longer allows for the old model of waiting for data scientists to take hours or days to generate insights based on out-of-date information. New developments in the Hadoop platform can power applications that can act on insights now, instead of later, and with more recent data.

Spark 1.2 challenges MapReduce’s Hadoop dominance

December 22, 2014
Serdar Yegulalp
http://www.infoworld.com/article/2862225/hadoop/spark-12-challenges-mapreduces-hadoop-dominance.html

Apache Spark, the in-memory and real-time data processing framework for Hadoop, turned heads and opened eyes after version 1.0 debuted. The feature changes in 1.2 show Spark working not only to improve, but to become the go-to framework for large-scale data processing in Hadoop.

Among the changes in Spark 1.2, the biggest items broaden Spark’s usefulness in multiple ways. A new elastic scaling system allows Spark to better use cluster nodes during long-running jobs, which has apparently been requested often for multitenant environments. Spark’s streaming functionality, a major reason why it’s on the map in the first place, now has a Python API and a write-ahead log to support high-availability scenarios.

The new version also includes Spark SQL, which allows Spark jobs to perform Apache Hive-like queries against data, and it can now work with external data sources via a new API. Machine learning, all the rage outside of Hadoop as well, gets a boost in Spark thanks to a new package of APIs and algorithms, with better support for Python as a bonus. Finally, Spark’s graph-computing API GraphX is out of alpha and stable.

Spark’s efforts to ramp up and expand speak to two ongoing efforts within the Hadoop world at large. The first is to shed the straitjacket created by legacy dependencies on the MapReduce framework and move processing to YARN, Tez, and Spark. Gary Nakamura, CEO of data-application infrastructure outfit Concurrent, believes the “proven and reliable” MapReduce will continue to dominate production over Spark (and Tez) in the coming year. However, MapReduce’s limitations are hard to ignore, and they put real limitations on the work that can be done with it.

Another development worth noting is Python’s expanding support for Spark — and Hadoop. Python’s popularity with number-crunchers remains strong and is ideal for use in Hadoop and Spark, but most of Python’s support there has remained confined to MapReduce jobs. Bolstering Spark’s support for Python broadens its appeal beyond the typical enterprise Java crowd and with Hadoop in general.

Much of Spark’s continued development has come through contributions from Hadoop shop Hortonworks. The company has deeply integrated Spark with YARN, is adding security and governance by way of the Apache Argus project, and is improving debugging.

This last issue has been the focus of criticism in the past, as programmer Alex Rubinsteyn has cited Spark for being difficult to debug: “Spark’s lazy evaluation,” he wrote, “makes it hard to know which parts of your program are the bottleneck and, even if you can identify a particularly slow expression, it’s not always obvious why it’s slow or how to make it faster.”

10 Predictions for Data and Analytics in 2015

December 19, 2014
Adam Shepherd

http://www.dbta.com/Editorial/News-Flashes/10-Predictions-for-Data-and-Analytics-in-2015-101221.aspx

As analytics continues to play a larger role in the enterprise, the need to leverage and protect the data looms larger. According to the IDC, the big data and analytics market will reach $125 billion worldwide in 2015.  Here are 10 predictions from industry experts about the data and analytics in 2015.

  1. Hadoop – Hadoop will become a worldwide phenomenon, believes Concurrent CEO Gary Nakamura, who notes that Hadoop has shown tremendous growth throughout Europe and Asia, and that expansion will only continue. A key to Hadoop being able to become an enterprise backbone, is the ROI businesses can expect from using it, and products and tools  continue to evolve to keep pace with the technology’s trajectory. According to Actian, SQL will be a “must-have” to get the analytic value out of Hadoop data.  We’ll see some vendor shake-out as bolt-on, legacy or immature SQL on Hadoop offerings cave to those that offer the performance, maturity and stability organizations need.
  2. Enterprise Security – With the seemingly never-ending stream of news reports of hacks and data leaks, one of the major data issues of 2014 that we can expect to continue in 2015 is big data breaches. “There is nothing you can do to stop a zero-day vulnerability, but the question is what do we do about it,” stated Walker White, president of data as a service provider BDNA. At this point it isn’t about keeping the hackers out, but how companies react to protect their data once the hackers have penetrated their systems. “Security ultimately is an arms race, there are very few mechanisms that simply can’t be broken, it tends to just be how far ahead can you stay of the people that are trying to break in,” agreed Seth Proctor, CTO of NuoDB.
  3. Business Intelligence – The growth of BI tools that are more user friendly for the average business employee will help to take some of the burden of IT teams. To do this, more BI providers will incorporate search into their interfaces to make the tools more accessible to average business users, according to Thoughtspot CEO Ajeet Singh.
  4. Cloud – The cloud will increasingly become the deployment model for BI and predictive analytics, particularly with the private cloud powered by the cost advantages, according to Actian.
  5. Hybrid Architecture – Hybrid architectures will become the norm for many organizations, according to Steven Riley of Riverbed Technology. Even though cloud computing and third-party hosting will continue their rapid expansion, on-premise IT will remain a reality for 2015 and beyond.  “In the coming year, analytics will have the power to become the next killer app to legitimize the need for hybrid cloud solutions,” adds Revolution Analytics CEO Dave Rich. Analytics has the ability to mine vast amounts of data from diverse sources, deliver value and build predictions without huge data landfills. In addition, the ability to apply predictions to the myriad decisions made daily – and do so within applications and systems running on-premises – is unprecedented.”
  6. Medical Data – When the average person thinks of their most important personal data security, most think about credit card information. Bitglass, which provides security for cloud apps and mobile devices, believes that medical records are 50 times more valuable on the black market than credit cards. Bitglass predicts that medical records will become a bigger target for data attacks than traditional methods such as credit cards. This will result in scrutiny pertaining to HIPAA regulations. Regulations stipulate that health organizations must report data breaches that affect more than 500 people.
  7. Data Science – As organizations gain a greater appreciation of the role that that data is playing data scientists are in greater demand, yet there are not enough qualified data scientists, according to EXASOL, an in-memory database company. Joe Caserta of Caserta Concepts believes that chief analytics officers (CAO) will now play role in the enterprise. As data-rich organizations continue to adopt a more strategic approach to big data, it makes sense that the responsibility for all that information needs to sit with someone who can apply the analytics big picture to all parts of the organization – the CAO. The coming year will be time for data-driven organizations to dedicate resources and executive commitment to the function.
  8. Internet of Things – OpenText predicts consumers will begin to become more aware of the IoT all around them – from smart watches to cars with built-in sensors, and Vormetric, a provider of security solutions, believes that the IoT will trigger a greater enterprise emphasis on securing big data using encryption. More personalized private data will be stored and analyzed by data analysis tools in the future.
  9. Location Data – Technologies will emerge in 2015 – full stack virtualization, pervasive visibility, and hybrid deployments – that will create a form of infrastructure mobility that allows organizations to optimize for location of data, applications, and people, says Riley of Riverbed. He predicts that organizations that begin to disperse their data to multiple locations will begin to gain significant competitive advantages.
  10. NewSQL – NewSQL will start taking the place of some RDBMSs, according to Morris of NuoDB, who believes that NewSQL will begin to support enterprise-scale applications that traditionally were only held by RDBMSs.

Data and analytics will only become more important and valuable to the enterprise. As the technologies for putting data to greater use continue to multiply, it is clear that those opportunities also carry risk, and there is the need to better protect the data that is being amassed.