Cindy Waxer, DataInformed
July 14, 2014
“Big data factory” may not be as glamorous a description for an organization’s analytics talent as “data rock star” or “virtuoso,” but it’s a term an increasing number of organizations are embracing as the race to build big data applications heats up.
For years, lone data scientists have dominated analytics departments, building applications one at a time while handing over dozens of files with multiple scripts to various departments. But that’s all changing as companies like BloomReach discover a new approach to app development that more closely resembles an auto manufacturer’s plant floor than a Silicon Valley cubicle.
BloomReach creates big data market applications by leveraging analytics to personalize website content and enable site searches for big-name brands like Crate&Barrel and NeimanMarcus. Although the startup relies on application development platform Cascading from Concurrent Inc. to build on Hadoop, Seth Rogers, a member of BloomReach’s technical staff, says that developing applications “is a time-consuming process. There’s a lot of trial and error and iterations that you have to go through.”
Rogers said that to minimize this heavy lifting and make its app development process more flexible, BloomReach has created its own big data factory. According to Rogers, application development traditionally has involved a hodgepodge of development tools, programming languages, and raw Hadoop in which “every product is in its own silo, which is very specific and very opaque. Nobody really understands what’s going on with programming languages or data storage.”
As a result, data scientists are left experimenting on their own, building one app at a time without a repeatable platform or reproducible results. Not only does this require starting from scratch with every new application, but if a data scientist suddenly jumps ship, a huge knowledge vacuum and a corresponding negative impact on production results.
To make application design simpler with repeatable development processes, BloomReach created a big data factory for the four terabytes of data it processes weekly and the 150 million consumer interactions it encounters daily.
Rogers said a successful big data factory comprises five key components:
- A set of common building blocks and tools for app development. Rogers said the Cascading app development platform allows BloomReach to access “programming libraries we’ve already written so that we can incorporate them into new apps in a fairly straightforward way and without customization.”
- A standardized infrastructure, in which data sets are stored in standard formats so that “when we are building a product, we are not starting from scratch,” Rogers said. “We already have our basic infrastructure in place.”
- A selection of monitoring tools that regularly tests the performance of systems and measures computing power usage to ensure peak performance.
- A standardized process for debugging along with databases for ad hoc queries. Rogers said it’s not uncommon for a customer’s Web site to experience an occasional glitch. The problem, however, is that it’s often difficult to sift through files for a nugget of bad data, especially if it hasn’t been indexed. Easy-to-query databases, however, “make it easy to find information without having to open up these gigantic, 10-terabyte files and run through them,” Rogers said. “We can go in there and easily do a search and see what exactly went wrong, whether it was user error or a legitimate bug.” In turn, apps can be modified quickly and effectively without having to go back to start.
- Centralized storage for data. Storage typically comes in three forms: a shared file system that is excellent for transforming data; a key store that lets users quickly look up data using a specific key for fast analysis; and a relational database that cleans up data, converts it into a structured record, and loads it into a regular SQL database “so that users can conduct queries on any field” and play with the data, said Rogers.
Application development tools, debugging solutions, storage options – they have always been available to a big data team. But by combining data, coding, and monitoring processes into a standardized, centralized, and simplified strategy for app development, Rogers said, a big data factory can accelerate the app development process, resulting in significant savings in developer time, storage capacity, and computing power resources.