Bixo Labs Case Study

Background
Bixo Labs provides fast, scalable solutions for big data processing. Typical projects involve large-scale ETL, web crawling, data mining, machine learning and search.

For large scale web crawling and data mining, Bixo Labs needs a system that can process billions of web pages, apply machine learning algorithms to content, and save the results in a variety of formats for customers (XML, Avro, TSV, Solr/Lucene, etc.) The resulting workflow needs to be efficient, reliable and scalable, and the solution must be supportable, both internally and by the client’s staff.

assembly

Solution
To provide big data solutions that meet these criteria, Bixo Labs chose Apache Hadoop and Cascading.  Cascading delivers the efficient, reliable and scalable workflow systems they need as well as an extensible, easy-to-use abstraction on top of Hadoop.  The ability to easily extend Cascading’s input and output formats made it possible to support the wide range of required data interchange formats. Because Cascading is part of the Hadoop ecosystem, Bixo Labs could incorporate other open source projects such as Mahout to provide additional required functionality such as machine learning.


Benefits
Cascading allows Bixo Labs to deliver custom big data processing solutions that are efficient and reliable.  It speeds development time, makes workflows easier to understand and improves workflow optimization. Developers are able to complete data workflows for client projects in ¼ the time compared to writing in raw MapReduce. Custom code can be written in Java (or other JVM-based languages), eliminating the need for Bixo Labs or its clients to hire or train MapReduce experts. Because the resulting workflows are much easier to understand, they can be easily handed off or reused between developers as needed. Cascading also gives developers the ability to visualize workflows at a higher level, enabling them to spot optimization opportunities that would be hidden otherwise. And, Cascading manages workflows during processing, replacing ad-hoc scripts that cause reliability problems and allowing the system to scale beyond a single server.

“Cascading makes data processing with Hadoop practical, scalable and reliable,” noted Ken Krugler, Bixo Labs’ Founder. “It allows us to quickly deliver solutions that turn big data into useful information for our customers.”

In addition to being Cascading users, Bixo Labs has created and contributed back a number of open source Cascading-based extensions and systems, including:

Bixo Labs will continue to use Cascading for custom big data solutions, and plans to add additional open source extensions for Cascading.

 

Resources

News & Events

Cascading

Cascading is software for fault tolerant data processing. Learn more ›

Cascading Support

Concurrent provides licensing, indemnification, and support for Cascading. Learn more ›

Consulting and Training Services

For advanced Cascading Consulting, Training, and Mentoring. Learn more ›