Key Takeaways
Etsy runs over 50 Cascading applications daily to study customer behavior and product sales. Programming in JRuby, Etsy can quickly test and create new applications on its e-commerce site that helps it acquire new customers and sell more products.

Solution
Etsy chose Cascading to abstract standard data processing operations away from the underlying map/reduce tasks. Cascading combines the scalability of Hadoop with an easy way to perform deep dives on data. Etsy also extended the Cascading API to create a Domain Specific Language (DSL), cascading.jruby1. DSLs generally provide simpler code structure and a cleaner syntax for common problem-specific data analysis tasks, and using JRuby provided a base language with which their team felt more productive when compared to Java.
Benefits
With Cascading and Hadoop, and using short-lived Hadoop clusters on Amazon EMR, Etsy is able to quickly build and launch data-driven products such as their gift recommender (and here) suggested shops recommender, and “taste test”. Each night, over 50 Cascading jobs extract data from web logs and database snapshots, then aggregate metrics used to monitor and understand the behavior of the site’s visitors. These jobs also aggregate the results of all the A/B tests running on the site, helping teams make product decisions. Finally, engineers are able to answer one-off questions and explore data easily, in order to get insights that help improve their site and community.
In the future, Etsy plans to rapidly increase the number of data-driven apps to help improve conversion rates, enhance the community areas of the site, and accelerate the company’s growth. Cascading and Hadoop on Amazon EMR will allow the engineering team to build and scale these applications much more easily than if they had used raw map/reduce on an in-house cluster.
1 More on the original author’s version of JRuby here: https://github.com/gmarabout/cascading.jruby
Background
Etsy provides a community and marketplace that reconnects people who make things with buyers. Twenty five million visitors from more than 150 countries come to Etsy each month to shop over 9.5 million items. This traffic generates over 1 billion page views and multi-terabytes of data each month. Site traffic doubled last year and continues to grow rapidly.
As the community and site traffic have grown, so have Etsy’s data processing needs. Etsy needed a solution that could scale with the site, but also a way to apply sophisticated analytics and ask specific questions about the data. The company wanted to use behavioral data gathered from the site to improve the user experience and to add new features quickly. Because of the volume of information, Etsy decided to use Hadoop as the core, but map/reduce jobs were a difficult way to define data analysis tasks that were more easily expressed as filters for discarding records or grouping records for aggregation or counting.
Concurrent Tweets