Twitter has invested heavily in making Cascading a key component of their data analytics infrastructure. Cascading enables Twitter engineers to create complex data processing workflows in their favorite programming languages easily as well while providing the scalability to seamlessly handle terabytes and petabytes of data.
The Twitter team needed to be able to write a lot of complex jobs against their Hadoop clusters and not just simply count entries in log files. Their requirements included the ability to perform computations, machine learning and even linear algebra to get the information they needed to improve the user experience and help advertisers succeed. They chose Cascading from Concurrent, Inc. for use with Hadoop to provide a higher level of abstraction, allowing developers to complete these complex analyses easily and quickly. While many technologies built for Hadoop make counting easy, Cascading makes it possible for build higher level, complicated functions
Three teams inside Twitter are using Cascading, and each team is using Cascading’s flexibility and ease of extensibility to boost productivity and leverage existing programming language skills.
The revenue team is focused on matching the most suited ads with users and helping advertisers determine which ads are the most effective. They analyze the contents of ads, Twitter topics, time of day and many other data points to help increase conversion rates. To make complex analysis of very large data sets simple to do, they wrote Scalding, an open source Scala API for Cascading. Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS. Like Apache Pig, Scalding uses simple and concise syntax to write big data jobs. But, unlike Pig which separates user-defined functionality from the query language, Scalding integrates them into one language. Using Cascading developers can think and write in Scala, but execute on 10s or 1,000s of nodes. In the majority of cases just a few lines of code can define an entire job. Scalding allows the revenue team to provide information to advertisers quickly that improves the performance of their ads and allows them to take advantage of opportunities based on Twitter activity. This fulfills a key part of Twitter’s advertising value proposition.
The publisher analytics team helps webmasters understand how people are engaging with their site using Twitter. The system provides companies with information on their brand, website and online influencers. They analyze terabytes of data daily to create site comparisons, URL profiles, influencer profiles, twitter account analysis and topic analysis that helps companies understand what customers are saying in social media. They analyze large datasets in real time on a highly reliable and fault-tolerant system built on Hadoop, Cassandra, Lucene, Cascading and other components. They also built and open-sourced Cascalog, a Clojure-based language that uses Cascading as the job execution engine.
The analytics team is focused on understanding Twitter user activity. While this might seem simple on the surface, Twitter is made up entirely of networks of users following other users or users who follow the same people. The team needed a way to make it easier to think at a higher level about the different networks and perform detailed and complex analysis on this information. They created PyCascading, a Python wrapper for Cascading to control the full data processing workflow from Python.
In all these cases, Cascading allows Twitter to leverage the power of the programming languages their teams know best while hiding the underlying complexity of writing, optimizing and executing MapReduce jobs. This allows them to deliver even highly complex information.