Key Takeaways
Twitter has invested heavily in making Cascading a key component of their data analytics infrastructure. Cascading enables Twitter engineers to create complex data processing workflows in their favorite programming languages while providing the scalability to seamlessly handle terabyes and petabytes of data.
Solution
The Twitter team needed to be able to write a lot of complex jobs against their Hadoop clusters and not just simply count entries in log files. Their requirements included the ability to perform computations, machine learning and even linear algebra to get the information they needed to improve the user experience and help advertisers succeed. They chose Cascading from Concurrent, Inc. for use with Hadoop to provide a higher level of abstraction, allowing developers to complete these complex analyses easily and quickly. While many technologies built for Hadoop make counting easy, Cascading makes it possible for build higher level, complicated functions easily as well.
Three teams inside Twitter are using Cascading, and each team is using Cascading’s flexibility and ease of extensibility to boost productivity and leverage existing programming language skills.
The revenue team is focused on matching the most suited ads with users and helping advertisers determine which ads are most effective. They analyze the contents of ads, twitter topics, time of day and many other data points to help increase conversion rates. To make complex analysis of very large data sets simple to do, they wrote Scalding, an open source Scala API for Cascading. Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS. Like Apache Pig, Scalding uses simple and concise syntax to write big data jobs. But, unlike Pig which separates user-defined functionality from the query language, Scalding integrates them into one language. Using Cascading developers can think and write in Scala, but execute on 10s or 1,000s of nodes. In the majority of cases just a few lines of code can define an entire job. Scalding allows the revenue team to provide information to advertisers quickly that improves the performance of their ads and allows them to take advantage of opportunities based on Twitter activity. This fulfills a key part of Twitter’s advertising value proposition.
The publisher analytics team helps webmasters understand how people are engaging with their site using Twitter. The system provides companies with information on their brand, website and online influencers. They analyze terabytes of data daily to create site comparisons, URL profiles, influencer profiles, twitter account analysis and topic analysis that helps companies understand what customers are saying in social media. They analyze large datasets in real time on a highly reliable and fault-tolerant system built on Hadoop, Cassandra, Lucene, Cascading and other components. They also built and open-sourced Cascalog, a Clojure-based language that uses Cascading as the job execution engine.
The analytics team is focused on understanding Twitter user activity. While this might seem simple on the surface, Twitter is made up entirely of networks of users following other users or users who follow the same people. The team needed a way to make it easier to think at a higher level about the different networks and perform detailed and complex analysis on this information. They created PyCascading, a Python wrapper for Cascading to control the full data processing workflow from Python.
In all these cases, Cascading allows Twitter to leverage the power of the programming languages their teams know best while hiding the underlying complexity of writing, optimizing and executing MapReduce jobs. This allows them to deliver even highly complex information and functionality needed by the business quickly and efficiently, a major advantage over other technologies they tried.
Benefits
The Twitter development team’s strategy was to create an analytics infrastructure that allows them to achieve complicated results on extremely large data sets easily. Cascading is a key component of this infrastructure, providing a great foundation for straightforward development of extensions. Without these extensions, Twitter developers had to stick to a small set of basic operations, a common limitation of Hadoop-related projects, which made it so difficult to create extensions that they usually didn’t get written. With Cascading programmers can add functions quickly that make development clean and convenient. Their development teams now rely on these extensions for everything from custom ad targeting algorithms to PageRank on the Twitter graph and their new, rapidly expanding Business Analytics platform.
Cascading provides Twitter with several other important benefits. It improves the efficiency and productivity of Twitter’s development team, allowing them to think and write in languages they know well and that fit the task at hand. Even highly complex tasks such as multiplying matrices can be expressed cleanly in just a few lines of code that looks like normal arithmetic to the programmer, speeding time to deployment. Cascading also makes new things the business side wants much easier to implement. Developers no longer need to rethink how to perform complex calculations each time; they can reuse functions already developed. The result is straightforward development and fast deployment.
Because of the analytics platforms built using Hadoop and Cascading, Twitter is able to deliver more value for web publishers, webmasters and advertisers. With these systems, web publishers and webmasters can understand better the amount of traffic they receive from Twitter and the effectiveness of Twitter integrations on their sites. Advertisers are able to improve the success rates of their campaigns and improve the value of their communications to Twitter users.
Cascading makes it easy for Twitter to perform extremely sophisticated analysis tasks on Hadoop clusters. Twitter is able to add new functionality to their publisher, advertising and analytics platforms quickly and easily, even though they require sophisticated mathematical computations underneath. Cascading is a great foundation for extensible Hadoop-based systems.
Twitter plans to develop additional innovative analytics functionality on their Cascading-based system to build value for users, web publishers and advertisers. The company has a strong commitment to the future of Cascading and has signed a contributor agreement so they can help make Cascading even better.
Background
Twitter provides a real-time information network that connects its users to the latest news and opinions they care about most. Users communicate via short posts called ‘Tweets’ of 140 characters or less. Twitter also connects businesses to their customers, enabling them to quickly share information with people who follow them as well as gather real-time market intelligence and feedback and build relationships. Twitter also constantly works to serve its users and businesses better by analyzing user activity to improve the service and through a rapidly growing advertising platform.
The company needed to create a platform to allow them to analyze the rapidly growing volumes of data from tweet contents, ad campaigns and user activity and among other functions. They chose Apache Hadoop as the underlying technology for its ability to evaluate petabytes of data and very quickly scale to handle the data analysis compared to traditional data management systems. However, Twitter’s analysis needs were unique and more complex than other firms. They needed to perform sophisticated statistical functions that would be very difficult to implement directly on Hadoop in MapReduce and could require developers to rethink and recode complex computations each time they were used.
Concurrent Tweets