SoundCloud

SoundCloud is the world’s leading audio platform that gives users unprecedented access to the world’s largest community of music and audio creators. With its continued ambition to ‘unmute the web’, SoundCloud allows everyone to discover original music & audio, connect with each other and share the sounds they hear. In addition, sound creators can use the platform to instantly record, upload and share sounds across the internet, as well as receive detailed stats and feedback from the SoundCloud community. SoundCloud today reaches over 350 million people each month and is based in Berlin, Germany with additional offices in San Francisco, New York City, and London.

SoundCloud’s Discovery team is just one of several within SoundCloud responsible for producing data products. Comprised of engineers and data scientists from diverse backgrounds, they collaborate with other teams to collect and analyze complex data and customer research. Their mandate is to continuously improve the SoundCloud user experience.

Key Takeaways

  • The right development tool and framework can significantly boost developer productivity and reduce the costs of production
  • With Scalding, SoundCloud’s Discovery team accelerated their application development and simplified deployment for their critical data applications on Hadoop
  • Scalding was a great fit for SoundCloud because they needed a reliable abstraction that could incorporate custom algorithms to consistently produce data products for their business
  • The Discovery team used Scalding data workflows to handle data processing and incorporate a custom ranking algorithm used for their Trending Music and Trending Audio data products
  • The inherit testing capabilities within Scalding coupled with Pragmasoft’s unit testing framework for Scalding amplified the team’s testing capabilities

Background

SoundCloud is an online audio platform that enables its users to upload, record, promote and share their originally created sounds. The platform allows anybody to become a creator and upload their original sounds for the world to hear. Creators on the platform now upload 12 hours of original content every minute, with 90% of these tracks played at least once, more than half within an hour of being posted. Not only limited to mainstream music, SoundCloud’s repository features original content across independent music, podcasts, comedy, speeches, ambient sounds, and many more.

SoundCloud’s Discovery team focuses on delivering data-driven applications to enhance the experience of their users. The team has a reputation of setting the standard for best practices and builds data products around search, recommendations, audio classification, and metadata extraction. Primarily consisting of engineers and data scientists, they are responsible for the end-to-end development and management of their data products.

The Discovery team is responsible for several core components of the SoundCloud user experience including search, the popular Trending Music and Trending Audio products and people and music recommendations. These products are based on user behavior and sound metadata.

Challenges

SoundCloud’s Discovery team was looking for a more productive way to reliably create data applications to meet business needs. Their previous development experience with Apache Pig was practical, but not scalable when it came to building complex applications such as recommendation engines or ranking algorithms.

Developing with Apache Pig

When building workflows for data processing, it’s natural to think of the problem in a functional style, or in business logic terms. However, programming in Pig required a more procedural style that required developers to also apply integration logic. This made development convoluted and quite challenging for the team’s developers as they had to develop their applications to incorporate both business and integration concepts.

Scaling Application Development

Developing in Pig was also not scalable. As each application requires more functionality, an application’s code base and queries exponentially become more complicated. If a developer wanted to incorporate something complex, they needed to apply UDFs (user-defined functions), which added to the complexity. Also, it was difficult to reuse existing code to apply to other applications. Adding to all the frustration was that their Pig applications were difficult to test and debug. Poor testing led to lower-quality code and longer development cycles.

Solution

The Discovery team decided to adopt Scalding as a core application development language. Developed by Twitter, Scalding is a powerful dynamic programming language built on the Cascading application development framework. Cascading balances an optimal level of abstraction with the necessary degrees of freedom for building data applications on Hadoop through a computational knowledge engine, systems integration framework, data processing and scheduling capabilities. Scalding inherits Cascading’s core capabilities and enables developers and data scientists to use Scala for building robust applications on Hadoop. Scala is a programming language designed to express common programming patterns in a concise, elegant, and type-safe way.

Scalding became a natural fit for SoundCloud’s Discovery team as they needed a reliable abstraction that could incorporate custom algorithms to consistently produce data products for their business. The highly adopted Scalding is popular because it is a concise yet powerful language that provides the ability to easily cube matrices, build recommendation engines and incorporate custom algorithms, all while reducing boilerplate code. From massive data processing applications to complex recommendation or ranking applications, enterprises are using Scalding to build and deploy reliable mission-critical applications.

Adopting Scalding was relatively easy. Most members of the team were already familiar with functional languages like Scala. Both Scala and Scalding primitives are similar to each other, practical and easy to understand, which made Scalding an easy language to learn. They also heavily leveraged existing in-house examples of Scalding applications to help with the onboarding process.

SoundCloud’s discovery team was able to use Scalding to successfully build their Trending Music and Trending Audio data applications with minimal frustration. The development team used Scalding for data aggregation, data cleansing and data validation, and then applied their ranking algorithm to process the data. Once data processing was completed, their application exported the output into a flat file for migration into their internal data store with Apache Sqoop. To help with testing, the team also took advantage of a third party unit testing library built for Scalding from Pragmasoft.

Benefits

By developing their data products with Scalding, SoundCloud’s Discovery team was able to prototype, develop and deploy their applications into production quickly. With Scalding, they experienced productivity benefits that accelerated their applications’ time to market.

Simplified Development

The Discovery team was able to speed up their development cycles because developing in Scalding was a natural fit with developers who were seeking a functional language. Inherited from the Cascading framework, Scalding allows developers to separate an application’s business logic from integration logic. This characteristic simplified the development process and ultimately accelerated their applications’ time to market.

Better Code Testing = Better Code Quality

Scalding naturally inherits Cascading’s core capabilities—including its testing capabilities. Scalding gave the team an easier way to perform end-to-end testing of their complicated workflows. Pragmasoft’s unit testing library for Scalding also helped to streamline their testing procedures. With their prototyping and testing processes in place, SoundCloud noticed better code quality, which contributed to a faster time to production.

Production-ready Applications

Another attribute that Scalding inherits from Cascading is the ability to reduce the operational complexity required for deploying applications into production on their Hadoop cluster. Once a Scalding application is coded, the application is then translated into the appropriate mappers and reducers and compiled down to a single JAR file that is ready for deployment. If a Scalding application happened to fail, it was easy to pinpoint the correct application where it happened. This reduced some of the team’s burden since they also needed to deploy and manage their own production applications.

Code Reusability

Because the team constantly produces numerous data applications for their business, it makes sense to find a way to leverage existing code when possible. With Scalding, they were easily able to reuse existing code across different data applications. Not only was it a significant time saver, but it also was helpful in the onboarding process of developers new to Scalding.