Airbnb uses Cascading because it provides developers more control when conducting advanced data analysis workflows, data normalization and cleansing. With Cascading applications are easier to test and developers are more confident that applications will work.
Airbnb chose Cascading for their ETL processes. Cascading gives them more control over the underlying MapReduce jobs and allows them to write custom code easily and in a single step. The development team found the Cascading API very easy to use, and the company already has dozens of tasks that run using it.
The analytics system includes both automated and manual functions. First, complicated infrastructure tasks including data normalization and cleansing is done by applications written using Cascading. Cascading is also used to reconstruct corrupted files and combine multiple data files into one. In combination with Cascading, Pig and Hive are used by analysts to run batch scripts to perform ad hoc analysis. With these tools, analysts are able to study data important to their business such as click-through rates, page statistics, drop-off rates and number of bookings. The analysts create queries for other interesting indicators, such as comparing regional performance and identifying potential problems with the site.
Airbnb uses Cascading for infrastructure work and any functions that are more complex. Because it’s an API instead of a language, custom code can be written in a single step, without compiling or the need for a language-specific wrapper. Cascading also provides much more control over underlying MapReduce jobs and when analyzing user flows on the website. In development, Cascading simplifies the testing process for custom code and provides Airbnb with more confidence that the application will work as intended when deployed. Because of these advantages, the Airbnb team is much more productive writing custom applications in Cascading and is able to react more quickly to business requirements.
Airbnb has been growing rapidly – seeing over 5 million guest nights booked since the company’s founding in 2008 and with over 4 million guest nights booked in the last 12 months alone. As noted by Florian Leibert of Airbnb, “things happen fast here and we have to make changes on the fly. We’ll definitely be using Cascading for more projects. We’ve only just gotten started taking advantage of all that it can do for us.”
Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world online or from a mobile phone. Founded in August of 2008 and based in San Francisco, California, Airbnb connects in more than 19,000 cities and 192 countries. Airbnb is the easiest way for people to monetize their extra space and showcase it to an audience of millions.
Airbnb makes the process simple. The site handles the reservation, takes payment (including a security deposit if the host requires one), sends the rental agreement, disburses payment to the host and returns the security deposit if there are no issues after the renter checks out.
To continuously improve the effectiveness of their site and service, Airbnb logs events such as searches or click-throughs on properties and analyzes the data to discover what’s driving room bookings as well as drop-offs. To look for trends and changes, the company may need to evaluate months of data at a time. The large datasets involved led them to choose Hadoop. However, they still had to figure out how to extract, transform and load (ETL) data from large volumes of log files into their analytics tools for subsequent analysis.