Trulia

Key Takeaways

Cascading is the core component of their data processing pipeline for data cleansing, entity resolution, and metadata extraction.  This helps Trulia provide higher quality content with the newest listings and the best deals in their customers’ markets.

Solution

Trulia selected Apache Hadoop and Cascading.  Cascading is used to manage execution of data processing that cleanses incoming data, identifies relationships between items and creates meta-data. Because the data processing logic is very complex, Cascading plays a critical role in Trulia’s main data processing pipeline, providing an easy and higher-level abstraction for running multiple MapReduce jobs.

Benefits

With Cascading, Trulia is able to deliver higher-quality content to be shown on the site and make development cycles much more efficient. The ability to manage complex data processing logic allows them to connect users with people who live nearby, slice and dice local data into useful bits, and show the newest listings and best deals in the user’s market.  Cascading also allows Trulia’s engineering team to move much faster developing data processing applications than they could with direct MapReduce development.

“With Cascading, we’re able to provide the high-quality local real estate information our users demand,” noted Daniele Farnedi, Vice President, Engineering, Trulia, Inc. “Cascading’s easy-to-use framework allows engineering to deliver new features much more quickly and manage the complex data processes required to support those features very efficiently.”

Background

Trulia is the fastest growing online real estate resource, empowering buyers, sellers and renters with smarter tools to help them find the right home. Trulia brings together local information, community insights, market data and national listings all in one place. The company’s website draws 16 million unique users every month generating 200 million page views for real estate listings. The company’s goal is to make finding a place to live pleasant instead of frustrating.

Trulia needed a solution to help them apply heuristic real estate market data, listing and property information to generate a snapshot of available properties for sale in each market.  The system had to be able to process tens of Terabytes of data efficiently so Trulia could display high-quality information on the site.

Concurrent Tweets Concurrent Tweets