Cascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.
Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.
All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.
By utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data.
The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention.
Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.
The Climate Corporation protects the $3 trillion global agriculture industry from the financial impact of adverse weather—the cause of over 90% of crop loss—with automated, full-season weather insurance. Total Weather Insurance™ is a unique insurance program which dynamically determines the weather conditions that can make or break an individual grower’s yields based on crop, location and soil type, and then automatically customizes full-season weather protection for that grower’s farm.
In order to accurately model and price its insurance products, conduct risk analysis, and model climate conditions and their impact, The Climate Corporation amasses large volumes of data from various third party sources such at the National Weather Service. Twenty-two weather data sets with three million new data points are ingested daily and put through lengthy ETL (Extract Transform and Load) workflow processes that normalize and filter it for further analysis. Data scientists then apply advanced analytics and machine learning algorithms that allows The Climate Corporation to offer the best priced products to its customers while minimizing its own risk and costs.