page Why are we using Hadoop? Stop me if you’ve heard this before…
On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day.
page Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB
And… page Hadoop places no constraints on how data is processed.
Before Hadoop page
page With Hadoop
page Access to this non-transactional data enables a number of applications…
Optimizing Hotel Search page
Page Performance Tracking page
Cache Analysis page A small number of queries (3%) make up more than a third of search volume.
User Segmentation page
All of this is great, but…
Most of these efforts are driven by development teams.
The challenge now is to unlock the value in this data by making it more available to the rest of the organization.
page “ Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data
page In a better world…
Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
page The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.
page BI vendors are working on integration with Hadoop…
And one more reporting tool… page
Example Processing Pipeline for Web Analytics Data page
Aggregating data for import into Data Warehouse page
page Example Use Case: Beta Data Processing
Example Use Case – Beta Data Processing page
Example Use Case – Beta Data Processing Output page
page Example Use Case: RCDC Processing
Example Use Case – RCDC Processing page
page Example Use Case: Click Data Processing
Click Data Processing – Current DW Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
Click Data Processing – New Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.
Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.
Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.
The challenge now is making Hadoop more accessible to non-developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility.
Oh, and also…
Orbitz is looking for a Lead Engineer for the BI/Big Data team.
Go to http://careers.orbitz.com / and search for IRC19035.
MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009