Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
Who We Are• Robert Lancaster – Solutions Architect, Hotel Supply Team – firstname.lastname@example.org – @rob1lancaster• Jonathan Seidman – Lead Engineer, Business Intelligence/Big Data Team – Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User- Group-CHUG) – email@example.com – @jseidman page 2
Cache Analysis100.00% 72% of queries are Queries singletons and make up90.00% Searches nearly a third of total search volume.80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total70.00% (Queries)60.00% A small number of queries (3%) make50.00% up more than a third of search volume.40.00% 34.30% 31.87%30.00%20.00%10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 14
All of this is great, but…Most of these efforts are driven by development teams.The challenge now is to unlock the value in this data by making it more available to the rest of the organization. page 16
“Given the ubiquity of data in modern organizations, a datawarehouse can keep pace today only by being “magnetic”:attracting all the data sources that crop up within anorganization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data page 17
Example Use Case: Click Data Processing page 30
Click Data Processing – Current DW ProcessingWeb DataServer Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) 3 hours 2 hours ~20% original data size page 31
Click Data Processing – New Hadoop ProcessingWeb DataServer Web Cleansing Web Server Logs HDFS (MapReduce) DW Servers page 32
Conclusions• Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.• Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.• Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.• The challenge now is making Hadoop more accessible to non- developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility. page 33
Oh, and also…• Orbitz is looking for a Lead Engineer for the BI/Big Data team.• Go to http://careers.orbitz.com/ and search for IRC19035. page 34
References• MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009 page 35
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.