• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Extending the EDW with Hadoop - Chicago Data Summit 2011
 

Extending the EDW with Hadoop - Chicago Data Summit 2011

on

  • 2,347 views

Slides from talk at the Chicago Data Summit on 4/26/11: "Extending the Enterprise Data Warehouse with Hadoop".

Slides from talk at the Chicago Data Summit on 4/26/11: "Extending the Enterprise Data Warehouse with Hadoop".

Statistics

Views

Total Views
2,347
Views on SlideShare
2,342
Embed Views
5

Actions

Likes
1
Downloads
71
Comments
0

4 Embeds 5

http://www.linkedin.com 2
https://twitter.com 1
http://paper.li 1
http://tweetedtimes.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Extending the EDW with Hadoop - Chicago Data Summit 2011 Extending the EDW with Hadoop - Chicago Data Summit 2011 Presentation Transcript

    • Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
    • Who We Are•  Robert Lancaster –  Solutions Architect, Hotel Supply Team –  rlancaster@orbitz.com –  @rob1lancaster•  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User- Group-CHUG) –  jseidman@orbitz.com –  @jseidman page 2
    • Launched: 2001, Chicago, IL page 3
    • Why are we using Hadoop? Stop me if you’ve heard this before… page 4
    • On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day. page 5
    • Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB page 6
    • And…Hadoop places no constraints on how data is processed. page 7
    • Before Hadoop page 8
    • With Hadoop page 9
    • Access to this non-transactional data enables a number ofapplications… page 10
    • Optimizing Hotel Search page 11
    • Recommendations page 12
    • Page Performance Tracking page 13
    • Cache Analysis100.00% 72% of queries are Queries singletons and make up90.00% Searches nearly a third of total search volume.80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total70.00% (Queries)60.00% A small number of queries (3%) make50.00% up more than a third of search volume.40.00% 34.30% 31.87%30.00%20.00%10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 14
    • User Segmentation page 15
    • All of this is great, but…Most of these efforts are driven by development teams.The challenge now is to unlock the value in this data by making it more available to the rest of the organization. page 16
    • “Given the ubiquity of data in modern organizations, a datawarehouse can keep pace today only by being “magnetic”:attracting all the data sources that crop up within anorganization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data page 17
    • In a better world… page 18
    • Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
    • The goal is a unified view of the data, allowing us to usethe power of our existing tools for reporting and analysis. page 20
    • BI vendors are working on integration with Hadoop… page 21
    • And one more reporting tool… page 22
    • Example Processing Pipeline for Web Analytics Data page 23
    • Aggregating data for import into Data Warehouse page 24
    • Example Use Case: Beta Data Processing page 25
    • Example Use Case – Beta Data Processing page 26
    • Example Use Case – Beta Data Processing Output page 27
    • Example Use Case: RCDC Processing page 28
    • Example Use Case – RCDC Processing page 29
    • Example Use Case: Click Data Processing page 30
    • Click Data Processing – Current DW ProcessingWeb DataServer Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) 3 hours 2 hours ~20% original data size page 31
    • Click Data Processing – New Hadoop ProcessingWeb DataServer Web Cleansing Web Server Logs HDFS (MapReduce) DW Servers page 32
    • Conclusions•  Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.•  Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.•  Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.•  The challenge now is making Hadoop more accessible to non- developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility. page 33
    • Oh, and also…•  Orbitz is looking for a Lead Engineer for the BI/Big Data team.•  Go to http://careers.orbitz.com/ and search for IRC19035. page 34
    • References•  MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009 page 35