• Like
  • Save

Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan Seidman & Rob Lancaster, Orbitz Worldwide

  • 3,303 views
Uploaded on

Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The …

Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,303
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
  • The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
  • Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
  • Our data warehouse contains a full record of all transactions, but much of the required non-transactional data was either not stored, or stored in aggregated fields.
  • Hadoop is being used to analyze and optimize cache performance – in this case hotel rate cache. This type of analysis will allow us to ensure that more requests can be served from the cache, optimizing the user experience and improving our “look-to-book” metrics. Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources.
  • 1 st viz is just plot of the lat/long of hotel bookings for the month, illustrating the global nature of the business. 2 nd viz is a simple price prediction for air fares Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
  • Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
  • Making part of BI team probably makes Orbitz unique, but it’s a reflection of the importance of big data to driving BI for the company.
  • probably both of these are common use cases at other companies employing Hadoop with an EDW.
  • Hadoop will be used to transform web analytics data into a dimensional model, allowing multiple business unit to generate reports providing valuable intelligence to improve business results.
  • Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
  • ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
  • Data was apparently available in the DW, but wasn’t modeled to enable efficient querying. Points up a strength of Hadoop, which is that it places no constraints on how data is processed.
  • This provides an example of a typical processing flow for the large volumes of non-transactional data we’re collecting. This processing allows us to convert large volumes of un-structured data into structured data that can be queried, extracted, etc. for further processing.
  • This type of processing also allows us summarize large volumes of data into a data set that can be exported to the data warehouse, allowing us to query and report on that data using all of our standard BI tools.

Transcript

  • 1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Hadoop World 2011 November 8 | 2011
  • 2. Who We Are
    • Robert Lancaster
      • Solutions Architect, Hotel Supply Team
      • [email_address]
      • @rob1lancaster
      • Co-organizer of Chicago Big Data and Chicago Machine Learning Study Group.
    • Jonathan Seidman
      • Lead Engineer, Business Intelligence/Big Data Team
      • Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data
      • [email_address]
      • @jseidman
    page
  • 3. page Launched in 2001 Over 160 million bookings
  • 4. Some History… page
  • 5. In 2009…
    • The Machine Learning team is formed to improve site performance. For example, improving hotel search results.
    • This required access to large volumes of behavioral data for analysis.
      • Fortunately, the required data was collected in session data stored in web analytics logs.
    page
  • 6. The Problem…
    • The only archive of the required data went back about two weeks.
    page Transactional data (e.g. bookings) and aggregated Non-transactional data Data Warehouse Non-transactional Data (e.g. searches)
  • 7. Hadoop Provided a Solution… page Data Warehouse Detailed non-transactional data (what every user sees, clicks, etc.) Hadoop Transactional data (e.g. bookings) and aggregated Non-transactional data
  • 8. Deploying Hadoop Enabled Multiple Applications… page
  • 9. And Useful Analyses… page
  • 10. But Brought New Challenges…
    • Most of these efforts are driven by development teams.
    • The challenge now is unlocking the value of this data for non-technical users.
    page
  • 11. In Early 2011…
    • Big Data team is formed under Business Intelligence team at Orbitz Worldwide.
    • Allows the Big Data team to work more closely with the data warehouse and BI teams.
    • Reflects the importance of big data to the future of the company.
    page
  • 12. A View Shared Beyond Orbitz… page “ We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW…” *James Kobielus, Forrester Research, “ Hadoop, Is It Soup Yet?” “… but that promise is still three to five years from fruition.”*
  • 13. Two Primary Ways We Use Hadoop to Complement the EDW
    • Extraction and transformation of data for loading into the data warehouse – “ETL”.
    • Off-loading of analysis from the data warehouse.
    page
  • 14. ETL Example: Proposed Dimensional Model page Raw logs Hadoop Dimensional model
  • 15. ETL Example: Click Data Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers Several hours of processing ~20% original data size Current Processing in Data Warehouse
  • 16. ETL Example: Click Data Processing
    • Moving to Hadoop will facilitate:
      • Remove load from the data warehouse.
      • Adding additional attributes for processing.
      • Allow processing to be run more frequently.
    page Web Server Logs Hadoop Data Cleansing (MapReduce) DW Web Server Web Servers Proposed Processing in Hadoop
  • 17. Analysis Example: Geo-Targeting Ads
    • Facilitated analysis that allows for more personalized ad content.
    • Allowed marketing team to analyze over a years worth of search data.
    • Provided analysis that was difficult to perform in the data warehouse.
    page
  • 18. BI Vendors Are Working on Hadoop Integration page Both big (relatively)…
  • 19. And small… page
  • 20. Example Processing Pipeline for Web Analytics Data page
  • 21. page Example Use Case: Selection Errors
  • 22. Use Case – Selection Errors: Introduction
    • Multiple points of entry.
    • Multiple paths through site.
    • Goal: tie events together to form picture of customer behavior.
    page
  • 23. Use Case – Selection Errors: Processing page
  • 24. Use Case – Selection Errors: Visualization page
  • 25. page Example Use Case: Beta Data
  • 26. Use Case – Beta Data: Introduction
    • Hotel Sort Optimization
    • Compare A vs. B
    • Web Analytics Data
      • What user saw.
      • How user behaved
    • Server Log Data
      • Sorting behavior used.
    page
  • 27. Use Case – Beta Data Processing page
  • 28. Use Case – Beta Data: Visualization page
  • 29. page Example Use Case: RCDC
  • 30. Use Case – RCDC: Introduction
    • Understand and improve cache behavior.
    • Improve “coverage”
      • Traditionally search 1 page of hotels at a time.
      • Get “just enough” information to present to consumers.
      • Increase amount of availability information we have when consumer performs a search.
    • Data needed to support needs beyond reporting.
    page
  • 31. Use Case – RCDC: Processing page
  • 32. Use Case – RCDC: Visualization page
  • 33. Conclusions
    • Hadoop market is still immature, but growing quickly. Better tools are on the way.
      • Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups.
    • Hadoop will not replace your data warehouse, but any organization with a large data warehouse should at least be exploring Hadoop as a complement to their BI infrastructure.
    page
  • 34. Conclusions
    • Work closely with your existing data management teams.
      • Your idea of what constitutes “big data” might quickly diverge from theirs.
    • The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse.
    page