Real World Machine Learning at Orbitz, Strata 2011


Published on

Slides from the "Real World Applications Panel: Machine Learning and Decision Support" at Strata 2011

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real World Machine Learning at Orbitz, Strata 2011

  1. 1. Machine Learning at OrbitzRobert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011
  2. 2. Launched: 2001, Chicago, IL page 2
  3. 3. Why Start the Machine Learning Team at Orbitz?•  Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience.•  For example: –  Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences? –  Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches? –  Personalization/segmentation: can we show targeted search results to specific consumer segments? page 3
  4. 4. Data Challenges•  The team immediately faced challenges getting access to data: –  Performing required analysis requires access to large amounts of data on user interaction with the site. –  This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations. –  Even worse, we had no archive of the data beyond several days. –  Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse. page 4
  5. 5. New Data Infrastructure to Address These Challenges•  Hadoop provides a solution to these challenges by: –  Providing long-term storage of entire raw dataset without placing constraints on how that data is processed. –  Allowing us to immediately take advantage of new web analytics data added to the site. –  Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis.•  Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis.•  Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources. page 5
  6. 6. New Data Infrastructure – Cont’d•  Hadoop and Hive are now being used by the machine learning team to: –  Extract data from logs for hotel sort and cache optimization analyses. –  Distribute complex cross-validation and performance evaluation operations. –  Extracting data for clustering.•  Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc. page 6
  7. 7. Use Case – Hotel Cache OptimizationOverview: Search methodology: •  Subset of total properties in a location (1 page at a time). •  Get “just enough” information to present to consumers. Caching: •  Reduces impact to suppliers (maintain “look-to-book” ratio). •  Reduces latency. •  Increases “coverage.”Optimization Goal: Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book). page 7
  8. 8. Hotel Cache Optimization – Early AttemptsEarly approaches were well intended, but were not driven by analysis of the available data. For example:Theory: High amount of thrashing leads to eviction of more useful cache entries.Attempted Solution: Increase cache size.Result: No increase in measured coverage.Problem: No actual analysis on required cache size.Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage.Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search.Result: No increase in measured coverage.Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced. page 8
  9. 9. Hotel Cache Optimization – Data Driven ApproachesData Driven Approaches: Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting. TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL. page 9
  10. 10. Hotel Cache Optimization– Traffic Distribution100.00% 72% of queries are Queries singletons and make up90.00% Searches nearly a third of total search volume.80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total70.00% (Queries)60.00% A small number of queries (3%) make50.00% up more than a third of search volume.40.00% 34.30% 31.87%30.00%20.00%10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 10
  11. 11. Optimize Hotel Cache – Traffic Partitioning Evaluate possible mechanisms for determining most frequent queries. Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume. Test for stability of mechanism across multiple time periods.Par$on  Strategy   Descrip$on   Pct  Queries  Pct  Searches  Searches/Query  Baseline   All  traffic   100.00%   100.00%   2.19  Top  50   Top  50  searched  markets   14.88%   26.76%   3.94   Top  50  searched  markets,    weekend  stay  HeurisCc   within  1  month.   0.87%   8.52%   21.4  EnumeraCon   Queries  repeated  5  or  more  Cmes.   3.45%   28.80%   18.29  PredicCon   TBD   TBD   TBD   TBD   page 11
  12. 12. Conclusions and Lessons Learned•  Start with a manageable problem (ease of measuring success, availability of data, etc.)•  Avoid thinking of machine learning team as an R&D organization.•  Instead, foster machine learning approaches throughout the organization: –  Embed resources on actual feature teams. –  Machine learning study groups, etc. page 12