Machine Learning at Orbitz
Robert Lancaster and Jonathan Seidman
                          Strata 2011
                    February 02 | 2011
Launched: 2001, Chicago, IL




                              page 2
Why Start the Machine Learning Team at Orbitz?


•  Team was created in 2009 with the goal to apply machine
   learning techniques to improve the customer experience.
•  For example:
   –  Hotel sort optimization: How can we improve the ranking of
      hotel search results in order to show consumers hotels that
      more closely match their preferences?
   –  Cache optimization: can we intelligently cache hotel rates in
      order to optimize the performance of hotel searches?
   –  Personalization/segmentation: can we show targeted search
      results to specific consumer segments?




                                                                      page 3
Data Challenges


•  The team immediately faced challenges getting access to data:
   –  Performing required analysis requires access to large
      amounts of data on user interaction with the site.
   –  This data is available in web analytics logs, but required
      fields were not available in our data warehouse because of
      size considerations.
   –  Even worse, we had no archive of the data beyond several
      days.
   –  Size constraints aside, there’s considerable time and effort
      to get new data added to the data warehouse.




                                                                     page 4
New Data Infrastructure to Address These Challenges

•  Hadoop provides a solution to these challenges by:
   –  Providing long-term storage of entire raw dataset without
      placing constraints on how that data is processed.
   –  Allowing us to immediately take advantage of new web
      analytics data added to the site.
   –  Providing a platform for efficient analysis of data, as well as
      preparation of data for input to external processes for further
      analysis.
•  Hive was added to the infrastructure to provide structure over
   the prepared data, facilitating ad-hoc queries and selection of
   specific data sets for analysis.
•  Data stored in Hive not only supports machine learning efforts,
   but also provides metrics to analysts not available through
   other sources.

                                                                        page 5
New Data Infrastructure – Cont’d

•  Hadoop and Hive are now being used by the machine learning
   team to:
  –  Extract data from logs for hotel sort and cache optimization
     analyses.
  –  Distribute complex cross-validation and performance
     evaluation operations.
  –  Extracting data for clustering.
•  Hadoop and Hive have also gained rapid adoption in the
   organization beyond the machine learning team: evaluating
   page download performance, searching production logs,
   keyword analysis, etc.




                                                                    page 6
Use Case – Hotel Cache Optimization

Overview:
  Search methodology:
     •  Subset of total properties in a location (1 page at a time).
     •  Get “just enough” information to present to consumers.
  Caching:
     •  Reduces impact to suppliers (maintain “look-to-book” ratio).
     •  Reduces latency.
     •  Increases “coverage.”
Optimization Goal:
  Improve the customer experience (reduce latency, increase
    coverage) when searching for hotel rates while controlling impact
    on suppliers (maintain look-to-book).




                                                                        page 7
Hotel Cache Optimization – Early Attempts


Early approaches were well intended, but were not driven by analysis of
  the available data. For example:

Theory: High amount of thrashing leads to eviction of more useful cache entries.
Attempted Solution: Increase cache size.
Result: No increase in measured coverage.
Problem: No actual analysis on required cache size.


Theory: Locally managed inventory represents “free” information and can be
  requested without limit to improve coverage.
Attempted Solution: Don’t cache locally managed inventory. Increase the amount
   of local inventory requested with each user search.
Result: No increase in measured coverage.
Problem: Locally managed inventory doesn’t represent a large percentage of total
  inventory and is already highly preferenced.


                                                                                   page 8
Hotel Cache Optimization – Data Driven Approaches


Data Driven Approaches:


  Traffic Partitioning: Identify the subset of traffic that is most
    efficient and optimize that subset through prefetching and
    increased bursting.


  TTL Optimization: Use historic logs of availability and rate
   change information to predict volatility of hotel rates and
   optimize cache TTL.




                                                                      page 9
Hotel Cache Optimization– Traffic Distribution
100.00%
                        72% of queries are                                                        Queries
                        singletons and make up
90.00%                                                                                            Searches
                        nearly a third of total
                        search volume.
80.00%                                                                                            Reverse Running Total
                                                                                                  (Searches)
 71.67%
                                                                                                  Reverse Running Total
70.00%                                                                                            (Queries)


60.00%
                                                                                   A small number of
                                                                                   queries (3%) make
50.00%                                                                             up more than a third
                                                                                   of search volume.
40.00%
                                                           34.30%
 31.87%

30.00%


20.00%


10.00%
                                                          2.78%

 0.00%
          1     2   3       4     5     6     7   8   9      10     11   12   13     14     15    16      17   18     19   20




                                                                                                                           page 10
Optimize Hotel Cache – Traffic Partitioning


       Evaluate possible mechanisms for determining most
        frequent queries.
       Favor mechanisms that gives high search/query ratio for
         the greatest percentage of search volume.
       Test for stability of mechanism across multiple time periods.

Par$on	
  Strategy	
   Descrip$on	
                                                 Pct	
  Queries	
  Pct	
  Searches	
  Searches/Query	
  

Baseline	
           All	
  traffic	
                                                     100.00%	
         100.00%	
                   2.19	
  

Top	
  50	
          Top	
  50	
  searched	
  markets	
                                   14.88%	
         26.76%	
                   3.94	
  
                     Top	
  50	
  searched	
  markets,	
  	
  weekend	
  stay	
  
HeurisCc	
           within	
  1	
  month.	
                                               0.87%	
           8.52%	
                  21.4	
  

EnumeraCon	
         Queries	
  repeated	
  5	
  or	
  more	
  Cmes.	
                     3.45%	
         28.80%	
                 18.29	
  

PredicCon	
          TBD	
                                                                    TBD	
            TBD	
                  TBD	
  


                                                                                                                                                 page 11
Conclusions and Lessons Learned


•  Start with a manageable problem (ease of measuring success,
   availability of data, etc.)
•  Avoid thinking of machine learning team as an R&D
   organization.
•  Instead, foster machine learning approaches throughout the
   organization:
   –  Embed resources on actual feature teams.
   –  Machine learning study groups, etc.




                                                                 page 12

Real World Machine Learning at Orbitz, Strata 2011

  • 1.
    Machine Learning atOrbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011
  • 2.
  • 3.
    Why Start theMachine Learning Team at Orbitz? •  Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience. •  For example: –  Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences? –  Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches? –  Personalization/segmentation: can we show targeted search results to specific consumer segments? page 3
  • 4.
    Data Challenges •  Theteam immediately faced challenges getting access to data: –  Performing required analysis requires access to large amounts of data on user interaction with the site. –  This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations. –  Even worse, we had no archive of the data beyond several days. –  Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse. page 4
  • 5.
    New Data Infrastructureto Address These Challenges •  Hadoop provides a solution to these challenges by: –  Providing long-term storage of entire raw dataset without placing constraints on how that data is processed. –  Allowing us to immediately take advantage of new web analytics data added to the site. –  Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis. •  Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis. •  Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources. page 5
  • 6.
    New Data Infrastructure– Cont’d •  Hadoop and Hive are now being used by the machine learning team to: –  Extract data from logs for hotel sort and cache optimization analyses. –  Distribute complex cross-validation and performance evaluation operations. –  Extracting data for clustering. •  Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc. page 6
  • 7.
    Use Case –Hotel Cache Optimization Overview: Search methodology: •  Subset of total properties in a location (1 page at a time). •  Get “just enough” information to present to consumers. Caching: •  Reduces impact to suppliers (maintain “look-to-book” ratio). •  Reduces latency. •  Increases “coverage.” Optimization Goal: Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book). page 7
  • 8.
    Hotel Cache Optimization– Early Attempts Early approaches were well intended, but were not driven by analysis of the available data. For example: Theory: High amount of thrashing leads to eviction of more useful cache entries. Attempted Solution: Increase cache size. Result: No increase in measured coverage. Problem: No actual analysis on required cache size. Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage. Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search. Result: No increase in measured coverage. Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced. page 8
  • 9.
    Hotel Cache Optimization– Data Driven Approaches Data Driven Approaches: Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting. TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL. page 9
  • 10.
    Hotel Cache Optimization–Traffic Distribution 100.00% 72% of queries are Queries singletons and make up 90.00% Searches nearly a third of total search volume. 80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total 70.00% (Queries) 60.00% A small number of queries (3%) make 50.00% up more than a third of search volume. 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 10
  • 11.
    Optimize Hotel Cache– Traffic Partitioning Evaluate possible mechanisms for determining most frequent queries. Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume. Test for stability of mechanism across multiple time periods. Par$on  Strategy   Descrip$on   Pct  Queries  Pct  Searches  Searches/Query   Baseline   All  traffic   100.00%   100.00%   2.19   Top  50   Top  50  searched  markets   14.88%   26.76%   3.94   Top  50  searched  markets,    weekend  stay   HeurisCc   within  1  month.   0.87%   8.52%   21.4   EnumeraCon   Queries  repeated  5  or  more  Cmes.   3.45%   28.80%   18.29   PredicCon   TBD   TBD   TBD   TBD   page 11
  • 12.
    Conclusions and LessonsLearned •  Start with a manageable problem (ease of measuring success, availability of data, etc.) •  Avoid thinking of machine learning team as an R&D organization. •  Instead, foster machine learning approaches throughout the organization: –  Embed resources on actual feature teams. –  Machine learning study groups, etc. page 12