Real World Machine Learning at Orbitz, Strata 2011

Machine Learning at Orbitz
Robert Lancaster and Jonathan Seidman
Strata 2011
February 02 | 2011

Launched: 2001, Chicago, IL

page 2

Why Start the Machine Learning Team at Orbitz?

•  Team was created in 2009 with the goal to apply machine
learning techniques to improve the customer experience.
•  For example:
–  Hotel sort optimization: How can we improve the ranking of
hotel search results in order to show consumers hotels that
more closely match their preferences?
–  Cache optimization: can we intelligently cache hotel rates in
order to optimize the performance of hotel searches?
–  Personalization/segmentation: can we show targeted search
results to specific consumer segments?

page 3

Data Challenges

•  The team immediately faced challenges getting access to data:
–  Performing required analysis requires access to large
amounts of data on user interaction with the site.
–  This data is available in web analytics logs, but required
fields were not available in our data warehouse because of
size considerations.
–  Even worse, we had no archive of the data beyond several
days.
–  Size constraints aside, there’s considerable time and effort
to get new data added to the data warehouse.

page 4

New Data Infrastructure to Address These Challenges

•  Hadoop provides a solution to these challenges by:
–  Providing long-term storage of entire raw dataset without
placing constraints on how that data is processed.
–  Allowing us to immediately take advantage of new web
analytics data added to the site.
–  Providing a platform for efficient analysis of data, as well as
preparation of data for input to external processes for further
analysis.
•  Hive was added to the infrastructure to provide structure over
the prepared data, facilitating ad-hoc queries and selection of
specific data sets for analysis.
•  Data stored in Hive not only supports machine learning efforts,
but also provides metrics to analysts not available through
other sources.

page 5

New Data Infrastructure – Cont’d

•  Hadoop and Hive are now being used by the machine learning
team to:
–  Extract data from logs for hotel sort and cache optimization
analyses.
–  Distribute complex cross-validation and performance
evaluation operations.
–  Extracting data for clustering.
•  Hadoop and Hive have also gained rapid adoption in the
organization beyond the machine learning team: evaluating
page download performance, searching production logs,
keyword analysis, etc.

page 6

Use Case – Hotel Cache Optimization

Overview:
Search methodology:
•  Subset of total properties in a location (1 page at a time).
•  Get “just enough” information to present to consumers.
Caching:
•  Reduces impact to suppliers (maintain “look-to-book” ratio).
•  Reduces latency.
•  Increases “coverage.”
Optimization Goal:
Improve the customer experience (reduce latency, increase
coverage) when searching for hotel rates while controlling impact
on suppliers (maintain look-to-book).

page 7

Hotel Cache Optimization – Early Attempts

Early approaches were well intended, but were not driven by analysis of
the available data. For example:

Theory: High amount of thrashing leads to eviction of more useful cache entries.
Attempted Solution: Increase cache size.
Result: No increase in measured coverage.
Problem: No actual analysis on required cache size.

Theory: Locally managed inventory represents “free” information and can be
requested without limit to improve coverage.
Attempted Solution: Don’t cache locally managed inventory. Increase the amount
of local inventory requested with each user search.
Result: No increase in measured coverage.
Problem: Locally managed inventory doesn’t represent a large percentage of total
inventory and is already highly preferenced.

page 8

Hotel Cache Optimization – Data Driven Approaches

Data Driven Approaches:

Traffic Partitioning: Identify the subset of traffic that is most
efficient and optimize that subset through prefetching and
increased bursting.

TTL Optimization: Use historic logs of availability and rate
change information to predict volatility of hotel rates and
optimize cache TTL.

page 9

Hotel Cache Optimization– Traffic Distribution
100.00%
72% of queries are Queries
singletons and make up
90.00% Searches
nearly a third of total
search volume.
80.00% Reverse Running Total
(Searches)
71.67%
Reverse Running Total
70.00% (Queries)

60.00%
A small number of
queries (3%) make
50.00% up more than a third
of search volume.
40.00%
34.30%
31.87%

30.00%

20.00%

10.00%
2.78%

0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

page 10

Optimize Hotel Cache – Traffic Partitioning

Evaluate possible mechanisms for determining most
frequent queries.
Favor mechanisms that gives high search/query ratio for
the greatest percentage of search volume.
Test for stability of mechanism across multiple time periods.

Par$on
Strategy
Descrip$on
Pct
Queries
Pct
Searches
Searches/Query

Baseline
All
traﬃc
100.00%
100.00%
2.19

Top
50
Top
50
searched
markets
14.88%
26.76%
3.94

Top
50
searched
markets,

weekend
stay

HeurisCc
within
1
month.
0.87%
8.52%
21.4

EnumeraCon
Queries
repeated
5
or
more
Cmes.
3.45%
28.80%
18.29

PredicCon
TBD
TBD
TBD
TBD

page 11

Conclusions and Lessons Learned

•  Start with a manageable problem (ease of measuring success,
availability of data, etc.)
•  Avoid thinking of machine learning team as an R&D
organization.
•  Instead, foster machine learning approaches throughout the
organization:
–  Embed resources on actual feature teams.
–  Machine learning study groups, etc.

page 12

Real World Machine Learning at Orbitz, Strata 2011

More Related Content

Similar to Real World Machine Learning at Orbitz, Strata 2011

More from Jonathan Seidman

Real World Machine Learning at Orbitz, Strata 2011