This presentation to describe architecture, data pre-processing , modelling and accuracy measure for a PoC for a ranking system of travel meta search engines based on machine learning
2. Agenda
1. Problem formulation
2. Effect of ranking on choice
3. System architecture
4. Data Sources
5. Data Cleaning
6. Feature Engineering
7. Utility Generation
8. Predictors Generation
9. Weight Generation
10. Prediction Problem
11. Regression and Partitioning Tree (RPART)
12. Pipeline
13. Relevance measure
14. References
15. Future work
2
3. 1. Problem formulation
● User searches for transportation from A to
B with different parameters
● What are the search results most likely to
be clicked and booked by the user.
● User looks at no more than top 3 to 5
results (intuitively)
● Objective: Predict results value to user
(utility score), and sort based on it
3
4. 2. Effect of Ranking on Search and Choice (1)
● Left figure, CTR - position relation, Right figure: CR conditional on click
outs vs position
The effect of position on CTR and CR with Random ranking [1]
4
5. 2. Effect of Ranking on Search and Choice (2)
● Position DOES have causal effect on CTR, but does NOT on CR
● User selection mechanism
○ Clickout : (search cost, expected utility , realize utility ). Utility = value for money
○ Purchase : Booking friction , realized utility
● Increase CR by providing more valuable results (reasonable price, duration,
departure time,etc....) (Value = Utility)
● High utility results need to be in higher positions, otherwise most likely they
will not be clicked
● Top results should be of high utility , otherwise they will not booked.
● Lost opportunity cost: If we show low utility results at higher position: most
likely it will not be booked and we lost the chance of presenting higher value
result in the same place that could be clicked.
● Low utility results = losing users’ trust
● Money = CTR * CR 5
6. 3. System Architecture (PoC)
Clickout-logs
Purchase-logs
User-data
Data
Cleaning
Feature
Engineering/
Generation
Weight
Generator
Score
Generator
Model Building
Model
Evaluation
Scoring
Model
Search Engine
Journey details
Search-logs
Third party data 6
7. 3 Data sources
● Search log search events with options like departure city and date, arrival
city,round-trip, passengers, timestamp
● Clickout log event of clicking out a search result with information like clickout
timestamp, count,link to purchase log (if purchase occurred)
● Purchase log payment method, number of tickets, price per passenger,
discount if applicable, time of purchase
● Journey details offer details like provider name, departure and arrival time ,
station/airport , number of stops, intermediate stops, duration, price
● User data Age, gender and city (If applicable)
● Third-party data Provider review from other websites like yelp, airlinequality ,
trustpilot
7
8. 4 Data Cleaning
● Filtering bots, scrapers and testing search/clickout events. Only human
triggered search need to be analyzed.
● Filtering invalid/obsolete Journeys
● Filtering out of service providers (Cache system responsibility)
● Filtering journeys with invalid time or out of range price (outliers, data
collection error)
8
9. 5 Feature Engineering
● People tend to decide in a relative way
● 13 EUR is near 15, but 20 can be too far
● Binning to increase accuracy and reduce outlier effect
● Distribution based binning (percentiles, p = 3, 4)
● The same for departure and arrival time, duration in mins.
● The questions is what is the number of bins (try and error, optimal binning)
● Intuitively how humans “bin” data :
○ Price => Super cheap (offer), cheap, Average, expensive
○ Departure time : morning, noon, afternoon, evening
○ Duration: short , average, long
9
10. 6 Utility Generation
● Target is to predict user received utility, how we can measure it?
○ High utility: user clicks and book (purchase)
○ Expected utility : user clicks but not book
○ No utility: neither click or book
● We can’t rely only on click-outs (Bias to higher ranked results)
● How we can quantify utility (implicit feedback problem)
○ Utilize clickout, booking ,and results relative positioning information
● For each clickout, in a given time window :
○ U is an estimation of received user utility
○ For each clickout , c and b are 0 or 1 of whether click/book happened
○ K is a constant to give booking more weight, (set to 10 after some experiments)
○ w weight for this clickout event to reflect its importance (recency)
10
11. 6 Utility Generation
● Sample search results page
● K = 10, w = 1 for all click-outs
Position
(Initial Ranking)
Search results
(Journey-offer)
clicked booked Utility Score
1 Y N 1
2 Y Y 11
3 N N 0
4 Y N 1
5 N N 0
…. No more clicks or books , all 0’s
11
12. 6 Utility Generation
● What is the problem in this utility score schema ?
○ Result 3 and 5 both have score 0
○ We have more evidence that result 3 is not of “convincing utility” to user as s/he skipped it and
clicked result 4
○ We don’t have such evidence for result 5 (may be our initial ranking was the reason!)
○ Same problem with results 1 and 4
○ Results 1 and 3 should be PENALIZED
● Utility scoring refinement (second iteration, U2)
○
○
● Score meaning
○ +ve -> We have some evidence it does have user received utility
○ -ve -> We have some evidence it does NOT have user received utility
○ 0 -> We don’t have evidence of whether it has user received utility
12
13. 6 Utility Generation
Utility score after refinement
Position
(Initial
Ranking)
Search results
(Journey-offer)
clicked booked U1 U2
refinement
U2
1 Y N 1 1-11 -10
2 Y Y 11 11 11
3 N N 0 0-1 -1
4 Y N 1 1 1
5 N N 0 0 0
…. No more clicks or books , all 0’s
13
14. 6 Utility Generation
Desired ranking
Position Search results
(Journey-offer)
clicked booke
d
U1 U2
refinement
U2
2 Y Y 11 11 11
4 Y N 1 1 1
5 N N 0 0 0
3 N N 0 0-1 -1
1 Y N 1 1-11 -10
…. No more clicks or books , all 0’s
Evidence
of utility
Evidence of
NO utility
No
evidence
14
15. 7 Predictors generation
● Combine search options and journey/offer details to predict user received
utility from search result (Journey)
● Predictors generation
○ Existing data
- Search Options : One-way /Round-trip, Price, travelers (number, age)
-Journey details : Time of day, Duration, Travel mode (bus , train, flight) , Provider, number of
connections
○ Derived
- Days to travel date, distance , duration to distance ratio , provider booking ratio
- domestic/international
-Seasonality (day of week, quarter of year -> summer, winter) , travelling is a seasonal
business
○ Third party : provider rating
15
16. 8 Weight Generation
● Most recent click-outs are more important than very old ones
● User behavior can change because of provider results changes, prices,
changes, discounts/offers , etc…
● X = time elapsed between current date and click-out date (days, weeks)
●
● Weight represents each clickout recency
● H, is the time windows for historical logs
16
17. 9 Prediction problem
● Problem formulation
Given set search options, journey details
, predict utility
● Sort search results based on predicted
utility
● Each observation is grouped click-outs
with same search options, journey
details (after preprocessing)
● The grouping is to generalize model
● Classification/ Regression tree fit the
problem
● POC : limit to click-outs for top 10
searched routes for last 3 months
17
Recency (Search options-Journey Details) Utility
Weight Predictors Target
18. 10 RPART
● Why
○ Problem is nonlinear
○ Presentable to stakeholders
○ Input are discrete, target is continuous
○ Highly customizable, many parameters
○ Many implementations exist
● rpart Package
○ Well documented
○ Mature
18
19. 10 RPART
Parameters
Parameter Description Values experimented Mapping to problem
cp Threshold for
fit/loss
improvement
0.005,0.01,0.02 How tree is restrictive
in splitting
weight Observation waits Alpha = 0.1 , 0.5, 0.9 Small alpha = less
weight to older results
method Split criteria Anova -> regression
problem
minsplit Min number in a
node to split
20,30,40
19
20. 10 RPART
● Divide data into train (from t = 1 to K ) , test ( t = K+1 to N)
● Example, build model using data from previous 3 months, train is oldest 2
months and 3 weeks, test is the latest week
● Reason, test capability of future prediction - Model is to be updated every 1 or
2 weeks
● Internal accuracy measure = RMSE
● Grid search (expand.grid)
● Several runs with rolling origin (cross validation with different time points)
20
22. 11 Pipeline
● Spark MR makes it easy to transform large sets of data
○ Spark as ETL is not fully utilizing spark power !!
● CSV for feature matrix
○ Simplicity, readability
○ Easiest for R (PoC)
● R for model building :
○ Rich in libraries
○ Quick prototype and testing
○ Experience !
● Exported as PMML
○ XML widely used model for exchanging ML models
○ Export library exists (pmml.rpart)
○ Consumer library exists (jpmml)
22
23. 12 Relevance Measure
● Consider it as binary retrieval problem (result is relevant or not)
● Consider the results we present in top K positions as (possibly relevant) , the
others are not
● K ranges usually between 3 to 5 (need more objective measure!!)
● Precision - recall for bookings not search events
● It’s not completely reliable to use search events for relevance measure,due to
position bias
● It is too restrictive to base measures only on bookings (booking cycle is not
related to search)
● As a start, rely on bookings only (measure relative improvement)
23
25. 14 Relevance Measure
● Recall is defined if number of events > 0, else it is not defined
● Events can be clicks or books
● The measures should at the end affect CTR and CR
PoC : No presentable results
Yet
25
26. 15 References
1. The Power of Ranking: Quantifying the Effects of Rankings on Online
Consumer Search and Choice
2. An Introduction to Recursive Partitioning Using the RPART Routines
3. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness,
Markedness & Correlation
26
27. 16 Possible improvements
● More on model optimization
● Loss function for missing relevant results
● Ranking SVM
● Results shuffling
● Get user profile data
● Speedup model building - r parallel/spark
● Investigate more on relevance measure
27