SlideShare a Scribd company logo
1 of 28
Download to read offline
Recommendation/Ranking
Systems For Travel (PoC)
Mohamed Baddar - Data Engineer
GoEuro
Agenda
1. Problem formulation
2. Effect of ranking on choice
3. System architecture
4. Data Sources
5. Data Cleaning
6. Feature Engineering
7. Utility Generation
8. Predictors Generation
9. Weight Generation
10. Prediction Problem
11. Regression and Partitioning Tree (RPART)
12. Pipeline
13. Relevance measure
14. References
15. Future work
2
1. Problem formulation
● User searches for transportation from A to
B with different parameters
● What are the search results most likely to
be clicked and booked by the user.
● User looks at no more than top 3 to 5
results (intuitively)
● Objective: Predict results value to user
(utility score), and sort based on it
3
2. Effect of Ranking on Search and Choice (1)
● Left figure, CTR - position relation, Right figure: CR conditional on click
outs vs position
The effect of position on CTR and CR with Random ranking [1]
4
2. Effect of Ranking on Search and Choice (2)
● Position DOES have causal effect on CTR, but does NOT on CR
● User selection mechanism
○ Clickout : (search cost, expected utility , realize utility ). Utility = value for money
○ Purchase : Booking friction , realized utility
● Increase CR by providing more valuable results (reasonable price, duration,
departure time,etc....) (Value = Utility)
● High utility results need to be in higher positions, otherwise most likely they
will not be clicked
● Top results should be of high utility , otherwise they will not booked.
● Lost opportunity cost: If we show low utility results at higher position: most
likely it will not be booked and we lost the chance of presenting higher value
result in the same place that could be clicked.
● Low utility results = losing users’ trust
● Money = CTR * CR 5
3. System Architecture (PoC)
Clickout-logs
Purchase-logs
User-data
Data
Cleaning
Feature
Engineering/
Generation
Weight
Generator
Score
Generator
Model Building
Model
Evaluation
Scoring
Model
Search Engine
Journey details
Search-logs
Third party data 6
3 Data sources
● Search log search events with options like departure city and date, arrival
city,round-trip, passengers, timestamp
● Clickout log event of clicking out a search result with information like clickout
timestamp, count,link to purchase log (if purchase occurred)
● Purchase log payment method, number of tickets, price per passenger,
discount if applicable, time of purchase
● Journey details offer details like provider name, departure and arrival time ,
station/airport , number of stops, intermediate stops, duration, price
● User data Age, gender and city (If applicable)
● Third-party data Provider review from other websites like yelp, airlinequality ,
trustpilot
7
4 Data Cleaning
● Filtering bots, scrapers and testing search/clickout events. Only human
triggered search need to be analyzed.
● Filtering invalid/obsolete Journeys
● Filtering out of service providers (Cache system responsibility)
● Filtering journeys with invalid time or out of range price (outliers, data
collection error)
8
5 Feature Engineering
● People tend to decide in a relative way
● 13 EUR is near 15, but 20 can be too far
● Binning to increase accuracy and reduce outlier effect
● Distribution based binning (percentiles, p = 3, 4)
● The same for departure and arrival time, duration in mins.
● The questions is what is the number of bins (try and error, optimal binning)
● Intuitively how humans “bin” data :
○ Price => Super cheap (offer), cheap, Average, expensive
○ Departure time : morning, noon, afternoon, evening
○ Duration: short , average, long
9
6 Utility Generation
● Target is to predict user received utility, how we can measure it?
○ High utility: user clicks and book (purchase)
○ Expected utility : user clicks but not book
○ No utility: neither click or book
● We can’t rely only on click-outs (Bias to higher ranked results)
● How we can quantify utility (implicit feedback problem)
○ Utilize clickout, booking ,and results relative positioning information
● For each clickout, in a given time window :
○ U is an estimation of received user utility
○ For each clickout , c and b are 0 or 1 of whether click/book happened
○ K is a constant to give booking more weight, (set to 10 after some experiments)
○ w weight for this clickout event to reflect its importance (recency)
10
6 Utility Generation
● Sample search results page
● K = 10, w = 1 for all click-outs
Position
(Initial Ranking)
Search results
(Journey-offer)
clicked booked Utility Score
1 Y N 1
2 Y Y 11
3 N N 0
4 Y N 1
5 N N 0
…. No more clicks or books , all 0’s
11
6 Utility Generation
● What is the problem in this utility score schema ?
○ Result 3 and 5 both have score 0
○ We have more evidence that result 3 is not of “convincing utility” to user as s/he skipped it and
clicked result 4
○ We don’t have such evidence for result 5 (may be our initial ranking was the reason!)
○ Same problem with results 1 and 4
○ Results 1 and 3 should be PENALIZED
● Utility scoring refinement (second iteration, U2)
○
○
● Score meaning
○ +ve -> We have some evidence it does have user received utility
○ -ve -> We have some evidence it does NOT have user received utility
○ 0 -> We don’t have evidence of whether it has user received utility
12
6 Utility Generation
Utility score after refinement
Position
(Initial
Ranking)
Search results
(Journey-offer)
clicked booked U1 U2
refinement
U2
1 Y N 1 1-11 -10
2 Y Y 11 11 11
3 N N 0 0-1 -1
4 Y N 1 1 1
5 N N 0 0 0
…. No more clicks or books , all 0’s
13
6 Utility Generation
Desired ranking
Position Search results
(Journey-offer)
clicked booke
d
U1 U2
refinement
U2
2 Y Y 11 11 11
4 Y N 1 1 1
5 N N 0 0 0
3 N N 0 0-1 -1
1 Y N 1 1-11 -10
…. No more clicks or books , all 0’s
Evidence
of utility
Evidence of
NO utility
No
evidence
14
7 Predictors generation
● Combine search options and journey/offer details to predict user received
utility from search result (Journey)
● Predictors generation
○ Existing data
- Search Options : One-way /Round-trip, Price, travelers (number, age)
-Journey details : Time of day, Duration, Travel mode (bus , train, flight) , Provider, number of
connections
○ Derived
- Days to travel date, distance , duration to distance ratio , provider booking ratio
- domestic/international
-Seasonality (day of week, quarter of year -> summer, winter) , travelling is a seasonal
business
○ Third party : provider rating
15
8 Weight Generation
● Most recent click-outs are more important than very old ones
● User behavior can change because of provider results changes, prices,
changes, discounts/offers , etc…
● X = time elapsed between current date and click-out date (days, weeks)
●
● Weight represents each clickout recency
● H, is the time windows for historical logs
16
9 Prediction problem
● Problem formulation
Given set search options, journey details
, predict utility
● Sort search results based on predicted
utility
● Each observation is grouped click-outs
with same search options, journey
details (after preprocessing)
● The grouping is to generalize model
● Classification/ Regression tree fit the
problem
● POC : limit to click-outs for top 10
searched routes for last 3 months
17
Recency (Search options-Journey Details) Utility
Weight Predictors Target
10 RPART
● Why
○ Problem is nonlinear
○ Presentable to stakeholders
○ Input are discrete, target is continuous
○ Highly customizable, many parameters
○ Many implementations exist
● rpart Package
○ Well documented
○ Mature
18
10 RPART
Parameters
Parameter Description Values experimented Mapping to problem
cp Threshold for
fit/loss
improvement
0.005,0.01,0.02 How tree is restrictive
in splitting
weight Observation waits Alpha = 0.1 , 0.5, 0.9 Small alpha = less
weight to older results
method Split criteria Anova -> regression
problem
minsplit Min number in a
node to split
20,30,40
19
10 RPART
● Divide data into train (from t = 1 to K ) , test ( t = K+1 to N)
● Example, build model using data from previous 3 months, train is oldest 2
months and 3 weeks, test is the latest week
● Reason, test capability of future prediction - Model is to be updated every 1 or
2 weeks
● Internal accuracy measure = RMSE
● Grid search (expand.grid)
● Several runs with rolling origin (cross validation with different time points)
20
11 Pipeline
Data
transformation
(MapReduce)
Feature Matrix
Model Building
Optimization
Model deployment
Search
backend
21
11 Pipeline
● Spark MR makes it easy to transform large sets of data
○ Spark as ETL is not fully utilizing spark power !!
● CSV for feature matrix
○ Simplicity, readability
○ Easiest for R (PoC)
● R for model building :
○ Rich in libraries
○ Quick prototype and testing
○ Experience !
● Exported as PMML
○ XML widely used model for exchanging ML models
○ Export library exists (pmml.rpart)
○ Consumer library exists (jpmml)
22
12 Relevance Measure
● Consider it as binary retrieval problem (result is relevant or not)
● Consider the results we present in top K positions as (possibly relevant) , the
others are not
● K ranges usually between 3 to 5 (need more objective measure!!)
● Precision - recall for bookings not search events
● It’s not completely reliable to use search events for relevance measure,due to
position bias
● It is too restrictive to base measures only on bookings (booking cycle is not
related to search)
● As a start, rely on bookings only (measure relative improvement)
23
13 Relevance measure
24
14 Relevance Measure
● Recall is defined if number of events > 0, else it is not defined
● Events can be clicks or books
● The measures should at the end affect CTR and CR
PoC : No presentable results
Yet
25
15 References
1. The Power of Ranking: Quantifying the Effects of Rankings on Online
Consumer Search and Choice
2. An Introduction to Recursive Partitioning Using the RPART Routines
3. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness,
Markedness & Correlation
26
16 Possible improvements
● More on model optimization
● Loss function for missing relevant results
● Ranking SVM
● Results shuffling
● Get user profile data
● Speedup model building - r parallel/spark
● Investigate more on relevance measure
27
Questions?
Suggestions ?
28

More Related Content

Similar to Ranking System for travel search (PoC)

Deepak Tiwari, Lyft
Deepak Tiwari, LyftDeepak Tiwari, Lyft
Deepak Tiwari, LyftHilary Ip
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Roelof van Zwol
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statisticsWSO2
 
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...Jessica Tams
 
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...VWO
 
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...Lviv Startup Club
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Alan Walker
 
Agile stories, estimating and planning
Agile stories, estimating and planningAgile stories, estimating and planning
Agile stories, estimating and planningDimitri Ponomareff
 
Workshop Stanford University - 28th July 2018 on Website Optimization
Workshop Stanford University - 28th July 2018 on Website Optimization  Workshop Stanford University - 28th July 2018 on Website Optimization
Workshop Stanford University - 28th July 2018 on Website Optimization Raj Lal
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH
 
Nicholas Gorski: Real-time revenue science at Twitter
Nicholas Gorski: Real-time revenue science at TwitterNicholas Gorski: Real-time revenue science at Twitter
Nicholas Gorski: Real-time revenue science at TwitterDavid Garrison
 
Lifetime Value (the only metric that matters) Utah DMC September 2018
Lifetime Value (the only metric that matters) Utah DMC September 2018Lifetime Value (the only metric that matters) Utah DMC September 2018
Lifetime Value (the only metric that matters) Utah DMC September 2018Utah Digital Marketing Collective
 
Lifetime Value - The Only Metric That Matters (DMC September 2018)
Lifetime Value - The Only Metric That Matters (DMC September 2018)Lifetime Value - The Only Metric That Matters (DMC September 2018)
Lifetime Value - The Only Metric That Matters (DMC September 2018)Luciano Pesci, PhD
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...John Blue
 
Effective Reporting
Effective ReportingEffective Reporting
Effective Reportingsupergigas
 

Similar to Ranking System for travel search (PoC) (20)

Deepak Tiwari, Lyft
Deepak Tiwari, LyftDeepak Tiwari, Lyft
Deepak Tiwari, Lyft
 
Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020 Marketplace in motion - AdKDD keynote - 2020
Marketplace in motion - AdKDD keynote - 2020
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
 
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...
How Choosing the Right Formats and Placements Can Supercharge Your Game Reven...
 
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...
Dmytro Petryk: Як керувати розробкою та релізом фічей в глобальному продукті ...
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
 
Data science guide
Data science guideData science guide
Data science guide
 
Agile stories, estimating and planning
Agile stories, estimating and planningAgile stories, estimating and planning
Agile stories, estimating and planning
 
Workshop Stanford University - 28th July 2018 on Website Optimization
Workshop Stanford University - 28th July 2018 on Website Optimization  Workshop Stanford University - 28th July 2018 on Website Optimization
Workshop Stanford University - 28th July 2018 on Website Optimization
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Druid
DruidDruid
Druid
 
Nicholas Gorski: Real-time revenue science at Twitter
Nicholas Gorski: Real-time revenue science at TwitterNicholas Gorski: Real-time revenue science at Twitter
Nicholas Gorski: Real-time revenue science at Twitter
 
Lifetime Value (the only metric that matters) Utah DMC September 2018
Lifetime Value (the only metric that matters) Utah DMC September 2018Lifetime Value (the only metric that matters) Utah DMC September 2018
Lifetime Value (the only metric that matters) Utah DMC September 2018
 
Lifetime Value - The Only Metric That Matters (DMC September 2018)
Lifetime Value - The Only Metric That Matters (DMC September 2018)Lifetime Value - The Only Metric That Matters (DMC September 2018)
Lifetime Value - The Only Metric That Matters (DMC September 2018)
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...
Ohio State University Ag Systems Management Senior Design Team - Selecting Ag...
 
Effective Reporting
Effective ReportingEffective Reporting
Effective Reporting
 

Recently uploaded

如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...vershagrag
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 

Recently uploaded (20)

如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 

Ranking System for travel search (PoC)

  • 1. Recommendation/Ranking Systems For Travel (PoC) Mohamed Baddar - Data Engineer GoEuro
  • 2. Agenda 1. Problem formulation 2. Effect of ranking on choice 3. System architecture 4. Data Sources 5. Data Cleaning 6. Feature Engineering 7. Utility Generation 8. Predictors Generation 9. Weight Generation 10. Prediction Problem 11. Regression and Partitioning Tree (RPART) 12. Pipeline 13. Relevance measure 14. References 15. Future work 2
  • 3. 1. Problem formulation ● User searches for transportation from A to B with different parameters ● What are the search results most likely to be clicked and booked by the user. ● User looks at no more than top 3 to 5 results (intuitively) ● Objective: Predict results value to user (utility score), and sort based on it 3
  • 4. 2. Effect of Ranking on Search and Choice (1) ● Left figure, CTR - position relation, Right figure: CR conditional on click outs vs position The effect of position on CTR and CR with Random ranking [1] 4
  • 5. 2. Effect of Ranking on Search and Choice (2) ● Position DOES have causal effect on CTR, but does NOT on CR ● User selection mechanism ○ Clickout : (search cost, expected utility , realize utility ). Utility = value for money ○ Purchase : Booking friction , realized utility ● Increase CR by providing more valuable results (reasonable price, duration, departure time,etc....) (Value = Utility) ● High utility results need to be in higher positions, otherwise most likely they will not be clicked ● Top results should be of high utility , otherwise they will not booked. ● Lost opportunity cost: If we show low utility results at higher position: most likely it will not be booked and we lost the chance of presenting higher value result in the same place that could be clicked. ● Low utility results = losing users’ trust ● Money = CTR * CR 5
  • 6. 3. System Architecture (PoC) Clickout-logs Purchase-logs User-data Data Cleaning Feature Engineering/ Generation Weight Generator Score Generator Model Building Model Evaluation Scoring Model Search Engine Journey details Search-logs Third party data 6
  • 7. 3 Data sources ● Search log search events with options like departure city and date, arrival city,round-trip, passengers, timestamp ● Clickout log event of clicking out a search result with information like clickout timestamp, count,link to purchase log (if purchase occurred) ● Purchase log payment method, number of tickets, price per passenger, discount if applicable, time of purchase ● Journey details offer details like provider name, departure and arrival time , station/airport , number of stops, intermediate stops, duration, price ● User data Age, gender and city (If applicable) ● Third-party data Provider review from other websites like yelp, airlinequality , trustpilot 7
  • 8. 4 Data Cleaning ● Filtering bots, scrapers and testing search/clickout events. Only human triggered search need to be analyzed. ● Filtering invalid/obsolete Journeys ● Filtering out of service providers (Cache system responsibility) ● Filtering journeys with invalid time or out of range price (outliers, data collection error) 8
  • 9. 5 Feature Engineering ● People tend to decide in a relative way ● 13 EUR is near 15, but 20 can be too far ● Binning to increase accuracy and reduce outlier effect ● Distribution based binning (percentiles, p = 3, 4) ● The same for departure and arrival time, duration in mins. ● The questions is what is the number of bins (try and error, optimal binning) ● Intuitively how humans “bin” data : ○ Price => Super cheap (offer), cheap, Average, expensive ○ Departure time : morning, noon, afternoon, evening ○ Duration: short , average, long 9
  • 10. 6 Utility Generation ● Target is to predict user received utility, how we can measure it? ○ High utility: user clicks and book (purchase) ○ Expected utility : user clicks but not book ○ No utility: neither click or book ● We can’t rely only on click-outs (Bias to higher ranked results) ● How we can quantify utility (implicit feedback problem) ○ Utilize clickout, booking ,and results relative positioning information ● For each clickout, in a given time window : ○ U is an estimation of received user utility ○ For each clickout , c and b are 0 or 1 of whether click/book happened ○ K is a constant to give booking more weight, (set to 10 after some experiments) ○ w weight for this clickout event to reflect its importance (recency) 10
  • 11. 6 Utility Generation ● Sample search results page ● K = 10, w = 1 for all click-outs Position (Initial Ranking) Search results (Journey-offer) clicked booked Utility Score 1 Y N 1 2 Y Y 11 3 N N 0 4 Y N 1 5 N N 0 …. No more clicks or books , all 0’s 11
  • 12. 6 Utility Generation ● What is the problem in this utility score schema ? ○ Result 3 and 5 both have score 0 ○ We have more evidence that result 3 is not of “convincing utility” to user as s/he skipped it and clicked result 4 ○ We don’t have such evidence for result 5 (may be our initial ranking was the reason!) ○ Same problem with results 1 and 4 ○ Results 1 and 3 should be PENALIZED ● Utility scoring refinement (second iteration, U2) ○ ○ ● Score meaning ○ +ve -> We have some evidence it does have user received utility ○ -ve -> We have some evidence it does NOT have user received utility ○ 0 -> We don’t have evidence of whether it has user received utility 12
  • 13. 6 Utility Generation Utility score after refinement Position (Initial Ranking) Search results (Journey-offer) clicked booked U1 U2 refinement U2 1 Y N 1 1-11 -10 2 Y Y 11 11 11 3 N N 0 0-1 -1 4 Y N 1 1 1 5 N N 0 0 0 …. No more clicks or books , all 0’s 13
  • 14. 6 Utility Generation Desired ranking Position Search results (Journey-offer) clicked booke d U1 U2 refinement U2 2 Y Y 11 11 11 4 Y N 1 1 1 5 N N 0 0 0 3 N N 0 0-1 -1 1 Y N 1 1-11 -10 …. No more clicks or books , all 0’s Evidence of utility Evidence of NO utility No evidence 14
  • 15. 7 Predictors generation ● Combine search options and journey/offer details to predict user received utility from search result (Journey) ● Predictors generation ○ Existing data - Search Options : One-way /Round-trip, Price, travelers (number, age) -Journey details : Time of day, Duration, Travel mode (bus , train, flight) , Provider, number of connections ○ Derived - Days to travel date, distance , duration to distance ratio , provider booking ratio - domestic/international -Seasonality (day of week, quarter of year -> summer, winter) , travelling is a seasonal business ○ Third party : provider rating 15
  • 16. 8 Weight Generation ● Most recent click-outs are more important than very old ones ● User behavior can change because of provider results changes, prices, changes, discounts/offers , etc… ● X = time elapsed between current date and click-out date (days, weeks) ● ● Weight represents each clickout recency ● H, is the time windows for historical logs 16
  • 17. 9 Prediction problem ● Problem formulation Given set search options, journey details , predict utility ● Sort search results based on predicted utility ● Each observation is grouped click-outs with same search options, journey details (after preprocessing) ● The grouping is to generalize model ● Classification/ Regression tree fit the problem ● POC : limit to click-outs for top 10 searched routes for last 3 months 17 Recency (Search options-Journey Details) Utility Weight Predictors Target
  • 18. 10 RPART ● Why ○ Problem is nonlinear ○ Presentable to stakeholders ○ Input are discrete, target is continuous ○ Highly customizable, many parameters ○ Many implementations exist ● rpart Package ○ Well documented ○ Mature 18
  • 19. 10 RPART Parameters Parameter Description Values experimented Mapping to problem cp Threshold for fit/loss improvement 0.005,0.01,0.02 How tree is restrictive in splitting weight Observation waits Alpha = 0.1 , 0.5, 0.9 Small alpha = less weight to older results method Split criteria Anova -> regression problem minsplit Min number in a node to split 20,30,40 19
  • 20. 10 RPART ● Divide data into train (from t = 1 to K ) , test ( t = K+1 to N) ● Example, build model using data from previous 3 months, train is oldest 2 months and 3 weeks, test is the latest week ● Reason, test capability of future prediction - Model is to be updated every 1 or 2 weeks ● Internal accuracy measure = RMSE ● Grid search (expand.grid) ● Several runs with rolling origin (cross validation with different time points) 20
  • 21. 11 Pipeline Data transformation (MapReduce) Feature Matrix Model Building Optimization Model deployment Search backend 21
  • 22. 11 Pipeline ● Spark MR makes it easy to transform large sets of data ○ Spark as ETL is not fully utilizing spark power !! ● CSV for feature matrix ○ Simplicity, readability ○ Easiest for R (PoC) ● R for model building : ○ Rich in libraries ○ Quick prototype and testing ○ Experience ! ● Exported as PMML ○ XML widely used model for exchanging ML models ○ Export library exists (pmml.rpart) ○ Consumer library exists (jpmml) 22
  • 23. 12 Relevance Measure ● Consider it as binary retrieval problem (result is relevant or not) ● Consider the results we present in top K positions as (possibly relevant) , the others are not ● K ranges usually between 3 to 5 (need more objective measure!!) ● Precision - recall for bookings not search events ● It’s not completely reliable to use search events for relevance measure,due to position bias ● It is too restrictive to base measures only on bookings (booking cycle is not related to search) ● As a start, rely on bookings only (measure relative improvement) 23
  • 25. 14 Relevance Measure ● Recall is defined if number of events > 0, else it is not defined ● Events can be clicks or books ● The measures should at the end affect CTR and CR PoC : No presentable results Yet 25
  • 26. 15 References 1. The Power of Ranking: Quantifying the Effects of Rankings on Online Consumer Search and Choice 2. An Introduction to Recursive Partitioning Using the RPART Routines 3. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation 26
  • 27. 16 Possible improvements ● More on model optimization ● Loss function for missing relevant results ● Ranking SVM ● Results shuffling ● Get user profile data ● Speedup model building - r parallel/spark ● Investigate more on relevance measure 27