SlideShare a Scribd company logo
1 of 46
Download to read offline
How we learned to rank
search results
Mouloud Lounaci & Andres Pipicello
Argentina Big Data Meetup
Meetup #6: Long time, no see OLX
October, 2018
2
mouloud.lounaci@olx.com
@mlounaci
https://www.linkedin.com/in/mlounaci
andres.pipicello@olx.com
https://www.linkedin.com/in/andrespipicello
Plan
OLX Group
Personalization and Relevance (PnR)
Learning To Rank (LTR)
The ranking journey
Building Dataset
Modeling
Serving the model
Results
3
4
Preface What do you need to know
about OLX Group ?
OLX Group
Scale of Data at OLX Group
5
35B
Monthly
Page Views
350M
Monthly
Users
60M
Monthly
Listings
4B
Daily
Events
Every minute...
2.5M events captured
500 houses listed
500 cars listed
1000 phones listed
Classifieds ?
6
Two-sided Marketplace
● Buyers looking for goods or services
● Sellers offering goods or services
OLX’s Mission
● Match buyers with sellers using
○ large scale data
○ state-of-the-art technology
Today it’s ALL about “search”
7
▪ Retrieval of relevant listings
– Query understanding
– Query-listing matching
▪ Ranking of relevant listings
– Learning to rank (LTR) using
query, user and listing
features
Query
Ranked relevant
Items
Actually it’s about “Ranking”
8
▪ Retrieval of relevant listings
– Query understanding
– Query-listing matching
▪ Ranking of relevant listings
– Learning to rank (LTR) using
query, user and listing
features
Query
Ranked Items
9
Chapter 1
Who are “we” ?
What do we do ?
How do we do it ?Personalization and
Relevance
10
PnR - The Team
PnR - Architecture
11
PnR - Architecture
12
Data Sources
Indexing
Retrieval +
Ranking
Ad Retrieval
13
Phoenix
Constructs the feed to send to the user
Manages all the different spells (algorithms)
used in the feed
Splitter for A|B Testing
Ad Retrieval
14
Loki
Executes the spell (algorithm) from Phoenix
Interacts with all the different data sources
Caches items for fast Page 2 retrieval
15
Chapter 2
Ragnarok ?
Why LTR ?
What is LTR for usLearning To Rank
What we want to do ?
“Learn from the data how to rank a resultset
for a search query”
aka.
RAGNAROK
AD RERANKER
Manual models become hard to tune with a very large number of
features.
1
Leverages large volume of user behaviour (Clicks/replies) data in
an automated way
Create a personalized ranking by including user features (social
search)
2
3
Why Learning To Rank ?
Top r e
do n
Top Rer d
do n
User
query
(Re)ranked
results
Spell
Returned
Documents
User
behaviour
If I click/reply,
then it’s
relevant for me
RAGNAROK
AD RERANKER
Overview
19
Chapter 3
Before we start, we need
Tools
The search for gold
Mining the gold (Spark),
Funnel ?
Modeling, or transforming
the gold
Serving the model
The ranking journey
The ranking journey
Step 1
Building Infra
● Access large
history Data
(Reservoir)
● Build Infra to
process it
(EMR)
Building Dataset
● Process Label
(Judgement
score proxy
with
Clicks/Replies).
● Process
Features.
Step 2
Analysing Dataset
● Analyse Click
and reply
behaviour
● Build “Gold
standard”
dataset for
ranking
Step 3
Building Model
● Iterating on
models
● Evaluating
models
● Selecting a
model
Step 4
Serving Model
● Design Service
Architecture
● Define Service
Requirement
● Create Ranking
Endpoint
Step 5
Integration with PnR
Architecture
● Integrate the
ranking in the
ad retrieval flow
● Define the
interaction with
the PnR
components .
Step 6
21
Step 1
Building Infra
Step 1
User browsing
(parquet - 1h delay)
RELEVANCE
RESERVOIR
Ads
(json - 5 min delay)
Labeled Dataset
Features
Building Infra
STORAGEPROCESSING
Building Infra
Big Data ?
23
1 year android history for South Africa
data...
Step 1
Building Infra
5B
User events
800M
Search
Impressions
40M
Individual
searches
Scalability is key
24
1 year history for South Africa data...
Step 1
Building Infra
5B
User events
800M
Search
Impressions
40M
Individual
searches
25
Step 2
Building Dataset
Step 2
Gold looks like this for us...
query_id query
features
Item
position
item_id Item
features
Label (Relevance
Judgement)
1 ... 1 item1 ... 0
1 ... 2 item2 ... 3
1 ... 3 item3 ... 1
1 ... 4 item4 ... 2
Building Dataset
The search for the “gold” standard dataset
Step 2
Gold looks like this for us...
query_id query
features
Item
position
item_id Item
features
Label (Relevance
Judgement)
1 ... 1 item1 ... 0
1 ... 2 item2 ... 3
1 ... 3 item3 ... 1
1 ... 4 item4 ... 2
Building Dataset
The search for the “gold” standard dataset
Step 2
We used spark (EMR) to build the dataset from user browsing data.
Building Dataset
Hydra
(Trackings)
Labeled
searches
(funnel)
Labeling
(apply funnel)
Let’s “spark” it off
Proxy label, the “funnel”
Step 2
Building Dataset
?
30
Step 3
Analyzing Dataset
Step 3
● Considering searches with at least one reply for training
(to improve quality)
● Include searches with more than 3(4) impressions (user
behaviour affected by smaller resultset)
● Inside each search consider impression up to 30-50-60th
position
● Metric that gives more importance to top position(NDCG with
customized decay)
Analyzing Dataset
Analysing Dataset
32
Step 4
Building Model
Q1
D1,1
D1,2
…
D1,m
Q2 Qn
...
D2,1
D2,2
...
D2,m
Dn,1
Dn,2
...
Dn,m
Pointwise Pairwise Listwise
f(Qi, Di,j) = s o F(Qi, Di,j > Di,k) = s o n {o,1} f(Qi, {Di,j,...,Di,m}) = {Di,j,...,Di,m})
ra d
Baseline
Q1,D1
Q1,D2
Q1,D3
Q1,D1 0.85
Q1,D3 0.65
Q1,D2 0.30
Q1,D1>D2
Q1,D2>D3
Q1,D3>D4
Q1,D1>D2 1
Q1,D2>D3
0
Q1,D3>D4
1
Q1,D1
Q1,D2
Q1,D3
D1
D3
D2
Step 4
Building Model
Start with a simple approach
McRank from classification to ranking
● Pointwise approach
● Train a classifier to predict the
relevance judgment k i {0, 1, 2}
● Use the class probabilities P(Y=k)
Ran g re =∑ P(Y=k) ∗T(k), w e n o se T(k)=k
Inspired by :
https://papers.nips.cc/paper/3270-mcrank-learning-to-rank-using-multiple-classification-and-gradient-boosting.pdf
Step 4
Building Model
35
Combined Model
Item Features Buyer Features Seller Features
Static Features
Interaction
Features
(Browsing)
Three Classes of Features
Step 4
Building Model
Search String
Search Location
Search Time
Ad Title
Ad Description
Ad Location
Ad Creation Time
Ad Price
Ad Private or Business
Ad Image Count
Ad Category
Textual Similarity (BM25)
Length of the Title
Length of the Description
Freshness
Proximity
Price
Is the Seller a Private Business
Image Count
Category
Raw attributes Features
Static Item/query features
Step 4
Building Model
Item Interaction Features - Example
Interactions
Impressions
Ad Views(Clicks)
Replies
Data source: ods.fact_listing_activity
Time Interval
30 days
7 days
Last day
Item Interaction Features
num_impressions_30days
num_adviews_30days
num_replies_30days
num_impressions_7days
num_adviews_7days
num_replies_7days
num_impressions_lastday
num_adviews_lastday
num_replies_lastday
Step 4
Building Model
38
Step 5
Serving Model
We met “Mleap“ on the way
39
Step 5
Serving Model
40
Step 5
Serving Model
The Service
Aws Data Pipeline For
training
Scala Akka Htttp with
mleap service on
Openshift for
prediction
Training every 7 days Serving
RAGNAROK
AD RERANKER
Ranked Items
ReRanked Items
41
Chapter 4
Does this work ?
Offline ?
Online ?Results
Preliminary results (Offline)
Feature Weight
Proximity 13
bm25 8.7
Freshness 4.6
Price 0
Title Length -4.3
Description Length -7.3
+14% nDCG
Preliminary results (Online)
Feature Weight
Proximity 13
bm25 8.7
Freshness 4.6
Price 0
Title Length -4.3
Description Length -7.3
+14% nDCG
+8%
Replies/DAU
44
Final results (Offline)
Feature Weight
Item Replies Received - 30 days 21.6
Preference for Cars - 30 days 15
Proximity 9.1
bm25 8.2
Preference for Car Parts - 30 days 8.2
Freshness 5.7
+71% nDCG
Item Performance
Buyer preference
Basic features
45
Final results (Online)
Feature Weight
Item Replies Received - 30 days 21.6
Preference for Cars - 30 days 15
Proximity 9.1
bm25 8.2
Preference for Car Parts - 30 days 8.2
Freshness 5.7
+71% nDCG
Coming soon...
Item Performance
Buyer preference
Basic features
46
The end Thank you
Any questions ?

More Related Content

Similar to How we learned to rank search results big data meetup

Tech M&A Monthly: What Happens If You Don’t Sell?
Tech M&A Monthly: What Happens If You Don’t Sell?Tech M&A Monthly: What Happens If You Don’t Sell?
Tech M&A Monthly: What Happens If You Don’t Sell?Corum Group
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Pydata Chicago - work hard once
Pydata Chicago - work hard oncePydata Chicago - work hard once
Pydata Chicago - work hard onceJi Dong
 
MLSEV Virtual. Applying Topic Modelling to improve Operations
MLSEV Virtual. Applying Topic Modelling to improve OperationsMLSEV Virtual. Applying Topic Modelling to improve Operations
MLSEV Virtual. Applying Topic Modelling to improve OperationsBigML, Inc
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...Click Earn Grow
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
 
Kaggle Days Brussels - Alberto Danese
Kaggle Days Brussels - Alberto DaneseKaggle Days Brussels - Alberto Danese
Kaggle Days Brussels - Alberto DaneseAlberto Danese
 
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...OECDregions
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialQiang Zhu
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Kwame Porter Robinson
 
Data as a Foundation for Growth
Data as a Foundation for GrowthData as a Foundation for Growth
Data as a Foundation for GrowthPerkuto
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Sandra Garcia
 
The Triangle - A universal method of working with digital analytics and marke...
The Triangle - A universal method of working with digital analytics and marke...The Triangle - A universal method of working with digital analytics and marke...
The Triangle - A universal method of working with digital analytics and marke...Robert Børlum-Bach
 
Agile London at Ticketmaster
Agile London at TicketmasterAgile London at Ticketmaster
Agile London at TicketmasterBilly Jenkins
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databasesjexp
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis Papaemmanouil
GDG DEvFest Hellas 2020 -  Automated ML - Panagiotis PapaemmanouilGDG DEvFest Hellas 2020 -  Automated ML - Panagiotis Papaemmanouil
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis PapaemmanouilPanagiotis Papaemmanouil
 

Similar to How we learned to rank search results big data meetup (20)

Tech M&A Monthly: What Happens If You Don’t Sell?
Tech M&A Monthly: What Happens If You Don’t Sell?Tech M&A Monthly: What Happens If You Don’t Sell?
Tech M&A Monthly: What Happens If You Don’t Sell?
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Pydata Chicago - work hard once
Pydata Chicago - work hard oncePydata Chicago - work hard once
Pydata Chicago - work hard once
 
MLSEV Virtual. Applying Topic Modelling to improve Operations
MLSEV Virtual. Applying Topic Modelling to improve OperationsMLSEV Virtual. Applying Topic Modelling to improve Operations
MLSEV Virtual. Applying Topic Modelling to improve Operations
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...
Click Earn Grow 2009 Original Concept Next Generation Online Betting Technolo...
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
Kaggle Days Brussels - Alberto Danese
Kaggle Days Brussels - Alberto DaneseKaggle Days Brussels - Alberto Danese
Kaggle Days Brussels - Alberto Danese
 
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...
Maron, M. - Visualisation and mapping of building open data - Mikel Maron, Ma...
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
kdd2015
kdd2015kdd2015
kdd2015
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Data as a Foundation for Growth
Data as a Foundation for GrowthData as a Foundation for Growth
Data as a Foundation for Growth
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)
 
The Triangle - A universal method of working with digital analytics and marke...
The Triangle - A universal method of working with digital analytics and marke...The Triangle - A universal method of working with digital analytics and marke...
The Triangle - A universal method of working with digital analytics and marke...
 
Agile London at Ticketmaster
Agile London at TicketmasterAgile London at Ticketmaster
Agile London at Ticketmaster
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis Papaemmanouil
GDG DEvFest Hellas 2020 -  Automated ML - Panagiotis PapaemmanouilGDG DEvFest Hellas 2020 -  Automated ML - Panagiotis Papaemmanouil
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis Papaemmanouil
 

Recently uploaded

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 

Recently uploaded (20)

Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 

How we learned to rank search results big data meetup

  • 1. How we learned to rank search results Mouloud Lounaci & Andres Pipicello Argentina Big Data Meetup Meetup #6: Long time, no see OLX October, 2018
  • 3. Plan OLX Group Personalization and Relevance (PnR) Learning To Rank (LTR) The ranking journey Building Dataset Modeling Serving the model Results 3
  • 4. 4 Preface What do you need to know about OLX Group ? OLX Group
  • 5. Scale of Data at OLX Group 5 35B Monthly Page Views 350M Monthly Users 60M Monthly Listings 4B Daily Events Every minute... 2.5M events captured 500 houses listed 500 cars listed 1000 phones listed
  • 6. Classifieds ? 6 Two-sided Marketplace ● Buyers looking for goods or services ● Sellers offering goods or services OLX’s Mission ● Match buyers with sellers using ○ large scale data ○ state-of-the-art technology
  • 7. Today it’s ALL about “search” 7 ▪ Retrieval of relevant listings – Query understanding – Query-listing matching ▪ Ranking of relevant listings – Learning to rank (LTR) using query, user and listing features Query Ranked relevant Items
  • 8. Actually it’s about “Ranking” 8 ▪ Retrieval of relevant listings – Query understanding – Query-listing matching ▪ Ranking of relevant listings – Learning to rank (LTR) using query, user and listing features Query Ranked Items
  • 9. 9 Chapter 1 Who are “we” ? What do we do ? How do we do it ?Personalization and Relevance
  • 10. 10 PnR - The Team
  • 12. PnR - Architecture 12 Data Sources Indexing Retrieval + Ranking
  • 13. Ad Retrieval 13 Phoenix Constructs the feed to send to the user Manages all the different spells (algorithms) used in the feed Splitter for A|B Testing
  • 14. Ad Retrieval 14 Loki Executes the spell (algorithm) from Phoenix Interacts with all the different data sources Caches items for fast Page 2 retrieval
  • 15. 15 Chapter 2 Ragnarok ? Why LTR ? What is LTR for usLearning To Rank
  • 16. What we want to do ? “Learn from the data how to rank a resultset for a search query” aka. RAGNAROK AD RERANKER
  • 17. Manual models become hard to tune with a very large number of features. 1 Leverages large volume of user behaviour (Clicks/replies) data in an automated way Create a personalized ranking by including user features (social search) 2 3 Why Learning To Rank ?
  • 18. Top r e do n Top Rer d do n User query (Re)ranked results Spell Returned Documents User behaviour If I click/reply, then it’s relevant for me RAGNAROK AD RERANKER Overview
  • 19. 19 Chapter 3 Before we start, we need Tools The search for gold Mining the gold (Spark), Funnel ? Modeling, or transforming the gold Serving the model The ranking journey
  • 20. The ranking journey Step 1 Building Infra ● Access large history Data (Reservoir) ● Build Infra to process it (EMR) Building Dataset ● Process Label (Judgement score proxy with Clicks/Replies). ● Process Features. Step 2 Analysing Dataset ● Analyse Click and reply behaviour ● Build “Gold standard” dataset for ranking Step 3 Building Model ● Iterating on models ● Evaluating models ● Selecting a model Step 4 Serving Model ● Design Service Architecture ● Define Service Requirement ● Create Ranking Endpoint Step 5 Integration with PnR Architecture ● Integrate the ranking in the ad retrieval flow ● Define the interaction with the PnR components . Step 6
  • 22. Step 1 User browsing (parquet - 1h delay) RELEVANCE RESERVOIR Ads (json - 5 min delay) Labeled Dataset Features Building Infra STORAGEPROCESSING Building Infra
  • 23. Big Data ? 23 1 year android history for South Africa data... Step 1 Building Infra 5B User events 800M Search Impressions 40M Individual searches
  • 24. Scalability is key 24 1 year history for South Africa data... Step 1 Building Infra 5B User events 800M Search Impressions 40M Individual searches
  • 26. Step 2 Gold looks like this for us... query_id query features Item position item_id Item features Label (Relevance Judgement) 1 ... 1 item1 ... 0 1 ... 2 item2 ... 3 1 ... 3 item3 ... 1 1 ... 4 item4 ... 2 Building Dataset The search for the “gold” standard dataset
  • 27. Step 2 Gold looks like this for us... query_id query features Item position item_id Item features Label (Relevance Judgement) 1 ... 1 item1 ... 0 1 ... 2 item2 ... 3 1 ... 3 item3 ... 1 1 ... 4 item4 ... 2 Building Dataset The search for the “gold” standard dataset
  • 28. Step 2 We used spark (EMR) to build the dataset from user browsing data. Building Dataset Hydra (Trackings) Labeled searches (funnel) Labeling (apply funnel) Let’s “spark” it off
  • 29. Proxy label, the “funnel” Step 2 Building Dataset ?
  • 31. Step 3 ● Considering searches with at least one reply for training (to improve quality) ● Include searches with more than 3(4) impressions (user behaviour affected by smaller resultset) ● Inside each search consider impression up to 30-50-60th position ● Metric that gives more importance to top position(NDCG with customized decay) Analyzing Dataset Analysing Dataset
  • 33. Q1 D1,1 D1,2 … D1,m Q2 Qn ... D2,1 D2,2 ... D2,m Dn,1 Dn,2 ... Dn,m Pointwise Pairwise Listwise f(Qi, Di,j) = s o F(Qi, Di,j > Di,k) = s o n {o,1} f(Qi, {Di,j,...,Di,m}) = {Di,j,...,Di,m}) ra d Baseline Q1,D1 Q1,D2 Q1,D3 Q1,D1 0.85 Q1,D3 0.65 Q1,D2 0.30 Q1,D1>D2 Q1,D2>D3 Q1,D3>D4 Q1,D1>D2 1 Q1,D2>D3 0 Q1,D3>D4 1 Q1,D1 Q1,D2 Q1,D3 D1 D3 D2 Step 4 Building Model Start with a simple approach
  • 34. McRank from classification to ranking ● Pointwise approach ● Train a classifier to predict the relevance judgment k i {0, 1, 2} ● Use the class probabilities P(Y=k) Ran g re =∑ P(Y=k) ∗T(k), w e n o se T(k)=k Inspired by : https://papers.nips.cc/paper/3270-mcrank-learning-to-rank-using-multiple-classification-and-gradient-boosting.pdf Step 4 Building Model
  • 35. 35 Combined Model Item Features Buyer Features Seller Features Static Features Interaction Features (Browsing) Three Classes of Features Step 4 Building Model
  • 36. Search String Search Location Search Time Ad Title Ad Description Ad Location Ad Creation Time Ad Price Ad Private or Business Ad Image Count Ad Category Textual Similarity (BM25) Length of the Title Length of the Description Freshness Proximity Price Is the Seller a Private Business Image Count Category Raw attributes Features Static Item/query features Step 4 Building Model
  • 37. Item Interaction Features - Example Interactions Impressions Ad Views(Clicks) Replies Data source: ods.fact_listing_activity Time Interval 30 days 7 days Last day Item Interaction Features num_impressions_30days num_adviews_30days num_replies_30days num_impressions_7days num_adviews_7days num_replies_7days num_impressions_lastday num_adviews_lastday num_replies_lastday Step 4 Building Model
  • 39. We met “Mleap“ on the way 39 Step 5 Serving Model
  • 40. 40 Step 5 Serving Model The Service Aws Data Pipeline For training Scala Akka Htttp with mleap service on Openshift for prediction Training every 7 days Serving RAGNAROK AD RERANKER Ranked Items ReRanked Items
  • 41. 41 Chapter 4 Does this work ? Offline ? Online ?Results
  • 42. Preliminary results (Offline) Feature Weight Proximity 13 bm25 8.7 Freshness 4.6 Price 0 Title Length -4.3 Description Length -7.3 +14% nDCG
  • 43. Preliminary results (Online) Feature Weight Proximity 13 bm25 8.7 Freshness 4.6 Price 0 Title Length -4.3 Description Length -7.3 +14% nDCG +8% Replies/DAU
  • 44. 44 Final results (Offline) Feature Weight Item Replies Received - 30 days 21.6 Preference for Cars - 30 days 15 Proximity 9.1 bm25 8.2 Preference for Car Parts - 30 days 8.2 Freshness 5.7 +71% nDCG Item Performance Buyer preference Basic features
  • 45. 45 Final results (Online) Feature Weight Item Replies Received - 30 days 21.6 Preference for Cars - 30 days 15 Proximity 9.1 bm25 8.2 Preference for Car Parts - 30 days 8.2 Freshness 5.7 +71% nDCG Coming soon... Item Performance Buyer preference Basic features
  • 46. 46 The end Thank you Any questions ?