Query expansion using
semantic query embeddings
Me, OLX, The team
Ich heiße Mariano Semelman!
Ich komme aus Argentinien.
@msemelman
mariano.semelman@olx.com
I’m a Data Scientist with 6 years of experience
working in:
● Behavioural targeting
● Natural language processing
● Recommendation systems
● Search engine
Me, OLX, The team
● OLX: Online classifieds
platform
● Berlin Shared Service:
Support and Center of
expertise to the rest of the
platform.
● PnR Services Team:
Search, Recommender
systems, Big Data.
Me, OLX, The team
Vladan
Radosavljevic
Head of Data
Science
Mariano
Semelman
Senior Data
Scientist
Manish
Saraswat
Data Scientist
Vaibhav
Sharma
Data Scientist
So frustrating...
Reasons: Typos, Wrong brand/model
combination, localism, specificity, etc.
What if we could search not just for what the
user searched for, but also for highly similar
queries which mean the same?
Sessions
Search Sessions from OLX South Africa
“13inch rims, “rims” “205 60 13”, “205”, ”205_13inch”
“mountain bicycle”, “fiets”, “bike”, “bicycle”
“honda nc 700”, “suzuki sv650”, “honda cbx 250 twister”, “honda xr 125”S1
S2
S3
“fencing”, “devils fork”S4
S5 “ferraro”, “ferrari”, “lamboghini”, “porsche”, “ewings”
S6 “catering table”, “funeral tent”, “wedding tent”, “bar stool”, “tiffany chairs”
Word2Vec
or How I Learned to Stop
Worrying and Love Embeddings
Embedding
Definition, very easy!:
F: X↪Y
X: Your domain (example: Words,
Categories, etc)
Y: Domain with interesting
properties for your problem.
F: Injective function that translates
from X to Y.
tricky part: creating F.
Word2Vec (skip-gram flavour)
The fake task!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Under the hood
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Chapeau!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Interesting property
If word A and word B always have similar
context, then cosine_similarity(F(A), F(B))
would tend to 1.
Gensim code
# import modules & set up logging
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
indexes = model.wv.index2word
embedding = model.wv.vectors
Search2Vec
or What does all this have to do
with searches...
Based on “Scalable Semantic Matching of Queries to
Ads in Sponsored Search Advertising” paper.
Remember the queries...
Search Sessions from OLX Data
13inch_rims rims 205_60_13 205 205_13inch
mountain_bicycle fiets bike bicycle
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
S2
S3
fencing devils_forkS4
S5 ferraro ferrari lamboghini porsche ewings
S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs
Remember the queries...
Search Sessions from OLX Data
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
Train samples:
(honda_nc_700, suzuki_sv650)
(suzuki_sv650, honda_nc_700)
(suzuki_sv650, honda_cbx_250_twister)
(honda_cbx_250_twister, suzuki_sv650)
(honda_cbx_250_twister, honda_xr_125)
(honda_xr_125, honda_cbx_250_twister)
Training Data
~110M searches across a year
~12M sessions (aka sentences)
~4M unique searches
Preprocessing (pyspark):
● lowercase remove trailing spaces,
stopwords, punctuation marks,
double spaces, etc
● outliers:
long “sentences”
long tail queries (<10 occurrences)
We have our model...
Offline evaluation
If you are searching for "${search_string}", do you expect similar results for "${related_query}"?
● 1) very similar results
● 2) related results
● 3) very different results
Tail queries
Limitations
Head queries: 162k
embeddings =)
Tail queries: 3.8M =(
Frequency
Query
10
Step 1: find top K queries for each head query from the vocabulary
query expansions score
scuba diving equipment 0.792
diving gear 0.766
scuba diving gear scuba equipment 0.765
scuba gear 0.764
scuba shop 0.763
query expansions score
bread maker 0.728
bread machines 0.722
bread machines cusinart bread maker 0.644
bread machine reviews 0.621
bread machine recipes 0.605
query expansions score
jeep 0.824
4x4 jeep 0.819
4x4 isuzu 4x4 0.805
toyota 4x4 0.793
hilux 4x4 0.790
Tail queries
Tail query
Step 2: form query documents (ie: flatten)
id document
scuba_diving_gear scuba diving equipment diving gear scuba equipment scuba
gear scuba shop
bread_machines bread maker bread machines cusinart bread maker bread
machine reviews bread machine recipes
4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4
4x4 off road
Tail query vectors
Step 3: invert index for fast matching (BM25)
input query top result top result’s document
diving equipment scuba_diving_gea
r
scuba diving equipment diving gear scuba equipment
scuba gear scuba shop
cusinart machine bread_machines bread maker bread machines cusinart bread maker
bread machine reviews bread machine recipes
off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie
nissan 4x4 4x4 off road
Offline Analysis with holdout data
90%
10%
ordered by
frequency
use for testing the
matching
index
query vectors
Vq-context
0.3 1.3 6.2 0.5 3.1
Vq-index (top result)
cosine similarity
0.2 1.4 7.2 0.6 6.1
Tail Queries Solution
Zahlen bitte!
We launched to production!!!
playstation 4 peugot ktm 900
8.7% -> 0.6%
Searches with no results
+13.5%**
Increase in new contacts by day
Vielen Dank!
Fragen?
Possible extensions
Include more entities in the sessions:
● Listings the user interacted with
● Categories the user browsed
● Locations the user search for/interacted with
Meta-prod2vec:
Add side information while generating pairs.
Meta-Prod2Vec - Product Embeddings Using Side-
Information for Recommendation

Search2Vec at OLX Group - Pydata Meetup Berlin

  • 1.
  • 2.
    Me, OLX, Theteam Ich heiße Mariano Semelman! Ich komme aus Argentinien. @msemelman mariano.semelman@olx.com I’m a Data Scientist with 6 years of experience working in: ● Behavioural targeting ● Natural language processing ● Recommendation systems ● Search engine
  • 3.
    Me, OLX, Theteam ● OLX: Online classifieds platform ● Berlin Shared Service: Support and Center of expertise to the rest of the platform. ● PnR Services Team: Search, Recommender systems, Big Data.
  • 4.
    Me, OLX, Theteam Vladan Radosavljevic Head of Data Science Mariano Semelman Senior Data Scientist Manish Saraswat Data Scientist Vaibhav Sharma Data Scientist
  • 5.
    So frustrating... Reasons: Typos,Wrong brand/model combination, localism, specificity, etc.
  • 6.
    What if wecould search not just for what the user searched for, but also for highly similar queries which mean the same?
  • 7.
    Sessions Search Sessions fromOLX South Africa “13inch rims, “rims” “205 60 13”, “205”, ”205_13inch” “mountain bicycle”, “fiets”, “bike”, “bicycle” “honda nc 700”, “suzuki sv650”, “honda cbx 250 twister”, “honda xr 125”S1 S2 S3 “fencing”, “devils fork”S4 S5 “ferraro”, “ferrari”, “lamboghini”, “porsche”, “ewings” S6 “catering table”, “funeral tent”, “wedding tent”, “bar stool”, “tiffany chairs”
  • 8.
    Word2Vec or How ILearned to Stop Worrying and Love Embeddings
  • 9.
    Embedding Definition, very easy!: F:X↪Y X: Your domain (example: Words, Categories, etc) Y: Domain with interesting properties for your problem. F: Injective function that translates from X to Y. tricky part: creating F.
  • 10.
    Word2Vec (skip-gram flavour) Thefake task! http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 11.
  • 12.
  • 13.
    Interesting property If wordA and word B always have similar context, then cosine_similarity(F(A), F(B)) would tend to 1.
  • 14.
    Gensim code # importmodules & set up logging import gensim sentences = [['first', 'sentence'], ['second', 'sentence']] # train word2vec on the two sentences model = gensim.models.Word2Vec(sentences, min_count=1) indexes = model.wv.index2word embedding = model.wv.vectors
  • 15.
    Search2Vec or What doesall this have to do with searches... Based on “Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising” paper.
  • 16.
    Remember the queries... SearchSessions from OLX Data 13inch_rims rims 205_60_13 205 205_13inch mountain_bicycle fiets bike bicycle honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1 S2 S3 fencing devils_forkS4 S5 ferraro ferrari lamboghini porsche ewings S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs
  • 17.
    Remember the queries... SearchSessions from OLX Data honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1 Train samples: (honda_nc_700, suzuki_sv650) (suzuki_sv650, honda_nc_700) (suzuki_sv650, honda_cbx_250_twister) (honda_cbx_250_twister, suzuki_sv650) (honda_cbx_250_twister, honda_xr_125) (honda_xr_125, honda_cbx_250_twister)
  • 18.
    Training Data ~110M searchesacross a year ~12M sessions (aka sentences) ~4M unique searches Preprocessing (pyspark): ● lowercase remove trailing spaces, stopwords, punctuation marks, double spaces, etc ● outliers: long “sentences” long tail queries (<10 occurrences)
  • 19.
    We have ourmodel...
  • 20.
    Offline evaluation If youare searching for "${search_string}", do you expect similar results for "${related_query}"? ● 1) very similar results ● 2) related results ● 3) very different results
  • 21.
  • 22.
    Limitations Head queries: 162k embeddings=) Tail queries: 3.8M =( Frequency Query 10
  • 23.
    Step 1: findtop K queries for each head query from the vocabulary query expansions score scuba diving equipment 0.792 diving gear 0.766 scuba diving gear scuba equipment 0.765 scuba gear 0.764 scuba shop 0.763 query expansions score bread maker 0.728 bread machines 0.722 bread machines cusinart bread maker 0.644 bread machine reviews 0.621 bread machine recipes 0.605 query expansions score jeep 0.824 4x4 jeep 0.819 4x4 isuzu 4x4 0.805 toyota 4x4 0.793 hilux 4x4 0.790 Tail queries
  • 24.
    Tail query Step 2:form query documents (ie: flatten) id document scuba_diving_gear scuba diving equipment diving gear scuba equipment scuba gear scuba shop bread_machines bread maker bread machines cusinart bread maker bread machine reviews bread machine recipes 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4 4x4 off road
  • 25.
    Tail query vectors Step3: invert index for fast matching (BM25) input query top result top result’s document diving equipment scuba_diving_gea r scuba diving equipment diving gear scuba equipment scuba gear scuba shop cusinart machine bread_machines bread maker bread machines cusinart bread maker bread machine reviews bread machine recipes off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4 4x4 off road
  • 26.
    Offline Analysis withholdout data 90% 10% ordered by frequency use for testing the matching index query vectors Vq-context 0.3 1.3 6.2 0.5 3.1 Vq-index (top result) cosine similarity 0.2 1.4 7.2 0.6 6.1
  • 27.
  • 28.
  • 29.
    We launched toproduction!!! playstation 4 peugot ktm 900
  • 30.
    8.7% -> 0.6% Searcheswith no results
  • 31.
    +13.5%** Increase in newcontacts by day
  • 32.
  • 33.
    Possible extensions Include moreentities in the sessions: ● Listings the user interacted with ● Categories the user browsed ● Locations the user search for/interacted with Meta-prod2vec: Add side information while generating pairs. Meta-Prod2Vec - Product Embeddings Using Side- Information for Recommendation