Search2Vec at OLX Group - Pydata Meetup Berlin

Query expansion using
semantic query embeddings

Me, OLX, The team
Ich heiße Mariano Semelman!
Ich komme aus Argentinien.
@msemelman
mariano.semelman@olx.com
I’m a Data Scientist with 6 years of experience
working in:
● Behavioural targeting
● Natural language processing
● Recommendation systems
● Search engine

Me, OLX, The team
● OLX: Online classifieds
platform
● Berlin Shared Service:
Support and Center of
expertise to the rest of the
platform.
● PnR Services Team:
Search, Recommender
systems, Big Data.

Me, OLX, The team
Vladan
Radosavljevic
Head of Data
Science
Mariano
Semelman
Senior Data
Scientist
Manish
Saraswat
Data Scientist
Vaibhav
Sharma
Data Scientist

So frustrating...
Reasons: Typos, Wrong brand/model
combination, localism, specificity, etc.

What if we could search not just for what the
user searched for, but also for highly similar
queries which mean the same?

Sessions
Search Sessions from OLX South Africa
“13inch rims, “rims” “205 60 13”, “205”, ”205_13inch”
“mountain bicycle”, “fiets”, “bike”, “bicycle”
“honda nc 700”, “suzuki sv650”, “honda cbx 250 twister”, “honda xr 125”S1
S2
S3
“fencing”, “devils fork”S4
S5 “ferraro”, “ferrari”, “lamboghini”, “porsche”, “ewings”
S6 “catering table”, “funeral tent”, “wedding tent”, “bar stool”, “tiffany chairs”

Word2Vec
or How I Learned to Stop
Worrying and Love Embeddings

Embedding
Definition, very easy!:
F: X↪Y
X: Your domain (example: Words,
Categories, etc)
Y: Domain with interesting
properties for your problem.
F: Injective function that translates
from X to Y.
tricky part: creating F.

Word2Vec (skip-gram flavour)
The fake task!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Under the hood

Chapeau!

Interesting property
If word A and word B always have similar
context, then cosine_similarity(F(A), F(B))
would tend to 1.

Gensim code
# import modules & set up logging
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
indexes = model.wv.index2word
embedding = model.wv.vectors

Search2Vec
or What does all this have to do
with searches...
Based on “Scalable Semantic Matching of Queries to
Ads in Sponsored Search Advertising” paper.

Remember the queries...
Search Sessions from OLX Data
13inch_rims rims 205_60_13 205 205_13inch
mountain_bicycle fiets bike bicycle
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
S2
S3
fencing devils_forkS4
S5 ferraro ferrari lamboghini porsche ewings
S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs

Remember the queries...
Search Sessions from OLX Data
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
Train samples:
(honda_nc_700, suzuki_sv650)
(suzuki_sv650, honda_nc_700)
(suzuki_sv650, honda_cbx_250_twister)
(honda_cbx_250_twister, suzuki_sv650)
(honda_cbx_250_twister, honda_xr_125)
(honda_xr_125, honda_cbx_250_twister)

Training Data
~110M searches across a year
~12M sessions (aka sentences)
~4M unique searches
Preprocessing (pyspark):
● lowercase remove trailing spaces,
stopwords, punctuation marks,
double spaces, etc
● outliers:
long “sentences”
long tail queries (<10 occurrences)

Offline evaluation
If you are searching for "${search_string}", do you expect similar results for "${related_query}"?
● 1) very similar results
● 2) related results
● 3) very different results

Limitations
Head queries: 162k
embeddings =)
Tail queries: 3.8M =(
Frequency
Query
10

Step 1: find top K queries for each head query from the vocabulary
query expansions score
scuba diving equipment 0.792
diving gear 0.766
scuba diving gear scuba equipment 0.765
scuba gear 0.764
scuba shop 0.763
bread maker 0.728
bread machines 0.722
bread machines cusinart bread maker 0.644
bread machine reviews 0.621
bread machine recipes 0.605
jeep 0.824
4x4 jeep 0.819
4x4 isuzu 4x4 0.805
toyota 4x4 0.793
hilux 4x4 0.790
Tail queries

Tail query
Step 2: form query documents (ie: flatten)
id document
scuba_diving_gear scuba diving equipment diving gear scuba equipment scuba
gear scuba shop
bread_machines bread maker bread machines cusinart bread maker bread
machine reviews bread machine recipes
4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4
4x4 off road

Tail query vectors
Step 3: invert index for fast matching (BM25)
input query top result top result’s document
diving equipment scuba_diving_gea
r
scuba diving equipment diving gear scuba equipment
scuba gear scuba shop
cusinart machine bread_machines bread maker bread machines cusinart bread maker
bread machine reviews bread machine recipes
off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie
nissan 4x4 4x4 off road

Offline Analysis with holdout data
90%
10%
ordered by
frequency
use for testing the
matching
index
query vectors
Vq-context
0.3 1.3 6.2 0.5 3.1
Vq-index (top result)
cosine similarity
0.2 1.4 7.2 0.6 6.1

We launched to production!!!
playstation 4 peugot ktm 900

8.7% -> 0.6%
Searches with no results

+13.5%**
Increase in new contacts by day

Possible extensions
Include more entities in the sessions:
● Listings the user interacted with
● Categories the user browsed
● Locations the user search for/interacted with
Meta-prod2vec:
Add side information while generating pairs.
Meta-Prod2Vec - Product Embeddings Using Side-
Information for Recommendation

Search2Vec at OLX Group - Pydata Meetup Berlin

More Related Content

What's hot

Similar to Search2Vec at OLX Group - Pydata Meetup Berlin

Recently uploaded

Search2Vec at OLX Group - Pydata Meetup Berlin