Embedded based retrieval in modern search ranking system

Embedding-based Retrieval
in Search Ranking System
Marsan Ma
2020-10-21

Outlines
1. Retrieval / Ranking overview
2. Embedded Based Retrieval
○ Training
■ Triple loss architecture
■ Hard-negative mining
○ Serving
■ ANN (Approximate Nearest Neighborhood)
■ Product quantization
3. (Maybe) next generation search ranking system

1. Retrieval / Ranking
Now & Then

● Full pipeline composed of multiple stages
○ Matching / Pre-ranking: focus on reducing search space for later stages by dropping irrelevant
samples while guarantee high recall (ensure all positive samples included)
○ Ranking: focus on high precision, guarantee top-K align with user interest.
○ Reranking: overwrite model result for customized business purpose, like promoting
new/high-quality content, assisting known model weakness … etc.
Main concept

● Matching: Recall@K, QPS (query per second), RP (response time)
● Ranking: NDCG@K, MAP@K
● Reranking: your business objects.
(AUC is only for quick evaluation, not directly align with business value)
Evaluation according to purpose

● Over-simplified here, missing CF(collaborative filtering) family, factorize family, GCN family…etc.
● Two tower is an architecture prioritizing engineering effort (speed, cost, complexity).
○ Avoid cross-attention
○ ANN (Approximate nearest neighbor) on inference, without model.
● COLD: wanna include cross-attention, how to balance performance & computing cost.
Retrieval stage evolution
Current mainstream
precisecheap

● Key idea of two-tower: train embedding, but avoid cross attention
○ Query and candidate encoded independently all the way to the last layer (which calculating similarity).
○ Because using the whole model to do online inference is super expensive. We want to use the final embedding
only in online inference (forget about the model after training).
○ Cost: without cross-attention means sacrificing performance.
● Btw, JEM team deployed BERT+cross_attention since their data volume is small.
● Alibaba COLD tried to cost down while using cross attention.
○ Benchmark two-tower, COLD, and their “deep interest network”.
Why avoiding cross-attention?

Why Embedding?
From “syntactic matching” to “semantic matching”.
1. Factorize features
○ Everything shouldn’t be naively one-hot coded as black-or-white, they have implicit
relationships in high dimensional space. (concept of Factorize Machine)
○ For linear model, even by adding quadratic features like <xi*yi>, if <xi*yi> rarely or
never happen in training set, linear model cannot learn their relationship <Wx,Wy>.
FM or embedding could still learn through indirect relationship like x-z-y.
2. Fuzzy text match
○ Match between query "kacis creations" and “Kasie’s creations” while the term-based
match cannot.
3. Personalization
○ User embedding naturally enabled personalized matching result.

Why deep learning?
(This slide quote from linkedin kdd2020 presentation)

Retrieval/Ranking/Recommendation full picture
Embedding
Embedding
in short:
1. Compile knowledge into
embeddings
2. Find nearest neighbors
3. Find-grain sort
in ﬁnal ranking

2. Embedding based
Retrieval (Training)

● A good review on main concept (good gif made by Google)
○ Two-tower, queries and database items are mapped to the embedding space.
○ model responds to natural-language queries.
Review: main concept of using embedding

1. “Unified” Embedding (by Facebook EBR)
○ two sided model: one side is search request (character n-gram), the other side is the document.
○ Other (social/location) features are included into encoder input! (thus called “unified”)
2. Triplet Loss function: keep enlarging the distance
between positive query-doc pairs with negative ones.
○ Have m terms (need) to tune, good and bad.
○ (?) Why not just use “clicked” as positive,
“seen but not clicked” as negative?
○ (?) Slower to converge
Best practice by today (2020)

Data sampling is crucial
1. Choosing “clicked” and “seen” as positive sample are good as each other in online test.
○ Since seen is also chosen by ranking stage, and it’s fine to choose the same in retrieval stage.
2. Hard negative mining
○ Online: choose K doc from other positive query-doc paris as hard negative. (K=2 the best)
○ Offline: choose top 101~500 from historical SERP as hard negative.
3. Random negative (seen) is better than hard-negative only (seen but not click)!!!
○ Hypothesis: model focused too much on hard-negative lose the ability to deal with obvious ones.
(ex: all hard-negatives are same location job with anchor job, so model thought location is not
important, which is obviously wrong.)
○ Also, random sample distribution align with serving distribution.
○ Best practice (two styles):
■ random negative/hard-negatives = 100:1
■ Transfer learning: train on hard first, than on random negatives.

Hard Negative Mining
1. Facebook in “Embedding Based Retrieval in Facebook Search”
○ Online hard negative
■ choose K doc from other positive query-doc paris as hard negative. (K=2 the best)
○ Offline hard negative
■ choose top 101~500 from historical SERP as hard negative.
2. AirBnB in “Real-time Personalization using Embeddings for Search Ranking at Airbnb”
○ Random sample items in same location with positive samples as hard-negative.
○ Add “rejected by room owner” as hard negative sample.

Embedding everywhere
1. Query converted to embedding
2. Indexed document with embedding
3. Retrieval stage use embedding,
also pass embedding to ranking model
to ensure ranking align with retrieval
(avoiding Matthew effect)
Engineering question
1. How often embedding re-trained/updated?
2. Detail about embedding based indexing?
Facebook search ranking system

Other topics
● Matthew Effect
○ Current ranking stages are designed for existing retrieval scenarios
=> ranker won’t agree with new retrieval algorithm, it reject (no impression) or give
them poor position (hard to be seen).
○ Solution: ranking model use retrieval stage embeddings as features, so ranking
model could learn from new insight. (by Facebook: empirically just add the cosine
similarity of query-item as ranking model feature)
● Embedding ensemble: weighted concatenation
○ Cascade multiple embeddings trained for different purpose
(each embedding will focus on one specific purpose, just like multi-channel retrieval.)
○ Alibaba COLD spend efforts on choosing best embeddings

2. Embedding based
Retrieval (Serving)

● Two-tower encoded embeddings independently, thus inference (serving) stage
no longer need the model.
● The only task in serving: find “top-K nearest neighborhood”
● Brute force way take O(N^2), how to reduce this?
Serving: main challenge

Serving: ANN (Approximate nearest neighborhood)
1. Tree based
○ KD-tree : good for low dimensional embeddings, but in high dimensional it’s good as brute-force.
○ For high-dimension, use hash-based or vector quantization (following two categories)
2. Hash based
○ LSH (Locality-Sensitive Hashing)
○ For < 10 million data volume, this category is good.
○ Open source like FALCONN, Annoy (by Spotify), NMSLIB (AWS Elastic Search, best@2019).
3. Vector quantization
○ Main stream for hundred million level data. Product quantization is the best practice. (deep-dive
in following slides)
○ Open source like FAISS (by Facebook), ScaNN (by Google)
4. Others
○ Milvus: open source vector similarity search engine, use it as a database query.
○ NGT (by Yahoo): best in some benchmarks
○ NSG (by Alibaba-Taobao)
All giants develop their own embedding+ANN, serve faster without losing precision is the key!

● Benchmark reference : (1 / 2 / 3)
○ NMSLib is the best among hash based algorithm.
○ FAISS speed up with GPU, and ScaNN further improved performance (recall@10).
(All new algorithms claim better against NMSlib, so maybe at this moment (2020) NMSlib is still the most stable choice
if you don’t trust new thing.)
Serving: ANN (Approximate nearest neighborhood)
FAISS (vector quantization) is
much faster than nmslib with GPU
ScaNN (vector quantization) now
best in both performance & speed

Serving : Product Quantization
(Note that before product quantization, there is “coarse quantization” using K-means and choose cluster.)
1. Let’s say you have original 50k jobs, each represent in 1024 dim embeddings.
2. Break-down 1024 dim embedding vector into 8 x 128 dim chunks
3. Encode them into 8-bit (256) groups, each group represented by its center.

1. When calculating all 50K distance(query, item) pairs
○ Preparing a look-up table of 256 distance(query, center)
○ Thus, for each of 50K distance(query, item) = SUM(8 x distance(query, center_i))
2. Computing reduced thousand times:
○ from: 1K dim root-mean-square
○ To: SUM(8 look-up-table values)
3. Memory reduced ~500 times
○ 4096 bytes float => 8 bytes id
Serving : Product Quantization

Serving : Other techniques
● Coarse quantization (invert file index)
○ Cluster all items into groups, only choose top-K groups with center closest to query.
○ After coarse quantization, then do product quantization to choose final candidates.
● Residue encoding
○ After vectors grouped, use residue to replace the original embedding vectors to improve
resolution after quantized. (as if remove offset, centeralize vectors into origin, like left figure.)
○ Note that for different group, query vector will have different residue, since different center.

TL;DR on 3rd generation
(best practice @ 2020)

TL;DR
(If you can only remember two thing today, here it is):
1. In training stage,
find best way to compile all knowledge into
user/item embeddings.
2. In serving stage,
find the fastest/cheapest way to
find nearest neighbors.

3. (Maybe) next generation
Search Ranking System

Alibaba’s latest best practices: (both expensive and high engineering effort, just FYI)
● Retrieval: COLD (Cost aware, Online, Lightweight Deep pre-ranking)
○ Eager to the performance improve from cross-attention.
○ Feature selection to reduce computing cost (avoid assembling too many
embeddings)
○ In short, choose ensembled embeddings having best AUC while maintaining
acceptable QPS (query-per-seconds) and RT (return-time).
○ Also take many engineering efforts to speed-up & cost down.
● Ranking: DIEN (deep interest network)
○ Rather than synthesize user embedding by “latest K clicked items”, use attention to
query “latest K relevant clicked items”.
○ Demerit: user embedding have to be synthesized online.
“Maybe” next generation : Ali COLD+DIEN

Deep interest network (DIN)
● Basic model (left): trained upon user_vec, item_vec of latest K clicked items.
● DIN (right): current viewing item decide attention weight to latest K clicked items.
● Note everything (Goods/Shop/Category/Other) are embedding, rather than 1-hot-coding.
○ Everything is factorized, not just black-or-white (0/1).

An alternative future: user-interest-capsule
● Latest (since 2019) user-vector tech, MIND (multi interest network with dynamic routing) in Alibaba:
○ DIN attention to item? Why not also attention to multi-user interest? It’s naive to assume user as
single user-interest vector.
○ Indeed could skip this since job-seeker rarely have multiple career interest.

● DeepFM v.s unified embedding
● Two tower or Siamese
● Character level n-gram
● Triplet loss
● Random negative + Hard negative mining (100:1)
● Residual encoding
● Embeddings weighted concatenation
● Multitask
Review tricks worth trying

Impact size : Facebook, Tencent
1. Facebook EBR
a. location feature and social embedding helps a lot! (Don’t forget domain specific data!)
2. Tencent ranking model (CTR)
a. Naive DNN: AUC=0.7618
b. Multi-task (CTR+Favorite+Like…) DNN: AUC=0.7678 (+0.6%)
c. DeepFM: AUC +0.2%
d. Last View + DIN: AUC +0.2%
e. Last Display + GRU (?): AUC + 0.4%

Trade off between performance (recall) & computing cost, strike balance between
vector-product based and fully DIN.
Impact size : Alibaba

1. Embedding-based Retrieval in Facebook Search
○ 理解 product quantization 算法
○ 负样本为王：评Facebook的向量化召回算法
2. Pre-training Tasks for Embedding-based Large-scale Retrieval
○ 向量化召回也需要“预训练”
3. Product Quantizers for k-NN Tutorial
4. COLD: Towards the Next Generation of Pre-Ranking System
○ 阿里定向广告最新突破：面向下一代的粗排排序系统COLD
5. Multi-Interest Network with Dynamic Routing for Recommendation at Tmall
○ 解读阿里深度学习实践，CTR 预估、MLR 模型、兴趣分布网络等
6. 推荐系统技术演进趋势：从召回到排序再到重排
7. 搜索推荐召回&&粗排相关性优化最新进展—2020
References

Embedded based retrieval in modern search ranking system

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Embedded based retrieval in modern search ranking system

Similar to Embedded based retrieval in modern search ranking system (20)

Recently uploaded

Recently uploaded (20)

Embedded based retrieval in modern search ranking system