Duplicates everywhere (Berlin)

Duplicates…
Duplicates everywhere
Alexey Grigorev
Berlin Machine Learning
2019.08.05

https://www.slideshare.net/AlexeyGrigorev/avito-duplicate-ads-detection-kaggle

Cross-Device linking competition
Using clickstream data
finds logs that belong to
the same user
https://www.slideshare.net/AlexeyGrigorev/cikm-cup-2016-crossdevice-linking

Disclaimer
Not a presentation of the duplicate detection system at OLX

Record Linkage vs Duplicates
Schema 1 Schema 2 Unified schema
Schema 3
Restoring the
duplicates graph!

Duplicates
For each rec i find
duplicates {rec i1, …, rec ik}
from the set of n records
ID F1 F2 ... Fm
Rec 1 f11 f12 f1m
Rec 2 f21 f22 f2m
... ...
Rec n fn1 fn2 fnm

ML for Duplicates
● Compare each pair with each?
● 1000 items => 1000 x 999 / 2 = 499 500 pairs
● Real datasets: millions! (avito: 51mln, olx.ua: 11mln)

ML for Duplicates
● Graph is very sparse!
● Don’t need to compare everything w everything
Reality

ML for Duplicates
● First step:
● Candidate selection
Idea:
● First, find candidate duplicates
(10-200)
● Then, get real duplicates (0-50)
For each rec i find
duplicates {rec i1, …, rec ik}
from the set of n records
k=0..50 items

Duplicate detection framework
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2

How:
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2

Domain knowledge
Candidates share the same
● Category (iphone)
● City (Birobidzhan) / district
● Seller id
● IP address of the seller
● Device signature

1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1
1,2 2,2 3,2 4,2
1,3 2,3 3,3 4,3
1,4 2,4 3,4 4,4
1,5 2,5 3,5 4,5
1,6
1,7
1,8
1,9

Domain knowledge
● Seller id
Easy to implement in any RDB!

Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2
Machine Learning!

Training data
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1

Machine Learning
ID1 ID2 Features Label
1 2 [0, 1, ..., 5] 1
1 4 [2, 0, ..., 3] 0
2 7 [3, 1, ..., 3] 0
k 5 [5, 3, ..., 8] 1
Feature
engineering
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1
Model
Tune F1/Recall/Precision

Features
Mostly pairwise differences/ratios and distances/similarities
Most basic ones:
● |price1 - price2|
● min(price1, price2) / max(price1, price2)
● dist(loc1, loc2)
● same(ip1, ip2)
● same(loc_id1, loc_id2)
● same(category1, category2)
● |len(title1) - len(title2)|
● |len(images1) - len(images2)|

Text features
Create a vector representation of text
● Bag of words
● TF-IDF
● Character N-Grams
sell pixel iphone samsung xs s9
1 1 0 0 0 0
0 0 1 0 1 0
sell pixel iphone samsung xs s9
0.001 0.1 0 0 0 0
0 0 0.2 0 0.8 0sel ell iph pho hon one
1 1 1 1 1 1
0 0 0 1 1 1

Text features: cosine similarity
https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/eb9cd609-e44a-40a2-9c3a-f16fc4f5289a.xhtml

Word2Vec
iphone
samsung
bmw
toyota Word vectors
(not document vectors!)

Word2Vec features
How to compare documents?
● Title1: “used bmw”
● Title2: “selling almost new bmw”
selling almost new bmw
used 0.3 0.1 0.6 0.5
bmw 0.2 0.1 0.55 1
min mean max std
0.1 0.41 1.0 0.28

Word2Vec features part 2
How to compare documents?
● Title1: “used bmw”
● Title2: “selling almost new bmw”
selling almost new bmw
used 0.3 0.1 0.6 0.5
bmw 0.2 0.1 0.55 1
min mean max std
0.1 0.33 0.6 0.20

Hashes
● md5: cryptographic hash
● dhash, phash, whash: Perceptive hashes
94088af86c038327
14ee7fe587860078a1109033318bd986
94088af86c038327
07aaedb9b75e88a6051184f01be5cc50

Dhash: difference hash
https://www.kaggle.com/iezepov/get-hash-from-images-slightly-daster/code
Read as b/w image, resize
Get numpy array
Difference between adjacent cells

149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
114 111 110 112 108 67 73 90 103
149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
141 111 110 112 108 67 73 90 103
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8

19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-3 -4 2 -4 -41 6 17 13
19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-30 -4 2 -4 -41 6 17 13
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
148 94
8 08
138 8a
248 f8
108 64
3 03
131 83
39 27
94088af86c038327

Features: hashes
● Number of images with same md5, phash, dhash, etc
● Distances between hashes
94088af86c038327
94088af86c038328
1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
Distance is 4 bit

9 4 14
40 45 35
9 4 14 40 45 35
reshape
stats
min mean max std
4 24.5 45 16.02
Rem
ove
pictures
with
sam
e
hash!

Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2
Hashes are useful here too!

Candidate selection step
● Seller id
● Image hash

Information retrieval
● Previously: exact matches
● Instead, do inexact matches with IR
IR:
● Documents D = {d1, …, dn}
● Query qi
● Find {di1, …, dik} relevant to qi

Documents D = {rec 1, …, rec n}:
● Titles
● Descriptions
Query qi:
● Title/description of rec i
Candidate
Selection step
Candidate
Scoring step

Implementation Details
In Lucene/Solr/Elasticsearch we can use “more like this” queries
{
"_id": "cafebabe",
"_source": {
"title": "продам iphone" ,
"description": "новый телефон почти не
пользовался" ,
"phone": "+48 012 131 1212" ,
"ip": "127.0.0.1" ,
"category": 17,
"lat": 10,
"lon": 15,
"hashes": ["0f4c", "1df0", "5f04"]
}
}
{
"query": {
"more_like_this": {
"like": {
"_index": "listings",
"_type": "_doc",
"_id": "cafebabe"
},
"max_query_terms": 100,
"fields": ["title", "description" ,
"ip^2", "hashes"]
}
}
}

Candidate selection: images
● Previously: exact match
● What if we wanted to do inexact?
94088af86c038327
94088af86c038328

{
"query": {
"fuzzy": {
"hash": {
"value": "94088af86c038327"
}
}
}
}
Fuzzy query (elasticsearch)

Chunk the hash:
"94088af86c038327" => "1:9408 2:8af8 3:6c03 4:8327"
{
"_id": "cafebabe",
"_source": {
"title": "продам iphone" ,
"description": "новый телефон почти не
пользовался" ,
"hashes": ["1:9408 2:8af8 3:6c03 4:8327",
...]
}
}

Image embeddings
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
“Image embeddings”

Image embeddings
CNN
Dim 2k+
Dim 100
SVD
???

Embedding index
● Numpy arrays
● Do X.dot(query)
● Client aggregates

Embedding index
● Numpy arrays
● Do X.dot(query)
● Client aggregates
● Approximate KNN! LSH techniques
● Almost the same as X.dot(query) but faster
● Many implementations: FAISS, Annoy, etc
Becomes slow as it
grows

Embedding index
FAISS FAISS FAISS FAISS FAISS FAISS FAISS

How about inserts?

How about inserts?
Too slow :-(

How about inserts?
FAISS FAISS FAISS FAISS FAISS Numpy Numpy
Delta index (updated realtime)Historical index (updated daily)

Approximate KNN: LSH
LSH: Locality sensitive hash
● Minhash: approximates jaccard similarity
● Random projections: approximates cosine similarity
X.dot(query)

Elasticsearch
Generate once and
keep at hand
Apply to all images

Triplet Loss
https://omoindrot.github.io/triplet-loss

Make hashes
index them

TLDR:
Throw everything to ElasticSearch
And let it find duplicates for you

Contact information
● http://alexeygrigorev.com & contact@alexeygrigorev.com
● https://github.com/alexeygrigorev
● https://www.linkedin.com/in/agrigorev

p
v
v . p = projection of v onto p
v . p positive

p
u
u . p = projection of v onto p
u . p negative

p
u
sign (u . p) != sign (v . p)
v
Vector normal to p

p
u
sign (u . p) == sign (v . p)
v

Random projections
● Generate m random vectors pi
● For each compute (u . pi >= 0)
● Create hash = [(u . p0 >= 0), (u . p1 >= 0), ...)]
For two vectors v and u
● Number of different bits ~ the angle
● Approximation becomes better as m grows
u
v
theta

x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0
0
0.57 -0.81
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1
0 1
0.57 -0.81
-0.97 -0.23
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0
0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1
0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0
0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0
0 1 0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0 0
0 1 0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0 0 1
0 1 0 1 0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0 0 1 0
0 1 0 1 0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v

0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0.2

0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Matrix P
Matrix V
V . P^T >=0

Duplicates everywhere (Berlin)

Duplicates everywhere (Berlin)

Recommended

Recommended

More Related Content

Similar to Duplicates everywhere (Berlin)

Similar to Duplicates everywhere (Berlin) (20)

More from Alexey Grigorev

More from Alexey Grigorev (20)

Recently uploaded

Recently uploaded (20)

Duplicates everywhere (Berlin)