SlideShare a Scribd company logo
Duplicates…
Duplicates everywhere
Alexey Grigorev
Berlin Machine Learning
2019.08.05
About me
https://www.slideshare.net/AlexeyGrigorev/avito-duplicate-ads-detection-kaggle
Cross-Device linking competition
Using clickstream data
finds logs that belong to
the same user
https://www.slideshare.net/AlexeyGrigorev/cikm-cup-2016-crossdevice-linking
Me now (Oct 2018)
Disclaimer
Not a presentation of the duplicate detection system at OLX
Back to duplicates
Record Linkage vs Duplicates
Schema 1 Schema 2 Unified schema
Schema 3
Restoring the
duplicates graph!
Duplicates
For each rec i find
duplicates {rec i1, …, rec ik}
from the set of n records
ID F1 F2 ... Fm
Rec 1 f11 f12 f1m
Rec 2 f21 f22 f2m
... ...
Rec n fn1 fn2 fnm
ML for Duplicates
● Compare each pair with each?
● 1000 items => 1000 x 999 / 2 = 499 500 pairs
● Real datasets: millions! (avito: 51mln, olx.ua: 11mln)
ML for Duplicates
● Graph is very sparse!
● Don’t need to compare everything w everything
Reality
ML for Duplicates
● First step:
● Candidate selection
Idea:
● First, find candidate duplicates
(10-200)
● Then, get real duplicates (0-50)
For each rec i find
duplicates {rec i1, …, rec ik}
from the set of n records
k=0..50 items
Duplicate detection framework
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
How:
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
Domain knowledge
Candidates share the same
● Category (iphone)
● City (Birobidzhan) / district
● Seller id
● IP address of the seller
● Device signature
1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1
1,2 2,2 3,2 4,2
1,3 2,3 3,3 4,3
1,4 2,4 3,4 4,4
1,5 2,5 3,5 4,5
1,6
1,7
1,8
1,9
Domain knowledge
Candidates share the same
● Category (iphone)
● City (Birobidzhan) / district
● Seller id
● IP address of the seller
● Device signature
Easy to implement in any RDB!
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
Machine Learning!
Duplicate Not duplicate
Training data
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1
Machine Learning
ID1 ID2 Features Label
1 2 [0, 1, ..., 5] 1
1 4 [2, 0, ..., 3] 0
2 7 [3, 1, ..., 3] 0
k 5 [5, 3, ..., 8] 1
Feature
engineering
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1
Model
Tune F1/Recall/Precision
Features
Mostly pairwise differences/ratios and distances/similarities
Most basic ones:
● |price1 - price2|
● min(price1, price2) / max(price1, price2)
● dist(loc1, loc2)
● same(ip1, ip2)
● same(loc_id1, loc_id2)
● same(category1, category2)
● |len(title1) - len(title2)|
● |len(images1) - len(images2)|
Text features
Create a vector representation of text
● Bag of words
● TF-IDF
● Character N-Grams
sell pixel iphone samsung xs s9
1 1 0 0 0 0
0 0 1 0 1 0
sell pixel iphone samsung xs s9
0.001 0.1 0 0 0 0
0 0 0.2 0 0.8 0sel ell iph pho hon one
1 1 1 1 1 1
0 0 0 1 1 1
Text features: cosine similarity
https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/eb9cd609-e44a-40a2-9c3a-f16fc4f5289a.xhtml
Word2Vec
iphone
samsung
bmw
toyota Word vectors
(not document vectors!)
Word2Vec features
How to compare documents?
● Title1: “used bmw”
● Title2: “selling almost new bmw”
selling almost new bmw
used 0.3 0.1 0.6 0.5
bmw 0.2 0.1 0.55 1
min mean max std
0.1 0.41 1.0 0.28
Word2Vec features part 2
How to compare documents?
● Title1: “used bmw”
● Title2: “selling almost new bmw”
selling almost new bmw
used 0.3 0.1 0.6 0.5
bmw 0.2 0.1 0.55 1
min mean max std
0.1 0.33 0.6 0.20
Images
Hashes
● md5: cryptographic hash
● dhash, phash, whash: Perceptive hashes
94088af86c038327
14ee7fe587860078a1109033318bd986
94088af86c038327
07aaedb9b75e88a6051184f01be5cc50
Dhash: difference hash
https://www.kaggle.com/iezepov/get-hash-from-images-slightly-daster/code
Read as b/w image, resize
Get numpy array
Difference between adjacent cells
149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
114 111 110 112 108 67 73 90 103
149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
141 111 110 112 108 67 73 90 103
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-3 -4 2 -4 -41 6 17 13
19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-30 -4 2 -4 -41 6 17 13
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
148 94
8 08
138 8a
248 f8
108 64
3 03
131 83
39 27
94088af86c038327
Features: hashes
● Number of images with same md5, phash, dhash, etc
● Distances between hashes
94088af86c038327
94088af86c038328
1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
Distance is 4 bit
9 4 14
40 45 35
9 4 14 40 45 35
reshape
stats
min mean max std
4 24.5 45 16.02
Rem
ove
pictures
with
sam
e
hash!
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
Hashes are useful here too!
Candidate selection step
Candidates share the same
● Category (iphone)
● City (Birobidzhan) / district
● Seller id
● IP address of the seller
● Device signature
● Image hash
How:
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
Information retrieval
● Previously: exact matches
● Instead, do inexact matches with IR
IR:
● Documents D = {d1, …, dn}
● Query qi
● Find {di1, …, dik} relevant to qi
Documents D = {rec 1, …, rec n}:
● Titles
● Descriptions
Query qi:
● Title/description of rec i
Candidate
Selection step
Candidate
Scoring step
Implementation Details
In Lucene/Solr/Elasticsearch we can use “more like this” queries
{
"_id": "cafebabe",
"_source": {
"title": "продам iphone" ,
"description": "новый телефон почти не
пользовался" ,
"phone": "+48 012 131 1212" ,
"ip": "127.0.0.1" ,
"category": 17,
"lat": 10,
"lon": 15,
"hashes": ["0f4c", "1df0", "5f04"]
}
}
{
"query": {
"more_like_this": {
"like": {
"_index": "listings",
"_type": "_doc",
"_id": "cafebabe"
},
"max_query_terms": 100,
"fields": ["title", "description" ,
"ip^2", "hashes"]
}
}
}
How:
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
Candidate selection: images
● Previously: exact match
● What if we wanted to do inexact?
94088af86c038327
94088af86c038328
Candidate selection: images
{
"query": {
"fuzzy": {
"hash": {
"value": "94088af86c038327"
}
}
}
}
Fuzzy query (elasticsearch)
Candidate selection: images
Chunk the hash:
"94088af86c038327" => "1:9408 2:8af8 3:6c03 4:8327"
{
"_id": "cafebabe",
"_source": {
"title": "продам iphone" ,
"description": "новый телефон почти не
пользовался" ,
"hashes": ["1:9408 2:8af8 3:6c03 4:8327",
...]
}
}
Image embeddings
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
“Image embeddings”
Image embeddings
CNN
Dim 2k+
Dim 100
SVD
???
100
100mln
100mln
Embedding index
● Numpy arrays
● Do X.dot(query)
● Client aggregates
Embedding index
● Numpy arrays
● Do X.dot(query)
● Client aggregates
● Approximate KNN! LSH techniques
● Almost the same as X.dot(query) but faster
● Many implementations: FAISS, Annoy, etc
Becomes slow as it
grows
Embedding index
FAISS FAISS FAISS FAISS FAISS FAISS FAISS
How about inserts?
FAISS FAISS FAISS FAISS FAISS FAISS FAISS
How about inserts?
FAISS FAISS FAISS FAISS FAISS FAISS FAISS
Too slow :-(
How about inserts?
FAISS FAISS FAISS FAISS FAISS Numpy Numpy
Delta index (updated realtime)Historical index (updated daily)
Approximate KNN: LSH
LSH: Locality sensitive hash
● Minhash: approximates jaccard similarity
● Random projections: approximates cosine similarity
X.dot(query)
Elasticsearch
Generate once and
keep at hand
Apply to all images
Triplet Loss
https://omoindrot.github.io/triplet-loss
https://omoindrot.github.io/triplet-loss
https://omoindrot.github.io/triplet-loss
Make hashes
index them
TLDR:
Throw everything to ElasticSearch
And let it find duplicates for you
Contact information
● http://alexeygrigorev.com & contact@alexeygrigorev.com
● https://github.com/alexeygrigorev
● https://www.linkedin.com/in/agrigorev
Thanks!
Questions
Backup: Random projection
p
p
v
p
v
v . p = projection of v onto p
v . p positive
p
u
u . p = projection of v onto p
u . p negative
p
u
sign (u . p) != sign (v . p)
v
Vector normal to p
p
u
sign (u . p) == sign (v . p)
v
p
u
sign (u . p) == sign (v . p)
v
Random projections
● Generate m random vectors pi
● For each compute (u . pi >= 0)
● Create hash = [(u . p0 >= 0), (u . p1 >= 0), ...)]
For two vectors v and u
● Number of different bits ~ the angle
● Approximation becomes better as m grows
u
v
theta
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0
0
0.57 -0.81
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1
0 1
0.57 -0.81
-0.97 -0.23
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0
0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1
0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0
0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0
0 1 0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0
0 1 0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0 1
0 1 0 1 0 1 0 1
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0 1 0
0 1 0 1 0 1 0 1 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Projection vectors
u
v
u
v
0.2
0 1 0 1 0 0 0 1 0 1
0 1 0 1 0 1 0 1 0 0
0.57 -0.81
-0.97 -0.23
0.99 -0.01
-0.86 0.50
0.34 -0.93
0.55 0.83
0.85 -0.52
-0.94 0.31
0.99 0.04
-0.66 -0.74
x1 x2
u -0.92 0.38
v -0.61 0.78
Matrix P
Matrix V
V . P^T >=0
Duplicates everywhere (Berlin)

More Related Content

Similar to Duplicates everywhere (Berlin)

Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
Just Mouse Jack Init
Just Mouse Jack InitJust Mouse Jack Init
Just Mouse Jack Init
antitree
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
Ashwini Mathur
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
Jeremy Schneider
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
Андрей Новиков
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
Odoo
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
Wim Godden
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
Wang Zuo
 
Nsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crashNsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crash
Fabio Pignatti
 
Machine learning in php php con poland
Machine learning in php   php con polandMachine learning in php   php con poland
Machine learning in php php con poland
Damien Seguy
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not Enough
Lukas Renggli
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Vision
giamuhammad
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Search2Vec at OLX Group - Pydata Meetup Berlin
Search2Vec at OLX Group - Pydata Meetup BerlinSearch2Vec at OLX Group - Pydata Meetup Berlin
Search2Vec at OLX Group - Pydata Meetup Berlin
Mariano Semelman
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
SmartHinJ
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
amiable_indian
 

Similar to Duplicates everywhere (Berlin) (20)

Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Just Mouse Jack Init
Just Mouse Jack InitJust Mouse Jack Init
Just Mouse Jack Init
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
 
Nsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crashNsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crash
 
Machine learning in php php con poland
Machine learning in php   php con polandMachine learning in php   php con poland
Machine learning in php php con poland
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not Enough
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Vision
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Search2Vec at OLX Group - Pydata Meetup Berlin
Search2Vec at OLX Group - Pydata Meetup BerlinSearch2Vec at OLX Group - Pydata Meetup Berlin
Search2Vec at OLX Group - Pydata Meetup Berlin
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
 

More from Alexey Grigorev

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Alexey Grigorev
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Alexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
Alexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
Alexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Alexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
Alexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Alexey Grigorev
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
Alexey Grigorev
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
Alexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Alexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
Alexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
Alexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
Alexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
Alexey Grigorev
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
Alexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
Alexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

Duplicates everywhere (Berlin)

  • 2.
  • 5. Cross-Device linking competition Using clickstream data finds logs that belong to the same user https://www.slideshare.net/AlexeyGrigorev/cikm-cup-2016-crossdevice-linking
  • 6.
  • 7. Me now (Oct 2018)
  • 8.
  • 9.
  • 10.
  • 11. Disclaimer Not a presentation of the duplicate detection system at OLX
  • 13. Record Linkage vs Duplicates Schema 1 Schema 2 Unified schema Schema 3 Restoring the duplicates graph!
  • 14. Duplicates For each rec i find duplicates {rec i1, …, rec ik} from the set of n records ID F1 F2 ... Fm Rec 1 f11 f12 f1m Rec 2 f21 f22 f2m ... ... Rec n fn1 fn2 fnm
  • 15.
  • 16.
  • 17.
  • 18. ML for Duplicates ● Compare each pair with each? ● 1000 items => 1000 x 999 / 2 = 499 500 pairs ● Real datasets: millions! (avito: 51mln, olx.ua: 11mln)
  • 19. ML for Duplicates ● Graph is very sparse! ● Don’t need to compare everything w everything Reality
  • 20. ML for Duplicates ● First step: ● Candidate selection Idea: ● First, find candidate duplicates (10-200) ● Then, get real duplicates (0-50) For each rec i find duplicates {rec i1, …, rec ik} from the set of n records k=0..50 items
  • 21. Duplicate detection framework Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2
  • 22. How: ● Domain knowledge (heuristics) ● Information retrieval techniques ● Approximate knn Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2
  • 23. Domain knowledge Candidates share the same ● Category (iphone) ● City (Birobidzhan) / district ● Seller id ● IP address of the seller ● Device signature
  • 24.
  • 25. 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4 1,5 2,5 3,5 4,5 1,6 1,7 1,8 1,9
  • 26. Domain knowledge Candidates share the same ● Category (iphone) ● City (Birobidzhan) / district ● Seller id ● IP address of the seller ● Device signature Easy to implement in any RDB!
  • 27. Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2 Machine Learning!
  • 29. Training data ID1 ID2 Label 1 2 1 1 4 0 2 7 0 k 5 1
  • 30. Machine Learning ID1 ID2 Features Label 1 2 [0, 1, ..., 5] 1 1 4 [2, 0, ..., 3] 0 2 7 [3, 1, ..., 3] 0 k 5 [5, 3, ..., 8] 1 Feature engineering ID1 ID2 Label 1 2 1 1 4 0 2 7 0 k 5 1 Model Tune F1/Recall/Precision
  • 31. Features Mostly pairwise differences/ratios and distances/similarities Most basic ones: ● |price1 - price2| ● min(price1, price2) / max(price1, price2) ● dist(loc1, loc2) ● same(ip1, ip2) ● same(loc_id1, loc_id2) ● same(category1, category2) ● |len(title1) - len(title2)| ● |len(images1) - len(images2)|
  • 32. Text features Create a vector representation of text ● Bag of words ● TF-IDF ● Character N-Grams sell pixel iphone samsung xs s9 1 1 0 0 0 0 0 0 1 0 1 0 sell pixel iphone samsung xs s9 0.001 0.1 0 0 0 0 0 0 0.2 0 0.8 0sel ell iph pho hon one 1 1 1 1 1 1 0 0 0 1 1 1
  • 33. Text features: cosine similarity https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/eb9cd609-e44a-40a2-9c3a-f16fc4f5289a.xhtml
  • 35. Word2Vec features How to compare documents? ● Title1: “used bmw” ● Title2: “selling almost new bmw” selling almost new bmw used 0.3 0.1 0.6 0.5 bmw 0.2 0.1 0.55 1 min mean max std 0.1 0.41 1.0 0.28
  • 36. Word2Vec features part 2 How to compare documents? ● Title1: “used bmw” ● Title2: “selling almost new bmw” selling almost new bmw used 0.3 0.1 0.6 0.5 bmw 0.2 0.1 0.55 1 min mean max std 0.1 0.33 0.6 0.20
  • 38. Hashes ● md5: cryptographic hash ● dhash, phash, whash: Perceptive hashes 94088af86c038327 14ee7fe587860078a1109033318bd986 94088af86c038327 07aaedb9b75e88a6051184f01be5cc50
  • 39. Dhash: difference hash https://www.kaggle.com/iezepov/get-hash-from-images-slightly-daster/code Read as b/w image, resize Get numpy array Difference between adjacent cells
  • 40. 149 168 145 131 134 111 115 114 108 198 192 162 135 104 137 128 108 97 158 165 151 117 111 133 130 139 115 79 95 132 151 180 212 189 158 124 91 47 57 90 67 81 165 142 110 104 80 63 53 43 34 20 42 101 110 113 109 92 79 53 27 59 102 114 111 110 112 108 67 73 90 103 149 168 145 131 134 111 115 114 108 198 192 162 135 104 137 128 108 97 158 165 151 117 111 133 130 139 115 79 95 132 151 180 212 189 158 124 91 47 57 90 67 81 165 142 110 104 80 63 53 43 34 20 42 101 110 113 109 92 79 53 27 59 102 141 111 110 112 108 67 73 90 103 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8
  • 41. 19 -23 -15 4 -23 4 -1 -6 -6 -30 -27 -31 33 -9 -20 -11 7 -14 -34 -6 22 -3 9 -24 16 37 19 29 32 -23 -31 -34 -44 10 33 -23 14 84 -23 -32 -24 -17 -10 -10 -9 -14 22 59 3 -3 -18 -13 -26 -26 32 43 -3 -4 2 -4 -41 6 17 13 19 -23 -15 4 -23 4 -1 -6 -6 -30 -27 -31 33 -9 -20 -11 7 -14 -34 -6 22 -3 9 -24 16 37 19 29 32 -23 -31 -34 -44 10 33 -23 14 84 -23 -32 -24 -17 -10 -10 -9 -14 22 59 3 -3 -18 -13 -26 -26 32 43 -30 -4 2 -4 -41 6 17 13 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
  • 42. TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
  • 43. TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 148 94 8 08 138 8a 248 f8 108 64 3 03 131 83 39 27 94088af86c038327
  • 44. Features: hashes ● Number of images with same md5, phash, dhash, etc ● Distances between hashes 94088af86c038327 94088af86c038328 1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 Distance is 4 bit
  • 45. 9 4 14 40 45 35 9 4 14 40 45 35 reshape stats min mean max std 4 24.5 45 16.02 Rem ove pictures with sam e hash!
  • 46. Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2 Hashes are useful here too!
  • 47. Candidate selection step Candidates share the same ● Category (iphone) ● City (Birobidzhan) / district ● Seller id ● IP address of the seller ● Device signature ● Image hash
  • 48. How: ● Domain knowledge (heuristics) ● Information retrieval techniques ● Approximate knn Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2
  • 49. Information retrieval ● Previously: exact matches ● Instead, do inexact matches with IR IR: ● Documents D = {d1, …, dn} ● Query qi ● Find {di1, …, dik} relevant to qi
  • 50. Documents D = {rec 1, …, rec n}: ● Titles ● Descriptions Query qi: ● Title/description of rec i Candidate Selection step Candidate Scoring step
  • 51. Implementation Details In Lucene/Solr/Elasticsearch we can use “more like this” queries { "_id": "cafebabe", "_source": { "title": "продам iphone" , "description": "новый телефон почти не пользовался" , "phone": "+48 012 131 1212" , "ip": "127.0.0.1" , "category": 17, "lat": 10, "lon": 15, "hashes": ["0f4c", "1df0", "5f04"] } } { "query": { "more_like_this": { "like": { "_index": "listings", "_type": "_doc", "_id": "cafebabe" }, "max_query_terms": 100, "fields": ["title", "description" , "ip^2", "hashes"] } } }
  • 52. How: ● Domain knowledge (heuristics) ● Information retrieval techniques ● Approximate knn Candidate Selection step Candidate Scoring step find candidate duplicates (10-200) get real duplicates (0-50) Step 1 Step 2
  • 53. Candidate selection: images ● Previously: exact match ● What if we wanted to do inexact? 94088af86c038327 94088af86c038328
  • 54. Candidate selection: images { "query": { "fuzzy": { "hash": { "value": "94088af86c038327" } } } } Fuzzy query (elasticsearch)
  • 55. Candidate selection: images Chunk the hash: "94088af86c038327" => "1:9408 2:8af8 3:6c03 4:8327" { "_id": "cafebabe", "_source": { "title": "продам iphone" , "description": "новый телефон почти не пользовался" , "hashes": ["1:9408 2:8af8 3:6c03 4:8327", ...] } }
  • 60. Embedding index ● Numpy arrays ● Do X.dot(query) ● Client aggregates
  • 61. Embedding index ● Numpy arrays ● Do X.dot(query) ● Client aggregates ● Approximate KNN! LSH techniques ● Almost the same as X.dot(query) but faster ● Many implementations: FAISS, Annoy, etc Becomes slow as it grows
  • 62. Embedding index FAISS FAISS FAISS FAISS FAISS FAISS FAISS
  • 63. How about inserts? FAISS FAISS FAISS FAISS FAISS FAISS FAISS
  • 64. How about inserts? FAISS FAISS FAISS FAISS FAISS FAISS FAISS Too slow :-(
  • 65. How about inserts? FAISS FAISS FAISS FAISS FAISS Numpy Numpy Delta index (updated realtime)Historical index (updated daily)
  • 66. Approximate KNN: LSH LSH: Locality sensitive hash ● Minhash: approximates jaccard similarity ● Random projections: approximates cosine similarity X.dot(query)
  • 67. Elasticsearch Generate once and keep at hand Apply to all images
  • 71.
  • 72. TLDR: Throw everything to ElasticSearch And let it find duplicates for you
  • 73. Contact information ● http://alexeygrigorev.com & contact@alexeygrigorev.com ● https://github.com/alexeygrigorev ● https://www.linkedin.com/in/agrigorev
  • 77. p
  • 78. p v
  • 79. p v v . p = projection of v onto p v . p positive
  • 80. p u u . p = projection of v onto p u . p negative
  • 81. p u sign (u . p) != sign (v . p) v Vector normal to p
  • 82. p u sign (u . p) == sign (v . p) v
  • 83. p u sign (u . p) == sign (v . p) v
  • 84. Random projections ● Generate m random vectors pi ● For each compute (u . pi >= 0) ● Create hash = [(u . p0 >= 0), (u . p1 >= 0), ...)] For two vectors v and u ● Number of different bits ~ the angle ● Approximation becomes better as m grows u v theta
  • 85. x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 86. 0 0 0.57 -0.81 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 87. 0 1 0 1 0.57 -0.81 -0.97 -0.23 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 88. 0 1 0 0 1 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 89. 0 1 0 1 0 1 0 1 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 90. 0 1 0 1 0 0 1 0 1 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 91. 0 1 0 1 0 0 0 1 0 1 0 1 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 92. 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 93. 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 94. 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 0.99 0.04 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 95. 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 0.99 0.04 -0.66 -0.74 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 96. 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 0.99 0.04 -0.66 -0.74 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v
  • 97. 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 0.99 0.04 -0.66 -0.74 x1 x2 u -0.92 0.38 v -0.61 0.78 Projection vectors u v u v 0.2
  • 98. 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0.57 -0.81 -0.97 -0.23 0.99 -0.01 -0.86 0.50 0.34 -0.93 0.55 0.83 0.85 -0.52 -0.94 0.31 0.99 0.04 -0.66 -0.74 x1 x2 u -0.92 0.38 v -0.61 0.78 Matrix P Matrix V V . P^T >=0