Fighting fraud: finding duplicates at scale (Highload+ 2019)

Борьба с мошенниками
Поиск дубликатов среди сотен миллионов
объявлений
Алексей Григорьев

Fighting Fraud
Finding Duplicates at Scale
Alexey Grigorev

https://www.slideshare.net/AlexeyGrigorev/avito-duplicate-ads-detection-kaggle

Disclaimer
It’s a simplification and doesn’t show all the details form the actual system we use
at OLX

Plan
● User generated content
○ Fraud and duplicates
○ Content moderation systems
● Duplicate detection framework
○ Step 1: Selecting candidates
○ Step 2: Scoring candidates with Machine Learning
○ Image hashes
● Implementation
○ Elasticsearch
○ Image index system

User generated content
Such description. So much text

User generated content
Such description. So
much text
much text
much text

Problems:
● Illegal content
● NSFW content
● Duplicates
● Spam
● Fraud

FraudDuplicates
Two sides of the same coin
* I don’t necessarily think that dogecoin is fraud

Such good description,
so better text

Such goud description,
so better text

Content moderation
Such description
So much text

Content moderation
ML
Such description
So much text
Automatic
moderation system

Content moderation
ML
Such description
So much text
AcceptAutomatic
moderation system
Reject

Content moderation
ML
Such description
So much text
Accept
Reject
Moderation queue
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators

ML
Such description
So much text
Accept
Reject
Moderation queue
Automatic
moderation system
Duplicate
detection
Forbidden
items
Other ML
models

Duplicate detection framework
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2
https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere-berlin

Why?
● Cannot compare each item with each
● 1000 items ⇒ 1000 x 999 / 2 = 499 500 pairs for comparison
● Real datasets: millions! (avito: 58mln, olx.ua: 13mln)
Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2

How?
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2

Candidate selection
● Category
● City / district
● Seller id
● IP address of the seller
● Device signature

Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2
Machine Learning!

Duplicate Not duplicate
Moderation queue
MP
Moderation panel
Accept
Reject
Moderators

Machine Learning
ID1 ID2 Features Label
1 2 [0, 1, ..., 5] 1
1 4 [2, 0, ..., 3] 0
2 7 [3, 1, ..., 3] 0
k 5 [5, 3, ..., 8] 1
Feature
engineering
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1
Model
Tune F1/Precision/Recall

Features
Pairwise distances/similarities
● Cosine between titles (Bag of words, TF-IDF)
● Cosine between description (Bag of words, TF-IDF)
● Word2Vec

Bag of Words
Titles:
● “Selling Pixel”
● “iPhone XS”
● “Samsung X9”

Bag of Words
sell, pixel, iphone, xs, samsung, x9
Titles:
● “iPhone XS”

Titles:
● “iPhone XS”
Bag of Words
sell pixel iphone xs samsung s9
1 1 0 0 0 0
0 0 1 1 0 0
0 0 0 0 1 1

Titles:
● “iPhone XS”
TF-IDF
sell pixel iphone xs samsung s9
0.001 0.3 0 0 0 0
0 0 0.2 0.8 0 0
0 0 0 0 0.1 0.9

Text features: cosine similarity
doc1
doc2
θ
angle between vectors
buy
sell
vectorizer = TfidfVectorizer()
vectorizer.fit(left + right)
X_left = vectorizer.transform(left)
X_right = vectorizer.transform(right)
sim = X_left.multiply(X_right).sum(axis=1)
In Scikit-Learn:

Image hashes
● md5: cryptographic hash
● dhash, phash, whash: Perceptive hashes
94088af86c038327
14ee7fe587860078a1109033318bd986

Image hashes
● md5: cryptographic hash
● dhash, phash, whash: Perceptive hashes
94088af86c038327
14ee7fe587860078a1109033318bd986
94088af86c038327
07aaedb9b75e88a6051184f01be5cc50

Dhash: difference hash
Read as b/w image, resize
Get numpy array
Difference between adjacent cells
https://www.kaggle.com/iezepov/get-hash-from-images-slightly-daster/code

149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
114 111 110 112 108 67 73 90 103
149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
141 111 110 112 108 67 73 90 103
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8

19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-3 -4 2 -4 -41 6 17 13
19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-30 -4 2 -4 -41 6 17 13
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
148 94
8 08
138 8a
248 f8
108 64
3 03
131 83
39 27
94088af86c038327

Features: hashes
● Number of images with same md5, phash, dhash, etc
● Distances between hashes: min, avg, max
94088af86c038327
94088af86c038328
1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
Distance is 4 bit

Candidate selection
● Category
● City / district
● Seller id
● IP address of the seller
● Device signature
● Image hashes

Image embeddings
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
“Image embeddings”

Image embeddings
These embeddings capture semantic similarity between images

https://keras.io/applications/

Image embeddings
CNN
Dim 1k+
Dim 100
SVD*
36a93c34a3abff
LSH
* TruncatedSVD works best
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

LSH: Random Projections
● Close in the original space ⇒ close in the projection
● Far in the original space ⇒ far in the projection

Generate the projection vectors
once and store them somewhere
LSH: Random Projections
https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere
Use the vectors to reduce the
dimensionality and compute the
hash
Store the hash in the database

Why Elasticsearch?
● Well-known, convenient, stable and scalable inverted index (thanks, Lucene!)
1 00fc
2 12ec
3 00fc
4 ebe4
5 7a1f
6 00fc
7 8ef4
8 12ec
00fc 1 3 6
12ec 2 8
ebe4 4
7a1f 5
8ef4 7
Direct index Inverted index
ImageID Hash
Hash ImageID

Elasticsearch for hashes
{
"_id": "cafebabe",
"_source": {
"title": "new iphone" ,
"description": "new iphone almost not used" ,
"hashes": ["94088af86c038327", ... ]
}
}
"94088af86c038327"
"query": {
"bool": {
"must": [{
"term": {
"hashes": "94088af86c038327"
}
]}
}
}

"94088af86c038327"
"1:9408 2:8af8 3:6c03 4:8327"
Fuzzy? Chunk the hash!

"94088af86c038327"
"1:9408 2:8af8 3:6c03 4:8327"
{
"_id": "cafebabe",
"_source": {
"hashes": ["1:9408 2:8af8 3:6c03 4:8327", ... ]
}
}
Let elasticsearch treat it as
usual tokens using e.g.
whitespace tokenizer
"query": {
"match": {
"hashes": {
"query": "1:9408 2:8af8 3:6c03 4:8327"
}
}
}

"query": {
"match": {
"hashes": {
"query": "1:9408 2:8af8 3:6c03 4:8327"
}
}
}
● 1:9408 2:8af8 3:6c03 4:8327
● 1:9408 2:8af8 3:6c03 4:8238
● 1:9408 2:8af8 3:6323 4:8327
● 1:9408 2:34f4 3:6c03 4:8327
● 1:9408 2:b3af 3:6c03 4:31eb
First, exact matches
Then 3 out of 4
Then 2 out of 4

“More like this” queries
{
"query": {
"more_like_this": {
"like": {
"_index": "listings",
"_type": "_doc",
"_id": "cafebabe"
},
"max_query_terms": 100,
"fields": ["title^2", "description" , "hashes"]
}
}
}
{
"_id": "cafebabe",
"_source": {
"hashes": ["1:9408 2:8af8 3:6c03 4:8327", ... ]
}
}

Image index (simplified)
s3
Such description
So much text

s3
ObjectCreated
{
"eventName": "ObjectCreated:Put",
"s3": {
"bucket": { "name": "pictures" },
"object": { "key": "doge.jpg" }
}
}

s3
ObjectCreated
Hash calculation

s3
ObjectCreated
Hash calculation
https://pypi.org/project/ImageHash/

hashes
Hash calculation
{
"dhash": "9687678c367b7b3a",
"phash": "ad60ad89b54b0d3d",
"whash": "fbf3804003199f9f"
}

hashes
Ingestor
Hash calculation
ES

hashes
Ingestor
Hash calculation
ESs3
ObjectCreated

It scalez!
231 rps
~10 mln~8 mln~8 mln

Image index (still simplified)
hashes
Ingestor
Hash calculation
ES
s3
ObjectCreated
ObjectDeleted
Ingestor

Image index
s3
ObjectCreated
ObjectDeleted
Ingestor
CNN+LSH Ingestor
Ingestor
ES

CNN+LSH
Options for deploying image models:
● Lambda

CNN+LSH
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)

CNN+LSH
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day

Such description
So much text
s3
ES
Hashes

ML
Such description
So much text
Automatic
moderation system
s3
ES
Hashes

ML
Such description
So much text
Automatic
moderation system
s3
ES
Duplicate
detection
system
Hashes

ML
Such description
So much text
Automatic
moderation system
s3
ES
Duplicate
detection
system
Hashes
Accept
Reject

ML
Such description
So much text
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators
s3
ES
Duplicate
detection
system
Hashes
Accept
Reject
Moderation queue

Summary
● Fraud and duplicates often come together
● Use heuristics to find duplicate candidates and ML to find duplicates
● Image hashes is a good and easy way to find duplicate images
● Neural networks can be used for hashing as well
● Elasticsearch is good for finding duplicates (inverted index!)
● AWS Lambda can scale up and down with no human involvement
● Simple things (e.g. hashes) - better in AWS Lambda
● Complex heavy things (e.g. neural nets) - Kubernetes

Contact information
● http://alexeygrigorev.com & contact@alexeygrigorev.com
● https://github.com/alexeygrigorev
● https://www.linkedin.com/in/agrigorev
● https://www.slideshare.net/AlexeyGrigorev

Feedback (and link to the slides!)
https://forms.gle/XnVJk8QCeW9gTebP9

Fighting fraud: finding duplicates at scale (Highload+ 2019)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fighting fraud: finding duplicates at scale (Highload+ 2019)

Similar to Fighting fraud: finding duplicates at scale (Highload+ 2019) (20)

More from Alexey Grigorev

More from Alexey Grigorev (20)

Recently uploaded

Recently uploaded (20)

Fighting fraud: finding duplicates at scale (Highload+ 2019)