Fighting fraud: finding duplicates at scale

Fighting Fraud
Finding Duplicates at Scale
Alexey Grigorev
2019/10/09

https://www.slideshare.net/AlexeyGrigorev/avito-duplicate-ads-detection-kaggle

Disclaimer
Not a presentation of the duplicate detection system at OLX

User generated content
Such description. So much text

Such description. So
much text
much text
much text

Problems:
● Illegal content
● NSFW content
● Duplicates
● Spam
● Fraud

FraudDuplicates
Two sides of the same coin
* I don’t necessarily think that dogecoin is fraud

Fraud and Duplicates
Such description
So much text

Such good description,
so better text

Such goud description,
so better text

100$
deposit

Content moderation
ML
Such description
So much text
Accept
Reject
Moderation queue
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators

ML
Such description
So much text
Accept
Reject
Moderation queue
Automatic
moderation system
Duplicate
detection
Forbidden
items
Other ML
models

https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere-berlin

Duplicate detection framework
Candidate
Selection step
Candidate
Scoring step
find candidate duplicates (10-200) get real duplicates (0-50)
Step 1 Step 2

How:
● Domain knowledge (heuristics)
● Information retrieval techniques
● Approximate knn
Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2

Candidate selection
● Category
● City / district
● Seller id
● IP address of the seller
● Device signature

Candidate
Selection step
Candidate
Scoring step
Step 1 Step 2
Machine Learning!

Duplicate Not duplicate
Moderation queue
MP
Moderation panel
Accept
Reject
Moderators

Machine Learning
ID1 ID2 Features Label
1 2 [0, 1, ..., 5] 1
1 4 [2, 0, ..., 3] 0
2 7 [3, 1, ..., 3] 0
k 5 [5, 3, ..., 8] 1
Feature
engineering
ID1 ID2 Label
1 2 1
1 4 0
2 7 0
k 5 1
Model
Tune F1/Precision/Recall

Features
Pairwise distances/similarities
● Cosine between titles (TF-IDF)
● Cosine between description (TF-IDF)
● Word2Vec

Hashes
● md5: cryptographic hash
● dhash, phash, whash: Perceptive hashes
94088af86c038327
14ee7fe587860078a1109033318bd986
94088af86c038327
07aaedb9b75e88a6051184f01be5cc50

Dhash: difference hash
https://www.kaggle.com/iezepov/get-hash-from-images-slightly-daster/code
Read as b/w image, resize
Get numpy array
Difference between adjacent cells

149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
114 111 110 112 108 67 73 90 103
149 168 145 131 134 111 115 114 108
198 192 162 135 104 137 128 108 97
158 165 151 117 111 133 130 139 115
79 95 132 151 180 212 189 158 124
91 47 57 90 67 81 165 142 110
104 80 63 53 43 34 20 42 101
110 113 109 92 79 53 27 59 102
141 111 110 112 108 67 73 90 103
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8

19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-3 -4 2 -4 -41 6 17 13
19 -23 -15 4 -23 4 -1 -6
-6 -30 -27 -31 33 -9 -20 -11
7 -14 -34 -6 22 -3 9 -24
16 37 19 29 32 -23 -31 -34
-44 10 33 -23 14 84 -23 -32
-24 -17 -10 -10 -9 -14 22 59
3 -3 -18 -13 -26 -26 32 43
-30 -4 2 -4 -41 6 17 13
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8

1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
148 94
8 08
138 8a
248 f8
108 64
3 03
131 83
39 27
94088af86c038327

Features: hashes
● Number of images with same md5, phash, dhash, etc
● Distances between hashes
94088af86c038327
94088af86c038328
1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
Distance is 4 bit

Candidate selection
● Category
● City / district
● Seller id
● IP address of the seller
● Device signature
● Image hashes

Image embeddings
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
“Image embeddings”

https://keras.io/applications/

Image embeddings
CNN
Dim 1k+
Dim 100
SVD
36a93c34a3abff
LSH

LSH: Random Projection
● Close in the original space ⇒ close in the projection
● Far in the original space ⇒ far in the projection

Generate the projection vectors
once and store them somewhere
LSH: Random projections
https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere
Use the vectors to reduce the
dimensionality and compute the
hash
Store the hash in the database

Why ElasticSearch?
● Well-known, convenient, stable and scalable inverted index (thanks, Lucene!)
1 00fc
2 12ec
3 00fc
4 ebe4
5 7a1f
6 00fc
7 8ef4
8 12ec
00fc 1 3 6
12ec 2 8
ebe4 4
7a1f 5
8ef4 7
Direct index Inverted index
ImageID Hash
Hash ImageID

Elasticsearch for hashes
For “fuzzy lookups” chunk the hash:
"94088af86c038327" => "1:9408 2:8af8 3:6c03 4:8327"
{
"_id": "cafebabe",
"_source": {
"title": "new iphone" ,
"description": "new iphone almost not used" ,
"hashes": ["1:9408 2:8af8 3:6c03 4:8327", ... ]
}
}
Let elasticsearch treat it as usual tokens
using e.g. whitespace tokenizer

Implementation Details
“More like this” queries
{
"query": {
"more_like_this": {
"like": {
"_index": "listings",
"_type": "_doc",
"_id": "cafebabe"
},
"max_query_terms": 100,
"fields": ["title^2", "description" , "hashes"]
}
}
}
{
"_id": "cafebabe",
"_source": {
"title": "new iphone" ,
"description": "new iphone almost not used" ,
"hashes": ["1:9408 2:8af8 3:6c03 4:8327", ... ]
}
}

Image index (simplified)
s3
Such description
So much text

s3
ObjectCreated
{
"eventName": "ObjectCreated:Put",
"s3": {
"bucket": { "name": "pictures" },
"object": { "key": "doge.jpg" }
}
}

s3
ObjectCreated
Hash calculation

s3
ObjectCreated
Hash calculation
https://pypi.org/project/ImageHash/

hashes
Hash calculation
{
"dhash": "9687678c367b7b3a",
"phash": "ad60ad89b54b0d3d",
"whash": "fbf3804003199f9f"
}

hashes
Ingestor
Hash calculation
ES

hashes
Ingestor
Hash calculation
ESs3
ObjectCreated

It scalez!
Invocations per hour
190 rps
Invocations per day

Image index (still simplified)
hashes
Ingestor
Hash calculation
ES
s3
ObjectCreated
ObjectDeleted
Ingestor

Image index
s3
ObjectCreated
ObjectDeleted
Ingestor
CNN+LSH Ingestor
Ingestor
ES

CNN+LSH
Options for deploying image models:
● Lambda

CNN+LSH
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)

CNN+LSH
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day

The big picture
ML
Such description
So much text
Accept
Reject
Moderation queue
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators
s3
ES
Duplicate
detection
system
Image index
Hashes

Fighting fraud: finding duplicates at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Fighting fraud: finding duplicates at scale

Similar to Fighting fraud: finding duplicates at scale (20)

More from Alexey Grigorev

More from Alexey Grigorev (20)

Recently uploaded

Recently uploaded (20)

Fighting fraud: finding duplicates at scale