63. LSH: Random Projection
● Close in the original space ⇒ close in the projection
● Far in the original space ⇒ far in the projection
64. Generate the projection vectors
once and store them somewhere
LSH: Random projections
https://www.slideshare.net/AlexeyGrigorev/duplicates-everywhere
Use the vectors to reduce the
dimensionality and compute the
hash
Store the hash in the database
65.
66. Why ElasticSearch?
● Well-known, convenient, stable and scalable inverted index (thanks, Lucene!)
1 00fc
2 12ec
3 00fc
4 ebe4
5 7a1f
6 00fc
7 8ef4
8 12ec
00fc 1 3 6
12ec 2 8
ebe4 4
7a1f 5
8ef4 7
Direct index Inverted index
ImageID Hash
Hash ImageID
67. Elasticsearch for hashes
For “fuzzy lookups” chunk the hash:
"94088af86c038327" => "1:9408 2:8af8 3:6c03 4:8327"
{
"_id": "cafebabe",
"_source": {
"title": "new iphone" ,
"description": "new iphone almost not used" ,
"hashes": ["1:9408 2:8af8 3:6c03 4:8327", ... ]
}
}
Let elasticsearch treat it as usual tokens
using e.g. whitespace tokenizer
85. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
86. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day
87. CNN+LSH
Options for deploying image models:
● Lambda
○ Tricky to do: HUGE TF binaries
○ 66 USD per 1 mln
○ Worth when load is low (< 1 mln)
● Kubernetes
○ Easier to do: docker + existing cluster
○ More expensive for low load
○ Better for 1+ mln images per day
88. The big picture
ML
Such description
So much text
Accept
Reject
Moderation queue
MP
Automatic
moderation system
Moderation panel
Accept
Reject
Moderators
s3
ES
Duplicate
detection
system
Image index
Hashes