Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A GENTLE INTRODUCTION TO
APACHE SPARK AND
LOCALITY-SENSITIVE
HASHING
1
FRANCOIS GARILLOT
(FORMERLY) TYPESAFE
francois@garillot.net
@huitseeker
2
LOCALITY-SENSITIVE HASHING
▸ A story : Why LSH
▸ How it works & hash families
▸ LSH distribution
▸ Beware : WIP
3
SPARK TENETS
▸ broadcast variables
▸ per-partition commands
▸ shuffle sparsely
4
5
6
7
SEGMENTATION
▸ small sample: 289421 users
▸ larger sample : 5684403 users
46K websites, ultimately users
4 personal laptop...
K-MEANS COMPLEXITY
Find with the 'elbow method' on within-cluster sum of squares.
Then
9
EM - GAUSSIAN MIXTURE
With dimensions, mixtures,
10
LOCALITY-SENSITIVE HASHING FUNCTIONS
A family H of hashing functions is -sensitive if:
▸ if then
▸ if then
11
DISTANCES ! (THOSE AND MANY OTHER)
▸ Hamming distance : where is a
randomly chosen index
▸ Jaccard :
▸ Cosine distance:
12
EARTH MOVER'S DISTANCE
13
EARTH MOVER'S DISTANCE
Find optimal F minimizing:
Then:
14
A WORD ON MODULARITY
LSH for EMD introduced by Charikar in the Simhash paper (2002).
Yet no place to plug your LSH family ...
LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL
▸ basic LSH:
▸ AND (series) construction:
▸ OR (parallel) construction :
16
17
BASIC LSH
val hashCollection = records.map(s => (getId(s), s)).
mapValues(s => getHash(s, hashers))
val subArray = hashCol...
LOOKUP
def findCandidates(record: Iterable[String], hashers: Array[Int => Int],
mBands: BandType) = {
val hash = getHash(r...
getHash(record,hashers)
DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONS
records.mapPartitions { iter =>
val rng = new S...
AND YET, OOM
21
BASIC LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem
22
WEB LOGS ARE SPARSE
Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)
Output of length 1000 integers per ...
ENTROPY LSH (PANIGRAPHI 2006)
REPLACE TABLES BY OFFSETS
, , chosen randomly from the surface
of , the sphere of radius cen...
ENTROPY LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem with as
few as hash ...
BUT ... NETWORK COSTS
▸ Basic LSH : look up buckets,
▸ Entropy LSH : search for offsets
26
LAYERED LSH (BAHMANI ET AL. 2012)
Output of your LSH family is in , with e.g. a cosine norm.
For closer points, the chance...
LAYERED LSH
Have an LSH family for your norm on
Likely that for all offsets
28
LAYERED LSH
Output of hash generation is (GH(p), (H(p), p)) for all p.
In Spark, group, or custom partitioner for (H(p), p...
PERFORMANCE
30
FUTURE WORK
HAVE A (BIG) WEBLOG ?
▸ Weve
▸ Yandex
31
FUTURE WORK
LOCALITY-SENSITIVE HASHING FORESTS !
32
RELEASE
github.com/huitseeker/spark-lsh
1 SEPT 2015
33
Upcoming SlideShare
Loading in …5
×

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

16,316 views

Published on

An Implementation war story of locality sensitive hashing with Apache Spark, with performance lessons.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The link to repositiy is broken. Can you re-up it.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

  1. 1. A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE HASHING 1
  2. 2. FRANCOIS GARILLOT (FORMERLY) TYPESAFE francois@garillot.net @huitseeker 2
  3. 3. LOCALITY-SENSITIVE HASHING ▸ A story : Why LSH ▸ How it works & hash families ▸ LSH distribution ▸ Beware : WIP 3
  4. 4. SPARK TENETS ▸ broadcast variables ▸ per-partition commands ▸ shuffle sparsely 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. SEGMENTATION ▸ small sample: 289421 users ▸ larger sample : 5684403 users 46K websites, ultimately users 4 personal laptops, 4 provided laptops 8
  9. 9. K-MEANS COMPLEXITY Find with the 'elbow method' on within-cluster sum of squares. Then 9
  10. 10. EM - GAUSSIAN MIXTURE With dimensions, mixtures, 10
  11. 11. LOCALITY-SENSITIVE HASHING FUNCTIONS A family H of hashing functions is -sensitive if: ▸ if then ▸ if then 11
  12. 12. DISTANCES ! (THOSE AND MANY OTHER) ▸ Hamming distance : where is a randomly chosen index ▸ Jaccard : ▸ Cosine distance: 12
  13. 13. EARTH MOVER'S DISTANCE 13
  14. 14. EARTH MOVER'S DISTANCE Find optimal F minimizing: Then: 14
  15. 15. A WORD ON MODULARITY LSH for EMD introduced by Charikar in the Simhash paper (2002). Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) ! 15
  16. 16. LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL ▸ basic LSH: ▸ AND (series) construction: ▸ OR (parallel) construction : 16
  17. 17. 17
  18. 18. BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } } 18
  19. 19. LOOKUP def findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet } 19
  20. 20. getHash(record,hashers) DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONS records.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)}) } 20
  21. 21. AND YET, OOM 21
  22. 22. BASIC LSH WITH A 2-STABLE GAUSSIAN DISTRIBUTION With data points, choose and , to solve the problem 22
  23. 23. WEB LOGS ARE SPARSE Input : hits per user, over 6 months, 2x50-ish integers/user (4GB) Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes 64-bit integers : 40 GB Yet ! 23
  24. 24. ENTROPY LSH (PANIGRAPHI 2006) REPLACE TABLES BY OFFSETS , , chosen randomly from the surface of , the sphere of radius centered at 24
  25. 25. ENTROPY LSH WITH A 2-STABLE GAUSSIAN DISTRIBUTION With data points, choose and , to solve the problem with as few as hash tables 25
  26. 26. BUT ... NETWORK COSTS ▸ Basic LSH : look up buckets, ▸ Entropy LSH : search for offsets 26
  27. 27. LAYERED LSH (BAHMANI ET AL. 2012) Output of your LSH family is in , with e.g. a cosine norm. For closer points, the chance of hashes hashing to the same bucket is high! 27
  28. 28. LAYERED LSH Have an LSH family for your norm on Likely that for all offsets 28
  29. 29. LAYERED LSH Output of hash generation is (GH(p), (H(p), p)) for all p. In Spark, group, or custom partitioner for (H(p), p) RDD. Network cost : 29
  30. 30. PERFORMANCE 30
  31. 31. FUTURE WORK HAVE A (BIG) WEBLOG ? ▸ Weve ▸ Yandex 31
  32. 32. FUTURE WORK LOCALITY-SENSITIVE HASHING FORESTS ! 32
  33. 33. RELEASE github.com/huitseeker/spark-lsh 1 SEPT 2015 33

×