Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Locality Sensitive Hashing By Spark

5,962 views

Published on

Spark Summit 2016 talk by Kelvin Chu (Uber) and Alain Rodriguez (Uber)

Published in: Data & Analytics

Locality Sensitive Hashing By Spark

  1. 1. Alain Rodriguez, Fraud Platform, Uber Kelvin Chu, Hadoop Platform, Uber Locality Sensitive Hashing by Spark June 08, 2016
  2. 2. Overlapping Routes Finding similar trips in a city
  3. 3. The problem Detect trips with a high degree of overlap We are interested in detecting trips that have various degrees of overlap. • Large number of trips • Noisy, inconsistent GPS data • Not looking for exact matches • Directionality is important
  4. 4. Input Data Millions of trips scattered over time and space GPS traces are represented as an ordered list of (latitude,longitude,time) tuples. • Coordinates are reals and have noise • Traces can be dense or sparse, yet overlapping • Large time and geographic search space [ { "latitude":25.7613453844, "epoch":1446577692, "longitude":-80.197244976 }, { "latitude":25.7613489535, "epoch":1446577693, "longitude":-80.1972450862 }, … ]
  5. 5. Divides the world into consistently sized regions. Area segments can be had of different sizes Google S2 Cells Efficient geo hashing
  6. 6. Jaccard index Set similarity coefficient The Jaccard index can be used as a measure of set similarity A = {a, b, c}, B = {b, c, d}, C = {c, d, e} J(A, A) = 1.0 J(A, B) = 0.5 J(A, C) = 0.2
  7. 7. Sparse and dense traces should be matched Ensure points are at most X distance apart Different devices generate varying data densities. Two segments that start and end at the same location should be detected as overlapping. Densification ensures that continuous segments are independently overlapping. Heuristic Densify sparse traces A A B B A A B B
  8. 8. Heuristic Remove noise resulting from a vehicle stopped at a light or a very chatty device. Remove contiguous duplicates Discretize segments Break down routes into equal size area segments; this eliminates route noise. Segment size determines matching sensitivity. Discretize route segments
  9. 9. Directionality matters Shingling captures directionality Two overlapping trips with opposite directions should not be matched. Combining contiguous segments captures the sequence of moves from one segment to another. Heuristic Shingle contiguous area segments 1 2 3 4 5 6 7 8 A A B B 1 2 3 4 5 6 7 8 1->2 2->3 3->4 4->5 5->6 6->7 7->8 A A B B 2->1 3->2 4->3 5->4 6->5 7->6 8->7
  10. 10. Set overlap problem Find traces that have the desired level of common shingles 1->2 2->3 3->4 4->5 5->6 6->7 7->8 8->9 9->10
  11. 11. N^2 takes forever LSH to the rescue ● Sifting through a month’s worth of trips for a city takes forever with the N^2 approach ● Locality-Sensitive Hashing allows us to find most matches quickly. Spark provides the perfect engine.
  12. 12. Locality-Sensitive Hashing (LSH) Quick Introduction
  13. 13. Problem - Near Neighbors Search Set of Points P Distance Function D Query Point Q
  14. 14. Problem - Clustering Set of Points P Distance Function D
  15. 15. Curse of Dimensionality 1-Dimension e.g. single integer Q: 7 Distance: 3 A Solution: Binary Tree e.g. Return 9, 4, 8, ... 2-Dimension e.g. GPS point Q: (12.73, 61.45) Distance: 10 A Solution: Quadtree, R-tree, etc
  16. 16. Curse of Dimensionality How about very high dimension? 1->2 2->3 3->4 4->5 5->6 6->7 7->8 Very hard problem A trip often has thousands of shingles ->3k
  17. 17. Approximate Solution Bucket1 T1 T2 h(T1 ) h(T2 ) D(T1 , T2 ) is small With high probability T1 and T2 are hashed into the same bucket. Trip T1 & Trip T2 are similar
  18. 18. Approximate Solution Bucket1T1 T2 h(T1 ) h(T2 ) D(T1 , T2 ) is large With high probability T1 and T2 are hashed into the different buckets. Bucket2 Trip T1 & Trip T2 are not similar
  19. 19. Some distance functions have good companions of hash functions. For Jaccard distance, it is MinHash function.
  20. 20. MinHash(S) = min { h(x) for all x in the set S } h(x) is hash function such as (ax + b) % m where a & b are some good constants and m is the number of hash bins Example: S = {26, 88, 109} h(x) = (2x + 7) % 8 MinHash(S) = min {3, 7, 1} = 1
  21. 21. Distance Hash Function Jaccard MinHash Hamming i-th value of vector x Cosine Sign of the dot product of x and a random vector Some Other Examples
  22. 22. How to increase and control the probability? It turns out the solution is very intuitive.
  23. 23. Use Multiple Hash Bucket1T1 T2 h1 (T1 ) h1 (T2 ) Bucket2 Bucket3 T1 T2 h2 (T1 ) h2 (T2 ) Both h1 and h2 are MinHash, but with different parameters (e.g. a & b)
  24. 24. Our Approach of LSH on Spark
  25. 25. Shuffle Keys h1 range T1 T2 h1 (T1 ) h1 (T2 ) h2 (T1 ) h2 (T2 ) ● RDD[Trip] ● The hash values are shuffle keys ● h1 and h2 have non-overlapping key ranges ● groupByKey() h2 range other hash Keys Range
  26. 26. Post Processing Bucket1 T1 , T2 ● If T1 and T2 are hashed into the same bucket, it’s likely that they are similar. ● Compute the Jaccard distance.
  27. 27. Approach 2 h1 range T1 T2 h1 (T1 ) h1 (T2 ) h2 (T1 ) h2 (T2 ) ● Same pair of trips are matched in both h1 and h2 buckets ● Use one more shuffle to dedup ● Network vs Distance Computation h2 range other hash Keys Range
  28. 28. Approach 3 ● Don’t send the actual trip vector in the LSH and Dedup shuffles ● Send only the trip ID ● After dedup, join back with the trip objects with one more shuffle ○ Then compute the Jaccard distance of each pair of matched trips. ● When the trip object is large, Approach 3 saves a lot of network usage.
  29. 29. How to Generate Thousands of Hash Functions ● Naive approach ○ Generate thousands tuples of (a, b, m) ● Cache friendly approach - CPU register/L1/L2 ○ Generate only two hash functions ○ h1 (x) = (a1 x + b1 ) % m1 ○ h2 (x) = (a2 x + b2 ) % m2 hi (x) = h1 (x) + i * h2 (x) i from 1 to number of hash functions
  30. 30. Other Features ● Amplification ○ Improve the probabilities ○ Reduce computation, memory and network used in final post-processing ○ More hashing (usually insignificant compared to the cost in final post-processing) ● Near Neighbors Search ○ Used in information retrieval, instances based machine learning
  31. 31. Other Applications of LSH ● Search for top K similar items ○ Documents, images, time-series, etc ● Cluster similar documents ○ Similar news articles, mirror web pages, etc ● Products recommendation ○ Collaborative filtering
  32. 32. Future Work ● Migrate to Spark ML API ○ DataFrame as first class citizen ○ Integrate it into Spark ● Low latency inserts with Spark Streaming ○ Avoid re-hashing when new objects are streaming in
  33. 33. Thank you Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

×