Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introdu...
Upcoming SlideShare
Loading in …5
×

Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

239 views

Published on

Research summary for my STAT645 course fall 2016. Paper Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data by Fang, Cheng, Tang, Maniu, Yang. http://ieeexplore.ieee.org/document/7498408/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

  1. 1. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data – Fang, Cheng, Tang, Maniu, Yang (2016) presented by Alex Klibisz University of Tennessee aklibisz@gmail.com November 17, 2016
  2. 2. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Contents 1 Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations 2 Sub-optimal Solutions 3 Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join 4 Results Evaluation Setup kNN Results hkNN Results Summary 5 Conclusion
  3. 3. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Trajectory Joins Vocabulary • Trajectory: series of locations that depicts movement of an entity over time. • Trajectory Object: snapshot of time and location; many trajectory objects in a single trajectory. • Trajectory Join: given two sets M and R of trajectories, join(M, R) returns trajectory objects from M and R within some proximity of space and time. • Joining Criterion: criteria by which objects in M and R are joined. This paper uses the k-nearest-neighbors algorithm to join objects.
  4. 4. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Example Use Case • Hubble space telescope generates 140GB/week about movements of stars and asteroids. Analysis of proximity among trajectory objects helps to uncover behavior of outer-space objects, discover meteors, etc. We can use trajectory joins to find objects in some proximity to one another. • Given two groups A and B of asteroids, return the identities of asteroids from B that have been close to those in A.
  5. 5. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion MapReduce Basics • Divide-and-conquer ”big data” on share-nothing clusters. • Master node partitions data and assigns it to map nodes. • Map performs analysis on local data. • Shuffle step redistributes data after the map step. • Reduce performs a summary operation over data from the the Map step. • MapReduce software handles the data partitioning, execution over distributed nodes, error recovery. 1 1 https://goo.gl/0nbYhp
  6. 6. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Problem Statement kNN Join Find the K nearest neighbors from set R for objects in M over time interval [ts, te] ⊆ [Ts, Te]. (h,k)NN Join Find a list of h objects from M over time interval [ts, te] ⊆ [Ts, Te] that minimize function f . Then return the k nearest neighbors for each of the h objects.
  7. 7. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion kNN Example Figure illustrates a kNN Join. An (h,k)NN join with h = 1, k = 2 might use f (m1) = max{d1, d2} = d2 to return the k nearest neighbors of d2 = {r1, r2}.
  8. 8. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Some Fundamental Operations • Min/max distance from point to line-segment. • Min/max distance from point to trajectory. • Min/max distance from trajectory to trajectory. • kNN from trajectory object to trajectory objects. 2 2 Formulas omitted for brevity, available in section 3.
  9. 9. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Sub-optimal Solutions Single Machine Brute Force (BF) Nested loop to compute euclidean distance between every pair of points in M and R. Worst-case O(|M||N|l) for l points in trajectory of interest tr. Single Machine Sweep Line (SL) Pre-sort the data based on time and compute only distances for overlapping trajectories. Also worst-case O(|M||N|l). Naive MapReduce Map divides objects in M and R randomly into disjoint subsets. Reduce joins all pairs of subsets to compute distance. A second MapReduce job selects the k nearest neighbors.
  10. 10. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Overview of kNN Join Each of the steps is composed of its own MapReduce algorithm for a total of 6 algorithms.
  11. 11. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Overview of kNN Join
  12. 12. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Pre-processing Phase Algorithm 1 1 Input: non-partitioned trajectories. 2 Map splits trajectories in sets M and R into T temporal partitions. O(l + T) where l is the size of a trajectory. 3 Reduce splits each temporal partition into N spatial partitions. O((|M| + |R|)(l + N)) 4 Output: trajectories partitioned by time and space.
  13. 13. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Sub-Trajectory Extraction • An anchor trajectory must span an entire time partition. • TrL i is object i in trajectory r in set L in time partiton T. Algorithm 2 1 Input: trajectories partitioned by time and space. 2 Map retrieves all sub-trajectories in [ts, te]3. Ot(log(l)), Os(l) 3 Reduce finds anchor trajectories that will be used in next step. Ot(|TrL i |2l), Os(|TrL i |l). 4 Output: anchor trajectories 3 the queried time window
  14. 14. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Anchor Trajectories • An anchor trajectory must span an entire time partition ts to te.
  15. 15. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Computing Time-dependent Bound (TDB) • The TDB is a circle c(t) that bounds the k nearest neighbors of a set S of objects at time t. • The TDB for a set S of objects can change over time. Algorithm 4, containing Algorithm 3 1 Input: anchor trajectories 2 Map computes the maximum distance from each anchor trajectory to each central point pi in each temporal partition T. Ot(N · l), Os(l) 3 Reduce computes the TDB of TrM i based on the maximum distances. Ot(|R|log|R|), Os(|R|) for the set of objects R. 4 Output: Time-dependent Bounds
  16. 16. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Time-dependent Bounds • The TDB is a circle c(t) that bounds the k nearest neighbors of a set S of objects at time t. • The TDB for a set S of objects can change over time. White dots are objects from M. Black dots are objects from R. c(t) needs a small circle to encompass k = 2 points. c(t ) needs a bigger circle to encompass k = 2 points.
  17. 17. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Finding Candidate Trajectories Algorithm 5 1 Input: partition of trajectories TrR j . 2 Map classifies each partition of trajectories TrR j as having no candidates, all candidates, or some candidates. Ot(|Tr|Nl), Os(|Tr|l). 3 Reduce gathers the candidates for a join into CR i . Ot(1), Os(|CR i |l). 4 Output: a set of candidate trajectories CR i .
  18. 18. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Candidate Trajectories Finding candidates for TrR j (red). Case 1 have no overlap. Case 2 have complete overlap. Case 3 have partial overlap.
  19. 19. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Trajectory Join Algorithm 6 1 Input: candidate trajectories 2 Map joins each partition TrM i with corresponding candidates CR i using a single machine. O(|Tr||CR i |l). 3 Reduce sorts each object’s neighbors and leaves only the k nearest. O(kN). 4 Output: each queried object with its k nearest neighbors.
  20. 20. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Extension: kNN Load Balancing 1 Hash the trajectory objects by an ID to distribute them more uniformly among compute nodes. 2 Requires modification in the sub-trajectory extraction, finding candidates, and trajectory join.
  21. 21. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Extension: hkNN Join 1 Review: finds the h objects from M that minimize some function f and returns each of their k nearest neighbors. 2 Forced to compute a smaller TDB. 3 Smaller query result hxk size. kNN query was |M|xk. 4 Time and space complexities remain the same.
  22. 22. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Evaluation Setup • 2 Synthetic and 2 real datasets. • Non-trivial size, up to 1.2B observations and 17.2GB. • Hadoop cluster with 60 slave nodes, multi-core 3.40GHz and 16GB memory per node. • Using Sweep Line (SL) for single-node parts. • Measuring query execution time and MapReduce shuffling cost.4 • k = 10, N = 400 constant for all datasets. T and tq varied. 4 The amount of data sent from mappers to reducers.
  23. 23. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Effect of T (number of temporal partitions) As T grows the time decreases until it hits an inflection point. This happens to be similar for both datasets. We are still spending the most time on single-node SL.
  24. 24. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion kNN Results Summary • Increasing N (number of temporal patitions) improves performance to a point of inflection. This point is different for the two datasets. Fig. 15. • Balanced Sweep-Line (BL-SL) is the more efficient single-node algorithm. Fig. 16.5 • Adding slave nodes improves performance. Rate of change is slow, likely due to I/O overhead. Fig. 17. • As k increases the running time and shuffle cost increase. TDB makes a difference. Fig. 18. • Increases in tq show a near-linear increase in running time and shuffling cost. TDB and load balancing make a difference. Fig. 19. • Time increases linearly with dataset size. Sharper increase in shuffling cost than time. Fig. 20. 5 I think they mixed up the figure labels.
  25. 25. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion hkNN Results Summary • Time is constant as h grows (probably because k is constant). • (h,k)NN is 2x faster than kNN methods. • Load-balanced is faster than non-load-balanced.
  26. 26. Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Conclusion Contributions 1 Leverage share-nothing MapReduce structure for kNN joins, which typically rely on shared indices. 2 Introduce the TDB and load-balancing methods, which yield tangible improvements. Questions 1 Most of the time is still spent on the single-node computation. What is the theoretical bound for improvement via parallelization? 2 How much time does the partitioning step take? 3 The partitioning step probably has to be re-run when new data arrives. Does this prevent a real-time implementation? 4 Any benefit to localize data instead of using HDFS?

×