Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ScaleJoin: a Deterministic,
Disjoint-Parallel and Skew-Resilient
Stream Join
Vincenzo Gulisano, Yiannis Nikolakopoulos,
Ma...
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
Motivation
Applications in sensor networks, cyber-physical
systems:
• large and fluctuating volumes of data generated
cont...
What is a stream join?
2015-10-31 5
Data stream:
unbounded sequence of tuples
t1
t2
t3
t4
t1
t2
t3
t4
t1
t2
t3
t4
R S
Slid...
Why parallel stream joins?
• WS = 600 seconds
• R receives 500 tuples/second
• S receives 500 tuples/second
• WR will cont...
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
Which are the challenges of a parallel stream join?
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resi...
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
The 3-step procedure (sequential stream join)
For each incoming tuple t:
1. compare t with all tuples in opposite window g...
The 3-step procedure, is it enough?
Scalability
High
throughput
Low latency
Disjoint
parallelism
Skew
resilience
Determini...
Enforcing determinism in sequential stream joins
• Next tuple to process = earliest(tS,tR)
• The earliest(tS,tR) tuple is ...
Deterministic 3-step procedure
Pick the next ready tuple t:
1. compare t with all tuples in opposite window given predicat...
Shared-nothing parallel stream join
(state-of-the-art)
Prod
R
Prod
S
PU1
PU2
PUN
… Cons
Add tuple to PUi S
Add tuple to PU...
Shared-nothing parallel stream join
(state-of-the-art)
Prod
R
Prod
S
PU1
PU2
PUN
…
2015-10-31 15
enqueue()
dequeue()
ConsM...
From coarse-grained to fine-grained synchronization
Prod
R
Prod
S
PU1
PU2
PUN
…
Cons
2015-10-31 16
ScaleGate
2015-10-31 17
addTuple(tuple,sourceID)
allows a tuple from sourceID to be merged by ScaleGate in the
resulting t...
ScaleJoin
Prod
R
Prod
S
PU1
PU2
PUN
…
Cons
Add tuple SGin
Add tuple SGin
Get next ready
output tuple
from SGout
Get next r...
2015-10-31 19
t1
t2
R S
WR
t3
t4
R S
t4
t1
WR
R S
t4
t2
WR
R S
t4
WR
t3
Sequential stream join:
ScaleJoin with 3 PUs:
Scal...
ScaleJoin
Prod
R
Prod
S
PU1
PU2
PUN
… Cons
Add tuple SGin
Add tuple SGin
Get next ready
output tuple
from SGout
2015-10-31...
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
Evaluation setup
• Common benchmark
• Implemented in Java
• Evaluation platform
– NUMA architecture: 2.6 GHz AMD Opteron 6...
ScaleJoin Scalability – comparisons/second
2015-10-31 23
Number of PUs
ScaleJoin latency – milliseconds
2015-10-31 24
Number of PUs
ScaleJoin skew-resilience
Constant distinct rates with peaks
2015-10-31 25
Agenda
• What is a stream join?
• Which are the challenges of a parallel stream join?
• Why ScaleJoin?
• How well does Sca...
Conclusions
• ScaleJoin: a Deterministic, Disjoint-Parallel and
Skew-Resilient Stream Join
• Challenges of parallel
stream...
ScaleJoin: a Deterministic,
Disjoint-Parallel and Skew-Resilient
Stream Join
Vincenzo Gulisano, Yiannis Nikolakopoulos,
Ma...
Upcoming SlideShare
Loading in …5
×

ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join

409 views

Published on

This is the presentation of the paper "ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join", presented by Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou and Philippas Tsigas at the IEEE Big Data conference held in Santa Clara, 2015.

Published in: Science
  • Be the first to comment

  • Be the first to like this

ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join

  1. 1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas 2015-10-31 1 Chalmers University of technology
  2. 2. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 2
  3. 3. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 3
  4. 4. Motivation Applications in sensor networks, cyber-physical systems: • large and fluctuating volumes of data generated continuously demand for: • Continuous processing of data streams • In a real-time fashion Store-then-process is not feasible!!! 2015-10-31 4
  5. 5. What is a stream join? 2015-10-31 5 Data stream: unbounded sequence of tuples t1 t2 t3 t4 t1 t2 t3 t4 t1 t2 t3 t4 R S Sliding window Window size WS WSWR Predicate P
  6. 6. Why parallel stream joins? • WS = 600 seconds • R receives 500 tuples/second • S receives 500 tuples/second • WR will contain 300,000 tuples • WS will contain 300,000 tuples • Each new tuple from R gets compared with all the tuples in WS • Each new tuple from S gets compared with all the tuples in WR … 300,000,000 comparisons/second! t1 t2 t3 t4 t1 t2 t3 t4 R S WSWR 2015-10-31 6
  7. 7. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 7
  8. 8. Which are the challenges of a parallel stream join? Scalability High throughput Low latency Disjoint parallelism Skew resilience Determinism 2015-10-31 8
  9. 9. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 9
  10. 10. The 3-step procedure (sequential stream join) For each incoming tuple t: 1. compare t with all tuples in opposite window given predicate P 2. add t to its window 3. remove stale tuples from t’s window Add tuples to S Add tuples to R Prod R Prod S Consume resultsConsPU 2015-10-31 10 We assume each producer delivers tuples in timestamp order
  11. 11. The 3-step procedure, is it enough? Scalability High throughput Low latency Disjoint parallelism Skew resilience Determinism 2015-10-31 11 t1 t2 t1 t2 R S WSWR t3 t1 t2 t1 t2 R S WSWR t4 t3
  12. 12. Enforcing determinism in sequential stream joins • Next tuple to process = earliest(tS,tR) • The earliest(tS,tR) tuple is referred to as the next ready tuple • Process ready tuples in timestamp order  Determinism PU tS tR 2015-10-31 12
  13. 13. Deterministic 3-step procedure Pick the next ready tuple t: 1. compare t with all tuples in opposite window given predicate P 2. add t to its window 3. remove stale tuples from t’s window Add tuples to S Add tuples to R Prod R Prod S Consume resultsConsPU 2015-10-31 13
  14. 14. Shared-nothing parallel stream join (state-of-the-art) Prod R Prod S PU1 PU2 PUN … Cons Add tuple to PUi S Add tuple to PUi R Consume results Pick the next ready tuple t: 1. compare t with all tuples in opposite window given P 2. add t to its window 3. remove stale tuples from t’s window Chose a PU Chose a PU Take the next ready output tuple Scalability High throughput Low latency Disjoint parallelism Skew resilience Determinism 2015-10-31 14 Merge
  15. 15. Shared-nothing parallel stream join (state-of-the-art) Prod R Prod S PU1 PU2 PUN … 2015-10-31 15 enqueue() dequeue() ConsMerge
  16. 16. From coarse-grained to fine-grained synchronization Prod R Prod S PU1 PU2 PUN … Cons 2015-10-31 16
  17. 17. ScaleGate 2015-10-31 17 addTuple(tuple,sourceID) allows a tuple from sourceID to be merged by ScaleGate in the resulting timestamp-sorted stream of ready tuples. getNextReadyTuple(readerID) provides to readerID the next earliest ready tuple that has not been yet consumed by the former. https://github.com/dcs-chalmers/ScaleGate_Java
  18. 18. ScaleJoin Prod R Prod S PU1 PU2 PUN … Cons Add tuple SGin Add tuple SGin Get next ready output tuple from SGout Get next ready input tuple from SGin 1. compare t with all tuples in opposite window given P 2. add t to its window in a round-robin fashion 3. remove stale tuples from t’s window 2015-10-31 18 SGin SGout Steps for PU
  19. 19. 2015-10-31 19 t1 t2 R S WR t3 t4 R S t4 t1 WR R S t4 t2 WR R S t4 WR t3 Sequential stream join: ScaleJoin with 3 PUs: ScaleJoin (example)
  20. 20. ScaleJoin Prod R Prod S PU1 PU2 PUN … Cons Add tuple SGin Add tuple SGin Get next ready output tuple from SGout 2015-10-31 20 SGin SGout Scalability High throughput Low latency Disjoint parallelism Skew resilience Determinism Prod S Prod S Prod R Get next ready input tuple from SGin 1. compare t with all tuples in opposite window given P 2. add t to its window in a round robin fashion 3. remove stale tuples from t’s window Steps for PUi
  21. 21. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 21
  22. 22. Evaluation setup • Common benchmark • Implemented in Java • Evaluation platform – NUMA architecture: 2.6 GHz AMD Opteron 6230 (48 cores over 4 sockets), 64 GB of memory – Architecture with Hyper Threading: 2.0 GHz Intel Xeon E5-2650 (16 cores over 2 sockets), 64 GB of memory 2015-10-31 22 t1 t2 t3 t4 t1 t2 t3 t4 R S R: <timestamp,x,y,z> S: <timestamp,a,b,c,d> P: a−10≤x≤a+10 AND b−10≤y≤b+10
  23. 23. ScaleJoin Scalability – comparisons/second 2015-10-31 23 Number of PUs
  24. 24. ScaleJoin latency – milliseconds 2015-10-31 24 Number of PUs
  25. 25. ScaleJoin skew-resilience Constant distinct rates with peaks 2015-10-31 25
  26. 26. Agenda • What is a stream join? • Which are the challenges of a parallel stream join? • Why ScaleJoin? • How well does ScaleJoin addresses stream joins’ challenges? • Conclusions 2015-10-31 26
  27. 27. Conclusions • ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join • Challenges of parallel stream joins • Fine-grained synchronization (ScaleGate) • 4 billion comparisons/second, with latency lower than 60 milliseconds Scalability High throughput Low latency Disjoint parallelism Skew resilience Determinism 2015-10-31 27
  28. 28. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas Thank you! Questions? 2015-10-31 28

×