Lic may17

Hannaneh Najdataei
Parallel Data Streaming Analytics in the
Context of Internet of Things
Licentiate seminar
.: May 2019 :.

Introduction Continuous clustering Elasticity in stream processing Conclusions 2
Internet of Things (IoT)

Cloud Computing
IoT Analytics
3Introduction Continuous clustering Elasticity in stream processing Conclusions
20 GB per
car per hour
Edge Devices

Edge Computing
Fog Computing
Cloud Computing
IoT Analytics

3-tier IoT Architecture
Cloud Tier
Data Centers
Fog Tier
Nodes
Edge Tier
Devices

Scope of the Thesis
7
The challenges
• Unbounded data
• Unpredictable data rate
• Various platforms
• Time requirements
Computationalpower
High
Medium
Low
Introduction Continuous clustering Elasticity in stream processing Conclusions
• Design and implement analytics

Scope of the Thesis
8
The objectives
• Continuous analysis
• Adaptive reconfiguration
• Hardware independent
• Efficient processing
• Design and implement analytics
The challenges
• Unbounded data
• Unpredictable data rate
• Various platforms
• Time requirement

Conventional Data Analytics (Batch processing)
9
Data
Analysis
Results
Database

Continuous Processing
10
Data
Analysis
Results

Stream Processing
11
Results
Data Analysis

Stream Processing Operators
• Stateless
• Stateful
State is the memory of the operator

• Stateless
• E.g. filter
• Stateful
tuple <ts,x>
<3,1> <2,4> <1,3><4,3>

• Stateless
• E.g. filter
• Stateful
• E.g. aggregate
window
<1,3><4,3>
<3,1> <2,4> <1,3>
tuple <ts,x>
<3,8>

Outline
15
1. Introduction
• Motivation
• Thesis objectives
o Continuous analysis
o Adaptive reconfiguration
o Hardware independent
o Efficient processing
2. Continuous clustering
3. Elasticity in stream processing
4. Conclusions

LiDAR Point Cloud Clustering
16
Side view
Top view
𝑑
Raw LiDAR data points

LiDAR Point Cloud Clustering
Clustered data pointsRaw LiDAR data points

Batch Clustering
18
1. Collect data points for one rotation
2. Store the points in search optimized data structure
3. Apply the clustering
𝜖
Parameters: 𝑚𝑖𝑛𝑃𝑡𝑠, 𝜖
Euclidean clustering
*[Ester et al.,Density-based1996] [Rusu et al., Semantic3D2010] [Rusu et al., pcl2011] [Patwary et al., DBSCAN2012]

Batch Clustering
19
Velodyne HDL-64E
• ~8 rotations per second
• Up to ~2.2 million points per second
Challenge?

Continuous Clustering
20
Ø H. Najdataei, Y. Nikolakopoulos, V. Gulisano, M. Papatriantafilou. “Continuous and Parallel LiDAR Point-cloud Clustering”
The 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2018.
Lisco: continuous clustering while the data is being collected

Lisco
Side view Top
view
𝑺 𝟐
𝑺 𝟏
𝑺4
𝑺 𝟑
𝑺5
𝑺6𝑺 𝟕
𝒍 𝟏
𝒍 𝟐
𝒍 𝟑
𝒍 𝟒
𝒍5
21
2D view
𝑺 𝟏 𝑺 𝟐 𝑺 𝟑 𝑺 𝟒 𝑺 𝟓 𝑺 𝟔 𝑺 𝟕
𝒍 𝟏
𝒍 𝟐 𝒅 𝟏 𝒅 𝟓 𝒅 𝟗
𝒍 𝟑 𝒅 𝟐 𝒅 𝟔 𝒅 𝟏𝟎
𝒍 𝟒 𝒅 𝟑 𝒅 𝟕 𝒅 𝟏𝟏
𝒍 𝟓 𝒅 𝟒 𝒅 𝟖 𝒅 𝟏𝟐

𝑝
Lisco
Side view Top
view
𝑺 𝟐
𝑺 𝟏
𝑺4
𝑺 𝟑
𝑺5
𝑺6𝑺 𝟕
𝒍 𝟏
𝒍 𝟐
𝒍 𝟑
𝒍 𝟒
𝒍5
22
L lasers
S steps
2D view
𝜀 Neighbor mask of
point 𝑝

Continuous Clustering Challenges
23
Partial view of neighbor mask
𝑝’
𝑝
LiDAR’s last read
Continuous cluster management
𝐶9
𝐶:
𝐶
𝐻9
𝐻:full neighbor mask of 𝑝actual neighbor mask of 𝑝neighbor mask of 𝑝′

Lisco
24
1. Find the neighbor mask and
compute distances
𝑝
2. Link the clusters
𝐶9
𝐶:
𝐻9

P-Lisco
25
compute distances
𝑝
𝐶9
𝐶:
𝐻9
Scouting Linking

P-Lisco
26
compute distances
𝐶9
𝐶:
𝐻9
Scouting Linking
Thread 1
Thread 2
Thread 3

P-Lisco
27
Thread 1
Thread 2
Thread 3
flag point 𝜀 − 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠
…
…
LinkerScouts
S1 S2 S3 S4 S5 S6 S7
𝑙9
𝑙:
𝑙E
𝑙F
𝑙G
𝑙H
Read the data points Modify the clusters

28
0.3 0.4 0.7
(m)
0
5
10
15
20
ExecutionTime(s)
PCL
Lisco
P-Lisco1
P-Lisco2
P-Lisco4
Intel Xeon E5-2695 ODROID-XU3
0.3 0.4 0.7
(m)
0
0.2
0.4
0.6
0.8
1
ExecutionTime(ms)
PCL
Lisco
P-Lisco1
P-Lisco2
P-Lisco4
P-Lisco8
P-Lisco16
0.3 0.4 0.7
(m)
0
5
10
15
20
ExecutionTime(s)
PCL
Lisco
P-Lisco1
P-Lisco2
P-Lisco4
Performance Evaluation (Real dataset)

Use case (1-Vehicle 1-Day)
29
t
⟨𝑥1, ⟩𝑦1GPS data ⟨𝑥2, ⟩𝑦2 ⟨𝑥3, ⟩𝑦3 ⟨𝑥5, ⟩𝑦5⟨𝑥4, ⟩𝑦4 ⟨𝑥6, ⟩𝑦6 ⟨ 𝑥7, ⟩𝑦7
Heavy traffic Exceeding speed limit

System Model
30
Ø B. Havers, R. Duvignau, H. Najdataei, V. Gulisano, A. Chaitanya Koppisetty, M.
Papatriantafilou “DRIVEN: a framework for efficient Data Retrieval and clustering in
Vehicular Networks” The 35th International Conference on Data Engineering (ICDE).
IEEE, 2019
• Continuous bounded error approximation
• Compress volumes of data
• Utilize communication bandwidth
• Generalized form of Lisco
• Leverage the inherent ordering of spatial
and temporal data

Outline
31
1. Introduction
• Lisco
• P-Lisco
4. Conclusions

Stream Processing

Stream Processing Performance
33
• Throughput
Number of tuples processed per time unit

Stream Processing Performance
34
• Throughput
• Latency
Time difference between receiving a tuple and
producing the corresponding results

Stream Processing Parallelism
• Task parallelism

Stream Processing Parallelism
• Task parallelism
Determinism: Consistent results independent of
tuples’ inter-arrival times
*
[Walulya et al.,FGCS18][Gulisano et al., ScaleJoin 2016]
• Data parallelism

Stream Processing Elasticity
Decommissioning
Provisioning

Stream Processing Elasticity
Scale out
* [Cardellini et al., HPCS16][Carbone et al.,VLDB17]

Stream Processing Efficiency
Shared-nothing Shared
Parallelism Reconfiguration
memory
Virtual
Shared-nothing

STRETCH Framework
40
Components:
• State manager
• Virtual shared-nothing
parallelism
Ø H. Najdataei, Y. Nikolakopoulos, M. Papatriantafilou, P. Tsigas, V. Gulisano “STRETCH: Scalable and Elastic Deterministic Streaming Analysis with
Virtual Shared-Nothing Parallelism” To appear in the 13th International Conference on Distributed and Event-Based Systems (DEBS). ACM, 2019.

Virtual Shared-nothing Parallelism

STRETCH Framework
Components:
• State manager
• Virtual shared-nothing
parallelism
• Elastic ScaleGate (ESG)

ScaleGate
t t t t t t t
sourcesourcereaderreader
Tuples that are ready to be
retrieved by readers • Methods
• addTuple(tuple, sourceID)
• getNextReadyTuple(readerID)

Elastic ScaleGate
44
• Methods
• addTuple(tuple, sourceID)
• getNextReadyTuple(readerID)
• Additional methods
• announceReaders(List reader_IDs, rID)
• removeReaders(List reader_IDs)
• announceSources(List source_IDs, min_ts)
• removeSources(List source_IDs)
t t t t t t t
sourcesourcereaderreader
Tuples that are ready to be
retrieved by readers

STRETCH Framework
ts=3
ts=3
ts=2
ts=1ts=5ts=9
ts=6ts=8
ts=1ts=2

STRETCH Framework
ts=5ts=9
ts=8
ts=5
ts=6ts=6

STRETCH Framework
ts=5
ts=9
ts=8
ts=5ts=6
ts=6
ts=6
ts=8

STRETCH Framework
ts=6
ts=8

STRETCH Framework

2000
4000
6000
8000
Inputrate(t/s)
Intra-epoch
2500
3000
3500
4000
4500
provisioning
(18 -> 31 PTs)
1500
2000
2500
decommissioning
(18 -> 7 PTs)
0.0
0.2
0.4
0.6
0.8
1.0
throughput(c/s)
1e10
Single thread STRETCH ScaleJoin
0
1
2
3
1e9
0.0
0.2
0.4
0.6
0.8
1.0
1e9
0 20 40 60
# threads
101
102
103
latency(ms)
hyper-threading
0 250 500 750
time (sec)
101
102
103
0 250 500 750
time (sec)
101
102
103
scalability
Performance Evaluation
50
• Use case: ScaleJoin
• Setup: Intel Xeon E5-2695

Outline
51
1. Introduction
• Virtual shared-nothing parallelism
• Elastic ScaleGate
• STRETCH framework
4. Conclusions

Conclusions
52
• Continuous clustering
• Efficient data structure to leverage parallelism
• High throughput and low latency
• Architecture independent
• Elasticity in stream processing
• Virtual shared-nothing parallelism
• Adaptive reconfiguration of processing units
• Intra-node resource utilization
• Deterministic execution
Ø Scale up/scale out
Ø Automatic control unit
Ø IoT applications
Ø Data quality improvement

Lic may17

Recommended

Recommended

More Related Content

Similar to Lic may17

Similar to Lic may17 (20)

Recently uploaded

Recently uploaded (20)

Lic may17