These are the slides for the lecture I gave at the EBSIS Summer School about data streaming and its challenges and trade-offs for data analysis in Fog architectures.
Vincenzo GulisanoAssociate Professor at Chalmers University of Technology
5. AMIs VNs
• demand-response
• scheduling [7]
• micro-grids
• detection of medium size blackouts [8]
• detection of non technical losses
• ...
5
• autonomous driving
• platooning
• accident detection [9]
• variable tolls [9]
• congestion monitoring [10]
• ...
Access to their data à Increased awareness, security, power-efficiency, ...
The data streaming paradigm and its use in Fog architectures
11. Fog architectures, applications trade-offs
and challenges
11
Sensor/Edge Devices Fog Devices Cloud Devices
Cost + ++ +++1
Computational power + ++ +++
Inter-node heterogeneity +++ ++ +
Intra-node heterogeneity + + +++
Size +++ ++ +
Communication wireless wireless/wired wired
Volume of data millions of small streams2 thousands of medium aggregated streams many large streams
Evolution over time +++ ++ +3
1. depends on who owns the physical hardware
2. with exceptions like e.g. Lidar sensors [16,17]
3. Virtual environments can also evolve ”behind the scenes”
The data streaming paradigm and its use in Fog architectures
18. data streaming operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending
on the tuples being processed
• Stateful operators
• maintain a state that evolves depending on the tuples being
processed
• produce output tuples that depend on multiple input tuples
18
OP
OP
The data streaming paradigm and its use in Fog architectures
27. time-based sliding window aggregation (count)
27
Counter: 4
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
Counter: 1
Counter: 2
Counter: 3
Counter: 3
time
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
we assumed each source
produces and delivers a
timestamp sorted stream!
What happens if this is not
the case?
The data streaming paradigm and its use in Fog architectures
33. 33
M A J F
sample query
Notice:
• the same semantics can be defined in several ways (using different
operators and composing them in different ways)
• Using many basic building blocks can ease the task of distributing
and parallelizing the analysis (more in the following...)
The data streaming paradigm and its use in Fog architectures
35. 1. Distributed deployment
2. Parallel deployment
3. Ordering and determinism
4. Shared-nothing vs shared-memory parallelism
5. Load balancing
6. Elasticity
7. Fault tolerance
8. Data sharing (differential privacy)
35
8 aspects / use cases to discuss challenges (and opportunities)
The data streaming paradigm and its use in Fog architectures
37. 1 - Distributed deployment – where to place a given operator? [19]
37
M
1. What hardware is available at each possible
deployment location?
2. What data is available at each possible
deployment location, depending on
1. the infrastructure?
2. its security regulations?
3. its privacy regulations?
?
The data streaming paradigm and its use in Fog architectures
38. 1 - Distributed deployment – where to place a given operator?
38
M
1. Low latency (if all the required data is locally
accessible)
2. Access to all local data (parts of which might
not be reported to RSUs / Servers) [4]
3. Embedded devices / limited computational
power [20]
The data streaming paradigm and its use in Fog architectures
39. 1 - Distributed deployment – where to place a given operator?
39
M
1. Access to multiple vehicles / high mobility
2. Access to smaller sets of data / partial view of
the data [22]
3. Embedded devices / higher computational
power
4. Wireless / wired communication
The data streaming paradigm and its use in Fog architectures
42. 2 - Parallel deployment – how do we parallelize the analysis?
42
M
M A
A
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
When computing e.g.
per-vehicle statistics,
we need to make sure
that all the tuples
referring to the same
vehicles are sent to
the same aggregate
instance
The data streaming paradigm and its use in Fog architectures
48. 48
M
M A
A
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
What if tuple with
timestamp 8:00
arrives after tuple
with timestamp 8:07?
3 – Ordering and determinism
The data streaming paradigm and its use in Fog architectures
50. 50
3 – Ordering and determinism – merge-sorting [25] & buffering [26,27]
addTuple(tuple,sourceID)
allows a tuple from sourceID to be merged by ScaleGate in the
resulting timestamp-sorted stream of ready tuples.
getNextReadyTuple(readerID)
provides to readerID the next earliest ready tuple that has not been
yet consumed by the former.
https://github.com/dcs-chalmers/ScaleGate_Java
The data streaming paradigm and its use in Fog architectures
55. 55
t1
t2
R S
WR
t3
t4
R S
t4
t1
WR
R S
t4
t2
WR
R S
t4
WR
t3
Sequential stream join:
ScaleJoin with 3 PUs:
Shared-memory parallel stream join
4 – shared-nothing vs. shared-memory parallelism
The data streaming paradigm and its use in Fog architectures
63. 63
6 – elasticity
• How many resources should be provisioned /
decommissioned? [23]
• How to mix load balancing and provisioning /
decommissioning? [23]
• How to chose which resources to provision /
decommission? [30]
• How to predict when more / less resources will
be needed? [30]
The data streaming paradigm and its use in Fog architectures
64. Load-aware provisioning
1 2 4 7 11 2617
64
6 – elasticity [23]
The data streaming paradigm and its use in Fog architectures
72. 8 – data sharing (differential privacy)
72
Wait a moment!
what if a single vehicle is
connected to a certain RSU?
Whether a certain mechanism preserves or not the privacy
of the underlying data depends on the knowledge of the
adversary
Differential privacy assumes the worst case scenario!
The data streaming paradigm and its use in Fog architectures
73. 73
8 – data sharing (differential privacy)
time
[8:00,9:00)
22.3 Km/h 10 Km/h ? 15Km/h
avg speed: 30 Km/h
without differential privacy, if knows the speed of , and
and the avg speed, then it can find out the speed of
The data streaming paradigm and its use in Fog architectures
74. 74
8 – data sharing (differential privacy)
time
[8:00,9:00)
22.3 Km/h 10 Km/h ? 15Km/h
avg speed: 30 Km/h
with differential privacy, if knows the speed of , and , it does
gain extra knowledge from the avg speed in order to find the speed of
+ noise (calibrated from probabilistic distribution) so
that whether participates or not, the results
will be nearly the same
The data streaming paradigm and its use in Fog architectures
78. Fog architectures, applications trade-offs
and challenges
78
Sensor/Edge Devices Fog Devices Cloud Devices
Cost + ++ +++
Computational power + ++ +++
Inter-node heterogeneity +++ ++ +
Intra-node heterogeneity + + +++
Size +++ ++ +
Communication wireless wireless/wired wired
Volume of data millions of small streams thousands of medium aggregated streams many large streams
Evolution over time +++ ++ +
The data streaming paradigm and its use in Fog architectures
80. Bibliography
1. Zhou, Jiazhen, Rose Qingyang Hu, and Yi Qian. "Scalable distributed communication architectures to support advanced
metering infrastructure in smart grid." IEEE Transactions on Parallel and Distributed Systems 23.9 (2012): 1632-1642.
2. Gulisano, Vincenzo, et al. "BES: Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures."
Proceedings of the 2nd ACM International Workshop on Cyber-Physical System Security. ACM, 2016.
3. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "Online and scalable data validation in advanced metering
infrastructures." IEEE PES Innovative Smart Grid Technologies, Europe. IEEE, 2014.
4. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "METIS: a two-tier intrusion detection system for advanced
metering infrastructures." International Conference on Security and Privacy in Communication Systems. Springer International
Publishing, 2014.
5. Yousefi, Saleh, Mahmoud Siadat Mousavi, and Mahmood Fathy. "Vehicular ad hoc networks (VANETs): challenges and
perspectives." 2006 6th International Conference on ITS Telecommunications. IEEE, 2006.
6. El Zarki, Magda, et al. "Security issues in a future vehicular network." European Wireless. Vol. 2. 2002.
7. Georgiadis, Giorgos, and Marina Papatriantafilou. "Dealing with storage without forecasts in smart grids: Problem
transformation and online scheduling algorithm." Proceedings of the 29th Annual ACM Symposium on Applied Computing.
ACM, 2014.
8. Fu, Zhang, et al. "Online temporal-spatial analysis for detection of critical events in Cyber-Physical Systems." Big Data (Big
Data), 2014 IEEE International Conference on. IEEE, 2014.
The data streaming paradigm and its use in Fog architectures 80
81. Bibliography
9. Arasu, Arvind, et al. "Linear road: a stream data management benchmark." Proceedings of the Thirtieth
international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.
10. Lv, Yisheng, et al. "Traffic flow prediction with big data: a deep learning approach." IEEE Transactions on
Intelligent Transportation Systems 16.2 (2015): 865-873.
11. Grochocki, David, et al. "AMI threats, intrusion detection requirements and deployment
recommendations." Smart Grid Communications (SmartGridComm), 2012 IEEE Third International
Conference on. IEEE, 2012.
12. Molina-Markham, Andrés, et al. "Private memoirs of a smart meter." Proceedings of the 2nd ACM
workshop on embedded sensing systems for energy-efficiency in building. ACM, 2010.
13. Gulisano, Vincenzo, et al. "Streamcloud: A large scale data streaming system." Distributed Computing
Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 2010.
14. Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream
processing." ACM SIGMOD Record 34.4 (2005): 42-47.
15. Bonomi, Flavio, et al. "Fog computing and its role in the internet of things." Proceedings of the first
edition of the MCC workshop on Mobile cloud computing. ACM, 2012.
16. Himmelsbach, Michael, et al. "LIDAR-based 3D object perception." Proceedings of 1st international
workshop on cognition for technical systems. Vol. 1. 2008.
The data streaming paradigm and its use in Fog architectures 81
82. Bibliography
17. Geiger, Andreas, et al. "Vision meets robotics: The KITTI dataset." The International Journal of Robotics Research (2013):
0278364913491297.
18. Gulisano, Vincenzo Massimiliano. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica,
2012.
19. Cardellini, Valeria, et al. "Optimal operator placement for distributed stream processing applications." Proceedings of the 10th
ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
20. Costache, Stefania, et al. "Understanding the Data-Processing Challenges in Intelligent Vehicular Systems." Proceedings of the
2016 IEEE Intelligent Vehicles Symposium (IV16).
21. Cormode, Graham. "The continuous distributed monitoring model." ACM SIGMOD Record 42.1 (2013): 5-14.
22. Giatrakos, Nikos, AntoniosDeligiannakis, and Minos Garofalakis. "Scalable Approximate Query Tracking over Highly Distributed
Data Streams." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
23. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365.
24. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." Data Engineering, 2003.
Proceedings. 19th International Conference on. IEEE, 2003.
The data streaming paradigm and its use in Fog architectures 82
83. Bibliography
25. Cederman, Daniel, et al. "Brief announcement: concurrent data structures for efficient streaming aggregation." Proceedings of
the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 2014.
26. Ji, Yuanzhen, et al. "Quality-driven processing of sliding window aggregates over out-of-order data streams." Proceedings of the
9th ACM International Conference on Distributed Event-Based Systems. ACM, 2015.
27. Ji, Yuanzhen, et al. "Quality-driven disorder handling for concurrent windowed stream queries with shared operators."
Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
28. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." Big Data (Big Data), 2015
IEEE International Conference on. IEEE, 2015.
29. Ottenwälder, Beate, et al. "MigCEP: operator migration for mobility driven distributed complex event processing." Proceedings
of the 7th ACM international conference on Distributed event-based systems. ACM, 2013.
30. De Matteis, Tiziano, and Gabriele Mencagli. "Keep calm and react with foresight: strategies for low-latency and energy-efficient
elastic data stream processing." Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming. ACM, 2016.
31. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." ACM Transactions on
Database Systems (TODS) 33.1 (2008): 3.
32. Castro Fernandez, Raul, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management." Proceedings of the 2013 ACM SIGMOD international conference on Management of data. ACM, 2013.
The data streaming paradigm and its use in Fog architectures 83