Query-Driven Descriptive Analytics for IoT
and Edge Computing
Moysis Symeonides*, Demetris Trihinas✝, Zacharias Georgiou*,
George Pallis*, Marios D. Dikaiakos*
IEEE International Conference on Cloud Engineering (IC2E 2019)
*Department of Computer Science
University of Cyprus
✝Department of Computer Science
University of Nicosia
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Distributed Data Processing Engines
2
2
● Frameworks like Hadoop and Spark are contributing to the democratization
of big data analytics by hiding the complexity related to:
○ Machine communication and resource management -> dealing with the
underlying infrastructure.
○ Task scheduling and supervision for analytic jobs.
○ Fault tolerance for both the infrastructure and execution state.
○ Monitoring and logging.
○ ...
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
● Transforming the physical world into an information system.
● 3.6 Billion IoT devices are being used daily1 with these devices projected
to generate 500 ZB of data2 by the end of the year (2019).
The Internet of Things
3
● It only seems “natural” that IoT services offload analytic jobs to the cloud
for data processing.
● But… IoT services usually come with near real-time requirements and
moving data “centrally” for processing penalizes analytics timeliness.
[1] Next big things in IoT predictions for 2020, ITPro, 2018
[2] Global Cloud Index, Cisco, 2018
Analytic Insights
IoT services
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Edge Computing… Saving IoT Analytics
4
Cloud
Analytic Insights
IoT services
The “Edge”
● Data processing now possible in place -or within- local network.
○ Shorter response times for latency critical IoT services.
○ More efficient processing by offloading “centralized” components.
● Possible because hardware for mobile/fog/edge is scaling-up1.
● But… bandwidth and battery capacity NOT scaling at same rate2.
[1] EdgeIoT: Mobile Edge Computing for the Internet of Things, X. Sun et al, IEEE Communications, 2016.
[2] Low-Cost Approximate and Adaptive Monitoring Techniques for the Internet of Things, D. Trihinas et al., IEEE, Trans. on Services Computing, 2018.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
IoT Analytics Over the Edge
5
Cloud
Analytic Insights
IoT services
The “Edge”
How to process enormous volumes of streaming data at
the edge to provide query-driven analytic insights while
also minimizing response times?
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 6
Query-Driven Analytics
Abstractions required for modelling knowledge extraction from data streams
Challenge 1: Expressing (ad-hoc) analytic queries
● One must have specific knowledge of the programming model of the
underlying processing engine.
...
...
Compute the average of
a metric using a 60s
sliding window
● Queries are bounded to the underlying processing engine (query portability).
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 7
Query-Driven Analytics
● A naive “edge” deployment can impose compute and communication
penalties for intermediate recomputations and data exchange.
Challenge 2: Geo-distributed deployments are the norm
for IoT services not the exception
dnR1 =
data exchange
and computation
R1R2 =
result exchange
...d1
dnR2 = d1
dnR1 = ...d1
Naive Deployment
...
Re-using intermediate results
...+ ...+
● Network bandwidth between geo-distributed entities is far from uniform.
Pixida: Optimizing data parallel jobs in wide-area data analytics , K. Kloudas, VLDB, 2015.
Mechanisms to avoid data movement and recomputations are needed
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Outline of Today’s Talk
8
8
● IoT analytics over geo-distributed topologies.
● Abstract query model for query-driven IoT analytics.
● The StreamSight Framework
○ Query plan compilation.
○ Edge computing improvements.
○ Experimentation.
● Future research directions and open research questions.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 9
Abstract Query Model
● Queries are applied on metric streams with the
intent to derive insights.
● Insights can be reused-transformed-composed with
other metric streams to create new insights.
<bus_id, bus99>,
<bus_delay, 5>
<bus_region, NW>
...
Metric
Record
Metric
Stream
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 10
Abstract Query Model
Insight = COMPUTE <Expression> EVERY <Interval> [WITH Optimizations>]
COMPUTE
➢The composition, transformation and aggregation of multiple metric
streams (e.g., expression, composite, aggregate).
EVERY
➢Denotes the interval the expression is evaluated and can be a time
interval (e.g., every 1min) or tuple-based (e.g., every 1000 records).
WITH
➢Optional statement for capturing user-defined optimizations and
constraints for data streams and edge topologies.
Metric Stream
Expression Insight Stream
Metric Stream
...
EVERY
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 11
Smart City Bus Network
edge server
● Buses equipped with GPS tracking devices emitting updates to respected
local edge server of the current region it is navigating through.
● Bus updates include: bus id, location coordinates, operating city region, an
estimation of the current bus route delay, etc.
● Inspired by Dublin smart city bus network.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 12
Insight Operations
1. Window Operations: Aggregation of values within a time period
COMPUTE
ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
EVERY 5 SECONDS
Raw metric stream Time periodAggregate
Time Interval
COMPUTE
ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
BY city_segment EVERY 5 SECONDS
Group by a metric key
Examples of Aggregates: sum, count, sdev, median, percentile,etc.
Apache Spark
14 ops
Apache Spark
15 Ops
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 13
Insight Operations
2. Temporal Compositions: Compositions with different time windows
COMPUTE (
ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
/
ARITHMEIC_MEAN (bus_delay, 60 MINUTES)
) EVERY 5 SECONDS
3. Accumulated Compositions: Updates on previously computed data
COMPUTE EWMA[0.85](passengers) BY bus_stop EVERY 1 TUPLE
Examples: running_mean, running_max, running_sdev, etc.
Apache Spark
32 ops
Apache Spark
24 ops
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
COMPUTE bus_delay
WHEN > ( RUNNING_MEAN(bus_delay) + 3 * RUNNING_SDEV(bus_delay) )
BY city_segment EVERY 5 SECONDS;
14
Insight Operations
4. Hybrid Compositions: Combing window and accumulated operations
COMPUTE (
ARITHMETIC_MEAN( bus_delay, 10 MINUTES)
-
EWMA[0.65]( bus_delay)
) BY city_segment EVERY 5 SECONDS
5. Filtered Compositions: Filter input and output streams
Window Operation
Accumulated Operation
Filter Predicate
Apache Spark
34 ops
Apache Spark
41 ops
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 15
Collaborative Edge Services
● Infrastructures of multiple stakeholders that are
geographically distributed
● Inspired by publically available data from:
○ the New York transportation authority,
○ the Dublin smart city bus network and
○ Uber
● Endorsed with real-time weather data from open-
access meteorological stations
● Companies, Employees and Clients can easily
submit their queries
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 16
Collaborative Edge Services
COMPUTE vehicleID
FROM (taxis, car_sharing)
WHEN GEOHASH[10](cusLoc) == GEOHASH[10](vehLoc)
EVERY 1 MINUTES
Geo-analytic Queries Travel app user interested
in finding closest taxis or
car-sharing vehicles.
Multiple Sources
The city segment with least
number of vehicles in a
15min sliding window
when the temperature
drops below 10◦C
COMPUTE MIN(
COUNT(buses, 15 MINUTES) BY city_segment +
COUNT(taxis, 15 MINUTES) BY city_segment +
COUNT(sharing, 15 MINUTES) BY city_segment
) WHEN temperature <= 10
EVERY 10 MINUTES
COMPUTE TOP_K[5] (
MEAN(total_amount, 1 MONTH)-
MEAN(total_amount, 1 MONTH, 1 MONTH )
) BY city_segment EVERY 1 HOURS
The top-5 city areas based
on current and previous
month average amount.
1 MONTH offset
Data-driven suggestions
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Outline of Today’s Talk
17
17
● IoT analytics over geo-distributed topologies.
● Abstract query model for query-driven IoT analytics.
● The StreamSight Framework
○ Query plan compilation.
○ Edge computing improvements.
○ Experimentation.
● Future research directions and open research questions.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 18
Specification, compilation, and execution of streaming IoT analytic
queries on distributed processing engines optimized for edge computing
environments.
StreamSight Framework
StreamSight: A Query-Driven Framework for Streaming Analytics in Edge Computing. Z. Georgiou et al, IEEE/ACM UCC, 2018.
Currently Supporting
Future Adapters
...
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
COMPUTE
ARITHMETIC_MEAN( bus_delay, 10 MINUTES)
BY city_segment EVERY 5 SECONDS
19
Query Model Translation
● Nodes correspond to a
grammar rule of the language
● Leaves are the tokens and
symbols of the language
Insight Description
Abstract Syntax Tree
● Parser performs early validation to verify syntactic correctness of query.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
● Constructs Query Execution Plan, assembling
the pipeline of stream operations from the
AST representation.
20
Compilation Phase
● A recursive algorithm traverses the AST
● Each node is mapped to a stream operation
of the underlying processing engine
Abstract Syntax Tree
● Naive AST Mapping... extremely inefficient by
ignoring geo-distributed nature of edge realms
○ Unnecessary intermediate re-computations
○ Increased data movement
● AST must acknowledge these.
...
...
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
-
System Optimizations
21
Reusing intermediate results
● StreamSight caches and broadcasts across worker nodes expressions,
composites and results to reduce unnecessary re-computations.
Insight 1: Calculate current average bus_delay Insight 2: Calculate the ratio between current
and last hour bus_delay
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
User Optimizations
22
[1] ApproxIoT: Approximate Analytics for Edge Computing, Z. Wen et al, ICDCS, 2018
Sampling enables the execution of an insight description on a portion of the
streamed measurements for approximate but in time answers (k <<N)
● Uniform Sampling
● Weighted Hierarchical Reservoir Sampling (WHRS)1
● Applies on the fly reservoir + stratified sampling
StreamSight allows the user to prioritize insights
● On high-load influx or network uncertainties critical queries are not
delayed while less important are queued.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 23
User Optimizations
COMPUTE MAX(taxis_fare_amount, 60 MINUTES)
BY city_segment EVERY 1 MINUTES
WITH SALIENCE 1 Priority Higher is better
Sampling with Error Margin & Confidence:
COMPUTE
ARITHMETIC_MEAN(taxi_passengers, 10 MINUTES)
EVERY 30 SECONDS
WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95
Error upper bound Confidence Interval
COMPUTE ARITHMETIC_MEAN(bus_delay, 60 MINUTES)
BY stop_id EVERY 5 MINUTES
WITH SALIENCE 1 AND SAMPLE 0.2
Prioritization On high-load influx
critical queries are not
delayed
Uniform Sampling Query execution on a
portion of the data
stream
Query execution with
bounded error
guarantees for sampling
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 24
User Optimizations
COMPUTE COUNT(taxis)
BY city_segment
EVERY 1 SECONDS
WITH ALLOW ON DEDICATED[5]
Dedicated Execution
Number of Dedicated
Nodes
COMPUTE
PEWMA[0.5](bus_delay) BY bus_id
EVERY 30 SECONDS
WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95
AND AWARENESS ON COMPUTATIONS Try to minimize the
Computations
Try to maximize the
Accuracy
Awareness on Computations
Accuracy Aware Execution
COMPUTE
PEWMA[0.5](bus_delay) BY bus_id
EVERY 30 SECONDS
WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95
AND AWARENESS ON ACCURACY
Execution of crucial
queries on dedicated
Nodes
Minimize the computation
footprint of execution for
less significant queries but
at the same time keep the
error less than 5%
Only in high influx periods
sacrifice a portion of the
accuracy but keep the error
less than 5%
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 25
Evaluation
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 26
Dublin Bus Workload
Real-World Datasets
● Dublin Smart City Buses Network[1]
○ 968 Buses (Jan 2014)
○ 16 metrics/record, including: bus_id, bus_delay, city_segment
○ Used 7 insights of actual interest for Bus operators
[1] Dublin, “Smart City ITS,” https://data.smartdublin.ie/, 2018
16 Edge servers
● 1 vCPU, 1GB MEM, 2↑ 16↓ Mbps
Evaluation Metric
● Batch Processing Time
Unstable
System
Stable
System
➢StreamSight achieved x1.4 speedup over the baseline
➢StreamSight+WHRS achieved x4.3 speedup over the baseline
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 27
Re-usage of Intermediate Results
● Dublin Bus Workload
● Average Processing Time ( Fixed Input rate 700 req/s )
StreamSight DOES NOT
incur a performance
overhead
Baseline configuration failed
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Outline of Today’s Talk
28
28
● IoT analytics over geo-distributed topologies.
● Abstract query model for query-driven IoT analytics.
● The StreamSight Framework
○ Query plan compilation.
○ Edge computing improvements.
○ Experimentation.
● Future research directions and open research questions.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 29
● Same composition across different insights - different queries but with common
operators.
● Same operators across different compositions - e.g., MEAN, is composed from a
SUM divided by a COUNT. If either SUM or COUNT available then reuse them.
● Same composition across different offsets1
● Re-use insights across users - involves tracking shared results across deployments
and users, privacy protection, etc. (possibly use of blockchain?)
COMPUTE
ARITHMETIC_MEAN(consumption, 10 MINUTES)/
ARITHMETIC_MEAN(consumption, 10 MINUTES, 10 MINUTES)
EVERY 15 MINUTES
we can cache and reuse the
composition for 10 minutes
Reusage of Intermediate Results
[1] SlickDeque: High Throughput and Low Latency Incremental Sliding-Window Aggregation. A. Shein et al, EDBT, 2018.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 30
● Query model operators: DEDICATED, SALIENCE, ALLOW ON, AWARENESS, etc.
● Still… fog-device-user mobility and network uncertainties affect IoT services
QoS, cost, and energy consumption.
● Analytics job scheduling requires “intelligent” consideration of data placement
when orchestrating dynamic IoT services.
● Ignoring this can result in IoT services placed for optimal responsiveness but
failing to guarantee timely insight refreshment.
Query Execution Placement
ADMin: Adaptive Monitoring Dissemination for the Internet of Things, D. Trihinas et al., IEEE,INFOCOM, 2017.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 31
● Moving to the “edge” means not only are data sources diverse but possibly
even the data processing engines.
● These engines must “speak” the same language.
● Open specification vs federation layer?
Multiple and Heterogeneous Data
Processing Engines
OpenFog Consortium and OpenEdge Initiative
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 32
● Do we always need to actually compute the answer on the entire data?
○ Sampling…
○ Yes, but we need bounded approximations… and these approximations must
be computed efficiently across geo-distributed environments.
■ Beware… substituting one computation with another must be beneficial in
terms of performance (e.g., multivariate and dependent metrics)1.
● Do we always need to actually compute the answer?
○ or... can we use a bounded approximation on recent history be satisfactory2.
Data-less Query Execution
[1] ATMoN: Adapting the ”Temporality” in Large-Scale Dynamic Networks, D Trihinas et al, IEEE ICDCS, 2018.
[2] Towards intelligent distributed data systems for scalable efficient and accurate analytics, P. Triantafyllou et al, IEEE ICDCS, 2018.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 33
● Query model provides provisions for data confidentiality, restricted access
control and data movement constraints across geo-locations.
● Offloading sensitive data to the cloud hinders man-in-the-middle attacks… on
the other hand… processing “in place” hinders attacks (e.g., DDoS) on “easier”
attacking planes (e.g., low-power IoT devices).
● Query model NOT enough… geo-distributed analytics requires task scheduling
algorithms to acknowledge privacy-aware compute… How to do this efficiently?
Security & Privacy
COMPUTE patient_stream
EVERY 5 MINUTES
WITH ALLOW
WHEN MEAN( heart_beat, 1 MINUTES ) >= 190
AND doctor_id IN (doctor_ids)
AND region == clinic_region
Evaluation
Rule
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
Conclusion
34
● Abstract query model for query-driven IoT analytics
○ Use cases (smart city, energy, health, microservices) illustrating value of the query model.
● A prototype framework called StreamSight
○ A framework for the specification, compilation, and execution of streaming analytic
queries on the “Edge” .
○ Optimizations:
■ Intermediate results
■ User-optimizations
○ StreamSight can achieve up to 4.3x speedup compared to a naively deployment.
● Many open research challenges for geo-distributed and query-driven
analytics in edge/fog topologies.
Reduce compute and network load on
the Edge
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
THANK YOU
This work is partially
supported by the European
Commission in terms of
Unicorn 731846 H2020 project
(H2020-ICT-2016-1)
Download StreamSight at: https://github.com/UCY-
LINC-LAB/StreamSight.git
35
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019 36
Energy Consumption in Micro-DCs
● Micro-DCs, also denoted as Green-DCs, powered by:
○ National electricity providers and
○ Photovoltaic power harvesting stations placed near to the DCs
● A wide range of sensors are placed in all datacenter racks and the
photovoltaic stations which generates measurements like:
○ Temperature and Energy consumption per Data Center, per Rack or per
Node
○ Energy generation per Photovoltaic Panel
○ Weather data from station like humidity, wind, temperature etc
● Inspired by ENEDI project http://enedi.eu
ENEDI: Energy Saving in Datacenters, Tryfonos et al, IEEE Global IoT, 2018.
D. Trihinas
trihinas@cs.ucy.ac.cy
Laboratory for
Internet Computing
StreamSight - IC2E 2019
ProcessingTime(s)
37
Insight Prioritization
● Dublin Bus Workload
● Average Processing Time (fixed workload)
● 1 Insight with high priority and 3 insights with low priority
Non prioritized queries are
queued
Introduced artificial latency (x2) between worker nodes
Prioritized insight
experiences no delay

StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing

  • 1.
    Query-Driven Descriptive Analyticsfor IoT and Edge Computing Moysis Symeonides*, Demetris Trihinas✝, Zacharias Georgiou*, George Pallis*, Marios D. Dikaiakos* IEEE International Conference on Cloud Engineering (IC2E 2019) *Department of Computer Science University of Cyprus ✝Department of Computer Science University of Nicosia
  • 2.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Distributed Data Processing Engines 2 2 ● Frameworks like Hadoop and Spark are contributing to the democratization of big data analytics by hiding the complexity related to: ○ Machine communication and resource management -> dealing with the underlying infrastructure. ○ Task scheduling and supervision for analytic jobs. ○ Fault tolerance for both the infrastructure and execution state. ○ Monitoring and logging. ○ ...
  • 3.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 ● Transforming the physical world into an information system. ● 3.6 Billion IoT devices are being used daily1 with these devices projected to generate 500 ZB of data2 by the end of the year (2019). The Internet of Things 3 ● It only seems “natural” that IoT services offload analytic jobs to the cloud for data processing. ● But… IoT services usually come with near real-time requirements and moving data “centrally” for processing penalizes analytics timeliness. [1] Next big things in IoT predictions for 2020, ITPro, 2018 [2] Global Cloud Index, Cisco, 2018 Analytic Insights IoT services
  • 4.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Edge Computing… Saving IoT Analytics 4 Cloud Analytic Insights IoT services The “Edge” ● Data processing now possible in place -or within- local network. ○ Shorter response times for latency critical IoT services. ○ More efficient processing by offloading “centralized” components. ● Possible because hardware for mobile/fog/edge is scaling-up1. ● But… bandwidth and battery capacity NOT scaling at same rate2. [1] EdgeIoT: Mobile Edge Computing for the Internet of Things, X. Sun et al, IEEE Communications, 2016. [2] Low-Cost Approximate and Adaptive Monitoring Techniques for the Internet of Things, D. Trihinas et al., IEEE, Trans. on Services Computing, 2018.
  • 5.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 IoT Analytics Over the Edge 5 Cloud Analytic Insights IoT services The “Edge” How to process enormous volumes of streaming data at the edge to provide query-driven analytic insights while also minimizing response times?
  • 6.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 6 Query-Driven Analytics Abstractions required for modelling knowledge extraction from data streams Challenge 1: Expressing (ad-hoc) analytic queries ● One must have specific knowledge of the programming model of the underlying processing engine. ... ... Compute the average of a metric using a 60s sliding window ● Queries are bounded to the underlying processing engine (query portability).
  • 7.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 7 Query-Driven Analytics ● A naive “edge” deployment can impose compute and communication penalties for intermediate recomputations and data exchange. Challenge 2: Geo-distributed deployments are the norm for IoT services not the exception dnR1 = data exchange and computation R1R2 = result exchange ...d1 dnR2 = d1 dnR1 = ...d1 Naive Deployment ... Re-using intermediate results ...+ ...+ ● Network bandwidth between geo-distributed entities is far from uniform. Pixida: Optimizing data parallel jobs in wide-area data analytics , K. Kloudas, VLDB, 2015. Mechanisms to avoid data movement and recomputations are needed
  • 8.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Outline of Today’s Talk 8 8 ● IoT analytics over geo-distributed topologies. ● Abstract query model for query-driven IoT analytics. ● The StreamSight Framework ○ Query plan compilation. ○ Edge computing improvements. ○ Experimentation. ● Future research directions and open research questions.
  • 9.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 9 Abstract Query Model ● Queries are applied on metric streams with the intent to derive insights. ● Insights can be reused-transformed-composed with other metric streams to create new insights. <bus_id, bus99>, <bus_delay, 5> <bus_region, NW> ... Metric Record Metric Stream
  • 10.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 10 Abstract Query Model Insight = COMPUTE <Expression> EVERY <Interval> [WITH Optimizations>] COMPUTE ➢The composition, transformation and aggregation of multiple metric streams (e.g., expression, composite, aggregate). EVERY ➢Denotes the interval the expression is evaluated and can be a time interval (e.g., every 1min) or tuple-based (e.g., every 1000 records). WITH ➢Optional statement for capturing user-defined optimizations and constraints for data streams and edge topologies. Metric Stream Expression Insight Stream Metric Stream ... EVERY
  • 11.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 11 Smart City Bus Network edge server ● Buses equipped with GPS tracking devices emitting updates to respected local edge server of the current region it is navigating through. ● Bus updates include: bus id, location coordinates, operating city region, an estimation of the current bus route delay, etc. ● Inspired by Dublin smart city bus network.
  • 12.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 12 Insight Operations 1. Window Operations: Aggregation of values within a time period COMPUTE ARITHMETIC_MEAN(bus_delay, 10 MINUTES) EVERY 5 SECONDS Raw metric stream Time periodAggregate Time Interval COMPUTE ARITHMETIC_MEAN(bus_delay, 10 MINUTES) BY city_segment EVERY 5 SECONDS Group by a metric key Examples of Aggregates: sum, count, sdev, median, percentile,etc. Apache Spark 14 ops Apache Spark 15 Ops
  • 13.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 13 Insight Operations 2. Temporal Compositions: Compositions with different time windows COMPUTE ( ARITHMETIC_MEAN(bus_delay, 10 MINUTES) / ARITHMEIC_MEAN (bus_delay, 60 MINUTES) ) EVERY 5 SECONDS 3. Accumulated Compositions: Updates on previously computed data COMPUTE EWMA[0.85](passengers) BY bus_stop EVERY 1 TUPLE Examples: running_mean, running_max, running_sdev, etc. Apache Spark 32 ops Apache Spark 24 ops
  • 14.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 COMPUTE bus_delay WHEN > ( RUNNING_MEAN(bus_delay) + 3 * RUNNING_SDEV(bus_delay) ) BY city_segment EVERY 5 SECONDS; 14 Insight Operations 4. Hybrid Compositions: Combing window and accumulated operations COMPUTE ( ARITHMETIC_MEAN( bus_delay, 10 MINUTES) - EWMA[0.65]( bus_delay) ) BY city_segment EVERY 5 SECONDS 5. Filtered Compositions: Filter input and output streams Window Operation Accumulated Operation Filter Predicate Apache Spark 34 ops Apache Spark 41 ops
  • 15.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 15 Collaborative Edge Services ● Infrastructures of multiple stakeholders that are geographically distributed ● Inspired by publically available data from: ○ the New York transportation authority, ○ the Dublin smart city bus network and ○ Uber ● Endorsed with real-time weather data from open- access meteorological stations ● Companies, Employees and Clients can easily submit their queries
  • 16.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 16 Collaborative Edge Services COMPUTE vehicleID FROM (taxis, car_sharing) WHEN GEOHASH[10](cusLoc) == GEOHASH[10](vehLoc) EVERY 1 MINUTES Geo-analytic Queries Travel app user interested in finding closest taxis or car-sharing vehicles. Multiple Sources The city segment with least number of vehicles in a 15min sliding window when the temperature drops below 10◦C COMPUTE MIN( COUNT(buses, 15 MINUTES) BY city_segment + COUNT(taxis, 15 MINUTES) BY city_segment + COUNT(sharing, 15 MINUTES) BY city_segment ) WHEN temperature <= 10 EVERY 10 MINUTES COMPUTE TOP_K[5] ( MEAN(total_amount, 1 MONTH)- MEAN(total_amount, 1 MONTH, 1 MONTH ) ) BY city_segment EVERY 1 HOURS The top-5 city areas based on current and previous month average amount. 1 MONTH offset Data-driven suggestions
  • 17.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Outline of Today’s Talk 17 17 ● IoT analytics over geo-distributed topologies. ● Abstract query model for query-driven IoT analytics. ● The StreamSight Framework ○ Query plan compilation. ○ Edge computing improvements. ○ Experimentation. ● Future research directions and open research questions.
  • 18.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 18 Specification, compilation, and execution of streaming IoT analytic queries on distributed processing engines optimized for edge computing environments. StreamSight Framework StreamSight: A Query-Driven Framework for Streaming Analytics in Edge Computing. Z. Georgiou et al, IEEE/ACM UCC, 2018. Currently Supporting Future Adapters ...
  • 19.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 COMPUTE ARITHMETIC_MEAN( bus_delay, 10 MINUTES) BY city_segment EVERY 5 SECONDS 19 Query Model Translation ● Nodes correspond to a grammar rule of the language ● Leaves are the tokens and symbols of the language Insight Description Abstract Syntax Tree ● Parser performs early validation to verify syntactic correctness of query.
  • 20.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 ● Constructs Query Execution Plan, assembling the pipeline of stream operations from the AST representation. 20 Compilation Phase ● A recursive algorithm traverses the AST ● Each node is mapped to a stream operation of the underlying processing engine Abstract Syntax Tree ● Naive AST Mapping... extremely inefficient by ignoring geo-distributed nature of edge realms ○ Unnecessary intermediate re-computations ○ Increased data movement ● AST must acknowledge these. ... ...
  • 21.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 - System Optimizations 21 Reusing intermediate results ● StreamSight caches and broadcasts across worker nodes expressions, composites and results to reduce unnecessary re-computations. Insight 1: Calculate current average bus_delay Insight 2: Calculate the ratio between current and last hour bus_delay
  • 22.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 User Optimizations 22 [1] ApproxIoT: Approximate Analytics for Edge Computing, Z. Wen et al, ICDCS, 2018 Sampling enables the execution of an insight description on a portion of the streamed measurements for approximate but in time answers (k <<N) ● Uniform Sampling ● Weighted Hierarchical Reservoir Sampling (WHRS)1 ● Applies on the fly reservoir + stratified sampling StreamSight allows the user to prioritize insights ● On high-load influx or network uncertainties critical queries are not delayed while less important are queued.
  • 23.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 23 User Optimizations COMPUTE MAX(taxis_fare_amount, 60 MINUTES) BY city_segment EVERY 1 MINUTES WITH SALIENCE 1 Priority Higher is better Sampling with Error Margin & Confidence: COMPUTE ARITHMETIC_MEAN(taxi_passengers, 10 MINUTES) EVERY 30 SECONDS WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95 Error upper bound Confidence Interval COMPUTE ARITHMETIC_MEAN(bus_delay, 60 MINUTES) BY stop_id EVERY 5 MINUTES WITH SALIENCE 1 AND SAMPLE 0.2 Prioritization On high-load influx critical queries are not delayed Uniform Sampling Query execution on a portion of the data stream Query execution with bounded error guarantees for sampling
  • 24.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 24 User Optimizations COMPUTE COUNT(taxis) BY city_segment EVERY 1 SECONDS WITH ALLOW ON DEDICATED[5] Dedicated Execution Number of Dedicated Nodes COMPUTE PEWMA[0.5](bus_delay) BY bus_id EVERY 30 SECONDS WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95 AND AWARENESS ON COMPUTATIONS Try to minimize the Computations Try to maximize the Accuracy Awareness on Computations Accuracy Aware Execution COMPUTE PEWMA[0.5](bus_delay) BY bus_id EVERY 30 SECONDS WITH MAX_ERROR 0.05 AND CONFIDENCE 0.95 AND AWARENESS ON ACCURACY Execution of crucial queries on dedicated Nodes Minimize the computation footprint of execution for less significant queries but at the same time keep the error less than 5% Only in high influx periods sacrifice a portion of the accuracy but keep the error less than 5%
  • 25.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 25 Evaluation
  • 26.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 26 Dublin Bus Workload Real-World Datasets ● Dublin Smart City Buses Network[1] ○ 968 Buses (Jan 2014) ○ 16 metrics/record, including: bus_id, bus_delay, city_segment ○ Used 7 insights of actual interest for Bus operators [1] Dublin, “Smart City ITS,” https://data.smartdublin.ie/, 2018 16 Edge servers ● 1 vCPU, 1GB MEM, 2↑ 16↓ Mbps Evaluation Metric ● Batch Processing Time Unstable System Stable System ➢StreamSight achieved x1.4 speedup over the baseline ➢StreamSight+WHRS achieved x4.3 speedup over the baseline
  • 27.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 27 Re-usage of Intermediate Results ● Dublin Bus Workload ● Average Processing Time ( Fixed Input rate 700 req/s ) StreamSight DOES NOT incur a performance overhead Baseline configuration failed
  • 28.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Outline of Today’s Talk 28 28 ● IoT analytics over geo-distributed topologies. ● Abstract query model for query-driven IoT analytics. ● The StreamSight Framework ○ Query plan compilation. ○ Edge computing improvements. ○ Experimentation. ● Future research directions and open research questions.
  • 29.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 29 ● Same composition across different insights - different queries but with common operators. ● Same operators across different compositions - e.g., MEAN, is composed from a SUM divided by a COUNT. If either SUM or COUNT available then reuse them. ● Same composition across different offsets1 ● Re-use insights across users - involves tracking shared results across deployments and users, privacy protection, etc. (possibly use of blockchain?) COMPUTE ARITHMETIC_MEAN(consumption, 10 MINUTES)/ ARITHMETIC_MEAN(consumption, 10 MINUTES, 10 MINUTES) EVERY 15 MINUTES we can cache and reuse the composition for 10 minutes Reusage of Intermediate Results [1] SlickDeque: High Throughput and Low Latency Incremental Sliding-Window Aggregation. A. Shein et al, EDBT, 2018.
  • 30.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 30 ● Query model operators: DEDICATED, SALIENCE, ALLOW ON, AWARENESS, etc. ● Still… fog-device-user mobility and network uncertainties affect IoT services QoS, cost, and energy consumption. ● Analytics job scheduling requires “intelligent” consideration of data placement when orchestrating dynamic IoT services. ● Ignoring this can result in IoT services placed for optimal responsiveness but failing to guarantee timely insight refreshment. Query Execution Placement ADMin: Adaptive Monitoring Dissemination for the Internet of Things, D. Trihinas et al., IEEE,INFOCOM, 2017.
  • 31.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 31 ● Moving to the “edge” means not only are data sources diverse but possibly even the data processing engines. ● These engines must “speak” the same language. ● Open specification vs federation layer? Multiple and Heterogeneous Data Processing Engines OpenFog Consortium and OpenEdge Initiative
  • 32.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 32 ● Do we always need to actually compute the answer on the entire data? ○ Sampling… ○ Yes, but we need bounded approximations… and these approximations must be computed efficiently across geo-distributed environments. ■ Beware… substituting one computation with another must be beneficial in terms of performance (e.g., multivariate and dependent metrics)1. ● Do we always need to actually compute the answer? ○ or... can we use a bounded approximation on recent history be satisfactory2. Data-less Query Execution [1] ATMoN: Adapting the ”Temporality” in Large-Scale Dynamic Networks, D Trihinas et al, IEEE ICDCS, 2018. [2] Towards intelligent distributed data systems for scalable efficient and accurate analytics, P. Triantafyllou et al, IEEE ICDCS, 2018.
  • 33.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 33 ● Query model provides provisions for data confidentiality, restricted access control and data movement constraints across geo-locations. ● Offloading sensitive data to the cloud hinders man-in-the-middle attacks… on the other hand… processing “in place” hinders attacks (e.g., DDoS) on “easier” attacking planes (e.g., low-power IoT devices). ● Query model NOT enough… geo-distributed analytics requires task scheduling algorithms to acknowledge privacy-aware compute… How to do this efficiently? Security & Privacy COMPUTE patient_stream EVERY 5 MINUTES WITH ALLOW WHEN MEAN( heart_beat, 1 MINUTES ) >= 190 AND doctor_id IN (doctor_ids) AND region == clinic_region Evaluation Rule
  • 34.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 Conclusion 34 ● Abstract query model for query-driven IoT analytics ○ Use cases (smart city, energy, health, microservices) illustrating value of the query model. ● A prototype framework called StreamSight ○ A framework for the specification, compilation, and execution of streaming analytic queries on the “Edge” . ○ Optimizations: ■ Intermediate results ■ User-optimizations ○ StreamSight can achieve up to 4.3x speedup compared to a naively deployment. ● Many open research challenges for geo-distributed and query-driven analytics in edge/fog topologies. Reduce compute and network load on the Edge
  • 35.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 THANK YOU This work is partially supported by the European Commission in terms of Unicorn 731846 H2020 project (H2020-ICT-2016-1) Download StreamSight at: https://github.com/UCY- LINC-LAB/StreamSight.git 35
  • 36.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 36 Energy Consumption in Micro-DCs ● Micro-DCs, also denoted as Green-DCs, powered by: ○ National electricity providers and ○ Photovoltaic power harvesting stations placed near to the DCs ● A wide range of sensors are placed in all datacenter racks and the photovoltaic stations which generates measurements like: ○ Temperature and Energy consumption per Data Center, per Rack or per Node ○ Energy generation per Photovoltaic Panel ○ Weather data from station like humidity, wind, temperature etc ● Inspired by ENEDI project http://enedi.eu ENEDI: Energy Saving in Datacenters, Tryfonos et al, IEEE Global IoT, 2018.
  • 37.
    D. Trihinas trihinas@cs.ucy.ac.cy Laboratory for InternetComputing StreamSight - IC2E 2019 ProcessingTime(s) 37 Insight Prioritization ● Dublin Bus Workload ● Average Processing Time (fixed workload) ● 1 Insight with high priority and 3 insights with low priority Non prioritized queries are queued Introduced artificial latency (x2) between worker nodes Prioritized insight experiences no delay