Work presented in partial fulfillment
of the requirements for the degree of
Bachelor in Computer Science - Federal University of Rio Grande do - Brazil
Diamond Application Development Crafting Solutions with Precision
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids
1. Distributed Near Real-Time Processing
of Sensor Network Data Flows
for Smart Grids
Advisor: Prof. Dr. Philippe O. A. Navaux
Co-advisor: Prof. M.Sc. Eduardo Roloff
.
Otávio Moraes de Carvalho
January 16, 2016
Institute of Informatics | Federal University of Rio Grande do Sul
2. Table of contents
.
1. Introduction
2. Background
3. Design
4. Implementation
5. Evaluation
6. Conclusion and Future work
2
3. Introduction
.
• Motivation
• Internet Ubiquity
• Ubiquity of Sensors
• Data velocity
• Smart Grids
• Objective
• Provide a scalable platform for distributed near real-time processing
of sensor networks data flows, focused on data profiles of Smart
Grids
1. How to scale a distributed platform for IoT?
2. How to provide insights in near real-time?
3. How to test a platform like this?
3
4. Internet of Things
.
• Pervasivity of sensors, that have ability to interact with each
other through unique addressing schemes, and cooperate with their
neighbours to reach common goals. [?]
Figure 1: Total units of connected devices - Gartner Inc. 2013 Forecast [?]
4
6. Distributed Stream Processing Systems
.
• Online applications that require real-time or near-real-time
processing functionalities are the main motivation.
• Low latency alternatives to Hadoop processing approach
(MapReduce) are needed [?].
• Common requirements:
1. Input streams with high up to very high data rates (> 10000
events/s).
2. Relaxed latency constraints (up to a few seconds).
3. Use cases require the correlation among historical and live data.
4. Systems that elastically scale and to support diverse workloads.
5. Low overhead fault tolerance supporting out of order events and
exactly once semantic.
6
7. Distributed Stream Processing Systems
.
• The most prominent frameworks found on the state-of-the-art:
1. Apache Storm
2. Apache Spark Streaming
3. Apache Flink
7
8. Cloud Computing
.
• According to NIST definition [?], Cloud Computing is a model that
conveniently provides on-demand network access to a shared pool
of configurable computing resources that can be provisioned and
released quickly without large management efforts and
interaction with the service provider.
Figure 3: Cloud Computing service models stack and their relationships
8
9. Big Data
.
• NIST defines big data as ”Big data shall mean the data of which
the data volume, acquisition speed, or data representation limits
the capacity of using traditional relational methods to conduct
effective analysis or the data which may be effectively processed with
important horizontal zoom technologies”. [?]
• ”3Vs” model: [?]
1. Volume, following the increasing generation and collection of masses
of data, data scale becomes increasingly big.
2. Variety, indicates the various types of data, which include
semi-structured and unstructured data such as audio, video,
webpage, and text, as well as traditional structured data.
3. Velocity, meaning the timeliness of big data, specifically, data
collection and analysis, etc. that must be rapidly and timely
conducted, so as to maximumly utilize the commercial value of big
data.
9
10. Smart Grids
.
• For 100 years, there has been no change in the basic structure of
the electrical power grid. Experiences have shown that the
hierarchical, centrally controlled grid of the 20th Century is ill-suited
to the needs of the 21st Century.
• Advanced Metering Infrastructure (AMI): Infrastructure for
information gathering through smart meters. Drives the need for
high throughput when using large number of IoT meters.
• Demand Side Management (DSM): Energy generation peak
management and reductions of the need for investments in power
generation sources.
• Energy Consumption Forecasts: Provide a prediction of an
amount of electricity consumed at a certain point of time. The
purpose of electricity load forecasting is an efficient economic and
quality planning of energy generation. Drives the need for low
processing latency.
10
11. Architecture
.
• A found a few architectural patterns on the state-of-the-art:
1. Lambda Architecture
2. Kappa Architecture
3. Liquid Architecture
11
12. Cyclic Architecture
.
• We propose Cyclic architecture, which is a hybrid solution mixing
architectural solutions from Kappa architecture and Liquid
architecture.
Figure 4: An overview of the proposed Cyclic Architecture
12
13. Dataset
.
1. The dataset used to evaluate the platform originates from the 8th
ACM International Conference on Distributed Event-Based Systems
(DEBS 2014).
2. The synthesized data file contains over 4055 Millions of
measurements for 2125 plugs distributed across 40 houses, for a
total amount of 136 GB.
3. Generated measurements cover a period of one month, from Sept.
1st, 2013, 00:00:00, to Sept. 30th, 2013, 23:59:59. For our tests, we
used a subset of this file, which have 100 Million measurements,
using the same amount of plugs and houses, for a total amount of
3.6 GB.
13
15. Forecasting Method
.
• The select forecast method was chosen due to need of a model fit
between the algorithm and the processing capabilities of a
distributed stream processing framework. It represents a mixed
approach between MLP (Multilayer Perceptron) and
Autoregressive Integrated Moving Average (ARIMA). [?].
• More specifically, the set of queries provide a forecast of the load for:
(1) each house, i.e., house-based and (2) for each individual plug,
i.e., plug-based. The forecast for each house and plug is made
based on the current load of the connected plugs and a plug specific
prediction model.
• The aim of these queries is not provide the best prediction model,
but at stressing the interplay between modules for model learning
that operate on long-term (historic) data with components that
apply the model on top of live, high velocity data.
15
16. Forecasting Method
.
L(si+2) =
avgL(si) + median(avgL(sj))
2
(1)
In the formula (1), avgL(si) represents the current average load for the
slice si. The value of avgL(si), in case of plug-based prediction, is
calculated as the average of all load values reported by the given plug
with timestamps ∈ si. In case of a house-based prediction the avgL(si) is
calculated as a sum of average values for each plug within the house.
avgL(sj) is a set of average load value for all slices sj such that:
sj = si+2−n∗k (2)
where k is the number of slices in a 24 hour period, n is a natural number
with values between 1 and floor(i+2
k ). The value of avgL(sj) is calculated
analogously to avgL(si) in case of plug-based and house-based (sum of
averages) variants.
16
19. Platform
.
• In order to evaluate the system, we needed a platform for being able
to execute our tests. The platform was built relying on Microsoft
Azure to host our application, and it was configured using the
following settings:
19
24. Conclusion
.
• A system for processing distributed near real-time data flows, with
focus on Smart Grids data profiles, was successfully design and
implemented.
• The build system is able to scale linearly up to 8 processing
nodes. Which is important to process large numbers of smart
meters.
• The system is able to provide desirable latencies, which is
important to provide load forecasts in time to be used. However, it
was found that tiny batch sizes could turn processing unstable.
• It was found that greater batch sizes improve throughput, in
expense of latencies, which start to increase proportionally.
24
25. Future work
.
• Improvements on throughput by increasing the number of parallel
data input feeds into Apache Kafka.
• Deeper research on prediction forecasting and results on forecast
accuracy.
• Studies on fault-tolerance and system availability.
• Abstraction layer for machine deployment and management,
using Apache YARN or Apache Mesos with Docker containers.
25
27. References I
.
L. Atzori et al.
The Internet of Things: A Survey.
Computer networks, 54(15):2787–2805, 2010.
T. Bylander and B. Rosen.
A Perceptron-like Online Algorithm for Tracking the Median.
In Neural Networks, 1997., International Conference on, volume 4,
pages 2219–2224. IEEE, 1997.
D. Laney.
3-D Data Management: Controlling Data Volume.
Velocity and Variety, META Group Original Research Note, 2001.
27
28. References II
.
I. Lee et al.
The Internet of Things (IoT): Applications, Investments and
Challenges for Enterprises.
Business Horizons, 2015.
P. Mell and T. Grance.
The NIST definition of Cloud Computing.
2011.
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt,
S. Madden, and M. Stonebraker.
A Comparison of Approaches to Large-Scale Data Analysis.
In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
165–178. ACM, 2009.
28
29. References III
.
N. B. D. PWG.
Nist big data interoperability framework.
Reference Architecture, 2014.
29
30. D-Streams
.
• Treat streaming computation as a series of deterministic batch
computations on small time intervals.
• D-Streams bring traditional functional transformation operators and
introduce new stateful operators that work over multiple intervals.
These include:
• Windowing
• Incremental aggregation over sliding windows
• Time-skewed joins
Figure 11: Comparison between a simple and a windowed DStream
30