This document provides an overview of various data analytics tools and frameworks for IoT, including Apache Hadoop, Apache Spark, Apache Storm, and NETCONF-YANG. It discusses using Hadoop MapReduce for batch data analysis, Apache Oozie for workflow scheduling, Apache Spark for fast processing, and Apache Storm for real-time streaming data analysis. Tools for deploying IoT systems like Chef and Puppet are also mentioned. Case studies and structural health monitoring are provided as examples of applying these technologies.
Introduction to Data Analytics for IoE focusing on smart city design and data analysis through sensors.
Discusses technologies like Hadoop, Spark, and Oozie for batch and real-time data analysis.
Explains the Map-Reduce framework, its process steps (Map, Shuffle, Reduce), and a practical example.
Describes batch processing and how Map-Reduce manages large data sets across multiple nodes.
Details on Map-Reduce execution, partitioning, and the role of mapper and reducer tasks.
Explains how output from mappers is processed through reducers for distributed data analysis.
Overview of Apache Oozie as a scheduler for managing Hadoop jobs and its workflow capabilities.
Introduction to Apache Spark, its high-speed processing, adoption in industry, and advantages.Apache Storm's architecture and capabilities for processing real-time data streams effectively.
Explains NETCONF and YANG protocols for managing network configurations, including transactions and operations.
Discusses YANG's language for data modeling in NETCONF, including extensibility and organization of data.
MODULE 6: DataAnalytics for
IoE
CO5-Design and develop smart city in IOT.
CO6-Analysis and evaluate the data
received through sensors in IOT.
2.
CONTENTS
• Introduction
• ApacheHadoop Using Hadoop Map Reduce for Batch Data
Analysis
• Apache Oozie
• Apache Spark
• Apache Storm, Using Apache Storm for Real-time Data
Analysis
• Structural Health Monitoring Case Study
• Tools for IoT:-
– Chef, Chef Case Studies,
– Puppet, Puppet Case Study
– Multi-tier Deployment
• NETCONF-YANG Case Studies
• IoT Code Generator.
Map Reduce
• Map-Reduceis a scalable programming model
that simplifies distributed processing of data.
Map-Reduce consists of three main steps:
Mapping, Shuffling and Reducing.
• An easy way to think about a Map-Reduce job
is to compare it with act of ‘delegating’ a large
task to a group of people, and then combining
the result of each person’s effort, to produce
the final outcome.
26.
Map Reduce
• Let’stake an example to bring the point across. You just heard about this
great news at your office, and are throwing a party for all your colleagues!
• You decide to cook Pasta for the dinner. Four of your friends, who like
cooking, also volunteer to join you in preparation.
• The task of preparing Pasta broadly involves chopping the vegetables,
cooking, and garnishing.
• Let’s take the job of chopping the vegetables and see how it is analogous
to a map-reduce task.
• Here the raw vegetables are symbolic of the input data, your friends are
equivalent to compute nodes, and final chopped vegetables are analogous
to desired outcome.
• Each friend is allotted onions, tomatoes and peppers to chop and weigh.
• You would also like to know how much of each vegetable types you have
in the kitchen. You would also like to chop these vegetables while this
calculation is occurring. In the end, the onions should be in one large bowl
with a label that displays its weight in pounds, tomatoes in a separate one,
and so on.
27.
Map Reduce
MAP: Tostart with, you assign each of your four friends a random mix
of different types of vegetables. They are required to use their
‘compute’ powers to chop them and measure the weight of each type
of veggie. They need to ensure not to mix different types of veggies. So
each friend will generate a mapping of <key, value> pairs that looks
like:
Friend X:
• <tomatoes, 5 lbs>
<onions, 10 lbs>
<garlic, 2 lbs>
Friend Y:
• <onions, 22 lbs>
<green peppers, 5 lbs>
…
• Now that your friends have chopped the vegetables, and labeled
each bowl with the weight and type of vegetable, we move to the
next stage: Shuffling.
28.
Map Reduce
SHUFFLE: Thisstage is also called Grouping. Here you want to group the veggies by their types. You
assign different parts of your kitchen to each type of veggie, and your friends are supposed to group the
bowls, so that like items are placed together:
North End of Kitchen:
• <tomatoes, 5 lbs>
<tomatoes, 11 lbs>
West End of Kitchen:
• <onions, 10 lbs>
<onions, 22 lbs>
<onions, 1.4 lbs>
East End of Kitchen:
• <green peppers, 3 lbs>
<green peppers, 10 lbs>
• The party starts in a couple of hours, but you are impressed by what your friends have
accomplished by Mapping and Grouping so far! The kitchen looks much more organized now and
the raw material is chopped. The final stage of this task is to measure how much of each veggie you
actually have. This brings us to the Reduce stage.
29.
Map Reduce
• REDUCE:In this stage, you ask each of your
friend to collect items of same type, put them
in a large bowl, and label this large bowl with
sum of individual bowl weights. Your friends
cannot wait for the party to start, and
immediately start ‘reducing’ small bowls. In
the end, you have nice large bowls, with total
weight of each vegetable labeled on it.
Map Reduce
• Thenumber represents total weight of that vegetable after
reducing from smaller bowls
• Your friends (‘compute nodes’) just performed a Map-
Reduce task to help you get started with cooking the Pasta.
Since you were coordinating the entire exercise, you are
“The Master” node of this Map-Reduce task. Each of your
friends took roles of Mappers, Groupers and Reducers at
different times. This example demonstrates the power of
this technique.
• This simple and powerful technique can be scaled very
easily if more of your friends decide to join you. In future,
we will continue to add more articles on different open
source tools that will help you easily implement Map-
Reduce to solve your computational problems.
32.
Introduction to batchprocessing –
MapReduce
• Today, the volume of data is often too big for a
single server – node – to process.
• Therefore, there was a need to develop code that
runs on multiple nodes.
• Writing distributed systems is an endless array of
problems, so people developed multiple
frameworks to make our lives easier.
• MapReduce is a framework that allows the user
to write code that is executed on multiple nodes
without having to worry about fault tolerance,
reliability, synchronization or availability.
33.
Batch processing
• Thereare a lot of use cases for a system described in the
introduction, but the focus of this post will be on data
processing – more specifically, batch processing.
• Batch processing is an automated job that does some
computation, usually done as a periodical job.
• It runs the processing code on a set of inputs, called a
batch. Usually, the job will read the batch data from a
database and store the result in the same or different
database.
• An example of a batch processing job could be reading all
the sale logs from an online shop for a single day and
aggregating it into statistics for that day (number of users
per country, the average spent amount, etc.). Doing this as
a daily job could give insights into customer trends.
MapReduce
MapReduce
• MapReduce isa programming model that was introduced
in a white paper by Google in 2004.
• Today, it is implemented in various data processing and
storing systems (Hadoop, Spark, MongoDB, …) and it is a
foundational building block of most big data batch
processing systems.
• For MapReduce to be able to do computation on large
amounts of data, it has to be a distributed model that
executes its code on multiple nodes. This allows the
computation to handle larger amounts of data by adding
more machines – horizontal scaling.
• This is different from vertical scaling, which implies
increasing the performance of a single machine.
36.
MapReduce
Execution
• In orderto decrease the duration of our distributed
computation, MapReduce tries to reduce
shuffling (moving) the data from one node to another by
distributing the computation so that it is done on the same
node where the data is stored.
• This way, the data stays on the same node, but the code is
moved via the network. This is ideal because the code is
much smaller than the data.
• To run a MapReduce job, the user has to implement two
functions, map and reduce, and those implemented
functions are distributed to nodes that contain the data by
the MapReduce framework.
• Each node runs (executes) the given functions on the data
it has in order the minimize network traffic (shuffling data).
MapReduce
• The computationperformance of MapReduce
comes at the cost of its expressivity.
• When writing a MapReduce job we have to follow
the strict interface (return and input data
structure) of the map and the reduce functions.
• The map phase generates key-value data pairs
from the input data (partitions), which are then
grouped by key and used in the reduce phase by
the reduce task.
• Everything except the interface of the functions is
programmable by the user.
39.
MapReduce
Map
• Hadoop, alongwith its many other features, had the
first open-source implementation of MapReduce. It
also has its own distributed file storage called HDFS.
• In Hadoop, the typical input into a MapReduce job is a
directory in HDFS.
• In order to increase parallelization, each directory is
made up of smaller units called partitions and each
partition can be processed separately by a map task
(the process that executes the map function).
• This is hidden from the user, but it is important to be
aware of it because the number of partitions can affect
the speed of execution.
MapReduce
• The maptask (mapper) is called once for every
input partition and its job is to extract key-value
pairs from the input partition. The mapper can
generate any number of key-value pairs from a
single input (including zero, see the figure above).
• The user only needs to define the code inside the
mapper. Below, we see an example of a simple
mapper that takes the input partition and
outputs each word as a key with value 1.
MapReduce
Reduce
• The MapReduceframework collects all the key-
value pairs produced by the mappers, arranges
them into groups with the same key and applies
the reduce function.
• All the grouped values entering the reducers are
sorted by the framework.
• The reducer can produce output files which can
serve as input into another MapReduce job, thus
enabling multiple MapReduce jobs to chain into a
more complex data processing pipeline.
44.
MapReduce
• The mapperyielded key-value pairs with the word as
the key and the number 1 as the value.
• The reducer can be called on all the values with the
same key (word), to create a distributed word counting
pipeline.
• In the image below, we see that not every sorted group
has a reduce task.
• This happens because the user needs to define the
number of reducers, which is 3 in our case.
• After a reducer is done with its task, it takes another
group if there is one that was not processed.
MapReduce
• MapReduce isa programming model that allows the user to
write batch processing jobs with a small amount of code.
• It is flexible in the sense that you, the user, can write code
to modify the behavior, but making complex data
processing pipelines becomes cumbersome because every
MapReduce job has to be managed and scheduled on its
own.
• The intermediate output of map tasks is written to a file
which allows the framework to recover easily if a node has
a failure.
• This stability comes at a cost of performance, as the data
could have been forwarded to reduce tasks with a small
buffer instead, creating a stream.
47.
Apache Oozie
• Oozieis a workflow scheduler system to manage
Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
• Oozie Coordinator jobs are recurrent Oozie Workflow
jobs triggered by time (frequency) and data availability.
• Oozie is integrated with the rest of the Hadoop stack
supporting several types of Hadoop jobs out of the box
(such as Java map-reduce, Streaming map-reduce, Pig,
Hive, Sqoop and Distcp) as well as system specific jobs
(such as Java programs and shell scripts).
• Oozie is a scalable, reliable and extensible system.
Apache Spark
• ApacheSpark is a lightning-fast unified analytics engine
for big data and machine learning. It was originally
developed at UC Berkeley in 2009.
• Since its release, Apache Spark, the unified analytics
engine, has seen rapid adoption by enterprises across a
wide range of industries.
• Internet powerhouses such as Netflix, Yahoo, and eBay
have deployed Spark at massive scale, collectively
processing multiple petabytes of data on clusters of
over 8,000 nodes.
• It has quickly become the largest open source
community in big data, with over 1000 contributors
from 250+ organizations.
53.
Apache Spark
• Sparkcan be 100x faster than Hadoop for large scale data
processing by exploiting in memory computing and other
optimizations.
• Spark is also fast when data is stored on disk, and currently
holds the world record for large-scale on-disk sorting.
• Spark has easy-to-use APIs for operating on large datasets.
• Spark comes packaged with higher-level libraries, including
support for SQL queries, streaming data, machine learning
and graph processing.
• These standard libraries increase developer productivity
and can be seamlessly combined to create complex
workflows.
Apache Storm
• ApacheStorm is a distributed real-time big data-processing
system.
• Storm is designed to process vast amount of data in a fault-
tolerant and horizontal scalable method.
• It is a streaming data framework that has the capability of
highest ingestion rates.
• Though Storm is stateless, it manages distributed
environment and cluster state via Apache ZooKeeper.
• It is simple and you can execute all kinds of manipulations
on real-time data in parallel.
• Apache Storm is continuing to be a leader in real-time data
analytics. Storm is easy to setup, operate and it guarantees
that every message will be processed through the topology
at least once.
56.
• Apache Stormhas the cluster with some
specific components, each component works
with some functions together assists “The
Apache Storm: Architecture”. There are two
types of nodes that are present in
architecture:
• Master node(Nimbus)
• Worker node(Supervisor)
57.
• The Masternode comprises nimbus, nimbus acts as a daemon for
the master node. The Master node runs the nimbus, nimbus
examines and administers the task to cluster or worker node, allots
tasks to machines, and supervise on failure. Nimbus permits to
accept code (data) in any programming language, in this way
anyone can utilize Apache storm without knowing any other
language.
• The Worker node comprises of a supervisor, the supervisor acts as
a daemon for the worker node. The Worker node runs the
supervisor, supervisor concentrates on the task given to the
machine and monitors worker processes as required based on what
Nimbus has assigned to it. Each worker node process operates a
part of topology in the form of spouts and bolts. Nimbus daemon
communicates with the supervisor daemon via ZooKeeper.
58.
Components of ApacheStorm
• Topology is the real-time computational and
graphical representation data structure. The
topology consists of bolt and spouts where
spout determines how the output is fixed to
the inputs of bolts and output from a single
bolt linked to the inputs of other bolts. A
storm cluster gets input as topology, the
nimbus daemon in the master node seeks
information with supervisor daemon in the
Worker node and accepts the topology.
• Spout actsas an initial point-step in topology, data from
unlike sources is acquired by the spout. It ingests the data
as a stream of tuples and sends it to bolt for processing of
stream as data. A single spout can generate multiple
outputs of streams as tuples, these tuples of streams are
further consumed by one or many bolts. Spout gets data
from various databases, file system distribution or
messages like Kafka consistently, converts them in streams
of tuples and sends them to bolts for processing.
• Bolts are responsible for the processing of data, their work
includes filtering, functioning, aggregations, and handing
databases, etc. Bolts consume multiple streams as input,
process them, and generate new streams for processing of
data.
61.
• Consider thecase of Twitter, it is an online social
platform to communicate with tweets. Here, user
tweets can be sent and received. Subscribed
users read and post tweets while unsubscribed
users read tweets only.
• A hashtag is used to classify tweets as a keyword
by putting # earlier to an appropriate keyword.
So, Apache Storm acts here as a real-time outline
of detecting the most used Hashtag per tweet.
62.
Apache Storm vsHadoop
• Basically Hadoop and Storm frameworks are
used for analyzing big data. Both of them
complement each other and differ in some
aspects. Apache Storm does all the operations
except persistency, while Hadoop is good at
everything but lags in real-time computation.
The following table compares the attributes of
Storm and Hadoop.
NETCONF/YANG
• Network ConfigurationProtocol, known as
NETCONF, gives access to the native capabilities
of a device within a network, defining methods to
manipulate its configuration database,
retrieve operational data, and invoke specific
operations.
• YANG provides the means to define
the content carried via NETCONF, for both data
and operations. Together, they help users build
network management applications that meet the
needs of network operators.
NETCONF/YANG
The motivation behindNETCONF and YANG was, to have a network
management system that manages the network at the service level
that includes:
– Standardized data model (YANG)
– Network-wide configuration transactions
– Validation and roll-back of configuration
– Centralized backup and restore configuration
Businesses have used Simple Network Management Protocol for a long
time, but it was being used more for reading device states than for
configuring devices. NETCONF and YANG address the shortcomings of
SNMP and add more functionalities in network management, such as:
1. Configuration transactions
2. Network-wide orchestrated activation
3. Network-level validation and roll-back
4. Save and restore configurations
68.
NETCONF/YANG
Configuration transactions:
• NETCONFconfigurations work based on atomic
transactions consisting of multiple configuration commands
required to move a network from state A to state B.
• The order of the configuration snippets within a transaction
does not matter and the success of a transaction is
based on the success of all the command snippets.
• If any single command fails, the entire transaction becomes
a failure.
• So, there is no intermediate erroneous state, either it’s at
state A (if any one command of the transaction fails) or at
state B (if the transaction is successful as a whole).
69.
NETCONF/YANG
Network-wide orchestrated activation:
•There is a distinction between the distribution of a
configuration to all the networking devices
and the activation of it.
• For example, if the operator wants to configure a VPN
over a network of devices all at one time, NETCONF
provides the flexibility to distribute the configuration,
validate it, lock all device configurations, commit the
configuration, and unlock.
• This set of actions will result in enabling a VPN over the
entire network at the same time, in an orchestrated,
synchronized way.
70.
NETCONF/YANG
Network-level validation androll-back:
• Each NETCONF server keeps a “Candidate database” (in
parallel to “Running config
database”).
• Using this candidate data store, a NETCONF manager
can implement a network-wide transaction by sending
a configuration to the candidate of each device,
validating the candidate, and if all participants are fine,
telling them to commit the changes.
• If the results are not satisfactory, the manager can ask
to roll-back all devices.
71.
NETCONF/YANG
Save and restoreconfigurations:
• NETCONF Manager can take a backup of the
networking device configuration whenever
needed and restore it by sending the saved
configuration to any networking device.
72.
Protocol Stack
• TheNETCONF protocol can be broken down
into 4 layers. These are as shown in figure
73.
1. Content: NETCONFdata models and protocol operations use the
YANG modelling language. A data model outlines the structure,
semantics and syntax of the data.
2. Operations: A set of base protocol operations initiated via by RPC
methods using XML- encoding, in order to perform operations
upon the device. Such as <get-config>, <edit-config> and <get>.
3. Messages: A set of RPC messages and notifications are defined for
use including <rpc>,<rpc-reply> and <rpc-error>.
4. Transport: The transport layer used to provide a communication
path between the client/server (manager/agent). The protocol
used is agnostic to NETCONF, but SSH is typically used
74.
Communication
NETCONF is basedupon a client/server model. Within the
communication flow of a NETCONF session there are 3
main parts. These are:
1. Session Establishment - Each side sends a <hello> ,
along with its <capabilities> . Announcing what
operations (capabilities) it supports.
2. Operation Request - The client then sends its request
(operation) to the server via the <rpc> message. The
response is then sent back to the client within <rpc-
reply> .
3. Session Close - The session is then closed by the client
via <close-session>
Netconf Manager TestScenario
The important aspects of a NETCONF Server validation can be
classified in the following
categories:
• Validate the YANG model encoding of NETCONF operations
(e.g., <get>, <get-
config>, <edit-config>) received in Request XML messages
(e.g., ietf, openconf, or
proprietary)
• Stress management plane with many concurrent NETCONF
sessions and assess the impact to regular control plane and
data plane operation of a network device
• Measure device response time of NETCONF transaction
77.
NETCONF Server testscenario
• Test Objective: Measure the efficiency of
NETCONF Server in terms of the time it takes
to respond to NETCONF Request XMLs when
multiple concurrent NETCONF Client
sessions are active.
• Test Topology: There are multiple NETCONF
Clients, all of them are connected to a single
NETCONF Server (DUT), as shown in figure:
NETCONF Server testscenario
Steps for testing are as follows:
1. The NETCONF Clients are preconfigured with a set of NETCONF Request XMLs as
per the YANG model supported by the DUT. The XMLs have different types of
command snippets like edit-config, get, get-config.
2. Once sessions are established, NETCONF clients will start sending NETCONF
Request messages in the form of XML files and the Server is supposed to
respond with NETCONF Reply messages with the same XML format.
3. Let’s assume the Clients have sent some Request messages and stopped sending
thereafter.
4. For each session, measure how much time (min/max/average) the Server takes
to send a Reply message after receiving a Request message.
5. Now have the Clients resume sending Request messages again at a higher
transmission rate for a certain duration of time and measure how that affects the
DUT’s response time under stress condition.
80.
YANG (Yet AnotherNext Generation)
• YANG is a language used to model data for the NETCONF
protocol. A YANG module defines
a hierarchy of nodes which can be used for NETCONF-
based operations, including configuration, state data,
remote procedure calls (RPCs), and notifications.
• This allows a complete description of all data sent
between a NETCONF client and server.
• YANG models the hierarchical organization of data as a
tree in which YANG provides clear and concise
descriptions of the nodes, as well as the interaction
between those nodes.
81.
YANG (Yet AnotherNext Generation)
• YANG structures data models into modules and sub modules.
• A module can import data from other external modules, and
include data from sub modules.
• The hierarchy can be extended, allowing one module to add
data nodes to the hierarchy defined in another module.
• This augmentation can be conditional, with new nodes to
appearing only if certain conditions are met.
• YANG models can describe constraints to be enforced on the
data, restricting the appearance or value of nodes based the
presence or value of other nodes in the hierarchy.
• These constraints are enforceable by either the client or the
server, and valid content must abide by them.
82.
YANG (Yet AnotherNext Generation)
• YANG defines a set of built-in types, and has a type mechanism
through which additional types may be defined.
• Derived types can restrict their base type's set of valid values using
mechanisms like range or pattern restrictions that can be enforced
by clients or servers.
• They can also define usage conventions for use of the derived type,
such as a string-based type that contains a host name.
• YANG permits the definition of complex types using reusable
grouping of nodes.
• The instantiation of these groupings can refine or augment the
nodes, allowing it to tailor the nodes to its particular needs.
• Derived types and groupings can be defined in one module or
submodule and used in either that location or in another module or
submodule that imports or includes it
83.
YANG (Yet AnotherNext Generation)
• YANG organizational constructs include defining lists of
nodes with the same names and identifying the keys which
distinguish list members from each other.
• Such lists may be defined as either sorted by user or
automatically sorted by the system.
• For user-sorted lists, operations are defined for
manipulating the order of the nodes.
• YANG modules can be translated into an XML format called
YIN, allowing applications using XML parsers and XSLT
scripts to operate on the models.
• XML Schema [XSD] files can be generated from YANG
modules, giving a precise description of the XML
representation of the data modeled in YANG modules
84.
YANG (Yet AnotherNext Generation)
• YANG strikes a balance between high-level object-oriented modelling and
low-level bits-on- the-wire encoding.
• The reader of a YANG module can easily see the high-level view of the
data model while seeing how the object will be encoded in NETCONF
operations.
• YANG is an extensible language, allowing extension statements to be
defined by standards bodies, vendors, and individuals.
• The statement syntax allows these extensions to coexist with standard
YANG statements in a natural way, while making extensions stand out
sufficiently for the reader to notice them.
• YANG resists the tendency to solve all possible problems, limiting the
problem space to allow expression of NETCONF data models, not arbitrary
XML documents or arbitrary data models.
• The data models described by YANG are designed to be easily operated
upon by NETCONF operations.