© 2015 IBM Corporation
How Spark Enables the Internet of Things:
Efficient Integration of Multiple Spark
Components for Smart City Use Cases
Paula Ta-Shma
IBM Research
paula@il.ibm.com
Joint work with:
Adnan Akbar, University of Surrey
Michael Factor, IBM Research
Guy Hadash, IBM Research
Juan Sancho, ATOS
© 2015 IBM Corporation2
The Evolution of Data Collection
Internet of
Things
© 2015 IBM Corporation3
2005 2012 2017
The IoT market will grow to
$1.7 trillion in 2020 (IDC)
By 2020 the number of networked devices
will be 30 billion (IDC), more than 4 times
the entire global population
IoT : The Biggest Big Data
GlobalDataVolumeinExabytes
2005 2012 2017
© 2015 IBM Corporation4
EMT Madrid Bus Company Needs to Make Decisions
According to Current and Predicted Future Traffic State
 The Problem
– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output.
This can be slow and costly.
 Objective
– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-
time traffic problems
 Approach
– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based
upon knowledge derived from historical data
Today Tomorrow
© 2015 IBM Corporation5
1. Collect historical time series data
– Collect data from devices
– Aggregate into objects
– Index and/or partition
Generic IoT Architecture – Data Flow
Secor
IoT
Swift
© 2015 IBM Corporation6
2. Learn patterns in data
– May be time/location dependent
– Generate thresholds, classifiers etc.
Generic IoT Architecture – Data Flow
Secor
Swift
© 2015 IBM Corporation7
IoT
3. Apply what was learned on
real time data stream
– Take action
Generic IoT Architecture – Data Flow
Secor
CEP
Swift
© 2015 IBM Corporation8
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark
Components for Smart City Use Cases
IoT
Generic IoT Architecture – Data Flow
CEP
Secor
Swift
Green Flows: Real time
Purple Flows: Batch
© 2015 IBM Corporation9
Aim: Collect historical timeseries data for analysis
– Continuously collect data from up to 3000 Madrid council traffic sensors via web service
- Data includes traffic speeds and intensities, updated every 5 mins
– Push the messages to Kafka
– Use Secor to aggregate multiple messages into a single Swift object
- According to policy, e.g., every 60 mins
- Possibly partition the data, e.g. according to date
- Convert to Parquet format
- Annotate with metadata, e.g., min/max speed, start/end time
– Index Swift objects according to their metadata using ElasticSearch
Secor
Swift
IoT Architecture – Madrid Traffic – Ingestion Flow
IoT
© 2015 IBM Corporation10
IoT Architecture – Madrid Traffic – Data Access
Aim: Access data efficiently and cost
effectively
– Store IoT data in OpenStack Swift object
storage
- Open source, low cost deployment, and
highly scalable
– Parquet data is accessible via Spark SQL
– Optimized predicate pushdown
- Custom Spark SQL external data source
driver
- Uses object metadata indexes
- Searches for Swift objects whose min/max
values overlap requested ranges
Get all data for morning traffic:
SELECT codigo, intensidad, velocidad FROM
madridtraffic
WHERE tf >= '08:00:00' AND tf <= '12:00:00'
Brute force method
13245 Swift requests
Optimized predicate pushdown
616 Swift requests
21.5 times improvement
Swift
© 2015 IBM Corporation11
IoT Architecture – Madrid Traffic – Machine Learning
Aim: Learn to differentiate between ‘good’ and
‘bad’ traffic
– Depends on context
- Time (morning/evening), Day (weekday/weekend)
- Location
– Use Spark MLlib k-means clustering
– Produce threshold values for real-time decision making
– Re-run algorithm when quality of clusters decreases
- Can use silhouette index to measure quality
Swift
© 2015 IBM Corporation12
IoT Architecture – Madrid Traffic – Machine Learning
Event Detection:
• Use Spark MLlib k-means
clustering to separate
data into 2 clusters
• Find the midpoint between
the 2 cluster centres
• Use this midpoint to
generate the thresholds
• Repeat for each context
e.g. time period (morning,
afternoon, evening, night)
Anomaly Detection:
• Use a single cluster and
define an anomaly to be
further than a certain
distance from the cluster
centre
Morning Traffic on Weekdays
© 2015 IBM Corporation13
IoT Architecture – Madrid Traffic –
Real Time Decision Making
Aim: Respond in real time to traffic conditions
– Use Complex Event Processing (CEP) approach
- Rule based
- Process events record by record
- CEP rules are typically defined manually but in many
cases it is difficult to get them right
- We automate this process and make it smart
- uCEP has a small footprint, can be run at the edge
CEP
IoT
Work in Progress
Proactive approach:
• Use Spark streaming
linear regression to
predict traffic behavior
(e.g. speed, intensity)
for near future
• Apply CEP on
predicted data
• Respond pro-actively
to predicted events
such as traffic
congestion
– e.g. EMT can
proactively re-
route buses
© 2015 IBM Corporation14
Demo
© 2015 IBM Corporation15
Our Architecture Applies to Many IoT Use Cases
 Energy/utilities
– Anomaly detection
- Pipe leakage
- Appliance malfunction
– Occupancy detection
 Healthcare
– Healthcare patient
monitoring/alert/response
 Insurance
– Driver behavior and location
monitoring
 Transportation
– Connected vehicles, engine
diagnostics, automated service
scheduling
 Logistics
– Goods tracking, sensitive
goods management
© 2015 IBM Corporation
Data
Sources
Apache
Spark
Node-RED
Secor
Message
Bus
Data
Storage
Data
Analytics
Data
Visualization
Freeboard Dashboard
Object
Storage
16
MQTT
The Madrid Traffic Use Case on IBM Bluemix
Madrid Traffic Sensors
Joint work with Naeem Altaf and team
© 2015 IBM Corporation17
Thank You !
© 2015 IBM Corporation18
Backup
© 2015 IBM Corporation19
COSMOS
 Funding: EU FP7 at level of 2PY x 3 years
 Started: Sept 2013
 Coordinator: ATOS
 Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS
 Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid
Council, III Taiwan – Smart Cities use cases
 Project Vision: Enable ‘things’ to interact with each other based on shared
experience, trust, reputation etc.
© 2015 IBM Corporation20
IBM Bluemix Data Analytics for IoT Architecture
© 2015 IBM Corporation21
 What is it?
– Apache Kafka is a high throughput distributed publish/subscribe messaging system.
– Secor is an open source tool developed by Pinterest, which aggregates Kafka messages
and saves as an S3 object.
 What extensions were needed?
– Support for OpenStack Swift as a Secor target. We also added support for Parquet
format and annotating objects with metadata search to support indexing.
 What is the value of integration with Swift?
– Enables bringing new data and applications to Swift which is an open source solution.
Parquet and metadata search enable improved performance for batch analytics.
 Status
– We contributed OpenStack Swift support to the Secor community and it is now part of
Secor.
Secor
Kafka + Secor
© 2015 IBM Corporation22
Parquet
 What is it?
– A column based semi-structured, schema-based storage format supported by Hadoop
and Spark. Enables column-wise compression and projection pushdown.
 What integration is needed?
– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.
Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.
 Status
– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage
systems such as Swift.
© 2015 IBM Corporation23
elasticsearch
 What is it?
– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.
 What integration is needed?
– Index object metadata allowing search for objects by attributes.
 What is the value of integration with Swift
– Use search to select objects for further processing, e.g., relevant objects for analytics.
- Note that S3 does not yet have native search according to metadata.
 Status
– The IBM SoftLayer object service includes a basic implementation of metadata search;
At IBM Research, we added extensions such as data type support and range searches.

How Spark Enables the Internet of Things- Paula Ta-Shma

  • 1.
    © 2015 IBMCorporation How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases Paula Ta-Shma IBM Research paula@il.ibm.com Joint work with: Adnan Akbar, University of Surrey Michael Factor, IBM Research Guy Hadash, IBM Research Juan Sancho, ATOS
  • 2.
    © 2015 IBMCorporation2 The Evolution of Data Collection Internet of Things
  • 3.
    © 2015 IBMCorporation3 2005 2012 2017 The IoT market will grow to $1.7 trillion in 2020 (IDC) By 2020 the number of networked devices will be 30 billion (IDC), more than 4 times the entire global population IoT : The Biggest Big Data GlobalDataVolumeinExabytes 2005 2012 2017
  • 4.
    © 2015 IBMCorporation4 EMT Madrid Bus Company Needs to Make Decisions According to Current and Predicted Future Traffic State  The Problem – EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output. This can be slow and costly.  Objective – Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real- time traffic problems  Approach – Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based upon knowledge derived from historical data Today Tomorrow
  • 5.
    © 2015 IBMCorporation5 1. Collect historical time series data – Collect data from devices – Aggregate into objects – Index and/or partition Generic IoT Architecture – Data Flow Secor IoT Swift
  • 6.
    © 2015 IBMCorporation6 2. Learn patterns in data – May be time/location dependent – Generate thresholds, classifiers etc. Generic IoT Architecture – Data Flow Secor Swift
  • 7.
    © 2015 IBMCorporation7 IoT 3. Apply what was learned on real time data stream – Take action Generic IoT Architecture – Data Flow Secor CEP Swift
  • 8.
    © 2015 IBMCorporation8 How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases IoT Generic IoT Architecture – Data Flow CEP Secor Swift Green Flows: Real time Purple Flows: Batch
  • 9.
    © 2015 IBMCorporation9 Aim: Collect historical timeseries data for analysis – Continuously collect data from up to 3000 Madrid council traffic sensors via web service - Data includes traffic speeds and intensities, updated every 5 mins – Push the messages to Kafka – Use Secor to aggregate multiple messages into a single Swift object - According to policy, e.g., every 60 mins - Possibly partition the data, e.g. according to date - Convert to Parquet format - Annotate with metadata, e.g., min/max speed, start/end time – Index Swift objects according to their metadata using ElasticSearch Secor Swift IoT Architecture – Madrid Traffic – Ingestion Flow IoT
  • 10.
    © 2015 IBMCorporation10 IoT Architecture – Madrid Traffic – Data Access Aim: Access data efficiently and cost effectively – Store IoT data in OpenStack Swift object storage - Open source, low cost deployment, and highly scalable – Parquet data is accessible via Spark SQL – Optimized predicate pushdown - Custom Spark SQL external data source driver - Uses object metadata indexes - Searches for Swift objects whose min/max values overlap requested ranges Get all data for morning traffic: SELECT codigo, intensidad, velocidad FROM madridtraffic WHERE tf >= '08:00:00' AND tf <= '12:00:00' Brute force method 13245 Swift requests Optimized predicate pushdown 616 Swift requests 21.5 times improvement Swift
  • 11.
    © 2015 IBMCorporation11 IoT Architecture – Madrid Traffic – Machine Learning Aim: Learn to differentiate between ‘good’ and ‘bad’ traffic – Depends on context - Time (morning/evening), Day (weekday/weekend) - Location – Use Spark MLlib k-means clustering – Produce threshold values for real-time decision making – Re-run algorithm when quality of clusters decreases - Can use silhouette index to measure quality Swift
  • 12.
    © 2015 IBMCorporation12 IoT Architecture – Madrid Traffic – Machine Learning Event Detection: • Use Spark MLlib k-means clustering to separate data into 2 clusters • Find the midpoint between the 2 cluster centres • Use this midpoint to generate the thresholds • Repeat for each context e.g. time period (morning, afternoon, evening, night) Anomaly Detection: • Use a single cluster and define an anomaly to be further than a certain distance from the cluster centre Morning Traffic on Weekdays
  • 13.
    © 2015 IBMCorporation13 IoT Architecture – Madrid Traffic – Real Time Decision Making Aim: Respond in real time to traffic conditions – Use Complex Event Processing (CEP) approach - Rule based - Process events record by record - CEP rules are typically defined manually but in many cases it is difficult to get them right - We automate this process and make it smart - uCEP has a small footprint, can be run at the edge CEP IoT Work in Progress Proactive approach: • Use Spark streaming linear regression to predict traffic behavior (e.g. speed, intensity) for near future • Apply CEP on predicted data • Respond pro-actively to predicted events such as traffic congestion – e.g. EMT can proactively re- route buses
  • 14.
    © 2015 IBMCorporation14 Demo
  • 15.
    © 2015 IBMCorporation15 Our Architecture Applies to Many IoT Use Cases  Energy/utilities – Anomaly detection - Pipe leakage - Appliance malfunction – Occupancy detection  Healthcare – Healthcare patient monitoring/alert/response  Insurance – Driver behavior and location monitoring  Transportation – Connected vehicles, engine diagnostics, automated service scheduling  Logistics – Goods tracking, sensitive goods management
  • 16.
    © 2015 IBMCorporation Data Sources Apache Spark Node-RED Secor Message Bus Data Storage Data Analytics Data Visualization Freeboard Dashboard Object Storage 16 MQTT The Madrid Traffic Use Case on IBM Bluemix Madrid Traffic Sensors Joint work with Naeem Altaf and team
  • 17.
    © 2015 IBMCorporation17 Thank You !
  • 18.
    © 2015 IBMCorporation18 Backup
  • 19.
    © 2015 IBMCorporation19 COSMOS  Funding: EU FP7 at level of 2PY x 3 years  Started: Sept 2013  Coordinator: ATOS  Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS  Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid Council, III Taiwan – Smart Cities use cases  Project Vision: Enable ‘things’ to interact with each other based on shared experience, trust, reputation etc.
  • 20.
    © 2015 IBMCorporation20 IBM Bluemix Data Analytics for IoT Architecture
  • 21.
    © 2015 IBMCorporation21  What is it? – Apache Kafka is a high throughput distributed publish/subscribe messaging system. – Secor is an open source tool developed by Pinterest, which aggregates Kafka messages and saves as an S3 object.  What extensions were needed? – Support for OpenStack Swift as a Secor target. We also added support for Parquet format and annotating objects with metadata search to support indexing.  What is the value of integration with Swift? – Enables bringing new data and applications to Swift which is an open source solution. Parquet and metadata search enable improved performance for batch analytics.  Status – We contributed OpenStack Swift support to the Secor community and it is now part of Secor. Secor Kafka + Secor
  • 22.
    © 2015 IBMCorporation22 Parquet  What is it? – A column based semi-structured, schema-based storage format supported by Hadoop and Spark. Enables column-wise compression and projection pushdown.  What integration is needed? – Since Swift is now part of the Hadoop ecosystem, no additional integration is needed. Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.  Status – Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage systems such as Swift.
  • 23.
    © 2015 IBMCorporation23 elasticsearch  What is it? – A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.  What integration is needed? – Index object metadata allowing search for objects by attributes.  What is the value of integration with Swift – Use search to select objects for further processing, e.g., relevant objects for analytics. - Note that S3 does not yet have native search according to metadata.  Status – The IBM SoftLayer object service includes a basic implementation of metadata search; At IBM Research, we added extensions such as data type support and range searches.

Editor's Notes

  • #3  So what really is the Internet of Things?   It is made up of physical objects (“things”) that have chips, sensors embedded in them that allow the sensing, capturing and communication of all types of data. These devices are then linked through both wired and wireless networks to the Internet.  Advanced  “things” have actuators embedded into them as well, giving them the capability to interact with other devices, computing systems and the external environment, including people. IoT takes this one step further – Actuation Quantity of data and quality of solution (actuation) Sensors have existed for a long time, think how many sensors you need to send a rocket into space, but today this is not rocket science, what is happening is that sensors are becoming commodities, leading to adoption on a massive scale, enabling new applications to be possible e.g. placing large numbers of sensors in agricultural fields to measure soil humidity and nutrient levels
  • #4 Big data versus huge data IoT data : typically sensor readings and associated data together with timestamps Why so big ? 1) Many more networked devices than networked humans – and growing fast 2) Associated data can be video, audio and social networking data – yes things will join social networks like humans do => Going to be biggest big data Video, audio, images can also be IoT data According to new research from International Data Corporation (IDC), the worldwide Internet of Things market will grow from $655.8 billion in 2014 to $1.7 trillion in 2020 with a compound annual growth rate (CAGR) of 16.9%. http://www.idc.com/getdoc.jsp?containerId=prUS25658015 http://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm
  • #5 Sensors record speed and intensity of traffic Open data
  • #6 Swift highly scalable and low cost, in comparison to using a database Essential for keeping IoT data long term What kind of data ? Together with traffic speed and intensity data we also have camera images – less suitable for DB Need some kind of indexing to make it efficient
  • #7 Learn from historical data to respond in real time to live data
  • #8 Actuate: CEP – respond according to thresholds Possible actions: reroute buses, alter traffic light behavior, alert citizens, etc. Action to be taken is application specific
  • #9 Need for real time and historical – real time is green arrows Adding historical dimension Left cycle is real time only, we added larger cycle Stopwatch graphic is from Wikimedia commons
  • #10 Object storage (openstack swift) as a long term repository for IoT data Scalable and relatively low cost By adding metadata to describe what is contained in each object and metadata search we can access it efficiently Databases are often overkill for what is needed by analytics Partitioning needs to be done statically, difficult to change later on Difficult to change the schema Partitioning must be hierarchical – advantages to selecting on columns up high in hierarchy HDFS versus Swift for analytics -Swift more scalable and reliable (Namenode SPOF issue?) -Supports metadata -Supports stronger security model -More natural for storing some types of data e.g. traffic images
  • #12 Can depend on other elements of context like weather etc. Note: table is for one location only
  • #14 Importance of responding in time
  • #16 Value of actuation in real time e.g. pipe leakage Potential to learn new insights by tapping in to the historical data e.g. as you heard in a previous talk by Intel, in healthcare improve quality of healthcare for patients with Parkinsons disease Insurance – pay as you drive and pay how you drive models – pay where you drive ? New business models I’ve been thinking a lot about the last one since I arrived in Amsterdam, I think its ironic that I lost my things on the way to a conference where I’m talking about the internet of things, airlines please work on this ;-)
  • #17 IBM Bluemix is IBM’s Platform as a Service offering. Based on Cloud Foundry. Bluemix has services for most of the components we used, for example there are Spark and object storage services as well as a MessageHub service based on Kafka. Together with a team led by Naeem Altaf, we ported this use case to run on Bluemix to give customers an example of an IoT use case that can be built on the platform. This work was demoed yesterday in Las Vegas at IBM’s Insight conference.