IoT meets Big Data
รัฐศิลป์ รานอกภานุวัชร์, D.ENG
Keywords
• Big Data
• Internet of Things
• Streaming data processing
• IoT Big Data analytics
• Advanced machine learning
2
3
Big Data technology
Credit: https://www.xenonstack.com/blog/big-data-engineering/ingestion-processing-big-data-iot-stream/ 4
Internet of Things (IoT)
Credit: https://orzota.com/industrial-iot/
Software and
platform
VisualizationThings
5
Sensors & Actuators
IoT data characteristics
Large-Scale
Streaming Data
Heterogeneity
Time and space
correlation
High noise data
IoT
data
Fast computing and
advanced machine learning
techniques require for IoT
streaming data processing
and IoT bigdata analytics
Analytics requirement
IoT Applications support
High-speed data streams
and requiring real-time
or near real-time actions
Reference: M. Chen, S. Mao, Y. Zhang, and V. C. Leung, Big data: related technologies, challenges and future prospects. Springer, 2014
Things are Producing Streaming Data
7
Variety
Difference type of
Data
Velocity
Speed at which
Data is Generated
Veracity
Data Accuracy
“6V” for IoT Big Data
IoT Big Data
Volume
Size of Data
Variability
Dynamic Behavior In Data
Source coz dataflow rate
Value
Useful Data
8
New class of analytics “Fast and streaming data analytics”
IoT data
‘6V’
Streaming
processing
Advanced
machine
learning
Fast distributed
computing
9
IoT Big Data Architecture
Filtering
Analytics
Ingestion Data
Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 10
Use Case – Truck Sensors
11
How to design a Streaming Analytics Solution?
12
How to design a Streaming Analytics System?
It usually starts very simple … just one data pipeline
13
New Event Stream sources are added…
14
New Processors are interested in the events …
15
… and the solution becomes the problem
16
… and the solution becomes the problem
17
Decouple event streams from consumers
data pipeline
18
Apache Kafka
A distributed streaming platform
19
Messaging Systems: Publish/Subscribe
Producer Consumer
Producer
Consumer
Topic 1 Topic 2
Topic 3
subscribe
publish(topic, msg)
Publish subscribe
system
msg
msg
20
Before: How to integrate this variety of data and make it available to all products?
▪ LinkedIn grew to have dozens of data systems and data repositories.
▪ LinkedId described their point-to-point data pipelines like;
The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 21
After
▪ Kafka was crated to server as centralized online data pipelining system:
▪ Elastically scalable
▪ Durable
▪ High-throughput
▪ Fast
22
Why must be concerned
▪ Over 1,300,000,000,000 messages are transported via Kafka every
day at LinkedIn
▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic
▪ 4.5 Million messages per second, on single cluster
▪ Kafka runs on around 1300 servers at LinkedIn
Newsfeed Recommendation Metrics and Monitoring23
A few important characteristics
Fast
◦ Kafka can handle hundreds of megabytes of reads and writes per second from a
large number of clients.
◦ Designed for real time activity streaming.
Distributed and highly scalable
◦ Kafka has a cluster-centric design offers strong durability and fault-tolerance
guarantees.
◦ Messages partitioning spread over a cluster of machines
Durable
◦ Message persisted to disk and replicated within cluster to prevent data loss.
◦ Each broker can handle terabytes of messages without performance impact
Kafka architecture: Broker, Topics, Producers,
and Consumers
26
Kafka Cluster is made up of multiple Kafka Brokers
Kafka Zookeeper Coordination
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK
27
Apache Kafka - Architecture
Producer
Consumer
29
Apache Kafka - Architecture
Producer
Consumer
30
Apache Kafka
Producer
Consumer
31
Use Case – Truck Sensors
32
Kafka Single Node Example
DOWNLOAD LATEST VERSION FROM HTTPS://KAFKA.APACHE.ORG/DOWNLOADS
Run ZooKeeper
Wait about 30 seconds or so for ZooKeeper to startup.
34
Run Kafka Server (Broker)
Wait about 30 seconds or so for Kafka to startup.
35
Create Kafka Topic
• We create a topic called my-topic with a replication factor of 1 since we only have one server.
• We will use 13 partitions for my-topic, which means we could have up to 13 Kafka consumers.
36
Run Kafka Producer
• Notice that we specify the Kafka node which is running at localhost:9092..
• Next run start-producer-console.sh and send at least four messages
37
Run Kafka Consumer
Notice that we specify the Kafka node which is running at localhost:9092 like
we did before, but we also specify to read all of the messages from my-topic
from the beginning —from-beginning.
38
Running Kafka Producer and Consumer
• Notice that the messages are not coming in order.
• This is because we only have one consumer so it is reading the messages from all 13
partitions.
• Order is only guaranteed within a partition.
39
IoT Big Data Streaming processing patterns
Events Events
Events
Real-time
applications
Long term
storage
Real-time
dashboards
Source: Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark by Raghav Mohan Program Manager Azure HDInsight
Example
Source: https://www.scnsoft.com/blog/salesforce-iot-cloud-benefits-and-limitations 42
IoT Big Data Analytic
IoT Big Data Architecture
Filtering
Analytics
Ingestion Data
Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 44
What is Machine Learning?
45
Source: https://cybrml.com/2017/01/23/ml-in-cs-4-machine-learning-technical-review/ 46
Machine Learning in IoT Applications
Source : https://medium.com/iotforall/using-deep-learning-processors-for-intelligent-iot-devices-1a7ed9d2226d 47
Dataset
48Reference : Deep Learning for IoT Big Data and Streaming Analytics: A Survey
Disadvantages of Pure Cloud Service Model
o Unpredictable response time from cloud server to endpoints
o Unreliable cloud connections can bring down the service
o Excessive data can overburden infrastructure
o Privacy issues when sensitive customer data are stored in the cloud
o Difficulties in scaling to ever increasing number of sensors and actuators
49
Fog computing for IoT
• Bringing computing and analytics closer to the end-users/devices to remove unnecessary and
prohibitive communication delays (saves on transmissions costs).
• It can receive, process and react in real time to the incoming data.
50
Ex. Fog computing + Kafka
https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/Cisco_UCS_Integrated_Infrastructure_for_Big_Data_with_
Cloudera_and_Apache_Spark.html 51
Case study #1
REFERENCE: HTTPS://MAPR.COM/BLOG/ML-IOT-CONNECTED-MEDICAL-DEVICES/
52
Streaming machine-learning application to detect
anomalies in data from a heart monitor
◦ Cheaper sensors that can monitor vital signs combined with machine learning, are making it
possible for doctors to rapidly apply smart medicine to their patients’ cases.
electrocardiogram (ECG)
53
Building the Model with Clustering
Heartbeats activity: normal EKG pattern
we use this repeating pattern to train a model on
previous heartbeat activity and then compare
subsequent observations to this model in order to
evaluate anomalous behavior.
To build a model of typical heartbeats activity, we process an
EKG (based on a specific patient or a group of many patients),
break it into overlapping pieces that are about 1/3 sec long, and
then apply a clustering algorithm to group similar shapes.
The k-means algorithm
54
Apache Spark processing with k-means
55
Results in a catalog of shapes
It can be used for reconstructing
what an EKG should look like.
56
Using the Model of Normal with Streaming Data
57
Detecting Anomalies
The difference between the observed and expected EKG (the green minus the red) is
the reconstruction error, or residual (shown in yellow). If the residual is high, then
there could be an anomaly.
58
Case study #2
REFERENCE:
การประชุมวิชาการระดับประเทศด้านเทคโนโลยีสารสนเทศ (NATIONAL CONFERENCE ON
INFORMATION TECHNOLOGY: NCIT) ครั้ง ที่ 10 24-25 ตุลาคา 2561
60
โรงเรือนผักไฮโดรโปรนิกส์อัตโนมัติโดยใช้เทคโนโลยี IoT และ
Machine learning
Internet
Camera
Amazon S3
Small class Medium class Large class
61
การวิเคราะห์การเติบโตผัก แบ่ง3 class
Small Medium Large
✓ ในการทาโมเดล เราจะทาการเทรนชุดข้อมูล class ละ 300 รูป
✓ เฟรมเวิร์ก Caffe โมเดล CNNs และ SDK ของ Intel deep learning training
tool ในการพัฒนาโมเดล ที่ติดตั้งบน AWS Cloud
62
ขั้นตอนการทางาน
Camera Module
ชุดข้อมูล class ละ 300 รูป
Predict Class
CNNs
CNNs = Convolutional Neural Network
ผลการทดสอบโมเดล
64
Profile ผักสาหรับควบคุมอัตโนมัติ 3 class
ตั แปร ค ค ม ม ย
Temp อง C อ มิ ยในโรงเรือน
Hum % ค มชนในอ ก ยในโรงเรือน
Lux Lux ค มเ ้มแสง ยในโรงเรือน
Fan On/Off ก รปิดปิด ัดลม
Silent On/Off ก รเปิดปิดม น ร งแสง
Water On/Off ก รเปิดปิดปัมน
Cool On/Off ก รเปิดปิดปัมน ไ ลผ นแผงรังผง
Foggy On/Off ก รเปิดปิด ั น มอก
Challenges and Future Directions
o Lack of Large IoT Dataset
o more data is needed to achieve more accuracy
o Preprocessing
o more complex since the system deals with data from different sources that may have various formats
o Secure and Privacy Preserving Machine Learning
o developing further techniques to defend and prevent the effect of this sort of attacks on models is
necessary for reliable IoT applications.
o Machine Learning for IoT Devices
o consider the requirements of handling Machine learning in resource-constrained devices
66
Thank you

IoT meets Big Data

  • 1.
    IoT meets BigData รัฐศิลป์ รานอกภานุวัชร์, D.ENG
  • 2.
    Keywords • Big Data •Internet of Things • Streaming data processing • IoT Big Data analytics • Advanced machine learning 2
  • 3.
  • 4.
    Big Data technology Credit:https://www.xenonstack.com/blog/big-data-engineering/ingestion-processing-big-data-iot-stream/ 4
  • 5.
    Internet of Things(IoT) Credit: https://orzota.com/industrial-iot/ Software and platform VisualizationThings 5 Sensors & Actuators
  • 6.
    IoT data characteristics Large-Scale StreamingData Heterogeneity Time and space correlation High noise data IoT data Fast computing and advanced machine learning techniques require for IoT streaming data processing and IoT bigdata analytics Analytics requirement IoT Applications support High-speed data streams and requiring real-time or near real-time actions Reference: M. Chen, S. Mao, Y. Zhang, and V. C. Leung, Big data: related technologies, challenges and future prospects. Springer, 2014
  • 7.
    Things are ProducingStreaming Data 7
  • 8.
    Variety Difference type of Data Velocity Speedat which Data is Generated Veracity Data Accuracy “6V” for IoT Big Data IoT Big Data Volume Size of Data Variability Dynamic Behavior In Data Source coz dataflow rate Value Useful Data 8
  • 9.
    New class ofanalytics “Fast and streaming data analytics” IoT data ‘6V’ Streaming processing Advanced machine learning Fast distributed computing 9
  • 10.
    IoT Big DataArchitecture Filtering Analytics Ingestion Data Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 10
  • 11.
    Use Case –Truck Sensors 11
  • 12.
    How to designa Streaming Analytics Solution? 12
  • 13.
    How to designa Streaming Analytics System? It usually starts very simple … just one data pipeline 13
  • 14.
    New Event Streamsources are added… 14
  • 15.
    New Processors areinterested in the events … 15
  • 16.
    … and thesolution becomes the problem 16
  • 17.
    … and thesolution becomes the problem 17
  • 18.
    Decouple event streamsfrom consumers data pipeline 18
  • 19.
    Apache Kafka A distributedstreaming platform 19
  • 20.
    Messaging Systems: Publish/Subscribe ProducerConsumer Producer Consumer Topic 1 Topic 2 Topic 3 subscribe publish(topic, msg) Publish subscribe system msg msg 20
  • 21.
    Before: How tointegrate this variety of data and make it available to all products? ▪ LinkedIn grew to have dozens of data systems and data repositories. ▪ LinkedId described their point-to-point data pipelines like; The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 21
  • 22.
    After ▪ Kafka wascrated to server as centralized online data pipelining system: ▪ Elastically scalable ▪ Durable ▪ High-throughput ▪ Fast 22
  • 23.
    Why must beconcerned ▪ Over 1,300,000,000,000 messages are transported via Kafka every day at LinkedIn ▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic ▪ 4.5 Million messages per second, on single cluster ▪ Kafka runs on around 1300 servers at LinkedIn Newsfeed Recommendation Metrics and Monitoring23
  • 24.
    A few importantcharacteristics Fast ◦ Kafka can handle hundreds of megabytes of reads and writes per second from a large number of clients. ◦ Designed for real time activity streaming. Distributed and highly scalable ◦ Kafka has a cluster-centric design offers strong durability and fault-tolerance guarantees. ◦ Messages partitioning spread over a cluster of machines Durable ◦ Message persisted to disk and replicated within cluster to prevent data loss. ◦ Each broker can handle terabytes of messages without performance impact
  • 25.
    Kafka architecture: Broker,Topics, Producers, and Consumers 26 Kafka Cluster is made up of multiple Kafka Brokers
  • 26.
  • 27.
    Apache Kafka -Architecture Producer Consumer 29
  • 28.
    Apache Kafka -Architecture Producer Consumer 30
  • 29.
  • 30.
    Use Case –Truck Sensors 32
  • 31.
    Kafka Single NodeExample DOWNLOAD LATEST VERSION FROM HTTPS://KAFKA.APACHE.ORG/DOWNLOADS
  • 32.
    Run ZooKeeper Wait about30 seconds or so for ZooKeeper to startup. 34
  • 33.
    Run Kafka Server(Broker) Wait about 30 seconds or so for Kafka to startup. 35
  • 34.
    Create Kafka Topic •We create a topic called my-topic with a replication factor of 1 since we only have one server. • We will use 13 partitions for my-topic, which means we could have up to 13 Kafka consumers. 36
  • 35.
    Run Kafka Producer •Notice that we specify the Kafka node which is running at localhost:9092.. • Next run start-producer-console.sh and send at least four messages 37
  • 36.
    Run Kafka Consumer Noticethat we specify the Kafka node which is running at localhost:9092 like we did before, but we also specify to read all of the messages from my-topic from the beginning —from-beginning. 38
  • 37.
    Running Kafka Producerand Consumer • Notice that the messages are not coming in order. • This is because we only have one consumer so it is reading the messages from all 13 partitions. • Order is only guaranteed within a partition. 39
  • 38.
    IoT Big DataStreaming processing patterns Events Events Events Real-time applications Long term storage Real-time dashboards Source: Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark by Raghav Mohan Program Manager Azure HDInsight
  • 39.
  • 40.
    IoT Big DataAnalytic
  • 41.
    IoT Big DataArchitecture Filtering Analytics Ingestion Data Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 44
  • 42.
    What is MachineLearning? 45
  • 43.
  • 44.
    Machine Learning inIoT Applications Source : https://medium.com/iotforall/using-deep-learning-processors-for-intelligent-iot-devices-1a7ed9d2226d 47
  • 45.
    Dataset 48Reference : DeepLearning for IoT Big Data and Streaming Analytics: A Survey
  • 46.
    Disadvantages of PureCloud Service Model o Unpredictable response time from cloud server to endpoints o Unreliable cloud connections can bring down the service o Excessive data can overburden infrastructure o Privacy issues when sensitive customer data are stored in the cloud o Difficulties in scaling to ever increasing number of sensors and actuators 49
  • 47.
    Fog computing forIoT • Bringing computing and analytics closer to the end-users/devices to remove unnecessary and prohibitive communication delays (saves on transmissions costs). • It can receive, process and react in real time to the incoming data. 50
  • 48.
    Ex. Fog computing+ Kafka https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/Cisco_UCS_Integrated_Infrastructure_for_Big_Data_with_ Cloudera_and_Apache_Spark.html 51
  • 49.
    Case study #1 REFERENCE:HTTPS://MAPR.COM/BLOG/ML-IOT-CONNECTED-MEDICAL-DEVICES/ 52
  • 50.
    Streaming machine-learning applicationto detect anomalies in data from a heart monitor ◦ Cheaper sensors that can monitor vital signs combined with machine learning, are making it possible for doctors to rapidly apply smart medicine to their patients’ cases. electrocardiogram (ECG) 53
  • 51.
    Building the Modelwith Clustering Heartbeats activity: normal EKG pattern we use this repeating pattern to train a model on previous heartbeat activity and then compare subsequent observations to this model in order to evaluate anomalous behavior. To build a model of typical heartbeats activity, we process an EKG (based on a specific patient or a group of many patients), break it into overlapping pieces that are about 1/3 sec long, and then apply a clustering algorithm to group similar shapes. The k-means algorithm 54
  • 52.
    Apache Spark processingwith k-means 55
  • 53.
    Results in acatalog of shapes It can be used for reconstructing what an EKG should look like. 56
  • 54.
    Using the Modelof Normal with Streaming Data 57
  • 55.
    Detecting Anomalies The differencebetween the observed and expected EKG (the green minus the red) is the reconstruction error, or residual (shown in yellow). If the residual is high, then there could be an anomaly. 58
  • 56.
    Case study #2 REFERENCE: การประชุมวิชาการระดับประเทศด้านเทคโนโลยีสารสนเทศ(NATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NCIT) ครั้ง ที่ 10 24-25 ตุลาคา 2561 60
  • 57.
  • 58.
    การวิเคราะห์การเติบโตผัก แบ่ง3 class SmallMedium Large ✓ ในการทาโมเดล เราจะทาการเทรนชุดข้อมูล class ละ 300 รูป ✓ เฟรมเวิร์ก Caffe โมเดล CNNs และ SDK ของ Intel deep learning training tool ในการพัฒนาโมเดล ที่ติดตั้งบน AWS Cloud 62
  • 59.
    ขั้นตอนการทางาน Camera Module ชุดข้อมูล classละ 300 รูป Predict Class CNNs CNNs = Convolutional Neural Network
  • 60.
  • 61.
    Profile ผักสาหรับควบคุมอัตโนมัติ 3class ตั แปร ค ค ม ม ย Temp อง C อ มิ ยในโรงเรือน Hum % ค มชนในอ ก ยในโรงเรือน Lux Lux ค มเ ้มแสง ยในโรงเรือน Fan On/Off ก รปิดปิด ัดลม Silent On/Off ก รเปิดปิดม น ร งแสง Water On/Off ก รเปิดปิดปัมน Cool On/Off ก รเปิดปิดปัมน ไ ลผ นแผงรังผง Foggy On/Off ก รเปิดปิด ั น มอก
  • 62.
    Challenges and FutureDirections o Lack of Large IoT Dataset o more data is needed to achieve more accuracy o Preprocessing o more complex since the system deals with data from different sources that may have various formats o Secure and Privacy Preserving Machine Learning o developing further techniques to defend and prevent the effect of this sort of attacks on models is necessary for reliable IoT applications. o Machine Learning for IoT Devices o consider the requirements of handling Machine learning in resource-constrained devices 66
  • 63.