IoT Data Streaming
รัฐศิลป์ รานอกภานุวัชร์, D.ENG
WHO AM I ?
 อาจารย์ผู้สอน ป.ตรี วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยธุรกิจบัณฑิตย์
 อาจารย์ผู้สอน ป.โท วิศวกรรมข้อมูลขนาดใหญ่ มหาวิทยาลัยธุรกิจบัณฑิตย์
 อาจารย์พิเศษ สอนวิชา Data Streaming and Real Time Analytics
สถาบันบัณฑิตพัฒนบริหารศาสตร์ นิด้า
 วิทยากรผู้สอน Amazon cloud ประจาสถาบัน 9expert
 ที่ปรึกษาบริษัทเอกชน ทางด้าน BigData และ Blockchain
 งานวิจัย Blockchain, IoT และ BigData
2
Outline
• Internet of Things (IoT)
• IoT Data Streaming
• Collect Data
• MQTT
• Kafka
• Streaming processing platform
• Flink
• Storm
• Spark
• Use-Case Examples
3
Internet of Things (IoT)
Credit: https://orzota.com/industrial-iot/
Software and
platform
(Data Stream
Processing)
VisualizationThings
(Generate
data steam)
4
Sensors & Actuators
IoT data characteristics
Large-Scale
Streaming Data
Heterogeneity
Time and space
correlation
High noise data
IoT
data
IoT Applications support
 High-speed data streams
 Requiring real-time or near
real-time actions
 Sometimes the need to join
○ with static data
○ with historical data
Reference: M. Chen, S. Mao, Y. Zhang, and V. C. Leung, Big data: related technologies, challenges and future prospects. Springer, 2014
What is Data Streaming?
Ref: https://www.cisco.com/c/dam/en/us/products/collateral/analytics-automation-software/data-virtualization/r20-consultancy-combining-datastreaming-wp.pdf
 The data streaming is
continuously transmitted from
one system (the producer) to
another (the consumer) which
reacts instantaneously (No delay)
on the incoming data.
Distributed Streaming
 Streaming:
 Computations on never ending “streams” of data records (“events”)
 Distributed:
 Computation spread across many machines
7Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Stateless streaming
 Every incoming record is independent of other records.
 There is no relation between different record can processed and persisted
independently.
 Eg. Map , Filter, Join with static data
8Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Stateful Streaming
 Computation and state
 E.g., counters, windows of past events, state machines, trained ML models
 counts of each distinct word seen in records
 Result depends on history of stream
 Processing of an incoming record depends upon the result of previously processed records
9Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Event-Time Streaming
 Data records associated with timestamps (time series data)
 Processing depends on timestamps
 An event-time stream processor should give you the tools to reason about time
 Handle streams that are out of order
10Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Event-Time Streaming
 Because time matters
 Time
 Event time, which is the time at which
events actually occurred
 Processing time, which is the time at
which events are observed in the system
11Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
Things are Producing Streaming Data
12
 Smart city
 Healthcare/Medical
device
 Connected cars
 Logistics
 Home automation
 Airlines
 Farmers
 Smart Machinery
 Security system
IoT Big Data Architecture
Filtering
Analytics
Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 14
Collect Data (high level architecture)
15
How to integrate? MQTT or Kafka
16Copy right : https://thenewstack.io/mqtt-protocol-iot/
Messaging Systems: Publish/Subscribe
Producer Consumer
Producer
Consumer
Topic 1 Topic 2
Topic 3
subscribe
publish(topic, msg)
Publish subscribe
system
msg
msg
Example
18
MQTT uses the pub/sub pattern to connect interested parties with each other
Arduino, Raspberry Pi
MQTT - Publish / subscribe messaging
protocol
19
 MQTT protocol is a Machine to Machine (M2M) protocol widely used in Internet of things.
 This protocol used publish-subscriber paradigm in contrast to HTTP based on request/response
paradigm.
 Built on top of TCP/IP for constrained devices and unreliable network
 Many (open source) broker implementation
 Many client libraries
MQTT Architecture (no scale)
20
MQTT Architecture (clustering depends on
broker implementation)
21
MQTT Architecture (clustering depends on
broker implementation)
22
MQTT Trade-Offs
Pros
 Lightweight
 Simple API
 Built for poor connectivity / high latency scenario
 Many client connections (tens of thousands per MQTT server)
Cons
 Queuing, not stream processing
 no buffering
 No high scalability
 No good integration to rest of the enterprise
 No reprocessing of events
23
Apache Kafka
A distributed streaming platform
24
Kafka Data Streams
Kafka is used to stream data into data lakes, applications and real-time stream analytics systems.
Kafka architecture: Broker, Topics, Producers,
and Consumers
26
Kafka Cluster is made up of multiple Kafka Brokers
Apache Kafka - Architecture
Producer
Consumer
27
Apache Kafka - Architecture
Producer
Consumer
28
Apache Kafka
Producer
Consumer
29
Kafka Zookeeper Coordination
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK
30
31
32
33
A few important characteristics
 Fast
 Kafka can handle hundreds of megabytes of reads and writes per second from a
large number of clients.
 Designed for real time activity streaming.
 Distributed and highly scalable
 Kafka has a cluster-centric design offers strong durability and fault-tolerance
guarantees.
 Messages partitioning spread over a cluster of machines
 Durable
 Message persisted to disk and replicated within cluster to prevent data loss.
 Each broker can handle terabytes of messages without performance impact
Streaming
Platform
USE CASE
Use Case – Truck Sensors
36
Kafka Trade-Offs (from IoT perspective)
Pros
 Stream processing, not just queuing
 High throughput
 Large scale
 High availability
 Long term storage and buffering
 Reprocessing of events
 Good integration to rest of the enterprise
Cons
 Not built for tens of thousands connections
 Requires stable network and good infrastructure
37
Collect Data (high level architecture)
38
How to integrate? MQTT+Kafka
End-to-End Integration from MQTT to Apache Kafka
39
MQTT Source and Sink Connectors for Kafka
Connect
40
https://www.confluent.io/hub/
https://www.confluent.io/connector/kafka-connect-mqtt/
IoT Data Ingestion through MQTT into Kafka
41Ref: https://github.com/gschmutz/stream-processing-workshop/tree/master/06-iot-data-ingestion-over-mqtt
IoT Big Data Architecture
Filtering
Analytics
Ref: https://mapr.com/blog/ml-iot-connected-medical-devices/ 42
What is stream processing?
 Technology that let users query continuous data stream and detect conditions
fast within a small time period from the time of receiving the data.
 The detection time period varies from few milliseconds to minutes.
Streams processing tools
44
Two Types of Stream Processing
45
Native Streaming
 It means every incoming record is
processed as soon as it arrives, without
waiting for others.
 There are some continuous running
processes which run for ever and every
record passes through these processes to
get processed.
 Framework to achieve the minimum
latency possible.
 But hard to achieve fault tolerance
46
Micro-batching
 It means incoming records in every few seconds are batched together and then
processed in a single mini batch with delay of few seconds.
 Cost of latency and it will not feel like a natural steaming
47
https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-
91ea3f04675b
Apache Storm
 Distributed dataflow abstraction (spouts & bolts) and large scale stream processing
 It is true streaming and is good for simple event based use cases
 Very low latency, and high throughput
 No state management
48
 if it is simple IoT kind of event based alerting system
source of streams
filtering,
functions,
aggregations,
joins, etc
Processing
Apache Flink
Queries
Applications
Devices
etc.
Database
Stream
File / Object
Storage
 Stateful computations over streams
 First True streaming framework with all advanced
features like event time processing, watermarks, etc
 Low latency with high throughput
Historic
Data
Streams
Application
Good for Complex event time processing,
aggregation, stream joins,etc
Architecture and Process Model
50
Ref: https://ci.apache.org/projects/flink/flink-docs-release-1.1/internals/general_arch.html
51
Ref: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html
Apache Spark
 Spark has emerged as true successor of Hadoop
 Unified batch and stream processing over a batch runtime
 High throughput, Fault tolerance by default due to micro-batch nature
 Not true streaming, not suitable for low latency requirements
52
Good for Stream machine learning
Use Case
53
54
Ref: Muhammad Syafrudin, “Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time
Monitoring System in Automotive Manufacturing”
Real-Time Monitoring System in Automotive Manufacturing
Detect abnormal events
and diagnosis in a process
55
System design
Ref: Muhammad Syafrudin, “Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time
Monitoring System in Automotive Manufacturing”
Sensor Data
56
57
58
59
60
Performance evaluation in terms of latency with different numbers of clients (a) and servers
(b); throughput with different numbers of clients (c) and servers (d);
Thank you

Io t data streaming

  • 1.
    IoT Data Streaming รัฐศิลป์รานอกภานุวัชร์, D.ENG
  • 2.
    WHO AM I?  อาจารย์ผู้สอน ป.ตรี วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยธุรกิจบัณฑิตย์  อาจารย์ผู้สอน ป.โท วิศวกรรมข้อมูลขนาดใหญ่ มหาวิทยาลัยธุรกิจบัณฑิตย์  อาจารย์พิเศษ สอนวิชา Data Streaming and Real Time Analytics สถาบันบัณฑิตพัฒนบริหารศาสตร์ นิด้า  วิทยากรผู้สอน Amazon cloud ประจาสถาบัน 9expert  ที่ปรึกษาบริษัทเอกชน ทางด้าน BigData และ Blockchain  งานวิจัย Blockchain, IoT และ BigData 2
  • 3.
    Outline • Internet ofThings (IoT) • IoT Data Streaming • Collect Data • MQTT • Kafka • Streaming processing platform • Flink • Storm • Spark • Use-Case Examples 3
  • 4.
    Internet of Things(IoT) Credit: https://orzota.com/industrial-iot/ Software and platform (Data Stream Processing) VisualizationThings (Generate data steam) 4 Sensors & Actuators
  • 5.
    IoT data characteristics Large-Scale StreamingData Heterogeneity Time and space correlation High noise data IoT data IoT Applications support  High-speed data streams  Requiring real-time or near real-time actions  Sometimes the need to join ○ with static data ○ with historical data Reference: M. Chen, S. Mao, Y. Zhang, and V. C. Leung, Big data: related technologies, challenges and future prospects. Springer, 2014
  • 6.
    What is DataStreaming? Ref: https://www.cisco.com/c/dam/en/us/products/collateral/analytics-automation-software/data-virtualization/r20-consultancy-combining-datastreaming-wp.pdf  The data streaming is continuously transmitted from one system (the producer) to another (the consumer) which reacts instantaneously (No delay) on the incoming data.
  • 7.
    Distributed Streaming  Streaming: Computations on never ending “streams” of data records (“events”)  Distributed:  Computation spread across many machines 7Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
  • 8.
    Stateless streaming  Everyincoming record is independent of other records.  There is no relation between different record can processed and persisted independently.  Eg. Map , Filter, Join with static data 8Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
  • 9.
    Stateful Streaming  Computationand state  E.g., counters, windows of past events, state machines, trained ML models  counts of each distinct word seen in records  Result depends on history of stream  Processing of an incoming record depends upon the result of previously processed records 9Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
  • 10.
    Event-Time Streaming  Datarecords associated with timestamps (time series data)  Processing depends on timestamps  An event-time stream processor should give you the tools to reason about time  Handle streams that are out of order 10Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
  • 11.
    Event-Time Streaming  Becausetime matters  Time  Event time, which is the time at which events actually occurred  Processing time, which is the time at which events are observed in the system 11Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
  • 12.
    Things are ProducingStreaming Data 12  Smart city  Healthcare/Medical device  Connected cars  Logistics  Home automation  Airlines  Farmers  Smart Machinery  Security system
  • 13.
    IoT Big DataArchitecture Filtering Analytics Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 14
  • 14.
    Collect Data (highlevel architecture) 15 How to integrate? MQTT or Kafka
  • 15.
    16Copy right :https://thenewstack.io/mqtt-protocol-iot/
  • 16.
    Messaging Systems: Publish/Subscribe ProducerConsumer Producer Consumer Topic 1 Topic 2 Topic 3 subscribe publish(topic, msg) Publish subscribe system msg msg
  • 17.
    Example 18 MQTT uses thepub/sub pattern to connect interested parties with each other Arduino, Raspberry Pi
  • 18.
    MQTT - Publish/ subscribe messaging protocol 19  MQTT protocol is a Machine to Machine (M2M) protocol widely used in Internet of things.  This protocol used publish-subscriber paradigm in contrast to HTTP based on request/response paradigm.  Built on top of TCP/IP for constrained devices and unreliable network  Many (open source) broker implementation  Many client libraries
  • 19.
  • 20.
    MQTT Architecture (clusteringdepends on broker implementation) 21
  • 21.
    MQTT Architecture (clusteringdepends on broker implementation) 22
  • 22.
    MQTT Trade-Offs Pros  Lightweight Simple API  Built for poor connectivity / high latency scenario  Many client connections (tens of thousands per MQTT server) Cons  Queuing, not stream processing  no buffering  No high scalability  No good integration to rest of the enterprise  No reprocessing of events 23
  • 23.
    Apache Kafka A distributedstreaming platform 24
  • 24.
    Kafka Data Streams Kafkais used to stream data into data lakes, applications and real-time stream analytics systems.
  • 25.
    Kafka architecture: Broker,Topics, Producers, and Consumers 26 Kafka Cluster is made up of multiple Kafka Brokers
  • 26.
    Apache Kafka -Architecture Producer Consumer 27
  • 27.
    Apache Kafka -Architecture Producer Consumer 28
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    A few importantcharacteristics  Fast  Kafka can handle hundreds of megabytes of reads and writes per second from a large number of clients.  Designed for real time activity streaming.  Distributed and highly scalable  Kafka has a cluster-centric design offers strong durability and fault-tolerance guarantees.  Messages partitioning spread over a cluster of machines  Durable  Message persisted to disk and replicated within cluster to prevent data loss.  Each broker can handle terabytes of messages without performance impact
  • 34.
  • 35.
    Use Case –Truck Sensors 36
  • 36.
    Kafka Trade-Offs (fromIoT perspective) Pros  Stream processing, not just queuing  High throughput  Large scale  High availability  Long term storage and buffering  Reprocessing of events  Good integration to rest of the enterprise Cons  Not built for tens of thousands connections  Requires stable network and good infrastructure 37
  • 37.
    Collect Data (highlevel architecture) 38 How to integrate? MQTT+Kafka
  • 38.
    End-to-End Integration fromMQTT to Apache Kafka 39
  • 39.
    MQTT Source andSink Connectors for Kafka Connect 40 https://www.confluent.io/hub/ https://www.confluent.io/connector/kafka-connect-mqtt/
  • 40.
    IoT Data Ingestionthrough MQTT into Kafka 41Ref: https://github.com/gschmutz/stream-processing-workshop/tree/master/06-iot-data-ingestion-over-mqtt
  • 41.
    IoT Big DataArchitecture Filtering Analytics Ref: https://mapr.com/blog/ml-iot-connected-medical-devices/ 42
  • 42.
    What is streamprocessing?  Technology that let users query continuous data stream and detect conditions fast within a small time period from the time of receiving the data.  The detection time period varies from few milliseconds to minutes.
  • 43.
  • 44.
    Two Types ofStream Processing 45
  • 45.
    Native Streaming  Itmeans every incoming record is processed as soon as it arrives, without waiting for others.  There are some continuous running processes which run for ever and every record passes through these processes to get processed.  Framework to achieve the minimum latency possible.  But hard to achieve fault tolerance 46
  • 46.
    Micro-batching  It meansincoming records in every few seconds are batched together and then processed in a single mini batch with delay of few seconds.  Cost of latency and it will not feel like a natural steaming 47 https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing- 91ea3f04675b
  • 47.
    Apache Storm  Distributeddataflow abstraction (spouts & bolts) and large scale stream processing  It is true streaming and is good for simple event based use cases  Very low latency, and high throughput  No state management 48  if it is simple IoT kind of event based alerting system source of streams filtering, functions, aggregations, joins, etc Processing
  • 48.
    Apache Flink Queries Applications Devices etc. Database Stream File /Object Storage  Stateful computations over streams  First True streaming framework with all advanced features like event time processing, watermarks, etc  Low latency with high throughput Historic Data Streams Application Good for Complex event time processing, aggregation, stream joins,etc
  • 49.
    Architecture and ProcessModel 50 Ref: https://ci.apache.org/projects/flink/flink-docs-release-1.1/internals/general_arch.html
  • 50.
  • 51.
    Apache Spark  Sparkhas emerged as true successor of Hadoop  Unified batch and stream processing over a batch runtime  High throughput, Fault tolerance by default due to micro-batch nature  Not true streaming, not suitable for low latency requirements 52 Good for Stream machine learning
  • 52.
  • 53.
    54 Ref: Muhammad Syafrudin,“Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time Monitoring System in Automotive Manufacturing” Real-Time Monitoring System in Automotive Manufacturing Detect abnormal events and diagnosis in a process
  • 54.
    55 System design Ref: MuhammadSyafrudin, “Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time Monitoring System in Automotive Manufacturing”
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    60 Performance evaluation interms of latency with different numbers of clients (a) and servers (b); throughput with different numbers of clients (c) and servers (d);
  • 60.