- IoT devices generate large streams of data that need to be collected and processed in real-time. MQTT and Kafka are common protocols for collecting IoT data streams. MQTT is lightweight but lacks scalability while Kafka is highly scalable.
- Stream processing platforms like Flink, Storm and Spark can be used to analyze the IoT data streams. Flink supports both batch and stream processing while Storm is best for low-latency streaming. Spark is better for machine learning on streams.
- An example use case is real-time equipment monitoring in a factory where IoT sensors stream data to Kafka which is then processed by Flink to detect abnormalities and enable predictive maintenance. Performance is evaluated based on latency and
2. WHO AM I ?
อาจารย์ผู้สอน ป.ตรี วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยธุรกิจบัณฑิตย์
อาจารย์ผู้สอน ป.โท วิศวกรรมข้อมูลขนาดใหญ่ มหาวิทยาลัยธุรกิจบัณฑิตย์
อาจารย์พิเศษ สอนวิชา Data Streaming and Real Time Analytics
สถาบันบัณฑิตพัฒนบริหารศาสตร์ นิด้า
วิทยากรผู้สอน Amazon cloud ประจาสถาบัน 9expert
ที่ปรึกษาบริษัทเอกชน ทางด้าน BigData และ Blockchain
งานวิจัย Blockchain, IoT และ BigData
2
3. Outline
• Internet of Things (IoT)
• IoT Data Streaming
• Collect Data
• MQTT
• Kafka
• Streaming processing platform
• Flink
• Storm
• Spark
• Use-Case Examples
3
4. Internet of Things (IoT)
Credit: https://orzota.com/industrial-iot/
Software and
platform
(Data Stream
Processing)
VisualizationThings
(Generate
data steam)
4
Sensors & Actuators
5. IoT data characteristics
Large-Scale
Streaming Data
Heterogeneity
Time and space
correlation
High noise data
IoT
data
IoT Applications support
High-speed data streams
Requiring real-time or near
real-time actions
Sometimes the need to join
○ with static data
○ with historical data
Reference: M. Chen, S. Mao, Y. Zhang, and V. C. Leung, Big data: related technologies, challenges and future prospects. Springer, 2014
6. What is Data Streaming?
Ref: https://www.cisco.com/c/dam/en/us/products/collateral/analytics-automation-software/data-virtualization/r20-consultancy-combining-datastreaming-wp.pdf
The data streaming is
continuously transmitted from
one system (the producer) to
another (the consumer) which
reacts instantaneously (No delay)
on the incoming data.
7. Distributed Streaming
Streaming:
Computations on never ending “streams” of data records (“events”)
Distributed:
Computation spread across many machines
7Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
8. Stateless streaming
Every incoming record is independent of other records.
There is no relation between different record can processed and persisted
independently.
Eg. Map , Filter, Join with static data
8Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
9. Stateful Streaming
Computation and state
E.g., counters, windows of past events, state machines, trained ML models
counts of each distinct word seen in records
Result depends on history of stream
Processing of an incoming record depends upon the result of previously processed records
9Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
10. Event-Time Streaming
Data records associated with timestamps (time series data)
Processing depends on timestamps
An event-time stream processor should give you the tools to reason about time
Handle streams that are out of order
10Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
11. Event-Time Streaming
Because time matters
Time
Event time, which is the time at which
events actually occurred
Processing time, which is the time at
which events are observed in the system
11Ref: Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Analytics
12. Things are Producing Streaming Data
12
Smart city
Healthcare/Medical
device
Connected cars
Logistics
Home automation
Airlines
Farmers
Smart Machinery
Security system
13. IoT Big Data Architecture
Filtering
Analytics
Source: https://mapr.com/blog/ml-iot-connected-medical-devices/ 14
14. Collect Data (high level architecture)
15
How to integrate? MQTT or Kafka
15. 16Copy right : https://thenewstack.io/mqtt-protocol-iot/
17. Example
18
MQTT uses the pub/sub pattern to connect interested parties with each other
Arduino, Raspberry Pi
18. MQTT - Publish / subscribe messaging
protocol
19
MQTT protocol is a Machine to Machine (M2M) protocol widely used in Internet of things.
This protocol used publish-subscriber paradigm in contrast to HTTP based on request/response
paradigm.
Built on top of TCP/IP for constrained devices and unreliable network
Many (open source) broker implementation
Many client libraries
22. MQTT Trade-Offs
Pros
Lightweight
Simple API
Built for poor connectivity / high latency scenario
Many client connections (tens of thousands per MQTT server)
Cons
Queuing, not stream processing
no buffering
No high scalability
No good integration to rest of the enterprise
No reprocessing of events
23
33. A few important characteristics
Fast
Kafka can handle hundreds of megabytes of reads and writes per second from a
large number of clients.
Designed for real time activity streaming.
Distributed and highly scalable
Kafka has a cluster-centric design offers strong durability and fault-tolerance
guarantees.
Messages partitioning spread over a cluster of machines
Durable
Message persisted to disk and replicated within cluster to prevent data loss.
Each broker can handle terabytes of messages without performance impact
36. Kafka Trade-Offs (from IoT perspective)
Pros
Stream processing, not just queuing
High throughput
Large scale
High availability
Long term storage and buffering
Reprocessing of events
Good integration to rest of the enterprise
Cons
Not built for tens of thousands connections
Requires stable network and good infrastructure
37
39. MQTT Source and Sink Connectors for Kafka
Connect
40
https://www.confluent.io/hub/
https://www.confluent.io/connector/kafka-connect-mqtt/
40. IoT Data Ingestion through MQTT into Kafka
41Ref: https://github.com/gschmutz/stream-processing-workshop/tree/master/06-iot-data-ingestion-over-mqtt
41. IoT Big Data Architecture
Filtering
Analytics
Ref: https://mapr.com/blog/ml-iot-connected-medical-devices/ 42
42. What is stream processing?
Technology that let users query continuous data stream and detect conditions
fast within a small time period from the time of receiving the data.
The detection time period varies from few milliseconds to minutes.
45. Native Streaming
It means every incoming record is
processed as soon as it arrives, without
waiting for others.
There are some continuous running
processes which run for ever and every
record passes through these processes to
get processed.
Framework to achieve the minimum
latency possible.
But hard to achieve fault tolerance
46
46. Micro-batching
It means incoming records in every few seconds are batched together and then
processed in a single mini batch with delay of few seconds.
Cost of latency and it will not feel like a natural steaming
47
https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-
91ea3f04675b
47. Apache Storm
Distributed dataflow abstraction (spouts & bolts) and large scale stream processing
It is true streaming and is good for simple event based use cases
Very low latency, and high throughput
No state management
48
if it is simple IoT kind of event based alerting system
source of streams
filtering,
functions,
aggregations,
joins, etc
Processing
48. Apache Flink
Queries
Applications
Devices
etc.
Database
Stream
File / Object
Storage
Stateful computations over streams
First True streaming framework with all advanced
features like event time processing, watermarks, etc
Low latency with high throughput
Historic
Data
Streams
Application
Good for Complex event time processing,
aggregation, stream joins,etc
49. Architecture and Process Model
50
Ref: https://ci.apache.org/projects/flink/flink-docs-release-1.1/internals/general_arch.html
51. Apache Spark
Spark has emerged as true successor of Hadoop
Unified batch and stream processing over a batch runtime
High throughput, Fault tolerance by default due to micro-batch nature
Not true streaming, not suitable for low latency requirements
52
Good for Stream machine learning
53. 54
Ref: Muhammad Syafrudin, “Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time
Monitoring System in Automotive Manufacturing”
Real-Time Monitoring System in Automotive Manufacturing
Detect abnormal events
and diagnosis in a process
54. 55
System design
Ref: Muhammad Syafrudin, “Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time
Monitoring System in Automotive Manufacturing”
59. 60
Performance evaluation in terms of latency with different numbers of clients (a) and servers
(b); throughput with different numbers of clients (c) and servers (d);