Presented By: Anuj & Jashan
Let’s get to know
Streaming
- A developer’s point of view
Our Agenda
01 Streaming: What, Why,
Benefits
02 Different Architectures
03 Challenges in Stream Processing
04 Types of Stream Processing: Stateless &
Stateful
05 Stateful Stream Processing:
Elaborated
What is Data Streaming?
● A continuous flow of data is called Data
Streaming.
● Ex: Surge in IoT devices caused more data to
gather.
● Data gathered at real time can be processed
to get real time results.
● Stream processing is the practice of taking
action on a series of data at the time the
data is created.
Why Data Streaming ?
● Providing insights faster.
● Handle never-ending stream of events
● Easy inspection of data from multiple streams
simultaneously
● Stream processing can work with a lot less hardware
than batch processing
● To Design data processing engine with infinite data sets
in mind.
Streaming: Benefits
● Lot less hardware required.
● Real-time fraud and anomaly detection.
● Internet of Things (IoT) real-time analytics.
● Real-time personalization, marketing, and
advertising.
Evolution of Stream Processing
beginning 1970 Early 2000s 2015 Current
Fortran / C
Started Simple processing
SQL and RDBMs
Databases invented
Batch
processing
Bulk processing and Big
Data like Map-Reduce
Streaming or
Micro Batching
Stream processing started
showing promises
Streaming SQL
Unified
Processing
Streaming or Micro-Batching ?
Different Architectures
Different Architectures: At a Glance
Lambda vs. Kappa
Lambda Architecture
Batch Processing + Streaming Power
Kappa Architecture
Pure Streaming
Challenges
Stream Processing Challenges
3 4
Late Data
Data received at a later time
than the actual event time
Deduplication
Removing duplicate data in
stream
1 2
Stream Joins
Joining Data from two
Streams
Aggregations
Aggregation operations for
SQL
5 6
Fast Incoming
DataStream Processor
Not Upto Speed
Fault tolerance
This paragraph actually is a
good place for title
description
Solution in Streaming
1 2
3 4
5 6
Stream Joins
Managing state in Streaming
Watermarking
Late Data
Managing state in Streaming
window, watermarking
Fast Incoming
DataBackpressure
Aggregations
Apply grouping (window)
and watermarking
Deduplication
Managing state in Streaming
with watermarking
Fault tolerance
Checkpointing
Late Data in Streaming
Types of Stream
Processing
Types of Stream Processing
Stateless & Stateful
Stateless Stream Processing
What
This streaming is the straight
forward streaming we don’t need to
maintain state
Where
Where we need to perform some
operation per individual
message/event like filter, select, etc
When
when result is not dependent upon
previous events
Stateful Stream Processing
What
This stream is maintaining the state to
perform Aggregations, Deduplication,
Joins
Where
where we need to perform
operations like groupBy, count, etc
When
When result is dependent upon
previous events.
Stateful Stream
Processing: Elaborated
Stateful Streaming: Aggregation
● Aggregation by key only
● Aggregation by event time windows
● Aggregation by both
Windowing
01
02
03
This is simplest window.. This
window is pretty straight forward
We can perform both
windows by with respect to
the processing time and
event time.
This is window is bit complex then the Fixed
window. This window gives us two insights
like window and slice
Windowing by processing time vs
event time
Fixed window (Tumbling windows)
Sliding window (Hopping
windows)
Event Time & Processing Time
Windowing by Processing Time vs Event
Time
Processing Time Window
● Processing time window is based upon
the clock time window.
● All the late events will keep into current
window
● Do not reorder the out of order events
Event Time Window
● Event time window is based upon time
when event get produced
● Event will be keep in the belonging
window.
● Re-order the out of order events.
Fixed Window (Tumbling Window)
Fixed/tumbling: time is partitioned into same-length,
non-overlapping chunks. Each event belongs to exactly one
window
Fixed Window (Tumbling Window)
Sliding Window (Hopping Window)
Sliding: windows have fixed length, but are separated by a time
interval (step) which can be smaller than the window length. Typically
the window interval is a multiplicity of the step. Each event belongs to
a number of windows ([window interval]/[window step]).
Sliding Window (Hopping Window)
Late Data in Streaming
Watermarking
● Data newer than watermark may be late, but allowed
to aggregate
● Windows older than watermark automatically deleted
to
limit the amount of intermediate state
Handle more late data -> Keep more state
Reduced the state -> Handle less lateness
Internal working of Watermarking
Stateful Streaming: Deduplication
● Drop duplicate records in a Stream
● Specify Columns which uniquely identify
records
● State will store unique keys in stream and
drop any record matching the state
Stateful Streaming: Deduplication
● Too large Key Set in state for
deduplication will make the streaming
unstable
● Solution: Drop the state after a specified
period.
Stateful Streaming: Joins
● Each of the stream should buffer events in
state for matching any future events of other
stream.
Stateful Streaming: Joins
Stateful Streaming: Joins
● Impressions can be 2 hours late
● Clicks can be 3 hours late
● Clicks can occur within 1 hour after the
corresponding impression
Some Use case of Streaming
● Algorithmic Trading, Stock Market Surveillance,
● Smart Patient Care
● Monitoring a production line
● Supply chain optimizations
● Intrusion, Surveillance and Fraud Detection ( e.g. Uber)
● Most Smart Device Applications: Smart Car, Smart Home ..
● Smart Grid — (e.g. load prediction and outlier plug detection see Smart grids, 4 Billion events, throughout in range
of 100Ks)
● Traffic Monitoring, Geofencing, Vehicle, and Wildlife tracking — e.g. TFL London Transport Management System
● Sports analytics — Augment Sports with real-time analytics (e.g. this is a work we did with a real football game (e.g.
Overlaying real time analytics on Football Broadcasts)
● Context-aware promotions and advertising
● Computer system and network monitoring
● Predictive Maintenance, (e.g. Machine Learning Techniques for Predictive Maintenance)
● Geospatial data processing
Some Use case of Streaming
Some Use case of Streaming
References
● https://hazelcast.com/glossary/kappa-architecture/
● https://hazelcast.com/glossary/lambda-architecture/
● https://databricks.com/glossary/lambda-architecture
● Streams Concepts — Confluent Platform
● Stateful Stream Processing: Databricks
● Spark Strcutured Streaming Documentation
Thank You !
Get in touch with us:
anuj.saxena@knoldus.com || anuj1207 || anuj-saxena
jashan.goyal@knoldus.com || jashangoyal09 || jashan-goyal

Let's get to know the Data Streaming

  • 1.
    Presented By: Anuj& Jashan Let’s get to know Streaming - A developer’s point of view
  • 2.
    Our Agenda 01 Streaming:What, Why, Benefits 02 Different Architectures 03 Challenges in Stream Processing 04 Types of Stream Processing: Stateless & Stateful 05 Stateful Stream Processing: Elaborated
  • 3.
    What is DataStreaming? ● A continuous flow of data is called Data Streaming. ● Ex: Surge in IoT devices caused more data to gather. ● Data gathered at real time can be processed to get real time results. ● Stream processing is the practice of taking action on a series of data at the time the data is created.
  • 4.
    Why Data Streaming? ● Providing insights faster. ● Handle never-ending stream of events ● Easy inspection of data from multiple streams simultaneously ● Stream processing can work with a lot less hardware than batch processing ● To Design data processing engine with infinite data sets in mind.
  • 5.
    Streaming: Benefits ● Lotless hardware required. ● Real-time fraud and anomaly detection. ● Internet of Things (IoT) real-time analytics. ● Real-time personalization, marketing, and advertising.
  • 6.
    Evolution of StreamProcessing beginning 1970 Early 2000s 2015 Current Fortran / C Started Simple processing SQL and RDBMs Databases invented Batch processing Bulk processing and Big Data like Map-Reduce Streaming or Micro Batching Stream processing started showing promises Streaming SQL Unified Processing
  • 7.
  • 8.
  • 9.
    Different Architectures: Ata Glance Lambda vs. Kappa
  • 10.
  • 11.
  • 12.
  • 13.
    Stream Processing Challenges 34 Late Data Data received at a later time than the actual event time Deduplication Removing duplicate data in stream 1 2 Stream Joins Joining Data from two Streams Aggregations Aggregation operations for SQL 5 6 Fast Incoming DataStream Processor Not Upto Speed Fault tolerance This paragraph actually is a good place for title description
  • 14.
    Solution in Streaming 12 3 4 5 6 Stream Joins Managing state in Streaming Watermarking Late Data Managing state in Streaming window, watermarking Fast Incoming DataBackpressure Aggregations Apply grouping (window) and watermarking Deduplication Managing state in Streaming with watermarking Fault tolerance Checkpointing
  • 15.
    Late Data inStreaming
  • 16.
  • 17.
    Types of StreamProcessing Stateless & Stateful
  • 18.
    Stateless Stream Processing What Thisstreaming is the straight forward streaming we don’t need to maintain state Where Where we need to perform some operation per individual message/event like filter, select, etc When when result is not dependent upon previous events
  • 19.
    Stateful Stream Processing What Thisstream is maintaining the state to perform Aggregations, Deduplication, Joins Where where we need to perform operations like groupBy, count, etc When When result is dependent upon previous events.
  • 20.
  • 21.
    Stateful Streaming: Aggregation ●Aggregation by key only ● Aggregation by event time windows ● Aggregation by both
  • 22.
    Windowing 01 02 03 This is simplestwindow.. This window is pretty straight forward We can perform both windows by with respect to the processing time and event time. This is window is bit complex then the Fixed window. This window gives us two insights like window and slice Windowing by processing time vs event time Fixed window (Tumbling windows) Sliding window (Hopping windows)
  • 23.
    Event Time &Processing Time
  • 24.
    Windowing by ProcessingTime vs Event Time Processing Time Window ● Processing time window is based upon the clock time window. ● All the late events will keep into current window ● Do not reorder the out of order events Event Time Window ● Event time window is based upon time when event get produced ● Event will be keep in the belonging window. ● Re-order the out of order events.
  • 25.
    Fixed Window (TumblingWindow) Fixed/tumbling: time is partitioned into same-length, non-overlapping chunks. Each event belongs to exactly one window
  • 26.
  • 27.
    Sliding Window (HoppingWindow) Sliding: windows have fixed length, but are separated by a time interval (step) which can be smaller than the window length. Typically the window interval is a multiplicity of the step. Each event belongs to a number of windows ([window interval]/[window step]).
  • 28.
  • 29.
    Late Data inStreaming
  • 30.
    Watermarking ● Data newerthan watermark may be late, but allowed to aggregate ● Windows older than watermark automatically deleted to limit the amount of intermediate state Handle more late data -> Keep more state Reduced the state -> Handle less lateness
  • 31.
    Internal working ofWatermarking
  • 32.
    Stateful Streaming: Deduplication ●Drop duplicate records in a Stream ● Specify Columns which uniquely identify records ● State will store unique keys in stream and drop any record matching the state
  • 33.
    Stateful Streaming: Deduplication ●Too large Key Set in state for deduplication will make the streaming unstable ● Solution: Drop the state after a specified period.
  • 34.
    Stateful Streaming: Joins ●Each of the stream should buffer events in state for matching any future events of other stream.
  • 35.
  • 36.
    Stateful Streaming: Joins ●Impressions can be 2 hours late ● Clicks can be 3 hours late ● Clicks can occur within 1 hour after the corresponding impression
  • 37.
    Some Use caseof Streaming ● Algorithmic Trading, Stock Market Surveillance, ● Smart Patient Care ● Monitoring a production line ● Supply chain optimizations ● Intrusion, Surveillance and Fraud Detection ( e.g. Uber) ● Most Smart Device Applications: Smart Car, Smart Home .. ● Smart Grid — (e.g. load prediction and outlier plug detection see Smart grids, 4 Billion events, throughout in range of 100Ks) ● Traffic Monitoring, Geofencing, Vehicle, and Wildlife tracking — e.g. TFL London Transport Management System ● Sports analytics — Augment Sports with real-time analytics (e.g. this is a work we did with a real football game (e.g. Overlaying real time analytics on Football Broadcasts) ● Context-aware promotions and advertising ● Computer system and network monitoring ● Predictive Maintenance, (e.g. Machine Learning Techniques for Predictive Maintenance) ● Geospatial data processing
  • 38.
    Some Use caseof Streaming
  • 39.
    Some Use caseof Streaming
  • 40.
    References ● https://hazelcast.com/glossary/kappa-architecture/ ● https://hazelcast.com/glossary/lambda-architecture/ ●https://databricks.com/glossary/lambda-architecture ● Streams Concepts — Confluent Platform ● Stateful Stream Processing: Databricks ● Spark Strcutured Streaming Documentation
  • 41.
    Thank You ! Getin touch with us: anuj.saxena@knoldus.com || anuj1207 || anuj-saxena jashan.goyal@knoldus.com || jashangoyal09 || jashan-goyal