SlideShare a Scribd company logo
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Streaming analytics better
than batch - when and why ?
_A. Kawa - D. Wysakowicz - K. Zarzycki_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Have you ever
built cool
Big Data
pipelines?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Use-Case
■ Can be done in batch and real-time
■ User session analytics at Spotify
● Simple stats
■ Duration, number of songs, skips,
searches etc.
● Advanced analytics
■ Mood, physical activity, real-time content,
ads
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
_1. Dashboards_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
Australian users are
listening to Discover
Weekly too short !!!
_1. Dashboards_ _2. Alerts_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
Australian users are
listening to Discover
Weekly too short !!!
Recommend songs
and ads based on
current activity.
_1. Dashboards_ _2. Alerts_ _3. Content_
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st
- Batch Architecture
1h
1h
1h
1h - 1d
1h
User
Events
User
Sessions
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st
- Batch Architecture
1h
1h
1h
1d
1h
User
Events
User
Sessions
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The More Moving Parts …
⬇ The higher learning curve
⬇ The more gluing code
⬇ The larger administrative effort
⬇ The more error-prone solution
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Long Waiting Time
Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and
http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
2nd
- Micro-Batch Architecture
1m - 1h
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Late Data …
♪ ♪ ♪ ♪ ♪ ♪ Event Time
14:55 - 16:35
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Included in Current Batch
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪
♪
14:55 - 16:35 16:50 - …
Event Time
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ Event Time
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Breaks Correctness
♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫
♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫
♪
Event Time
Processing
Time
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Problems
FILES,
BATCHES,
DATA LAKES
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Solving Streaming Problem With Batch?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
3rd
- Streaming-First Architecture
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
User 3 ♪ ♪ ♪ ♪ ♪ ♪
Session gap
eg. 15 minutes
♪
♪♪
5
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
User 3 ♪ ♪ ♪ ♪ ♪ ♪
Session gap
eg. 15 minutes
♪
♪♪
5
[3,2]
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reading From Kafka
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gap
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪
User 1
User 2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gap
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
User 1 ♪ ♪ ♪ ♪ ♪ ♪
Session gap
- 15 minutes
♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Analyzing User Session
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Handling Late Events
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Triggering Early Results
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Sessionization Example
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
Working example:
https://github.com/getindata/flink-use-case
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic
● Built-in support for event-time windows
● Accurate results for late / out-of-order events and replays
● Early triggers
■ Low latency and high-throughput
■ Exactly-once stateful processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic
● Built-in support for event-time windows
● Accurate results for late / out-of-order events and replays
● Early triggers
■ Low latency and high-throughput
■ Exactly-once stateful processing
User survey:
http://data-artisans.com/flink-user-survey-2016-part-1
http://data-artisans.com/flink-user-survey-2016-part-2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How can I reprocess data?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reprocessing Events In Flink
1. Take periodic snapshots of a job
● It stores Kafka offsets, on-flight sessions, application state
2. Restart a job from a savepoint rather than from a beginning
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
What if data is no longer in Kafka?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Consuming Data From HDFS
■ Run your streaming code on HDFS (bounded data)
● You need to read data in event-time based order
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How to join with other data
sets/streams?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Join With Other Datasets / Streams
■ Flink can join windowed streams easily
■ Join of data stream with data set is WIP
● Even with slowly changing data set!
● Even keyed data
Stream 2
Stream 1
Joined Stream Input Stream Joined Stream
+
Id Name
1 John Doe
2 Jane Doe
Dataset
+
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
When is batch processing good?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration
● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Technical advantages
● A large swaths of historical data in HDFS
● High-level libraries in mature batch technologies
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration
● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Implementation advantages
● Offline experiments over large historical data
■ Historical events are usually stored in HDFS, not Kafka
● High-level libraries in batch processing technologies
■ Spark MLlib, H2O
(when data arrives continuously)
don’t solve
streaming problem
with batch jobs
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
I like this streaming API.
Can I use it for batch?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Unified batch and streaming API
■ Not with raw Flink API
■ But with Flink Table API
■ Apache Beam
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Who Are You, actually?
■ At GetInData, we build custom Big Data solutions
● Hadoop, Flink, Spark, Kafka and more
■ Our team is today represented by
Krzysztof
Zarzycki
Dawid
Wysakowicz
Adam
Kawa
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
■ Stream often the natural representation of your data
■ Stream processing is not only about low latency
Summary
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Thanks ! Big Data Tech Warsaw !
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Log Abstraction
11:00 -
12:00
12:00 -
13:00
…
…
10:00 - …
10:00 - …
10:00 -
11:00
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Spark Structured Streaming
⬇ It’s still ALPHA and the APIs are still experimental
⬇ Operates on top of micro-batches (Spark SQL engine)
⬆ Easy-to-learn API (Dataset/DataFrame)
⬆ Rich ecosystem of tools and libraries e.g. MLlib
⬆ Supports event-time
⬇ Sessionization not yet supported - SPARK-10816
⬇ Queryable state not yet supported - SPARK-16738
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kafka Streams
⬇ No exactly-once (just at-least-once)
⬇ Kafka as the only data source
⬇ No bounded streams (batch) optimizations
⬆ Simplicity
⬆ Embedded into application
⬆ Supports event-time
⬇ Lack of session windows
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Beam
⬆ Unified API for batch and streaming
⬆ Rich streaming processing semantics
⬆ Complex TriggerDSL
⬆ Multiple runtime environments
⬆ Spark, Flink, Apex, Dataflow
⬆ Side inputs and outputs
⬇ Verbose Java API
⬇ New project - Top level since 01/2017
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Google Dataflow
■ Runtime environment for Apache Beam in Google Cloud
⬇ No support for Iterative Computations
⬆ Supports Side Outputs
⬆ Works with every Google Cloud Service (Pub/Sub, BigTable
etc.)

More Related Content

Viewers also liked

Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

Viewers also liked (20)

Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
 
The Data Dichotomy- Rethinking the Way We Treat Data and Services
The Data Dichotomy- Rethinking the Way We Treat Data and ServicesThe Data Dichotomy- Rethinking the Way We Treat Data and Services
The Data Dichotomy- Rethinking the Way We Treat Data and Services
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
 
Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the Log
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 

Similar to Streaming analytics better than batch when and why - (Big Data Tech 2017)

Time-State Analytics
Time-State AnalyticsTime-State Analytics
Time-State Analytics
HostedbyConfluent
 
Lost in Translation:varnishlog, varnishtest(VUG7)
Lost in Translation:varnishlog, varnishtest(VUG7)Lost in Translation:varnishlog, varnishtest(VUG7)
Lost in Translation:varnishlog, varnishtest(VUG7)
Iwana Chan
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
GetInData
 
Session Replays: Unlocking User Experience with Data
Session Replays: Unlocking User Experience with DataSession Replays: Unlocking User Experience with Data
Session Replays: Unlocking User Experience with Data
ShraddhaSrivastava78
 

Similar to Streaming analytics better than batch when and why - (Big Data Tech 2017) (20)

Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
 
Cache aware-server-push in H2O version 1.5
Cache aware-server-push in H2O version 1.5Cache aware-server-push in H2O version 1.5
Cache aware-server-push in H2O version 1.5
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyoSpring Framework 5.0による Reactive Web Application #JavaDayTokyo
Spring Framework 5.0による Reactive Web Application #JavaDayTokyo
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Flink. Pure Streaming
Flink. Pure StreamingFlink. Pure Streaming
Flink. Pure Streaming
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
Time-State Analytics
Time-State AnalyticsTime-State Analytics
Time-State Analytics
 
Lost in Translation:varnishlog, varnishtest(VUG7)
Lost in Translation:varnishlog, varnishtest(VUG7)Lost in Translation:varnishlog, varnishtest(VUG7)
Lost in Translation:varnishlog, varnishtest(VUG7)
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Apache httpd 2.4: The Cloud Killer App
Apache httpd 2.4: The Cloud Killer AppApache httpd 2.4: The Cloud Killer App
Apache httpd 2.4: The Cloud Killer App
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
About time
About timeAbout time
About time
 
Numbers in the Hidden: A Pragmatic View of 'Nirvana'
Numbers in the Hidden: A Pragmatic View of 'Nirvana'Numbers in the Hidden: A Pragmatic View of 'Nirvana'
Numbers in the Hidden: A Pragmatic View of 'Nirvana'
 
Session Replays: Unlocking User Experience with Data
Session Replays: Unlocking User Experience with DataSession Replays: Unlocking User Experience with Data
Session Replays: Unlocking User Experience with Data
 

More from GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
GetInData
 

More from GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 

Streaming analytics better than batch when and why - (Big Data Tech 2017)

  • 1. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Streaming analytics better than batch - when and why ? _A. Kawa - D. Wysakowicz - K. Zarzycki_
  • 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Have you ever built cool Big Data pipelines?
  • 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Use-Case ■ Can be done in batch and real-time ■ User session analytics at Spotify ● Simple stats ■ Duration, number of songs, skips, searches etc. ● Advanced analytics ■ Mood, physical activity, real-time content, ads
  • 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? _1. Dashboards_
  • 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! _1. Dashboards_ _2. Alerts_
  • 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! Recommend songs and ads based on current activity. _1. Dashboards_ _2. Alerts_ _3. Content_
  • 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1h - 1d 1h User Events User Sessions
  • 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1st - Batch Architecture 1h 1h 1h 1d 1h User Events User Sessions
  • 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The More Moving Parts … ⬇ The higher learning curve ⬇ The more gluing code ⬇ The larger administrative effort ⬇ The more error-prone solution
  • 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Long Waiting Time Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
  • 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 2nd - Micro-Batch Architecture 1m - 1h
  • 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  • 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  • 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Late Data … ♪ ♪ ♪ ♪ ♪ ♪ Event Time 14:55 - 16:35 Processing Time
  • 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Included in Current Batch ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ 14:55 - 16:35 16:50 - … Event Time Processing Time
  • 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ Event Time Processing Time
  • 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  • 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  • 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ... Breaks Correctness ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ Event Time Processing Time
  • 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Problems FILES, BATCHES, DATA LAKES
  • 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Solving Streaming Problem With Batch?
  • 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 3rd - Streaming-First Architecture
  • 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 3 ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5
  • 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. User Session Windows ♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 3 ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪♪ 5 [3,2]
  • 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reading From Kafka val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
  • 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 1 User 2
  • 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) User 1 ♪ ♪ ♪ ♪ ♪ ♪ Session gap - 15 minutes ♪♪
  • 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Analyzing User Session val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  • 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Handling Late Events val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
  • 31. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Triggering Early Results val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
  • 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Sessionization Example val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) Working example: https://github.com/getindata/flink-use-case
  • 33. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing
  • 34. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Modern Stream Processing Engines ■ Rich stream processing semantic ● Built-in support for event-time windows ● Accurate results for late / out-of-order events and replays ● Early triggers ■ Low latency and high-throughput ■ Exactly-once stateful processing User survey: http://data-artisans.com/flink-user-survey-2016-part-1 http://data-artisans.com/flink-user-survey-2016-part-2
  • 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 36. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How can I reprocess data?
  • 37. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reprocessing Events In Flink 1. Take periodic snapshots of a job ● It stores Kafka offsets, on-flight sessions, application state 2. Restart a job from a savepoint rather than from a beginning
  • 38. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What if data is no longer in Kafka?
  • 39. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Consuming Data From HDFS ■ Run your streaming code on HDFS (bounded data) ● You need to read data in event-time based order
  • 40. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How to join with other data sets/streams?
  • 41. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join With Other Datasets / Streams ■ Flink can join windowed streams easily ■ Join of data stream with data set is WIP ● Even with slowly changing data set! ● Even keyed data Stream 2 Stream 1 Joined Stream Input Stream Joined Stream + Id Name 1 John Doe 2 Jane Doe Dataset +
  • 42. © Copyright. All rights reserved. Not to be reproduced without prior written consent. When is batch processing good?
  • 43. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Technical advantages ● A large swaths of historical data in HDFS ● High-level libraries in mature batch technologies
  • 44. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Batch Processing Use-Cases ■ Ad-hoc analytics and data exploration ● Notebooks, Spark/Flink/Hive, Parquet, complete data sets ■ Implementation advantages ● Offline experiments over large historical data ■ Historical events are usually stored in HDFS, not Kafka ● High-level libraries in batch processing technologies ■ Spark MLlib, H2O (when data arrives continuously) don’t solve streaming problem with batch jobs
  • 45. © Copyright. All rights reserved. Not to be reproduced without prior written consent. I like this streaming API. Can I use it for batch?
  • 46. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Unified batch and streaming API ■ Not with raw Flink API ■ But with Flink Table API ■ Apache Beam
  • 47. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Who Are You, actually? ■ At GetInData, we build custom Big Data solutions ● Hadoop, Flink, Spark, Kafka and more ■ Our team is today represented by Krzysztof Zarzycki Dawid Wysakowicz Adam Kawa
  • 48. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ■ Stream often the natural representation of your data ■ Stream processing is not only about low latency Summary
  • 49. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  • 50. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Thanks ! Big Data Tech Warsaw !
  • 51. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 52. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Log Abstraction 11:00 - 12:00 12:00 - 13:00 … … 10:00 - … 10:00 - … 10:00 - 11:00
  • 53. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Spark Structured Streaming ⬇ It’s still ALPHA and the APIs are still experimental ⬇ Operates on top of micro-batches (Spark SQL engine) ⬆ Easy-to-learn API (Dataset/DataFrame) ⬆ Rich ecosystem of tools and libraries e.g. MLlib ⬆ Supports event-time ⬇ Sessionization not yet supported - SPARK-10816 ⬇ Queryable state not yet supported - SPARK-16738
  • 54. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kafka Streams ⬇ No exactly-once (just at-least-once) ⬇ Kafka as the only data source ⬇ No bounded streams (batch) optimizations ⬆ Simplicity ⬆ Embedded into application ⬆ Supports event-time ⬇ Lack of session windows
  • 55. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Beam ⬆ Unified API for batch and streaming ⬆ Rich streaming processing semantics ⬆ Complex TriggerDSL ⬆ Multiple runtime environments ⬆ Spark, Flink, Apex, Dataflow ⬆ Side inputs and outputs ⬇ Verbose Java API ⬇ New project - Top level since 01/2017
  • 56. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Google Dataflow ■ Runtime environment for Apache Beam in Google Cloud ⬇ No support for Iterative Computations ⬆ Supports Side Outputs ⬆ Works with every Google Cloud Service (Pub/Sub, BigTable etc.)