SlideShare a Scribd company logo
WINDOWING DATA IN BIG DATA
STREAMS
ADAM WARSKI, WOLVESSUMMIT
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
BIG DATA? FAST DATA?
▸ What is big data?
▸ Shift of focus
▸ Processing speed
▸ Fast data -> streaming
A TYPE OF DATA PROCESSING
ENGINE THAT IS DESIGNED WITH
INFINITE DATA SETS IN MIND
Tyler Akidau, Google
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT IS STREAMING?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WINDOWING
▸ Time becomes the focus point
▸ How many invalid password errors where there in the
last 5 minutes
▸ During which 30-minute window did we get most
traffic?
▸ What’s the average 5-minute speed on a section of a
highway throughout the day?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
HOW TO DO STREAMING? WITH WINDOWS?
▸ Many possibilities:
▸ Spark Streaming
▸ Spark Structured Streaming
▸ Kafka Streams
▸ Flink
▸ Akka Streams
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHICH ONE TO CHOOSE?
LET’S FIND OUT
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
/ME
▸ coder @
▸ Lightbend, Confluent, Datastax consulting partner
▸ mainly Scala
▸ open-source: MacWire, ElasticMQ, Quicklens,
…
▸ http://www.warski.org / @adamwarski
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT’S THE TIME?
▸ How to associate time with an event:
▸ event time: “logical”, data-dependent
▸ ingestion time: when the event entered the system
▸ processing time: when the event is being processed
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TYPES OF WINDOWS
▸ Time-based
▸ fixed/tumbling
▸ sliding
▸ Session-based
time
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
OUT-OF-ORDER: WATERMARKS, LATENESS
▸ Windows GC
▸ At some point, enough is enough
▸ Watermark:
▸ all events before X have been observed
▸ heuristics
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TRIGGERS
▸ When to emit window results
▸ Watermark progress
▸ Event time progress
▸ Processing time progress
▸ Punctuations
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
ACCUMULATION OF RESULTS
▸ If we trigger many times …
▸ discard
▸ accumulate
▸ retract & accumulate
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FINALLY … HOW TO MANIPULATE THE DATA
▸ map, flatMap, filter …
▸ stateful computation
▸ fold, reduce
▸ past-dependent operations
▸ where to store the state
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Event/ingestion/processing time
▸ Tumbling/sliding/session windows
▸ Watermarks
▸ Triggers
▸ Accumulation of results
▸ State management
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING
▸ Micro-batches (DStream)
▸ .window() API:
▸ tumbling/sliding windows
▸ only processing time
▸ no watermarks
▸ triggers at the end of the window
▸ state persisted in cluster (e.g. updateStateByKey())
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING - WHY BOTHER?
▸ Popular
▸ Not only streaming
▸ ML
▸ SQL
▸ GraphX
▸ but …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STRUCTURED STREAMING
▸ Alpha in Spark 2.0
▸ Micro-batches not exposed
▸ groupBy(window(…))
▸ Event-time support
▸ No watermarks, session windows (2.1?)
▸ Trigger: processing time; outputs changed windows
▸ Exactly-once processing*
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FLINK
▸ Mostly with keyed streams (parallelism)
▸ TimeCharacteristic: event/ingestion/processing
▸ TimestampAssigner: also generates watermarks
▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session
▸ Trigger: event/processing time, count, single/continuous
▸ Window function: fold/reduce/with-kv-state
▸ Exactly-once* / at-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
KAFKA STREAMS
▸ State: Kafka topics/local key-value backed by a topic for resiliency
▸ Watermarks: no, but windows are retained for 1 day
▸ Time: event/ingestion/processing; TimestampExtractor
▸ Tumbling/sliding windows
▸ Trigger: after every element
▸ aggregate by key&window into an ever-updating KTable
▸ At-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
AKKA STREAMS
▸ Single-node, no clustering
▸ No OOTB support, but quite easy to implement:
▸ Windows: arbitrary, assign windows to each element
▸ Trigger: only window-close
▸ State: local
▸ Watermarks: can be implemented
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Spark: widely used, some features missing
▸ Flink: versatile
▸ Kafka: simple model
▸ Akka: single-node
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Windowing is just one of the aspects
▸ Other:
▸ State management
▸ Work distribution
▸ Processing guarantees
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Other stream processing systems out there!
▸ Apache Storm
▸ Google Cloud Dataflow
▸ Amazon Kinesis
▸ Apache Beam
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
LINKS
▸ Streaming 101 & 102: 
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-101
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102
▸ https://softwaremill.com/windowing-data-in-akka-streams/
THANKS!
ADAM WARSKI
@ADAMWARSKI /
ADAM.WARSKI@SOFTWAREMILL.COM

More Related Content

Similar to Windowing data in big data streams

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Ian Massingham
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
Databricks
 
Serverless Swift for Mobile Developers
Serverless Swift for Mobile DevelopersServerless Swift for Mobile Developers
Serverless Swift for Mobile Developers
All Things Open
 
Aggressive Autonomous Actions - Operating with Automation
Aggressive Autonomous Actions - Operating with AutomationAggressive Autonomous Actions - Operating with Automation
Aggressive Autonomous Actions - Operating with Automation
CTruncer
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
Bahadir Cambel
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Tenchi Security
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Alexandre Sieira
 
Real-World Application Observability - 11 Practical Developer Focused Tips
Real-World Application Observability - 11 Practical Developer Focused TipsReal-World Application Observability - 11 Practical Developer Focused Tips
Real-World Application Observability - 11 Practical Developer Focused Tips
VictorSzoltysek
 
Asynchronicity
AsynchronicityAsynchronicity
Asynchronicity
Takahiro Yoshimura
 
WordPress + Amazon Web Services Hands-on WARSAW
WordPress + Amazon Web Services Hands-on WARSAWWordPress + Amazon Web Services Hands-on WARSAW
WordPress + Amazon Web Services Hands-on WARSAW
Matt Pilarski
 
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
Kel
 
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
ChrisJohnsonBidler
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...
Jim Birch
 
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
CODE BLUE
 
About time
About timeAbout time
About time
Nadav Wiener
 
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
AWS Chicago
 
Macdoored
MacdooredMacdoored
Macdoored
Shakacon
 
The state of the swarm
The state of the swarmThe state of the swarm
The state of the swarm
Mathieu Buffenoir
 
Improving velocity through abstraction
Improving velocity through abstractionImproving velocity through abstraction
Improving velocity through abstraction
VictorSzoltysek
 
LambHack: A Vulnerable Serverless Application
LambHack: A Vulnerable Serverless ApplicationLambHack: A Vulnerable Serverless Application
LambHack: A Vulnerable Serverless Application
James Wickett
 

Similar to Windowing data in big data streams (20)

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Serverless Swift for Mobile Developers
Serverless Swift for Mobile DevelopersServerless Swift for Mobile Developers
Serverless Swift for Mobile Developers
 
Aggressive Autonomous Actions - Operating with Automation
Aggressive Autonomous Actions - Operating with AutomationAggressive Autonomous Actions - Operating with Automation
Aggressive Autonomous Actions - Operating with Automation
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
 
Real-World Application Observability - 11 Practical Developer Focused Tips
Real-World Application Observability - 11 Practical Developer Focused TipsReal-World Application Observability - 11 Practical Developer Focused Tips
Real-World Application Observability - 11 Practical Developer Focused Tips
 
Asynchronicity
AsynchronicityAsynchronicity
Asynchronicity
 
WordPress + Amazon Web Services Hands-on WARSAW
WordPress + Amazon Web Services Hands-on WARSAWWordPress + Amazon Web Services Hands-on WARSAW
WordPress + Amazon Web Services Hands-on WARSAW
 
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
AMIMOTO: WordPress + Amazon Web Services Hands-on WARSAW
 
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
Serverless Chicago - Datomic Cloud and AWS AppSync - April 26 2018
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...
 
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
[CB16] About the cyber grand challenge: the world’s first all-machine hacking...
 
About time
About timeAbout time
About time
 
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
AWS user group Serverless in September - Chris Johnson Bidler "Go Serverless ...
 
Macdoored
MacdooredMacdoored
Macdoored
 
The state of the swarm
The state of the swarmThe state of the swarm
The state of the swarm
 
Improving velocity through abstraction
Improving velocity through abstractionImproving velocity through abstraction
Improving velocity through abstraction
 
LambHack: A Vulnerable Serverless Application
LambHack: A Vulnerable Serverless ApplicationLambHack: A Vulnerable Serverless Application
LambHack: A Vulnerable Serverless Application
 

More from SoftwareMill

Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
SoftwareMill
 
How To Survive a Live-Coding Session
How To Survive a Live-Coding SessionHow To Survive a Live-Coding Session
How To Survive a Live-Coding Session
SoftwareMill
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
SoftwareMill
 
Have you ever wondered about code review?
Have you ever wondered about code review?Have you ever wondered about code review?
Have you ever wondered about code review?
SoftwareMill
 
Reactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaReactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and Alpakka
SoftwareMill
 
W świecie botów czyli po co nam SI
W świecie botów czyli po co nam SIW świecie botów czyli po co nam SI
W świecie botów czyli po co nam SI
SoftwareMill
 
Small intro to Big Data
Small intro to Big DataSmall intro to Big Data
Small intro to Big Data
SoftwareMill
 
Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9
SoftwareMill
 
Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry)
SoftwareMill
 
Teal Is The New Black
Teal Is The New BlackTeal Is The New Black
Teal Is The New Black
SoftwareMill
 
Kafka as a message queue
Kafka as a message queueKafka as a message queue
Kafka as a message queue
SoftwareMill
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
SoftwareMill
 
Origins of Free
Origins of FreeOrigins of Free
Origins of Free
SoftwareMill
 
Cassandra - how to fail?
Cassandra - how to fail?Cassandra - how to fail?
Cassandra - how to fail?
SoftwareMill
 
How to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companyHow to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent company
SoftwareMill
 
Performance tests with gatling
Performance tests with gatlingPerformance tests with gatling
Performance tests with gatling
SoftwareMill
 
Origins of free
Origins of freeOrigins of free
Origins of free
SoftwareMill
 
Projekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraProjekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designera
SoftwareMill
 
Machine learning by example
Machine learning by exampleMachine learning by example
Machine learning by example
SoftwareMill
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
SoftwareMill
 

More from SoftwareMill (20)

Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
How To Survive a Live-Coding Session
How To Survive a Live-Coding SessionHow To Survive a Live-Coding Session
How To Survive a Live-Coding Session
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
 
Have you ever wondered about code review?
Have you ever wondered about code review?Have you ever wondered about code review?
Have you ever wondered about code review?
 
Reactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaReactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and Alpakka
 
W świecie botów czyli po co nam SI
W świecie botów czyli po co nam SIW świecie botów czyli po co nam SI
W świecie botów czyli po co nam SI
 
Small intro to Big Data
Small intro to Big DataSmall intro to Big Data
Small intro to Big Data
 
Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9
 
Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry)
 
Teal Is The New Black
Teal Is The New BlackTeal Is The New Black
Teal Is The New Black
 
Kafka as a message queue
Kafka as a message queueKafka as a message queue
Kafka as a message queue
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Origins of Free
Origins of FreeOrigins of Free
Origins of Free
 
Cassandra - how to fail?
Cassandra - how to fail?Cassandra - how to fail?
Cassandra - how to fail?
 
How to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companyHow to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent company
 
Performance tests with gatling
Performance tests with gatlingPerformance tests with gatling
Performance tests with gatling
 
Origins of free
Origins of freeOrigins of free
Origins of free
 
Projekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraProjekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designera
 
Machine learning by example
Machine learning by exampleMachine learning by example
Machine learning by example
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Windowing data in big data streams

  • 1. WINDOWING DATA IN BIG DATA STREAMS ADAM WARSKI, WOLVESSUMMIT
  • 2. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI BIG DATA? FAST DATA? ▸ What is big data? ▸ Shift of focus ▸ Processing speed ▸ Fast data -> streaming
  • 3. A TYPE OF DATA PROCESSING ENGINE THAT IS DESIGNED WITH INFINITE DATA SETS IN MIND Tyler Akidau, Google ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHAT IS STREAMING?
  • 4. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WINDOWING ▸ Time becomes the focus point ▸ How many invalid password errors where there in the last 5 minutes ▸ During which 30-minute window did we get most traffic? ▸ What’s the average 5-minute speed on a section of a highway throughout the day?
  • 5. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI HOW TO DO STREAMING? WITH WINDOWS? ▸ Many possibilities: ▸ Spark Streaming ▸ Spark Structured Streaming ▸ Kafka Streams ▸ Flink ▸ Akka Streams ▸ …
  • 6. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHICH ONE TO CHOOSE? LET’S FIND OUT
  • 7. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI /ME ▸ coder @ ▸ Lightbend, Confluent, Datastax consulting partner ▸ mainly Scala ▸ open-source: MacWire, ElasticMQ, Quicklens, … ▸ http://www.warski.org / @adamwarski
  • 8. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHAT’S THE TIME? ▸ How to associate time with an event: ▸ event time: “logical”, data-dependent ▸ ingestion time: when the event entered the system ▸ processing time: when the event is being processed
  • 9. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI TYPES OF WINDOWS ▸ Time-based ▸ fixed/tumbling ▸ sliding ▸ Session-based time
  • 10. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI OUT-OF-ORDER: WATERMARKS, LATENESS ▸ Windows GC ▸ At some point, enough is enough ▸ Watermark: ▸ all events before X have been observed ▸ heuristics
  • 11. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI TRIGGERS ▸ When to emit window results ▸ Watermark progress ▸ Event time progress ▸ Processing time progress ▸ Punctuations
  • 12. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI ACCUMULATION OF RESULTS ▸ If we trigger many times … ▸ discard ▸ accumulate ▸ retract & accumulate
  • 13. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI FINALLY … HOW TO MANIPULATE THE DATA ▸ map, flatMap, filter … ▸ stateful computation ▸ fold, reduce ▸ past-dependent operations ▸ where to store the state
  • 14. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Event/ingestion/processing time ▸ Tumbling/sliding/session windows ▸ Watermarks ▸ Triggers ▸ Accumulation of results ▸ State management
  • 15. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STREAMING ▸ Micro-batches (DStream) ▸ .window() API: ▸ tumbling/sliding windows ▸ only processing time ▸ no watermarks ▸ triggers at the end of the window ▸ state persisted in cluster (e.g. updateStateByKey())
  • 16. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STREAMING - WHY BOTHER? ▸ Popular ▸ Not only streaming ▸ ML ▸ SQL ▸ GraphX ▸ but …
  • 17. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STRUCTURED STREAMING ▸ Alpha in Spark 2.0 ▸ Micro-batches not exposed ▸ groupBy(window(…)) ▸ Event-time support ▸ No watermarks, session windows (2.1?) ▸ Trigger: processing time; outputs changed windows ▸ Exactly-once processing*
  • 18. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI FLINK ▸ Mostly with keyed streams (parallelism) ▸ TimeCharacteristic: event/ingestion/processing ▸ TimestampAssigner: also generates watermarks ▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session ▸ Trigger: event/processing time, count, single/continuous ▸ Window function: fold/reduce/with-kv-state ▸ Exactly-once* / at-least-once
  • 19. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI KAFKA STREAMS ▸ State: Kafka topics/local key-value backed by a topic for resiliency ▸ Watermarks: no, but windows are retained for 1 day ▸ Time: event/ingestion/processing; TimestampExtractor ▸ Tumbling/sliding windows ▸ Trigger: after every element ▸ aggregate by key&window into an ever-updating KTable ▸ At-least-once
  • 20. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI AKKA STREAMS ▸ Single-node, no clustering ▸ No OOTB support, but quite easy to implement: ▸ Windows: arbitrary, assign windows to each element ▸ Trigger: only window-close ▸ State: local ▸ Watermarks: can be implemented
  • 21. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Spark: widely used, some features missing ▸ Flink: versatile ▸ Kafka: simple model ▸ Akka: single-node
  • 22. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Windowing is just one of the aspects ▸ Other: ▸ State management ▸ Work distribution ▸ Processing guarantees
  • 23. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Other stream processing systems out there! ▸ Apache Storm ▸ Google Cloud Dataflow ▸ Amazon Kinesis ▸ Apache Beam ▸ …
  • 24. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI LINKS ▸ Streaming 101 & 102:  ▸ https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101 ▸ https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-102 ▸ https://softwaremill.com/windowing-data-in-akka-streams/