SlideShare a Scribd company logo
1 of 82
Download to read offline
Streaming, Database &
Distributed Systems:
Bridging the Divide
Ben Stopford (@benstopford)
Codemesh 2016
Event Driven
Systems
Most stateful systems have to pull
from these three worlds
Today we have 2 goals
1.  Understand Stateful Stream
Processing (now & near future)
2.  Case for SSP as a general framework
for building data-centric systems.
Data systems come in
different forms
•  Database (OLTP)
•  Analytics Database (OLAP/Hadoop)
•  Messaging
•  Distributed log
•  Stream Processing
•  Stateful Stream Processing
Database (OLTP)
Focuses on providing a consistent view that
supports updates and queries on individual tuples.
Analytics Database (OLAP/Hadoop)
1.  Focuses on aggregations via table scans.
2.  Executes as distributed system
Messaging
Focuses on asynchronous information transfer with limited
state
Distributed Log
1.  Similar to messaging, but data can be retained
2.  Executes as distributed system (scale + fault tolerance)
Stream Processing
Manipulate concurrent streams of events
Comes from CEP background (ephemeral)
Stateful Stream Processing
Moves stream processing to be a more general
framework for building data-centric systems.
What is stream processing?
Data
Index
Query
Engine
Query
Engine
vs
Database
Finite source
Stream Processor
Infinite source
Infinite streams need
windows
How many items will we bring into the machine at
one time?
Windows bound a computation
How many items will we bring into the machine at
one time?
Buffering allows us to handle
late events
How many items will we bring into the machine at
one time?
Some query
Over some time window
Emitting at some frequency
Continually executing query
Stream(s)
Stream Processing Engine
Derived Stream
Avg(p.time – o.time)
From orders, payment
Group by payment.region
over 1 day window
emitting every second
Stream Processing
orders!
payments!
Completion time,
by region!
Avg(o.time – p.time)
From orders, payment
Group by payment.region
over 1 day window
emitting every second
Materialised View (DB )
Query
orders!
payments!
Completion time,
by region!
Avg(o.time – p.time)
From orders, payment, user
Group by user.region
over 1 day window
emitting every second
Stateful Stream Processing
Streams
Stream Processing Engine
Derived Stream
Query
Derived “Table”
Table
“View” is output as
table or stream
Table == Stream + Window0
n
== 0 N
Table is a stream with an infinite window (i.e. buffer from 0 -> now)
window !
SSP is about creating
materialised views.
Materialised as a table, or
materialised as a stream
Features: similar to database query
engine
Join Filter
Aggr-
egate
View
Windowed
Streams
Can distribute over many machines
in two dimensions
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Scale Out Scale Forward
Stateful Stream Processing engines typically
use Kafka (a distributed commit log)
Join Filter
Aggr-
egate
View
Kafka (a distributed log)
A log is very simple idea
Messages are added at the end of the log
Just think of the log as a file
Old New
Readers have a position & scan
Sally
is here
George
is here
Fred
is here
Old New
Scan Scan
Scan
Can “Rewind & Replay” the log
Rewind & Replay
Compacted Log
(Tabular View)
Version 3
Version 2
Version 1
Version 2
Version 1
Version 5
Version 4
Version 3
Version 2
Version 1
Version 2
Version 3
Version 5
STEAM
(All versions)
COMPACTED STREAM
(Latest Key only)
The log is a
Distributed System
For scalability and fault tolerance
Shard on the way in
Producers
Kafka
Consumers
Each shard is a queue
Producers
Kafka
Consumers
Producers
Kafka
Many consumers
share partitions
in one topic
Consumers share consumption of a
single topic
The Log reassigns data on failure
Producers
Kafka
Many consumers
share partitions in
one topic
Kafka supplies two levels of
leader election
Replicas in Kafka have
an elected leader
Consumers in Kafka
have an elected leader
The log is important for SSP
Maintains History: Acts like a “push based” distributed file system
The log is important: Two Primitives
Stream
Compacted Stream (‘table’)
The Log is, to a streaming
engine, what HDFS is to Hadoop
But it’s a bit more than a HDFS
replacement: Processors inherit the
idea of “membership” from the log
So stateful Stream Processors use
the Log
Join Filter
Aggr-
egate
View
Kafka (Distributed Log)
They also use local storage
Join Filter
Aggr-
egate
View
(1) a Kafka
(2) Local KV Store
Local KV store has a few uses
(1)  It caches streams on disk
(2) It caches “tables” on disk
Join Filter
Aggr-
egate
View
This makes join operations fast as they’re entirely local
Streams just cache recent
messages to help with joins
Tables are fully
“realised” locally
Stateful Stream Processing
stream
Compacted
stream
Join
Stream data
Stream-Tabular
Data
Infinite
Stream
Locally Cached
Table
(disk resident)
KafkaKafka Streams
e.g. Useful for Enrichment
stream
Compacted
stream
Join
Orders
Customers
KafkaKafka Streams
Local DB
Aggregates need intermediary state
stream
Compacted
stream
Join
Orders
Customers
KafkaSum(orders)
group by region
Persist current value,
in case we fail
State store inherits durability from
the log
State store flushes
back to the log
Join Filter
Aggr-
egate
View
Separate Data, Processing & View
View
OrdersPayments View
View
Storage Layer
(a Kafka)
Processing & View
Query
You can query the views from
anywhere
View
OrdersPayments View
View
Storage Layer
(a Kafka)
Processing & View
Query
So what happens on failure?
View
OrdersPayments View
View
Storage Layer
(a Kafka)
Processing & View
Clustering Reroutes Data to
surviving node
View
OrdersPayments View
View
Storage Layer
(Kafka)
Ownership of partitions is re-routed from dead node
Processing & View
But what about state?
View
OrdersPayments View
View
Storage Layer
(Kafka)
“Cold” replica of state
takes over
Processing & View
Primitives for sharding &
replication
Stock
OrdersPayments Stock
Stock
Redundant copies are
cached on other nodes
Sharding spread data
over processors
So processors inherit much
from the log
Clustering comes
from the log
You just write the
functional bit
General framework for distributed, realtime data
computation
Protection from
broker failure
Protection from
engine failure
Join tables & streams
(in process)
Event Driven
Create views which
can be queried
Query
But stream
processing has a
problem
Correctness Guarantees in multi
layer topologies
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Duplicates are a side effect of all at-least-once delivery mechanisms
Data is rerouted, on failure, which
can cause duplicates
Idempotance isn’t enough
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Filter
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Distributed Snapshots*
(transactions)
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Join Filter
Aggr-
egate
View
Transaction markers:
[Begin], [Prepare], [Commit], [Abort]
Buffer
Chandy, Lamport - Distributed Snapshots: Determining Global States of Distributed Systems
*In development in Kafka
So why use these
tools?
(1) Streaming is a
superset of batch
Databases look backwards
Batch == Streaming from offset 0
Query
Query
Query
Distributed File
System (HDFS)
Query
Query
QueryDistributed Log
(Kafka)
MPP Batch System MPP Streaming System
Streaming is the superset of batch
Streaming
Batch
Database
Global, Linearisible
consistency model
(2) Separates store & view
“Engine” part is lightweight
but stateful
Storage Just a java process
which uses a library
Log handles fault
tolerance of both layers
Separates Concerns of
Model & View – Think MVC
Storage
View & Controller
Model
Physically Separates Read &
Write – Think CQRS
Storage
View & Controller
Model
Database vs SSP
Data
Index
Query
Engine
Query
Engine
vs
Database Stateful Stream Processor
Query
Query
View
Index Data
(3) Decentralised approaches
are more general
Rather than pushing processing
into an “appliance”
(code -> data)
Centralised Processing
App
Data Decentric Architecture
Distributed
Log
Decentralised Processing over many
user-specific views
This more general
than than just
analytics use cases
It’s more than taking a
database and adding push
notifications
Whether you’re building a hulking,
multistage, analytic platform
Query
Final View
Intermediary View (2)
Intermediary View (1)
Or a simple microservice that
needs to run hot-hot & scale
Business Logic
Manage local
state
Join various
streams
Hot secondary
instance
Composable Primatives
Declarative
Function
Traditional DB
Work
Distribution
Replication
Sharding
Query
Engine
Distributed DB Distributed Systems
Membership
Global
Consistency
General framework for distributed, event-
driven data computation
Protection from
broker failure
Protection from
engine failure
Join tables & streams
(in process)
Event Driven
Create views which
can be queried
Query
Stateful Stream Processing
Framework for building a streaming data
systems, just for you “~)
Find out more:
•  http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
•  https://martin.kleppmann.com/2015/02/11/database-inside-out-at-salesforce.html
•  http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
•  https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cidr07p42.pdf
•  http://highscalability.com/blog/2015/5/4/elements-of-scale-composing-and-scaling-data-
platforms.html
•  https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-
kafka-streams
•  https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html
•  https://www.madewithtea.com/processing-tweets-with-kafka-streams.html
•  http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/
•  http://www.slideshare.net/zacharycox/updating-materialized-views-and-caches-using-kafka
The end
@benstopford
http://benstopford.com

More Related Content

What's hot

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

What's hot (20)

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
data replication
data replicationdata replication
data replication
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Aca11 bk2 ch9
Aca11 bk2 ch9Aca11 bk2 ch9
Aca11 bk2 ch9
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
TCP- Transmission Control Protocol
TCP-  Transmission Control Protocol TCP-  Transmission Control Protocol
TCP- Transmission Control Protocol
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
Shadow paging
Shadow pagingShadow paging
Shadow paging
 
Unit IOT NETCONF.pptx
Unit IOT NETCONF.pptxUnit IOT NETCONF.pptx
Unit IOT NETCONF.pptx
 
Mass Storage Structure
Mass Storage StructureMass Storage Structure
Mass Storage Structure
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 

Viewers also liked

Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
Ben Stopford
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
Ben Stopford
 
The return of big iron?
The return of big iron?The return of big iron?
The return of big iron?
Ben Stopford
 

Viewers also liked (20)

Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
JAX London Slides
JAX London SlidesJAX London Slides
JAX London Slides
 
Microservices for a Streaming World
Microservices for a Streaming WorldMicroservices for a Streaming World
Microservices for a Streaming World
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the Log
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
 
The return of big iron?
The return of big iron?The return of big iron?
The return of big iron?
 
Big Data & the Enterprise
Big Data & the EnterpriseBig Data & the Enterprise
Big Data & the Enterprise
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
 
Coherence Implementation Patterns - Sig Nov 2011
Coherence Implementation Patterns - Sig Nov 2011Coherence Implementation Patterns - Sig Nov 2011
Coherence Implementation Patterns - Sig Nov 2011
 
Refactoring tested code - has mocking gone wrong?
Refactoring tested code - has mocking gone wrong?Refactoring tested code - has mocking gone wrong?
Refactoring tested code - has mocking gone wrong?
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafka
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Ideas for Distributing Skills Across a Continental Divide
Ideas for Distributing Skills Across a Continental DivideIdeas for Distributing Skills Across a Continental Divide
Ideas for Distributing Skills Across a Continental Divide
 
Test-Oriented Languages: Is it time for a new era?
Test-Oriented Languages: Is it time for a new era?Test-Oriented Languages: Is it time for a new era?
Test-Oriented Languages: Is it time for a new era?
 
The Data Dichotomy- Rethinking the Way We Treat Data and Services
The Data Dichotomy- Rethinking the Way We Treat Data and ServicesThe Data Dichotomy- Rethinking the Way We Treat Data and Services
The Data Dichotomy- Rethinking the Way We Treat Data and Services
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
 

Similar to Streaming, Database & Distributed Systems Bridging the Divide

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
Stephen Rose
 

Similar to Streaming, Database & Distributed Systems Bridging the Divide (20)

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
10 Principals for Effective Event-Driven Microservices with Apache Kafka
10 Principals for Effective Event-Driven Microservices with Apache Kafka10 Principals for Effective Event-Driven Microservices with Apache Kafka
10 Principals for Effective Event-Driven Microservices with Apache Kafka
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
10 Principals for Effective Event Driven Microservices
10 Principals for Effective Event Driven Microservices10 Principals for Effective Event Driven Microservices
10 Principals for Effective Event Driven Microservices
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 

More from Ben Stopford

NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
NDC London 2017  - The Data Dichotomy- Rethinking Data and Services with StreamsNDC London 2017  - The Data Dichotomy- Rethinking Data and Services with Streams
NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
Ben Stopford
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
Advanced databases ben stopford
Advanced databases   ben stopfordAdvanced databases   ben stopford
Advanced databases ben stopford
Ben Stopford
 

More from Ben Stopford (17)

The Future of Streaming: Global Apps, Event Stores and Serverless
The Future of Streaming: Global Apps, Event Stores and ServerlessThe Future of Streaming: Global Apps, Event Stores and Serverless
The Future of Streaming: Global Apps, Event Stores and Serverless
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices Generation
 
Building Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka StreamsBuilding Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka Streams
 
NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
NDC London 2017  - The Data Dichotomy- Rethinking Data and Services with StreamsNDC London 2017  - The Data Dichotomy- Rethinking Data and Services with Streams
NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
 
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
 
Building Event Driven Services with Stateful Streams
Building Event Driven Services with Stateful StreamsBuilding Event Driven Services with Stateful Streams
Building Event Driven Services with Stateful Streams
 
Devoxx London 2017 - Rethinking Services With Stateful Streams
Devoxx London 2017 - Rethinking Services With Stateful StreamsDevoxx London 2017 - Rethinking Services With Stateful Streams
Devoxx London 2017 - Rethinking Services With Stateful Streams
 
Event Driven Services Part 2: Building Event-Driven Services with Apache Kafka
Event Driven Services Part 2:  Building Event-Driven Services with Apache KafkaEvent Driven Services Part 2:  Building Event-Driven Services with Apache Kafka
Event Driven Services Part 2: Building Event-Driven Services with Apache Kafka
 
Event Driven Services Part 1: The Data Dichotomy
Event Driven Services Part 1: The Data Dichotomy Event Driven Services Part 1: The Data Dichotomy
Event Driven Services Part 1: The Data Dichotomy
 
Event Driven Services Part 3: Putting the Micro into Microservices with State...
Event Driven Services Part 3: Putting the Micro into Microservices with State...Event Driven Services Part 3: Putting the Micro into Microservices with State...
Event Driven Services Part 3: Putting the Micro into Microservices with State...
 
Strata Software Architecture NY: The Data Dichotomy
Strata Software Architecture NY: The Data DichotomyStrata Software Architecture NY: The Data Dichotomy
Strata Software Architecture NY: The Data Dichotomy
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
 
Advanced databases ben stopford
Advanced databases   ben stopfordAdvanced databases   ben stopford
Advanced databases ben stopford
 
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Architecting for Change: An Agile Approach
Architecting for Change: An Agile ApproachArchitecting for Change: An Agile Approach
Architecting for Change: An Agile Approach
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 

Streaming, Database & Distributed Systems Bridging the Divide

  • 1. Streaming, Database & Distributed Systems: Bridging the Divide Ben Stopford (@benstopford) Codemesh 2016
  • 2.
  • 3. Event Driven Systems Most stateful systems have to pull from these three worlds
  • 4. Today we have 2 goals 1.  Understand Stateful Stream Processing (now & near future) 2.  Case for SSP as a general framework for building data-centric systems.
  • 5. Data systems come in different forms •  Database (OLTP) •  Analytics Database (OLAP/Hadoop) •  Messaging •  Distributed log •  Stream Processing •  Stateful Stream Processing
  • 6. Database (OLTP) Focuses on providing a consistent view that supports updates and queries on individual tuples.
  • 7. Analytics Database (OLAP/Hadoop) 1.  Focuses on aggregations via table scans. 2.  Executes as distributed system
  • 8. Messaging Focuses on asynchronous information transfer with limited state
  • 9. Distributed Log 1.  Similar to messaging, but data can be retained 2.  Executes as distributed system (scale + fault tolerance)
  • 10. Stream Processing Manipulate concurrent streams of events Comes from CEP background (ephemeral)
  • 11. Stateful Stream Processing Moves stream processing to be a more general framework for building data-centric systems.
  • 12. What is stream processing? Data Index Query Engine Query Engine vs Database Finite source Stream Processor Infinite source
  • 13. Infinite streams need windows How many items will we bring into the machine at one time?
  • 14. Windows bound a computation How many items will we bring into the machine at one time?
  • 15. Buffering allows us to handle late events How many items will we bring into the machine at one time?
  • 16. Some query Over some time window Emitting at some frequency Continually executing query Stream(s) Stream Processing Engine Derived Stream
  • 17. Avg(p.time – o.time) From orders, payment Group by payment.region over 1 day window emitting every second Stream Processing orders! payments! Completion time, by region!
  • 18. Avg(o.time – p.time) From orders, payment Group by payment.region over 1 day window emitting every second Materialised View (DB ) Query orders! payments! Completion time, by region!
  • 19. Avg(o.time – p.time) From orders, payment, user Group by user.region over 1 day window emitting every second Stateful Stream Processing Streams Stream Processing Engine Derived Stream Query Derived “Table” Table “View” is output as table or stream
  • 20. Table == Stream + Window0 n == 0 N Table is a stream with an infinite window (i.e. buffer from 0 -> now) window !
  • 21. SSP is about creating materialised views. Materialised as a table, or materialised as a stream
  • 22. Features: similar to database query engine Join Filter Aggr- egate View Windowed Streams
  • 23. Can distribute over many machines in two dimensions Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Scale Out Scale Forward
  • 24. Stateful Stream Processing engines typically use Kafka (a distributed commit log) Join Filter Aggr- egate View Kafka (a distributed log)
  • 25. A log is very simple idea Messages are added at the end of the log Just think of the log as a file Old New
  • 26. Readers have a position & scan Sally is here George is here Fred is here Old New Scan Scan Scan
  • 27. Can “Rewind & Replay” the log Rewind & Replay
  • 28. Compacted Log (Tabular View) Version 3 Version 2 Version 1 Version 2 Version 1 Version 5 Version 4 Version 3 Version 2 Version 1 Version 2 Version 3 Version 5 STEAM (All versions) COMPACTED STREAM (Latest Key only)
  • 29. The log is a Distributed System For scalability and fault tolerance
  • 30. Shard on the way in Producers Kafka Consumers
  • 31. Each shard is a queue Producers Kafka Consumers
  • 32. Producers Kafka Many consumers share partitions in one topic Consumers share consumption of a single topic
  • 33. The Log reassigns data on failure Producers Kafka Many consumers share partitions in one topic
  • 34. Kafka supplies two levels of leader election Replicas in Kafka have an elected leader Consumers in Kafka have an elected leader
  • 35. The log is important for SSP Maintains History: Acts like a “push based” distributed file system
  • 36. The log is important: Two Primitives Stream Compacted Stream (‘table’)
  • 37. The Log is, to a streaming engine, what HDFS is to Hadoop
  • 38. But it’s a bit more than a HDFS replacement: Processors inherit the idea of “membership” from the log
  • 39. So stateful Stream Processors use the Log Join Filter Aggr- egate View Kafka (Distributed Log)
  • 40. They also use local storage Join Filter Aggr- egate View (1) a Kafka (2) Local KV Store
  • 41. Local KV store has a few uses (1)  It caches streams on disk (2) It caches “tables” on disk Join Filter Aggr- egate View This makes join operations fast as they’re entirely local Streams just cache recent messages to help with joins Tables are fully “realised” locally
  • 42. Stateful Stream Processing stream Compacted stream Join Stream data Stream-Tabular Data Infinite Stream Locally Cached Table (disk resident) KafkaKafka Streams
  • 43. e.g. Useful for Enrichment stream Compacted stream Join Orders Customers KafkaKafka Streams Local DB
  • 44. Aggregates need intermediary state stream Compacted stream Join Orders Customers KafkaSum(orders) group by region Persist current value, in case we fail
  • 45. State store inherits durability from the log State store flushes back to the log Join Filter Aggr- egate View
  • 46. Separate Data, Processing & View View OrdersPayments View View Storage Layer (a Kafka) Processing & View Query
  • 47. You can query the views from anywhere View OrdersPayments View View Storage Layer (a Kafka) Processing & View Query
  • 48. So what happens on failure? View OrdersPayments View View Storage Layer (a Kafka) Processing & View
  • 49. Clustering Reroutes Data to surviving node View OrdersPayments View View Storage Layer (Kafka) Ownership of partitions is re-routed from dead node Processing & View
  • 50. But what about state? View OrdersPayments View View Storage Layer (Kafka) “Cold” replica of state takes over Processing & View
  • 51. Primitives for sharding & replication Stock OrdersPayments Stock Stock Redundant copies are cached on other nodes Sharding spread data over processors
  • 52. So processors inherit much from the log Clustering comes from the log You just write the functional bit
  • 53. General framework for distributed, realtime data computation Protection from broker failure Protection from engine failure Join tables & streams (in process) Event Driven Create views which can be queried Query
  • 55. Correctness Guarantees in multi layer topologies Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View
  • 56. Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Duplicates are a side effect of all at-least-once delivery mechanisms Data is rerouted, on failure, which can cause duplicates
  • 57. Idempotance isn’t enough Join Filter Aggr- egate View Join Filter Aggr- egate View Filter Join Filter Aggr- egate View Join Filter Aggr- egate View
  • 58. Distributed Snapshots* (transactions) Join Filter Aggr- egate View Join Filter Aggr- egate View Join Filter Aggr- egate View Transaction markers: [Begin], [Prepare], [Commit], [Abort] Buffer Chandy, Lamport - Distributed Snapshots: Determining Global States of Distributed Systems *In development in Kafka
  • 59.
  • 60.
  • 61. So why use these tools?
  • 62. (1) Streaming is a superset of batch
  • 64. Batch == Streaming from offset 0 Query Query Query Distributed File System (HDFS) Query Query QueryDistributed Log (Kafka) MPP Batch System MPP Streaming System
  • 65. Streaming is the superset of batch Streaming Batch Database Global, Linearisible consistency model
  • 67. “Engine” part is lightweight but stateful Storage Just a java process which uses a library Log handles fault tolerance of both layers
  • 68. Separates Concerns of Model & View – Think MVC Storage View & Controller Model
  • 69. Physically Separates Read & Write – Think CQRS Storage View & Controller Model
  • 70. Database vs SSP Data Index Query Engine Query Engine vs Database Stateful Stream Processor Query Query View Index Data
  • 72. Rather than pushing processing into an “appliance” (code -> data) Centralised Processing App
  • 73. Data Decentric Architecture Distributed Log Decentralised Processing over many user-specific views
  • 74. This more general than than just analytics use cases
  • 75. It’s more than taking a database and adding push notifications
  • 76. Whether you’re building a hulking, multistage, analytic platform Query Final View Intermediary View (2) Intermediary View (1)
  • 77. Or a simple microservice that needs to run hot-hot & scale Business Logic Manage local state Join various streams Hot secondary instance
  • 79. General framework for distributed, event- driven data computation Protection from broker failure Protection from engine failure Join tables & streams (in process) Event Driven Create views which can be queried Query
  • 80. Stateful Stream Processing Framework for building a streaming data systems, just for you “~)
  • 81. Find out more: •  http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/ •  https://martin.kleppmann.com/2015/02/11/database-inside-out-at-salesforce.html •  http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf •  https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cidr07p42.pdf •  http://highscalability.com/blog/2015/5/4/elements-of-scale-composing-and-scaling-data- platforms.html •  https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with- kafka-streams •  https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html •  https://www.madewithtea.com/processing-tweets-with-kafka-streams.html •  http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/ •  http://www.slideshare.net/zacharycox/updating-materialized-views-and-caches-using-kafka