SlideShare a Scribd company logo
Streaming
Why should I care?
Christian Trebing
Blue Yonder GmbH
@ctrebing
1
Agenda
Motivation
Streaming Intro
Implementation
Challenges
2
Data Processing
- The Monolith
3
THE Database
Data Input
Data Validation
Machine Learning
Data Output
Pain
Several teams are developing this application
• Customer desperately wants new feature in machine
learning
• But the data validation team is in the midst of
refactoring their database structure (‚will be finished
in two weeks‘)
• So you wait…
4
Could Microservices Help?
Great:
• No Dependency on single state
• Independent Development
• Independent Upgrades
Difficult:
• Too much data to transfer
• Too much data to store in each
service
5
Data Input Data
Machine Learning Data
Data Validation Data
Data Output Data
Are There Other Possibilities?
6
Streaming Intro
7
Databases and Streams
Same information in table and stream:
8
A=1 B=5 C=3 A=8 A=4 C=2
A: 1 A: 1
B: 5
A: 1
B: 5
C: 3
A: 8
B: 5
C: 3
A: 4
B: 5
C: 3
A: 4
B: 5
C: 2
Why does it matter?
Different services can be in different states
• Each service can consume the stream in its own
speed
• One service can be updated while the other runs
9
A=1 B=5 C=3 A=8 A=4 C=2
A: 1 A: 1
B: 5
A: 1
B: 5
C: 3
A: 8
B: 5
C: 3
A: 4
B: 5
C: 3
A: 4
B: 5
C: 2
Service 1
on index 3
Service 2
on index 5
Partitioned Streams
Sales stream, partitioned by location
Each partition could be handled by diff. processors
10
location: Rimini
product: Spaghetti
sales_date: 2017-07-10
quantity: 5
location: Rimini
product: Ravioli
sales_date: 2017-07-10
quantity: 8
location: Rimini
product: Pizza
sales_date: 2017-07-11
quantity: 1
location: Rimini
product: Spaghetti
sales_date: 2017-07-11
quantity: 7
location: Bilbao
product: Pizza
sales_date: 2017-07-10
quantity: 3
location: Bilbao
product: Ravioli
sales_date: 2017-07-11
quantity: 9
location: Bilbao
product: Spaghetti
sales_date: 2017-07-11
quantity: 5
location: Karlsruhe
product: Pizza
sales_date: 2017-07-10
quantity: 8
location: Karlsruhe
product: Ravioli
sales_date: 2017-07-10
quantity: 3
location: Karlsruhe
product: Ravioli
sales_date: 2017-07-11
quantity: 7
location: Karlsruhe
product: Spaghetti
sales_date: 2017-07-11
quantity: 7
Data Processing
- With Streaming
11
Data Input Data
Machine Learning Data
Data Validation Data
Data Output Data
Streaming Platform
What did we gain?
Independent Development
Independent Upgrade
Scalability
Did we throw out databases completely?
• Let’s see…
12
Is it magic?
No, it’s a tradeoff:
• A database is so powerful: ACID guarantess, SQL
language. You can do nearly everything
• This comes at a price
• Dependency on single state
• Scaling is hard
So let’s more think what we really need
13
What do we loose?
14
Database Stream
ACID Ordering on stream partition
SQL Queries
Service is responsible of keeping
its state
You have to decide whether you can live with that.
Implementation
15
Apache Kafka
16
Kafka Clients in Python
• pykafka, python-kafka, confluent-kafka-client
• Nice comparison has been done here:
• http://activisiongamescience.github.io/2016/06/15/Kafka-
Client-Benchmarking/
• Most performant currently is confluent-kafka-client
• Uses the c library librdkafka
17
Producer
from confluent_kafka import Producer
p = Producer({'bootstrap.servers': 'mybroker,mybroker2'})
for data in some_data_source:
p.produce('mytopic', data.encode('utf-8'))
p.flush()
18
Consumer
19
Apache Avro
Data Serialization, Enabling Schema Evolution
Clearly defined schema with:
• Schema evolution
• Schema registry
20
Data Type Field Name
string location
string product
string sales_date
int quantity
Writer’s schema Reader’s schema
Data Type Field Name
string location
string product
string sales_date
int quantity
int, default=0 delivery_id
Avro Schema
21
AvroProducer
22
AvroConsumer
23
Example: Data Validation
Separate valid and invalid sales records
24
Additional Processors
Need to evolve your application:
• Add processors for evaluation topics
• Try new variant of validation logic
database
• remember processing state, same for each processor
streaming
• offset in each processor, can work independently
25
Challenge
- Machine Learning Still Batch
26
How to get the input data?
Remember: There is no possibility to query a stream.
Somewhere all this data needs to be. Options:
• Memory of the service
• Serving database
• Blob store
Yes, that’s duplication. We’ll have to live with it.
27
Write Path / Read Path
28
write path read path
THE databasedata validation
machine learning
query
machine learning
write path read path
Blob storedata validation
machine learning
query
machine learning
Machine Learning Input Data
29
sales_validated products_validatedlocations_validated
join
tabletable
append to file
Challenge
- State in Processors
30
State - Nightmare of every
distributed systems engineer
Streaming: Data just rushes through
Why do we need state?
• Time window processing
• Data you want to join with
Formerly, the database did it for you
31
State - Some Challenges?
Failure of a processor
Scaling
32
State in Stream Processors
- Possible Solutions
• Just keep in memory. Reprocess stream to warm up
• Each processor to keep its own db
• Save condensed in stream
• Get it from other service
Frameworks exist in other languages:
• for example: Kafka Streams, Apache Samza
Up to yesterday: none in python. Then heard about
https://github.com/wintoncode/winton-kafka-streams
33
Summary
• You have more options for your data processing
applications than you might have thought
• As always, there are some tradeoffs
• You know the challenges
34
Questions?
35
Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
36

More Related Content

What's hot

An Engineering Approach to Database Evaluations
An Engineering Approach to Database EvaluationsAn Engineering Approach to Database Evaluations
An Engineering Approach to Database Evaluations
SingleStore
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
DataStax Academy
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Databricks
 
Modernization sql server 2016
Modernization   sql server 2016Modernization   sql server 2016
Modernization sql server 2016
Kiki Noviandi
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward
 
Whats new in Columnstore Indexes for SQL Server 2017
Whats new in Columnstore Indexes for SQL Server 2017Whats new in Columnstore Indexes for SQL Server 2017
Whats new in Columnstore Indexes for SQL Server 2017
Niko Neugebauer
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
Oliver Seemann
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
Ebay: DB Capacity planning at eBay
Ebay: DB Capacity planning at eBayEbay: DB Capacity planning at eBay
Ebay: DB Capacity planning at eBay
DataStax Academy
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
Clustered Columnstore Introduction
Clustered Columnstore IntroductionClustered Columnstore Introduction
Clustered Columnstore Introduction
Niko Neugebauer
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
VoltDB : A Technical Overview
VoltDB : A Technical OverviewVoltDB : A Technical Overview
VoltDB : A Technical Overview
Tim Callaghan
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
Why you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentWhy you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise Environment
VoltDB
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
Dan Lynn
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
MariaDB plc
 

What's hot (19)

An Engineering Approach to Database Evaluations
An Engineering Approach to Database EvaluationsAn Engineering Approach to Database Evaluations
An Engineering Approach to Database Evaluations
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
Modernization sql server 2016
Modernization   sql server 2016Modernization   sql server 2016
Modernization sql server 2016
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
 
Whats new in Columnstore Indexes for SQL Server 2017
Whats new in Columnstore Indexes for SQL Server 2017Whats new in Columnstore Indexes for SQL Server 2017
Whats new in Columnstore Indexes for SQL Server 2017
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Ebay: DB Capacity planning at eBay
Ebay: DB Capacity planning at eBayEbay: DB Capacity planning at eBay
Ebay: DB Capacity planning at eBay
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
 
Clustered Columnstore Introduction
Clustered Columnstore IntroductionClustered Columnstore Introduction
Clustered Columnstore Introduction
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
VoltDB : A Technical Overview
VoltDB : A Technical OverviewVoltDB : A Technical Overview
VoltDB : A Technical Overview
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
 
Why you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentWhy you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise Environment
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
 

Similar to Streaming - Why Should I Care?

Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
GIAF
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
Qingsong Yao
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
Francesco Mucio
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
Riccardo Perico
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
Krishna Gade
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
DataWorks Summit
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
GPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep diveGPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep dive
Riccardo Perico
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won Transactions
VoltDB
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
Warehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your DataWarehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your Data
Scott Arbeitman
 
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
itnig
 
Data Pipelines -Big Data Meets Salesforce
Data Pipelines -Big Data Meets SalesforceData Pipelines -Big Data Meets Salesforce
Data Pipelines -Big Data Meets Salesforce
CarolEnLaNube
 
Geek Sync I In Database Automation We Trust
Geek Sync I In Database Automation We TrustGeek Sync I In Database Automation We Trust
Geek Sync I In Database Automation We Trust
IDERA Software
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
Kyle Hailey
 

Similar to Streaming - Why Should I Care? (20)

Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
GPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep diveGPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep dive
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won Transactions
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
Warehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your DataWarehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your Data
 
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
Die Another Day: Scaling from 0 to 4 million daily requests as a lone develop...
 
Data Pipelines -Big Data Meets Salesforce
Data Pipelines -Big Data Meets SalesforceData Pipelines -Big Data Meets Salesforce
Data Pipelines -Big Data Meets Salesforce
 
Geek Sync I In Database Automation We Trust
Geek Sync I In Database Automation We TrustGeek Sync I In Database Automation We Trust
Geek Sync I In Database Automation We Trust
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
 

Recently uploaded

Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Ortus Solutions, Corp
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
chandangoswami40933
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
ICS
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
kalichargn70th171
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
kalichargn70th171
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 

Recently uploaded (20)

Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 

Streaming - Why Should I Care?

  • 1. Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1
  • 3. Data Processing - The Monolith 3 THE Database Data Input Data Validation Machine Learning Data Output
  • 4. Pain Several teams are developing this application • Customer desperately wants new feature in machine learning • But the data validation team is in the midst of refactoring their database structure (‚will be finished in two weeks‘) • So you wait… 4
  • 5. Could Microservices Help? Great: • No Dependency on single state • Independent Development • Independent Upgrades Difficult: • Too much data to transfer • Too much data to store in each service 5 Data Input Data Machine Learning Data Data Validation Data Data Output Data
  • 6. Are There Other Possibilities? 6
  • 8. Databases and Streams Same information in table and stream: 8 A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 B: 5 A: 1 B: 5 C: 3 A: 8 B: 5 C: 3 A: 4 B: 5 C: 3 A: 4 B: 5 C: 2
  • 9. Why does it matter? Different services can be in different states • Each service can consume the stream in its own speed • One service can be updated while the other runs 9 A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 B: 5 A: 1 B: 5 C: 3 A: 8 B: 5 C: 3 A: 4 B: 5 C: 3 A: 4 B: 5 C: 2 Service 1 on index 3 Service 2 on index 5
  • 10. Partitioned Streams Sales stream, partitioned by location Each partition could be handled by diff. processors 10 location: Rimini product: Spaghetti sales_date: 2017-07-10 quantity: 5 location: Rimini product: Ravioli sales_date: 2017-07-10 quantity: 8 location: Rimini product: Pizza sales_date: 2017-07-11 quantity: 1 location: Rimini product: Spaghetti sales_date: 2017-07-11 quantity: 7 location: Bilbao product: Pizza sales_date: 2017-07-10 quantity: 3 location: Bilbao product: Ravioli sales_date: 2017-07-11 quantity: 9 location: Bilbao product: Spaghetti sales_date: 2017-07-11 quantity: 5 location: Karlsruhe product: Pizza sales_date: 2017-07-10 quantity: 8 location: Karlsruhe product: Ravioli sales_date: 2017-07-10 quantity: 3 location: Karlsruhe product: Ravioli sales_date: 2017-07-11 quantity: 7 location: Karlsruhe product: Spaghetti sales_date: 2017-07-11 quantity: 7
  • 11. Data Processing - With Streaming 11 Data Input Data Machine Learning Data Data Validation Data Data Output Data Streaming Platform
  • 12. What did we gain? Independent Development Independent Upgrade Scalability Did we throw out databases completely? • Let’s see… 12
  • 13. Is it magic? No, it’s a tradeoff: • A database is so powerful: ACID guarantess, SQL language. You can do nearly everything • This comes at a price • Dependency on single state • Scaling is hard So let’s more think what we really need 13
  • 14. What do we loose? 14 Database Stream ACID Ordering on stream partition SQL Queries Service is responsible of keeping its state You have to decide whether you can live with that.
  • 17. Kafka Clients in Python • pykafka, python-kafka, confluent-kafka-client • Nice comparison has been done here: • http://activisiongamescience.github.io/2016/06/15/Kafka- Client-Benchmarking/ • Most performant currently is confluent-kafka-client • Uses the c library librdkafka 17
  • 18. Producer from confluent_kafka import Producer p = Producer({'bootstrap.servers': 'mybroker,mybroker2'}) for data in some_data_source: p.produce('mytopic', data.encode('utf-8')) p.flush() 18
  • 20. Apache Avro Data Serialization, Enabling Schema Evolution Clearly defined schema with: • Schema evolution • Schema registry 20 Data Type Field Name string location string product string sales_date int quantity Writer’s schema Reader’s schema Data Type Field Name string location string product string sales_date int quantity int, default=0 delivery_id
  • 24. Example: Data Validation Separate valid and invalid sales records 24
  • 25. Additional Processors Need to evolve your application: • Add processors for evaluation topics • Try new variant of validation logic database • remember processing state, same for each processor streaming • offset in each processor, can work independently 25
  • 27. How to get the input data? Remember: There is no possibility to query a stream. Somewhere all this data needs to be. Options: • Memory of the service • Serving database • Blob store Yes, that’s duplication. We’ll have to live with it. 27
  • 28. Write Path / Read Path 28 write path read path THE databasedata validation machine learning query machine learning write path read path Blob storedata validation machine learning query machine learning
  • 29. Machine Learning Input Data 29 sales_validated products_validatedlocations_validated join tabletable append to file
  • 30. Challenge - State in Processors 30
  • 31. State - Nightmare of every distributed systems engineer Streaming: Data just rushes through Why do we need state? • Time window processing • Data you want to join with Formerly, the database did it for you 31
  • 32. State - Some Challenges? Failure of a processor Scaling 32
  • 33. State in Stream Processors - Possible Solutions • Just keep in memory. Reprocess stream to warm up • Each processor to keep its own db • Save condensed in stream • Get it from other service Frameworks exist in other languages: • for example: Kafka Streams, Apache Samza Up to yesterday: none in python. Then heard about https://github.com/wintoncode/winton-kafka-streams 33
  • 34. Summary • You have more options for your data processing applications than you might have thought • As always, there are some tradeoffs • You know the challenges 34
  • 36. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 36