SlideShare a Scribd company logo
Hello Cleveland!
Work for
Wrote this >>>
You can heckle on Twitter….
@jasonbelldata
The Plan
1. A quick overview of
project and my original
talk.
2. Look at how to make it
better with Kafka’s
components.
Think of this talk as a
conceptual how-to and
thoughts about awkward
data.
THIS IS A TRUE STORY.
The events depicted took place in
London in 2016.
At the request of the client,
the names have been changed.
Out of respect for the data,
the rest has been told exactly
as it occurred.
The Data
CSV Payloads are GZipped and sent to an
endpoint via HTTP.
The data contains live flight searches and are
batched together in chunks of a certain duration.
LHR,IST,2020-08-01,169.99,Y
There may be 3 rows or 20,000 rows.
It’s variable and we can’t control that.
I call this “The Fun Bit”
So this is what we did as a proof of concept.
And boy did I learn.
Thoughts, Experiences and
Considerations with Throughput
and Latency on High Volume
Stream Processing Systems Using
Cloud Based Infrastructures.
How I Bled All Over
Onyx
One Day I Got a Question
!“We want to stream in 12TB of
data a day….. Can you do
that?”
A Streaming Application… kind of.
What is Onyx?
!Distributed Computation Platform
!Written 100% in Clojure
!Great plugin architecture (inputs and
outputs)
!Uses graph Direct Acyclic Graph workflow
!I spoke about it at ClojureX 2016
!It’s very good….
Phase 1 – Testing at 1% Volume
!It’s working!
!Stuff’s going to S3!
!Tell the client the good news!
Phase 2 – Testing at 2% Volume
!It’s working!
!Stuff’s going to S3!
!Tell the client the good news!
Phase 3 – Testing at 5% Volume
Know Thy Framework
! Aeron buffer size = 3 x (batch-size x segment size
x connections)
!(1 x 512k x 8) x 3 = 4mb
!Aeron default buffer is 16mb but then the twist
!Onyx segment size = aeron.term.buffer.size /
8
!Max message size of 2mb
Onyx in Docker containers.
! --shm-size ended up being 10GB
! OOM killers were frequent
! Java logging is helpful but still hard to deal with.
CGROUPS_MEM=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes)
MEMINFO_MEM=$(($(awk '/MemTotal/ {print $2}' /proc/
meminfo)*1024)) MEM=$(($MEMINFO_MEM>$CGROUPS_MEM?$CGROUPS_MEM:
$MEMINFO_MEM)) JVM_PEER_HEAP_RATIO=${JVM_PEER_HEAP_RATIO:-0.6}
XMX=$(awk '{printf("%d",$1*$2/1024^2)}' <<< " ${MEM} $
{JVM_PEER_HEAP_RATIO} ") # Use the container memory limit to
set max heap size so that the GC # knows to collect before
it's hard-stopped by the container environment, # causing OOM
exception.
One Friday evening…
2. I did a rewrite in Kafka Streams.
!Moved the pure Clojure functions
into the streams architecture.
!Used Amazonica to write to AWS
S3.
Kafka Streams rewrite…
!Took 2 hours to write and deploy
to DCOS.
!And, by this stage, was slightly
tipsy.
Kafka Streams rewrite…
!In deployment Kafka Streams
didn’t touch more than 2% of the
CPU/Memory load on the
instance.
!Max memory was 790Mb
compared to 8Gb Onyx jobs.
Monday Morning…
!Now might be a good time to tell
the CTO. ;)
Now lets make it better.
I need to break down the process
into separate components.
Doing everything in one streaming app is bad!
The Event Log Producer
I’m going to make this into a streaming
application.
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
Streaming Event Log
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Deserialise Decompress Step
(defn deserialize-gzip-message [bytes]
(try (-> bytes
ByteArrayInputStream.
GZIPInputStream.
io/reader
slurp)
(catch Exception e {:error e})))
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Transform Step
Define a batch UUID called batchuuid
Define a batch timestamp
Define a StringBuilder outputbundle.
Split bundle by newline:
For each CSV line:
Define a uuid
Append the uuid
Append the batchuuid
Append the timestamp
Append complete line to outputbundle + newline
Push outputbundle string to compress.payload
Push outputbundle string to cflight.incoming
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
The Cheapest Price Streaming App
Option 1 - Compare each line of the CSV bundle.
(defn get-price
[row]
(let [first-comma (inc (cstr/index-of row ","))
last-comma (cstr/index-of row "," first-comma)
price (subs row first-comma last-comma)]
(Double/parseDouble price)))
;; Finds the cheapest flight on a row by row comparison.
;; Uses get-price function.
(defn find-cheapest-flight [data]
(println "Finding cheapest flight")
(let [rows (cstr/split data #"n")
[price cheapest-flight] (reduce (fn [[cheapest-price cheapest-row :as cheapest]
[candidate-price candidate-row :as candidate]]
(if (> cheapest-price
candidate-price)
candidate
cheapest))
(map (fn [row-o]
[(get-price row-o) row-o])
rows))]
(str cheapest-flight "n"))) ;; put the EOL in before we serialise it again
OR????????
KSQL?
Option 2 - Step 1: Convert the CSV bundle to AVRO payload
{“depiata":"LHR",
“arriata":"IST",
“searchdate”:"2020-08-01",
“gbpprice”:169.99,
“paxclass":"Y",
“uuid":"b6e87fb96f73c119caed08d6e9509c66",
“batchuuid":"39b6aba33172e379c763fcbd1b23aad7",
"timestamp":"2020-08-01T09:00:00"}
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Event Log Transform Step Amended
Define a batch UUID called batchuuid
Define a batch timestamp
Define a StringBuilder outputbundle.
Split bundle by newline:
For each CSV line:
Define a uuid
Append the uuid
Append the batchuuid
Append the timestamp
Append complete line to outputbundle + newline
Convert line to AVRO
Push AVRO payload to cflight.incoming topic
Push outputbundle string to compress.payload
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Option 2 - Step 2: Craft a KSQL Job
CREATE STREAM CFLIGHT (………………….)
WITH (KAFKA_TOPIC=‘cflight.incoming',
PARTITIONS=1,
VALUE_FORMAT='avro');
CREATE STREAM CFLIGHT_CHEAPEST AS
SELECT UUID, BATCHUUID,
MIN(GBPPRICE) AS CHEAPEST_QUOTE
FROM CFLIGHT
GROUP BY BATCHUUID
EMIT CHANGES;
The Batch Compressor Streaming App
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
compress.payload
(topic)
GZip the output
stream, get the
byte array.
batch.compressed
(topic)
Compress and Serialise Step
(def gzip-serializer-fn
(fn [output-str]
(let [out (java.io.ByteArrayOutputStream.)]
(do (doto (java.io.BufferedOutputStream.
(java.util.zip.GZIPOutputStream. out))
(.write (.getBytes output-str))
(.close)))
(.toByteArray out))))
Persist to S3
Kafka Connect
Use Prebundled connectors
where you can.
Also means you have access to further filter
and transform steps if required.
{"name": "cflight.sink",
"config":
{"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
“topics”:"CHEAPEST_FLIGHT",
"s3.region":"us-east-1",
"s3.bucket.name": "cflight.bucket",
"s3.part.size":"5242880",
"flush.size":"1",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE"}}
{"name": "batchcompress.sink",
"config":
{"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
“topics":"batch.compressed",
"s3.region":"us-east-1",
"s3.bucket.name": "flightbatch.bucket",
"s3.part.size":"5242880",
"flush.size":"1",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.bytearray.ByteArrayFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE"}}
EventLog
Streaming App
nginx
Cheapest Flight
App
Compress
Bundle
Streaming App
Kafka Connect
Batch compress
Sink
Kafka Connect
CFlight sink
Brokers
S3 Storage
Thank you.
Many thanks to Shay and David for organising, everyone who attended and sent
kind wishes. Lastly, a huge thank you to MeetupCat.
Photo supplied by @jbfletch_

More Related Content

What's hot

Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
confluent
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 

What's hot (20)

Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
 
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 

Similar to Kafka Streams: Revisiting the decisions of the past (How I could have made it better)

Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Deepak Singh
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
grim_radical
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
CodeIgniter Conference
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
RomanKhavronenko
 

Similar to Kafka Streams: Revisiting the decisions of the past (How I could have made it better) (20)

Node Boot Camp
Node Boot CampNode Boot Camp
Node Boot Camp
 
Non-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.jsNon-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.js
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
 
NativeBoost
NativeBoostNativeBoost
NativeBoost
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 

Kafka Streams: Revisiting the decisions of the past (How I could have made it better)

  • 3. You can heckle on Twitter…. @jasonbelldata
  • 4. The Plan 1. A quick overview of project and my original talk. 2. Look at how to make it better with Kafka’s components. Think of this talk as a conceptual how-to and thoughts about awkward data.
  • 5. THIS IS A TRUE STORY.
  • 6. The events depicted took place in London in 2016.
  • 7. At the request of the client, the names have been changed.
  • 8. Out of respect for the data, the rest has been told exactly as it occurred.
  • 10. CSV Payloads are GZipped and sent to an endpoint via HTTP. The data contains live flight searches and are batched together in chunks of a certain duration.
  • 12. There may be 3 rows or 20,000 rows. It’s variable and we can’t control that.
  • 13. I call this “The Fun Bit”
  • 14. So this is what we did as a proof of concept.
  • 15. And boy did I learn.
  • 16. Thoughts, Experiences and Considerations with Throughput and Latency on High Volume Stream Processing Systems Using Cloud Based Infrastructures.
  • 17. How I Bled All Over Onyx
  • 18. One Day I Got a Question !“We want to stream in 12TB of data a day….. Can you do that?”
  • 20. What is Onyx? !Distributed Computation Platform !Written 100% in Clojure !Great plugin architecture (inputs and outputs) !Uses graph Direct Acyclic Graph workflow !I spoke about it at ClojureX 2016 !It’s very good….
  • 21.
  • 22.
  • 23. Phase 1 – Testing at 1% Volume !It’s working! !Stuff’s going to S3! !Tell the client the good news!
  • 24. Phase 2 – Testing at 2% Volume !It’s working! !Stuff’s going to S3! !Tell the client the good news!
  • 25. Phase 3 – Testing at 5% Volume
  • 26. Know Thy Framework ! Aeron buffer size = 3 x (batch-size x segment size x connections) !(1 x 512k x 8) x 3 = 4mb !Aeron default buffer is 16mb but then the twist !Onyx segment size = aeron.term.buffer.size / 8 !Max message size of 2mb
  • 27. Onyx in Docker containers. ! --shm-size ended up being 10GB ! OOM killers were frequent ! Java logging is helpful but still hard to deal with. CGROUPS_MEM=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes) MEMINFO_MEM=$(($(awk '/MemTotal/ {print $2}' /proc/ meminfo)*1024)) MEM=$(($MEMINFO_MEM>$CGROUPS_MEM?$CGROUPS_MEM: $MEMINFO_MEM)) JVM_PEER_HEAP_RATIO=${JVM_PEER_HEAP_RATIO:-0.6} XMX=$(awk '{printf("%d",$1*$2/1024^2)}' <<< " ${MEM} $ {JVM_PEER_HEAP_RATIO} ") # Use the container memory limit to set max heap size so that the GC # knows to collect before it's hard-stopped by the container environment, # causing OOM exception.
  • 29. 2. I did a rewrite in Kafka Streams. !Moved the pure Clojure functions into the streams architecture. !Used Amazonica to write to AWS S3.
  • 30. Kafka Streams rewrite… !Took 2 hours to write and deploy to DCOS. !And, by this stage, was slightly tipsy.
  • 31. Kafka Streams rewrite… !In deployment Kafka Streams didn’t touch more than 2% of the CPU/Memory load on the instance. !Max memory was 790Mb compared to 8Gb Onyx jobs.
  • 32. Monday Morning… !Now might be a good time to tell the CTO. ;)
  • 33. Now lets make it better.
  • 34. I need to break down the process into separate components.
  • 35. Doing everything in one streaming app is bad!
  • 36. The Event Log Producer
  • 37.
  • 38. I’m going to make this into a streaming application.
  • 39. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic) Streaming Event Log LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 40. Deserialise Decompress Step (defn deserialize-gzip-message [bytes] (try (-> bytes ByteArrayInputStream. GZIPInputStream. io/reader slurp) (catch Exception e {:error e}))) LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 41. Transform Step Define a batch UUID called batchuuid Define a batch timestamp Define a StringBuilder outputbundle. Split bundle by newline: For each CSV line: Define a uuid Append the uuid Append the batchuuid Append the timestamp Append complete line to outputbundle + newline Push outputbundle string to compress.payload Push outputbundle string to cflight.incoming LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 42. The Cheapest Price Streaming App
  • 43. Option 1 - Compare each line of the CSV bundle. (defn get-price [row] (let [first-comma (inc (cstr/index-of row ",")) last-comma (cstr/index-of row "," first-comma) price (subs row first-comma last-comma)] (Double/parseDouble price))) ;; Finds the cheapest flight on a row by row comparison. ;; Uses get-price function. (defn find-cheapest-flight [data] (println "Finding cheapest flight") (let [rows (cstr/split data #"n") [price cheapest-flight] (reduce (fn [[cheapest-price cheapest-row :as cheapest] [candidate-price candidate-row :as candidate]] (if (> cheapest-price candidate-price) candidate cheapest)) (map (fn [row-o] [(get-price row-o) row-o]) rows))] (str cheapest-flight "n"))) ;; put the EOL in before we serialise it again
  • 45. KSQL?
  • 46. Option 2 - Step 1: Convert the CSV bundle to AVRO payload {“depiata":"LHR", “arriata":"IST", “searchdate”:"2020-08-01", “gbpprice”:169.99, “paxclass":"Y", “uuid":"b6e87fb96f73c119caed08d6e9509c66", “batchuuid":"39b6aba33172e379c763fcbd1b23aad7", "timestamp":"2020-08-01T09:00:00"}
  • 47. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic) LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 48. Event Log Transform Step Amended Define a batch UUID called batchuuid Define a batch timestamp Define a StringBuilder outputbundle. Split bundle by newline: For each CSV line: Define a uuid Append the uuid Append the batchuuid Append the timestamp Append complete line to outputbundle + newline Convert line to AVRO Push AVRO payload to cflight.incoming topic Push outputbundle string to compress.payload LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 49. Option 2 - Step 2: Craft a KSQL Job CREATE STREAM CFLIGHT (………………….) WITH (KAFKA_TOPIC=‘cflight.incoming', PARTITIONS=1, VALUE_FORMAT='avro'); CREATE STREAM CFLIGHT_CHEAPEST AS SELECT UUID, BATCHUUID, MIN(GBPPRICE) AS CHEAPEST_QUOTE FROM CFLIGHT GROUP BY BATCHUUID EMIT CHANGES;
  • 50. The Batch Compressor Streaming App
  • 51. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic)
  • 52. compress.payload (topic) GZip the output stream, get the byte array. batch.compressed (topic)
  • 53. Compress and Serialise Step (def gzip-serializer-fn (fn [output-str] (let [out (java.io.ByteArrayOutputStream.)] (do (doto (java.io.BufferedOutputStream. (java.util.zip.GZIPOutputStream. out)) (.write (.getBytes output-str)) (.close))) (.toByteArray out))))
  • 57. Also means you have access to further filter and transform steps if required.
  • 60. EventLog Streaming App nginx Cheapest Flight App Compress Bundle Streaming App Kafka Connect Batch compress Sink Kafka Connect CFlight sink Brokers S3 Storage
  • 61. Thank you. Many thanks to Shay and David for organising, everyone who attended and sent kind wishes. Lastly, a huge thank you to MeetupCat. Photo supplied by @jbfletch_