SlideShare a Scribd company logo
1 of 49
Download to read offline
101* ways to configure
Kafka - badly
Audun Fauchald Strand
Lead Developer Infrastructure
@audunstrand
bio: gof, mq, ejb,
mda, wli, bpel eda,
soa, ws*,esb, ddd
Henning Spjelkavik
Architect
@spjelkavik
bio: Skiinfo (Vail Resorts),
FINN.no
enjoys reading jstacks
agenda
introduction to kafka
kafka @ finn.no
101* mistakes
questions
“From a certain point onward
there is no longer any turning
back. That is the point that
must be reached.”
― Franz Kafka, The Trial
Top 5
1. no consideration of data on the
inside vs outside
2. schema not externally defined
3. same config for every
client/topic
4. 128 partitions as default config
5. running on 8 overloaded nodes
FINN.no
2nd largest website in norway
classified ads ( Ebay, Zillow in one)
60 millions pageviews a day
80 microservices
130 developers
1000 deploys to production a week
6 minutes from commit to deploy
(median)
#kafkasummit @spjelkavik @audunstrand
Schibsted Media Group
6800 people in 30 countries
FINN.no is a part of
kafka @ finn.no
kafka @finn.
no
architecture
use cases
tools
#kafkasummit @spjelkavik @audunstrand
in the beginning ...
Architecture governance board decided to use RabbitMQ as message queue.
Kafka was installed for a proof of concept, after developers spotted it januar 2013.
#kafkasummit @spjelkavik @audunstrand
2013 - POC
“High” volume
Stream of classified ads
Ad matching
Ad indexed
mod05
zk
kafka
mod07
zk
kafka
mod01
zk
kafka
mod03
zk
kafka
mod06
zk
kafka
mod08
zk
kafka
mod02
zk
kafka
mod04
zk
kafka
dc 1
dc 2
Version 0.8.1
4 partitions
common client
java library
thrift
#kafkasummit @spjelkavik @audunstrand
2014 - Adoption and
complaining
low volume/ high
reliability
Ad Insert
Product Orchestration
Payment
Build Pipeline
click streams
mod05
zk
kafka
mod07
zk
kafka
mod01
zk
kafka
mod03
zk
kafka
mod06
zk
kafka
mod08
zk
kafka
mod02
zk
kafka
mod04
zk
kafka
dc 1
dc 2
Version 0.8.1
4 partitions
experimenting
with
configuration
common java
library
#kafkasummit @spjelkavik @audunstrand
tooling
alerting
#kafkasummit @spjelkavik @audunstrand
2015 - Migration and
consolidation
“reliable messaging”
asynchronous
communication
between services
store and forward
zipkin
slack notifications
dc 1
dc 2
Version 0.8.2
5-20 partitions
multiple
configurations
broker05
zk
kafka
broker01
zk
kafka
broker03
zk
kafka
broker04
zk
kafka
broker02
zk
kafka
#kafkasummit @spjelkavik @audunstrand
tooling
Grafana dashboard visualizing jmx stats
kafka-manager
kafka-cat
#kafkasummit @spjelkavik @audunstrand
2016 - Confluent
zk04 zk
broker01
broker05
kafka
kafka
broker03
kafka
broker04
kafka
broker02
kafka
zk05 zk
zk02 zk zk03 zk
zk01 zk
platform
schema registry
data replication
kafka connect
kafka streams
101* mistakes
“God gives the
nuts, but he
does not crack
them.”
― Franz Kafka
Pattern
Language
why is it a mistake
what is the consequence
what is the correct solution
what has finn.no done
Top 5
1. no consideration of data on the
inside vs outside
2. schema not externally defined
3. same config for every
client/topic
4. 128 partitions as default config
5. running on 8 overloaded nodes
#kafkasummit @spjelkavik @audunstrand
mistake:
no consideration of data on
the inside vs outside
https://flic.kr/p/6MjhUR
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
everything published on Kafka (0.8.2) is visible to any client that can access
#kafkasummit @spjelkavik @audunstrand
what is the consequence
direct reads across services/domains is quite normal in legacy and/or enterprise
systems
coupling makes it hard to make changes
unknown and unwanted coupling has a cost
Kafka had no security per topic - you must add that yourself
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Consider what is data on the inside, versus data on the outside
Convention for what is private data and what is public data
If you want to change your internal representation often, map it before publishing it
publicly (Anti corruption layer)
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
Decided on a naming convention (i.e Public.xyzzy) for public topics
Communicates the intention (contract)
#kafkasummit @spjelkavik @audunstrand
mistake:
schema not externally
defined
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
data and code needs separate versioning strategies
version should be part of the data
defining schema in a java library makes it more difficult to access data from non-
jvm languages
very little discoverability of data, people chose other means to get their data
difficult to create tools
#kafkasummit @spjelkavik @audunstrand
what is the consequence
development speed outside jvm has been slow
change of data needs coordinated deployment
no process for data versioning, like backwards compatibility checks
difficult to create tooling that needs to know data format, like data
lake and database sinks
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
confluent.io platform has a separate schema registry
apache avro
multiple compatibility settings and evolutions strategies
connect
Take complexity out of the applications
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
still using java library, with schemas in builders
confluent platform 2.0 is planned for the next step, not (just) kafka 0.9
#kafkasummit @spjelkavik @audunstrand
mistake:
running mixed load with a
single, default configuration
https://flic.kr/p/qbarDR
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
Historically - One Big Database with Expensive License
Database world - OLTP and OLAP
Changed with Open Source software and Cloud
Tried to simplify the developer's day with a single config
Kafka supports very high throughput and highly reliable
#kafkasummit @spjelkavik @audunstrand
what is the consequence
Trade off between throughput and degree of reliability
With a single configuration - the last commit wins
Either high throughput, and risk of loss - or potentially too slow
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Understand your use cases and their needs!
Use proper pr topic configuration
Consider splitting / isolation
#kafkasummit @spjelkavik @audunstrand
Defaults that are quite reliable
Exposing configuration variables in the client
Ask the questions;
● at least once delivery
● ordering - if you partition, what must have strict ordering
● 99% delivery - is that good enough?
● what level of throughput is needed
what has finn.no done
#kafkasummit @spjelkavik @audunstrand
Configuration
Configuration for production
● Partitions
● Replicas (default.replication.factor)
● Minimum ISR (min.insync.replicas)
● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full)
● Retries
● Leader election
Configuration for consumer
● Number of threads
● When to commit (autocommit.enable vs consumer.commitOffsets)
#kafkasummit @spjelkavik @audunstrand
Gwen Shapira recommends...
● akcs = all
● block.on.buffer.full = true
● retries = MAX_INT
● max.inflight.requests.per.connect = 1
● Producer.close()
● replication-factor >= 3
● min.insync.replicas = 2
● unclean.leader.election = false
● auto.offset.commit = false
● commit after processing
● monitor!
#kafkasummit @spjelkavik @audunstrand
mistake:
default configuration of 128 partitions
for each topic
https://flic.kr/p/6KxPgZ
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
partitions are kafkas way of scaling consumers, 128 partitions can handle 128
consumer processes
in 0.8; clusters could not reduce the number of partitions without deleting data
highest number of consumers today is 20
#kafkasummit @spjelkavik @audunstrand
what is the consequence
our 0.8 cluster was configured with 128 partitions as default, for all topics.
many partitions and many topics creates many datapoints that must be coordinated
zookeeper must coordinate all this
rebalance must balance all clients on all partitions
zookeeper and kafka went down (may 2015)
Users could note create ads for two days
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
small number of partitions as default
increase number of partitions for selected topics
understand your use case (throughput target)
reduce length of transactions on consumer side
Max partitions on a broker => 1500 advised in our case - we had 38k
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
5 partitions as default
2 heavy-traffic topics have more than 5 partitions
#kafkasummit @spjelkavik @audunstrand
mistake:
deploy a proof of concept
hack - in production ; i.e
why we had 8 zk nodes
https://flic.kr/p/6eoSgT
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
Kafka was set up by Ops for a proof of concept - not for hardened production use
By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper
Zookeeper is dependent on a majority quorum, low latency between nodes
The 8 nodes were NOT dedicated - in fact - they were overloaded already
#kafkasummit @spjelkavik @audunstrand
what is the consequence
Zookeeper recommends 3 nodes for normal usage, 5 for high, and any more is
questionable
More nodes leads to longer time for finding consensus, more communication
If we get a split between data centers, there will be 4 in each
You should not run Zk between data centers, due to latency and outage
possibilities
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Have an odd number of Zookeeper nodes - preferrably 3, at most 5
Don’t cross data centers
Check the documentation before deploying serious production load
Don’t run a sensitive service (Zookeeper) on a server with 50 jvm-based services,
300% over committed on RAM
Watch GC times
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
dc 1
dc 2
broker05
zk
kafka
broker01
zk
kafka
broker03
zk
kafka
broker04
zk
kafka
broker02
zk
kafka
Version 0.8.2
5-20 partitions
multiple
configurations
#kafkasummit @spjelkavik @audunstrand
“They say ignorance is
bliss.... they're wrong ”
― Franz Kafka
#kafkasummit @spjelkavik @audunstrand
References / Further reading
Designing data intensive systems, Martin Kleppmann
Data on the inside - data on the outside, Pat Helland
I Heart Logs, Jay Kreps
The Confluent Blog, http://confluent.io/
Kafka - The definitive guide
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations
http://www.finn.no/apply-here
http://www.schibsted.com/en/Career/
“It's only because of
their stupidity that
they're able to be so
sure of themselves.”
― Franz Kafka, The
Trial
Audun Fauchald Strand
@audunstrand
Henning Spjelkavik
@spjelkavik
http://www.finn.no/apply-here
http://www.schibsted.com/en/Career/
Q?
#kafkasummit @spjelkavik @audunstrand
Runner up
Using pre-1.0 software
Have control of topic creation
Kafka is storage - treat it like one also ops-wise
Client side rebalancing, misunderstood
Commiting on all consumer threads, believing that you only commited on one

More Related Content

What's hot

Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
Joel Koshy
 

What's hot (20)

Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badly
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 

Viewers also liked

Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
WSO2
 
IPAAS_information on your terms
IPAAS_information on your termsIPAAS_information on your terms
IPAAS_information on your terms
Market Engel SAS
 

Viewers also liked (20)

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2 What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
 
SnapLogic Adds Support for Kafka and HDInsight to Elastic Integration Platform
SnapLogic Adds Support for Kafka and HDInsight to Elastic Integration PlatformSnapLogic Adds Support for Kafka and HDInsight to Elastic Integration Platform
SnapLogic Adds Support for Kafka and HDInsight to Elastic Integration Platform
 
Cloud-Con: Integration & Web APIs
Cloud-Con: Integration & Web APIsCloud-Con: Integration & Web APIs
Cloud-Con: Integration & Web APIs
 
IPAAS_information on your terms
IPAAS_information on your termsIPAAS_information on your terms
IPAAS_information on your terms
 
Anypoint mq (mulesoft) introduction
Anypoint mq (mulesoft)  introductionAnypoint mq (mulesoft)  introduction
Anypoint mq (mulesoft) introduction
 
IPaaS
IPaaSIPaaS
IPaaS
 
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
 
Java Messaging Service
Java Messaging ServiceJava Messaging Service
Java Messaging Service
 
Cloud fuse-apachecon eu-2012
Cloud fuse-apachecon eu-2012Cloud fuse-apachecon eu-2012
Cloud fuse-apachecon eu-2012
 

Similar to 101 ways to configure kafka - badly (Kafka Summit)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
jLove - A Change-Data-Capture use-case: designing an evergreen cache
jLove - A Change-Data-Capture use-case: designing an evergreen cachejLove - A Change-Data-Capture use-case: designing an evergreen cache
jLove - A Change-Data-Capture use-case: designing an evergreen cache
Nicolas Fränkel
 

Similar to 101 ways to configure kafka - badly (Kafka Summit) (20)

Dask and Machine Learning Models in Production - PyColorado 2019
Dask and Machine Learning Models in Production - PyColorado 2019Dask and Machine Learning Models in Production - PyColorado 2019
Dask and Machine Learning Models in Production - PyColorado 2019
 
What is Apache Kafka®?
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
 
Arun
ArunArun
Arun
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
jLove - A Change-Data-Capture use-case: designing an evergreen cache
jLove - A Change-Data-Capture use-case: designing an evergreen cachejLove - A Change-Data-Capture use-case: designing an evergreen cache
jLove - A Change-Data-Capture use-case: designing an evergreen cache
 
VM Forking and Hypervisor-based fuzzing
VM Forking and Hypervisor-based fuzzingVM Forking and Hypervisor-based fuzzing
VM Forking and Hypervisor-based fuzzing
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 

More from Henning Spjelkavik

More from Henning Spjelkavik (20)

Hles 2021 Digital transformation - How to use digital tools to improve our ev...
Hles 2021 Digital transformation - How to use digital tools to improve our ev...Hles 2021 Digital transformation - How to use digital tools to improve our ev...
Hles 2021 Digital transformation - How to use digital tools to improve our ev...
 
Digital techlunsj hos FINN.no 2020-06-10
Digital techlunsj hos FINN.no 2020-06-10Digital techlunsj hos FINN.no 2020-06-10
Digital techlunsj hos FINN.no 2020-06-10
 
10 years of microservices at finn.no - why is that dragon still here (ndc o...
10 years of microservices at finn.no  - why is that dragon still here  (ndc o...10 years of microservices at finn.no  - why is that dragon still here  (ndc o...
10 years of microservices at finn.no - why is that dragon still here (ndc o...
 
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
 
An approach to it in a high level event - IOF HLES 2017
An  approach to it in a high level event - IOF HLES 2017An  approach to it in a high level event - IOF HLES 2017
An approach to it in a high level event - IOF HLES 2017
 
Smidig 2016 - Er ledelse verdifullt likevel?
Smidig 2016 - Er ledelse verdifullt likevel?Smidig 2016 - Er ledelse verdifullt likevel?
Smidig 2016 - Er ledelse verdifullt likevel?
 
Geomatikkdagene 2016 - Kart på FINN.no
Geomatikkdagene 2016 - Kart på FINN.noGeomatikkdagene 2016 - Kart på FINN.no
Geomatikkdagene 2016 - Kart på FINN.no
 
IT for Event Directors
IT for Event DirectorsIT for Event Directors
IT for Event Directors
 
Hvorfor vi bør brenne gammel management litteratur
Hvorfor vi bør brenne gammel management litteraturHvorfor vi bør brenne gammel management litteratur
Hvorfor vi bør brenne gammel management litteratur
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.no
 
HLES 2015 It in a high level event
HLES 2015 It in a high level eventHLES 2015 It in a high level event
HLES 2015 It in a high level event
 
Strategisk design med "Impact Mapping"
Strategisk design med "Impact Mapping"Strategisk design med "Impact Mapping"
Strategisk design med "Impact Mapping"
 
Smidig 2014 - Impact Mapping - Levér det som teller
Smidig 2014 - Impact Mapping - Levér det som tellerSmidig 2014 - Impact Mapping - Levér det som teller
Smidig 2014 - Impact Mapping - Levér det som teller
 
Kart på FINN.no - Fra CGI til slippy map
Kart på FINN.no - Fra CGI til slippy mapKart på FINN.no - Fra CGI til slippy map
Kart på FINN.no - Fra CGI til slippy map
 
Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014
Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014
Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014
 
Misbruk av målstyring
Misbruk av målstyringMisbruk av målstyring
Misbruk av målstyring
 
Jz2010 Hvordan enkel analyse kan øke stabiliteten og hastigheten
Jz2010 Hvordan enkel analyse kan øke stabiliteten og hastighetenJz2010 Hvordan enkel analyse kan øke stabiliteten og hastigheten
Jz2010 Hvordan enkel analyse kan øke stabiliteten og hastigheten
 
Fornebuløpet - Brosjyre
Fornebuløpet - BrosjyreFornebuløpet - Brosjyre
Fornebuløpet - Brosjyre
 
Fornebuløpet - Treningsprogram
Fornebuløpet - TreningsprogramFornebuløpet - Treningsprogram
Fornebuløpet - Treningsprogram
 
Verdistrømanalyse Smidig 2009
Verdistrømanalyse   Smidig 2009Verdistrømanalyse   Smidig 2009
Verdistrømanalyse Smidig 2009
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

101 ways to configure kafka - badly (Kafka Summit)

  • 1. 101* ways to configure Kafka - badly Audun Fauchald Strand Lead Developer Infrastructure @audunstrand bio: gof, mq, ejb, mda, wli, bpel eda, soa, ws*,esb, ddd Henning Spjelkavik Architect @spjelkavik bio: Skiinfo (Vail Resorts), FINN.no enjoys reading jstacks
  • 2. agenda introduction to kafka kafka @ finn.no 101* mistakes questions “From a certain point onward there is no longer any turning back. That is the point that must be reached.” ― Franz Kafka, The Trial
  • 3. Top 5 1. no consideration of data on the inside vs outside 2. schema not externally defined 3. same config for every client/topic 4. 128 partitions as default config 5. running on 8 overloaded nodes
  • 4. FINN.no 2nd largest website in norway classified ads ( Ebay, Zillow in one) 60 millions pageviews a day 80 microservices 130 developers 1000 deploys to production a week 6 minutes from commit to deploy (median)
  • 5. #kafkasummit @spjelkavik @audunstrand Schibsted Media Group 6800 people in 30 countries FINN.no is a part of
  • 8. #kafkasummit @spjelkavik @audunstrand in the beginning ... Architecture governance board decided to use RabbitMQ as message queue. Kafka was installed for a proof of concept, after developers spotted it januar 2013.
  • 9. #kafkasummit @spjelkavik @audunstrand 2013 - POC “High” volume Stream of classified ads Ad matching Ad indexed mod05 zk kafka mod07 zk kafka mod01 zk kafka mod03 zk kafka mod06 zk kafka mod08 zk kafka mod02 zk kafka mod04 zk kafka dc 1 dc 2 Version 0.8.1 4 partitions common client java library thrift
  • 10. #kafkasummit @spjelkavik @audunstrand 2014 - Adoption and complaining low volume/ high reliability Ad Insert Product Orchestration Payment Build Pipeline click streams mod05 zk kafka mod07 zk kafka mod01 zk kafka mod03 zk kafka mod06 zk kafka mod08 zk kafka mod02 zk kafka mod04 zk kafka dc 1 dc 2 Version 0.8.1 4 partitions experimenting with configuration common java library
  • 12. #kafkasummit @spjelkavik @audunstrand 2015 - Migration and consolidation “reliable messaging” asynchronous communication between services store and forward zipkin slack notifications dc 1 dc 2 Version 0.8.2 5-20 partitions multiple configurations broker05 zk kafka broker01 zk kafka broker03 zk kafka broker04 zk kafka broker02 zk kafka
  • 13. #kafkasummit @spjelkavik @audunstrand tooling Grafana dashboard visualizing jmx stats kafka-manager kafka-cat
  • 14. #kafkasummit @spjelkavik @audunstrand 2016 - Confluent zk04 zk broker01 broker05 kafka kafka broker03 kafka broker04 kafka broker02 kafka zk05 zk zk02 zk zk03 zk zk01 zk platform schema registry data replication kafka connect kafka streams
  • 15. 101* mistakes “God gives the nuts, but he does not crack them.” ― Franz Kafka
  • 16. Pattern Language why is it a mistake what is the consequence what is the correct solution what has finn.no done
  • 17. Top 5 1. no consideration of data on the inside vs outside 2. schema not externally defined 3. same config for every client/topic 4. 128 partitions as default config 5. running on 8 overloaded nodes
  • 18. #kafkasummit @spjelkavik @audunstrand mistake: no consideration of data on the inside vs outside https://flic.kr/p/6MjhUR
  • 19. #kafkasummit @spjelkavik @audunstrand why is it a mistake everything published on Kafka (0.8.2) is visible to any client that can access
  • 20. #kafkasummit @spjelkavik @audunstrand what is the consequence direct reads across services/domains is quite normal in legacy and/or enterprise systems coupling makes it hard to make changes unknown and unwanted coupling has a cost Kafka had no security per topic - you must add that yourself
  • 21. #kafkasummit @spjelkavik @audunstrand what is the correct solution Consider what is data on the inside, versus data on the outside Convention for what is private data and what is public data If you want to change your internal representation often, map it before publishing it publicly (Anti corruption layer)
  • 22. #kafkasummit @spjelkavik @audunstrand what has finn.no done Decided on a naming convention (i.e Public.xyzzy) for public topics Communicates the intention (contract)
  • 24. #kafkasummit @spjelkavik @audunstrand why is it a mistake data and code needs separate versioning strategies version should be part of the data defining schema in a java library makes it more difficult to access data from non- jvm languages very little discoverability of data, people chose other means to get their data difficult to create tools
  • 25. #kafkasummit @spjelkavik @audunstrand what is the consequence development speed outside jvm has been slow change of data needs coordinated deployment no process for data versioning, like backwards compatibility checks difficult to create tooling that needs to know data format, like data lake and database sinks
  • 26. #kafkasummit @spjelkavik @audunstrand what is the correct solution confluent.io platform has a separate schema registry apache avro multiple compatibility settings and evolutions strategies connect Take complexity out of the applications
  • 27. #kafkasummit @spjelkavik @audunstrand what has finn.no done still using java library, with schemas in builders confluent platform 2.0 is planned for the next step, not (just) kafka 0.9
  • 28. #kafkasummit @spjelkavik @audunstrand mistake: running mixed load with a single, default configuration https://flic.kr/p/qbarDR
  • 29. #kafkasummit @spjelkavik @audunstrand why is it a mistake Historically - One Big Database with Expensive License Database world - OLTP and OLAP Changed with Open Source software and Cloud Tried to simplify the developer's day with a single config Kafka supports very high throughput and highly reliable
  • 30. #kafkasummit @spjelkavik @audunstrand what is the consequence Trade off between throughput and degree of reliability With a single configuration - the last commit wins Either high throughput, and risk of loss - or potentially too slow
  • 31. #kafkasummit @spjelkavik @audunstrand what is the correct solution Understand your use cases and their needs! Use proper pr topic configuration Consider splitting / isolation
  • 32. #kafkasummit @spjelkavik @audunstrand Defaults that are quite reliable Exposing configuration variables in the client Ask the questions; ● at least once delivery ● ordering - if you partition, what must have strict ordering ● 99% delivery - is that good enough? ● what level of throughput is needed what has finn.no done
  • 33. #kafkasummit @spjelkavik @audunstrand Configuration Configuration for production ● Partitions ● Replicas (default.replication.factor) ● Minimum ISR (min.insync.replicas) ● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full) ● Retries ● Leader election Configuration for consumer ● Number of threads ● When to commit (autocommit.enable vs consumer.commitOffsets)
  • 34. #kafkasummit @spjelkavik @audunstrand Gwen Shapira recommends... ● akcs = all ● block.on.buffer.full = true ● retries = MAX_INT ● max.inflight.requests.per.connect = 1 ● Producer.close() ● replication-factor >= 3 ● min.insync.replicas = 2 ● unclean.leader.election = false ● auto.offset.commit = false ● commit after processing ● monitor!
  • 35. #kafkasummit @spjelkavik @audunstrand mistake: default configuration of 128 partitions for each topic https://flic.kr/p/6KxPgZ
  • 36. #kafkasummit @spjelkavik @audunstrand why is it a mistake partitions are kafkas way of scaling consumers, 128 partitions can handle 128 consumer processes in 0.8; clusters could not reduce the number of partitions without deleting data highest number of consumers today is 20
  • 37. #kafkasummit @spjelkavik @audunstrand what is the consequence our 0.8 cluster was configured with 128 partitions as default, for all topics. many partitions and many topics creates many datapoints that must be coordinated zookeeper must coordinate all this rebalance must balance all clients on all partitions zookeeper and kafka went down (may 2015) Users could note create ads for two days
  • 38. #kafkasummit @spjelkavik @audunstrand what is the correct solution small number of partitions as default increase number of partitions for selected topics understand your use case (throughput target) reduce length of transactions on consumer side Max partitions on a broker => 1500 advised in our case - we had 38k http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
  • 39. #kafkasummit @spjelkavik @audunstrand what has finn.no done 5 partitions as default 2 heavy-traffic topics have more than 5 partitions
  • 40. #kafkasummit @spjelkavik @audunstrand mistake: deploy a proof of concept hack - in production ; i.e why we had 8 zk nodes https://flic.kr/p/6eoSgT
  • 41. #kafkasummit @spjelkavik @audunstrand why is it a mistake Kafka was set up by Ops for a proof of concept - not for hardened production use By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper Zookeeper is dependent on a majority quorum, low latency between nodes The 8 nodes were NOT dedicated - in fact - they were overloaded already
  • 42. #kafkasummit @spjelkavik @audunstrand what is the consequence Zookeeper recommends 3 nodes for normal usage, 5 for high, and any more is questionable More nodes leads to longer time for finding consensus, more communication If we get a split between data centers, there will be 4 in each You should not run Zk between data centers, due to latency and outage possibilities
  • 43. #kafkasummit @spjelkavik @audunstrand what is the correct solution Have an odd number of Zookeeper nodes - preferrably 3, at most 5 Don’t cross data centers Check the documentation before deploying serious production load Don’t run a sensitive service (Zookeeper) on a server with 50 jvm-based services, 300% over committed on RAM Watch GC times
  • 44. #kafkasummit @spjelkavik @audunstrand what has finn.no done dc 1 dc 2 broker05 zk kafka broker01 zk kafka broker03 zk kafka broker04 zk kafka broker02 zk kafka Version 0.8.2 5-20 partitions multiple configurations
  • 45.
  • 46. #kafkasummit @spjelkavik @audunstrand “They say ignorance is bliss.... they're wrong ” ― Franz Kafka
  • 47. #kafkasummit @spjelkavik @audunstrand References / Further reading Designing data intensive systems, Martin Kleppmann Data on the inside - data on the outside, Pat Helland I Heart Logs, Jay Kreps The Confluent Blog, http://confluent.io/ Kafka - The definitive guide https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations http://www.finn.no/apply-here http://www.schibsted.com/en/Career/
  • 48. “It's only because of their stupidity that they're able to be so sure of themselves.” ― Franz Kafka, The Trial Audun Fauchald Strand @audunstrand Henning Spjelkavik @spjelkavik http://www.finn.no/apply-here http://www.schibsted.com/en/Career/ Q?
  • 49. #kafkasummit @spjelkavik @audunstrand Runner up Using pre-1.0 software Have control of topic creation Kafka is storage - treat it like one also ops-wise Client side rebalancing, misunderstood Commiting on all consumer threads, believing that you only commited on one