How LinkedIn used Kafka to scale
Database Infrastructure
Basavaiah Thambara(Basu)
​Staff SRE
https://www.linkedin.com/in/basavaiaht
Today’s
agenda
Introduction to Espresso - DataStore
Espresso - Replication needs
Limitations of using MySQL Replication
Espresso Replication using Kafka
Advantages of using Kafka
How Kafka based replication works
Conclusion & References
Espresso
● Document store
● Built on top of MySQL
● Bridges gap between RDBMS & k-v stores
● Features
■ Multi-colo
■ Secondary Indexing
■ Schema evolution
■ Change data capture
■ ETL to and from Hadoop
● Use cases
■ Profiles, Invitations, InMails, etc.
Scale of usage
● 80% of site facing databases
● No.of clusters - 145 , servers - ~19k
● No.of databases - ~300
● Data size - ~12PB
● Peak qps - ~3.4 M on single data store
Database Sharding
● Database
● Shard or partition
Espresso Basic Architecture
● Storage node
● Apache Helix
● Zookeeper
● Router
● Client/application
The need of replication
● Read scaling
● High availability
● Disaster Recovery
● Multi-colo support
● Backups
Espresso - local replication
● MySQL replication
● Per node replication
● Master
● Slave
● Master serves
● Node failure
old design
Espresso - cross colo replication
● Multi-colo writes
● Last writer win
● Databus
● Data Replicator
● Colo failure
old design
Limitations of using MySQL replication
● Poor resource utilization
Limitations of using MySQL replication
● Cluster expansion is complex
Limitations of using MySQL replication
● Upon master failure, single node gets traffic
● Human intervention to bring up slaves
● Slave less situation might lead to outage
When master goes down When slaves go down
Limitations of using MySQL replication
● Databus operational complexity
● Databus maintenance cost
Espresso : Replication using kafka
● Per partition replication
● Flexible partition placement
● Every node serves traffic
● Data replicator uses kafka
New design
Advantages of using kafka
● Better h/w utilization
● Cluster expansion is easy as
■ add node(s) to cluster
■ rebalance
● No human intervention
Advantages of using kafka
● Node failure
■ parallel mastership handoff
■ parallel restore of slaves
● Databus complexity eliminated
● Huge cost savings
● Single platform for
■ internal replication
■ cross colo replication
Kafka based replication
● Delivery must be
■ guaranteed
■ In-Order
■ Exactly Once
GTIDs and SCNs
● Global transaction identifier
● Unique
Kafka based replication
Espresso Kafka Producer
● part of storage node
● Uses Open Replicator
● Single Threaded
Message protocol
Message protocol - Mastership Handoff
Message protocol - Mastership Handoff
Message protocol - Mastership Handoff
Producer Checkpointing
Producer Checkpointing...
Producer Checkpointing...
Producer Checkpointing...
Producer Checkpointing...
Producer configuration
● acks = “all”
● Infinite retries
● block.on.buffer.full = true
● max.in.flight.requests.per.connection = 1
● linger = 0
● on non-retryable exception
■ destroy producer
■ create new producer
■ resume from last checkpoint
Espresso Kafka event Consumer
Kafka Consumer
Kafka Consumer
Kafka Consumer
Kafka Consumer
Kafka Consumer
Zombie Write Filtering
Zombie Write Filtering
Zombie Write Filtering
Zombie Write Filtering
Zombie Write Filtering
Kafka broker config and spec
● Kafka broker config
■ replication factor =3
■ min.isr = 2
■ Disabled unclean leader elections
● Kafka broker node spec
■ 256GB RAM
■ 8 core, intel @ 2.00GHz
■ 19TB HDD with RAID
■ os - RHEL 6
Kafka replication stats
● Kafka cluster in each colo
● No.of kafka brokers - 336
● Peak 500MB per sec , 36 TB per day
● Peak 1.5M messages per sec, 34 billion per day
Conclusion
● Kafka is used for database replication at scale
● LinkedIn leveraged Kafka to scale Espresso
● Kafka helped to Unify data pipelines
● Saved $$$
References
1.https://engineering.linkedin.com/espresso/introducing-espresso-
linkedins-hot-new-distributed-document-store
2.https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-
linkedin
3.https://www.slideshare.net/ConfluentInc/espresso-database-
replication-with-kafka-tom-quiggle
4.https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-
apache-kafka-49753844
Q&A?

Using Kafka to scale database replication