Apache Kafka at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn

About Me
2

Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A

We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc

.. and We Build Products on Data
6

Newsfeed
7

Recommendation
8HADOOP SUMMIT 2013
People you may know

Recommendation
9

Search
10

Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

.. and a LOT of Monitoring
12

The Problem:
How to integrate this variety of data
and make it available to all products?

©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines

Example: User Activity Data Flow

What We Want
• A centralized data pipeline

Apache Kafka
We tried some systems off-
the-shelf, but…

What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use

• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka

Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE

Agenda
21
• Kafka Design
• Roadmap
• Q & A

Key Idea #1:
Data-parallelism leads to scale-out

• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions

Key Idea #2:
Disks are fast when used sequentially

• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A

Key Idea #3:
Batching makes best use of network/IO

• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer

The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}

Agenda
29
• Kafka Design
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A

Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day

Agenda
34
• Kafka Design
• O(1) ETL
• Roadmap
• Q & A

Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?

Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”

Agenda
37
• Kafka Design
• O(1) ETL
• Roadmap
• Q & A

Kafka to Hadoop

Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus

Agenda
40
• Kafka Design
• O(1) ETL
• Roadmap
• Q & A

Does it really work?
“All published messages must be delivered to all consumers (quickly)”

More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/

Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/

Kafka Users: Next Maybe You?

Acknowledgements

Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang

Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours

Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm

How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9

Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013

Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

Apache Kafka at LinkedIn

More Related Content

What's hot

Similar to Apache Kafka at LinkedIn

More from Guozhang Wang

Recently uploaded

Apache Kafka at LinkedIn

Editor's Notes