©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8HADOOP SUMMIT 2013
People you may know
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data
and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want
• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-
the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?
“All published messages must be delivered to all consumers (quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

Apache Kafka at LinkedIn

  • 1.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Apache Kafka at LinkedIn
  • 2.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure About Me 2
  • 3.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 3 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 4.
  • 5.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure We Have a lot of Data 5 • User activity tracking • Page views, ad impressions, etc • Server logs and metrics • Syslogs, request-rates, etc • Messaging • Emails, news feeds, etc • Computation derived • Results of Hadoop / data warehousing, etc
  • 6.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure .. and We Build Products on Data 6
  • 7.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Newsfeed 7
  • 8.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 8HADOOP SUMMIT 2013 People you may know
  • 9.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 9
  • 10.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Search 10
  • 11.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Metrics and Monitoring 11 HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 12.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure .. and a LOT of Monitoring 12
  • 13.
    The Problem: How tointegrate this variety of data and make it available to all products?
  • 14.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 14 Life back in 2010: Point-to-Point Pipeplines
  • 15.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 15 Example: User Activity Data Flow
  • 16.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 16 What We Want • A centralized data pipeline
  • 17.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 17 Apache Kafka We tried some systems off- the-shelf, but…
  • 18.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 18 What We REALLY Want • A centralized data pipeline that is • Elastically scalable • Durable • High-throughput • Easy to use
  • 19.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure • A distributed pub-sub messaging system • Scale-out from groundup • Persistent to disks • High-Throughput (10s MB/sec per server) 19 Apache Kafka
  • 20.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 20 Life Since Kafka in Production Apache Kafka • Developed and maintained by 5 Devs + 2 SRE
  • 21.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 21 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 22.
  • 23.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure • Produce/consume requests are randomly balanced among brokers 23 Distribute Clients across Partitions
  • 24.
    Key Idea #2: Disksare fast when used sequentially
  • 25.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure • Appends are effectively O(1) • Reads from known offset are fast still, when cached 25 Store Messages as a Log 3 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 7) Partition i of Topic A
  • 26.
    Key Idea #3: Batchingmakes best use of network/IO
  • 27.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure • Batched send and receive • Batched compression • No message caching in JVM • Zero-copy from file to socket (Java NIO) 27 Batch Transfer
  • 28.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 28 The API (0.8) Producer: send(topic, message) Consumer: Iterable stream = createMessageStreams(…).get(topic) for (message: stream) { // process the message }
  • 29.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 29 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 30.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 30 Kafka Usage at LinkedIn • Mainly used for tracking user-activity and metrics data • 16 - 32 brokers in each cluster (615+ total brokers) • 527 billion messages/day • 7500+ topics, 270k+ partitions • Byte rates: • Writes: 97 TB/day • Reads: 430 TB/day
  • 31.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 31 Kafka Usage at LinkedIn
  • 32.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 32 Kafka Usage at LinkedIn
  • 33.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 33 Kafka Usage at LinkedIn
  • 34.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 34 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 35.
    Problems • Hundreds ofmessage types • Thousands of fields • What do they all mean? • What happens when they change?
  • 36.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 36 Standardized Schema on Avro • Schema • Message structure contract • Performance gain • Workflow • Check in schema • Auto compatibility check • Code review • “Ship it!”
  • 37.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 37 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 38.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 38 Kafka to Hadoop
  • 39.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 39 Hadoop ETL (Camus) • Map/Reduce job does data load • One job loads all events • ~10 minute ETA on average from producer to HDFS • Hive registration done automatically • Schema evolution handled transparently • Open sourced: – https://github.com/linkedin/camus
  • 40.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 40 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 41.
    Does it reallywork? “All published messages must be delivered to all consumers (quickly)”
  • 42.
  • 43.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 43 More Features in Kafka 0.8 • Intra-cluster replication (0.8.0) • Highly availability, • Reduced latency • Log compaction (0.8.1) • State storage • Operational tools (0.8.2) • Topic management • Automated leader rebalance • etc .. Checkout our page for more: http://kafka.apache.org/
  • 44.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 44 Kafka 0.9 • Clients Rewrite • Remove ZK dependency • Even better throughput • Security • More operability, multi-tenancy ready • Transactional Messaing • From at-least-one to exactly-once Checkout our page for more: http://kafka.apache.org/
  • 45.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure Kafka Users: Next Maybe You?
  • 46.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 46 Acknowledgements
  • 47.
  • 48.
  • 49.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 49 Real-time Analysis with Kafka • Analytics from Hadoop can be slow • Production -> Kafka: tens of milliseconds • Kafka - > Hadoop: < 1 minute • ETL in Hadoop: ~ 45 minutes • MapReduce in Hadoop: maybe hours
  • 50.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 50 Real-time Analysis with Kafka • Solution No.1: directly consuming from Kafka • Solution No. 2: other storage than HDFS • Spark, Shark • Pinot, Druid, FastBit • Solution No. 3: stream processing • Apache Samza • Storm
  • 51.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 51 How Fast can Kafka Go? • Bottleneck #1: network bandwidth • Producer: 100 Mb/s for 1 Gig-Ethernet • Consumer can be slower due to multi-sub • Bottleneck #2: disk space • Data may be deleted before consumed at peak time• • Configurable time/size-based retention policy • Bottleneck #3: Zookeeper • Mainly due to offset commit, will be lifted in 0.9
  • 52.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 52 Intra-cluster Replication • Pick CA within Datacenter (failover < 10ms) • Network partition is rare • Latency less than an issue • Separate data replication and consensus • Consensus => Zookeeper • Replication => primary-backup (f to tolerate f-1 failure) • Configurable ACK (durability v.s. latency) • More details: • http://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 53.
    ©2013 LinkedIn Corporation.All Rights Reserved. KAFKA Team, Data Infrastructure 53 Replication Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

Editor's Notes

  • #6 Data-serving websites, LinkedIn has a lot of data
  • #9 Based on relevence
  • #12 We have this variety of data and and we need to build all these products around such data.
  • #13 We have this variety of data and and we need to build all these products around such data.
  • #15 Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL
  • #18 ActiveMQ: they do not fly
  • #21 Now you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?
  • #24 Topic = message stream Topic has partitions, partitions are distributed to brokers
  • #25 Do not be afraid of disks
  • #26 File system caching
  • #28 And finally after all these tricks, the client interface we exposed to the users, are very simple.
  • #30 Now I will switch my gear and talk a little bit about Kafka usage at Linkedin
  • #31 21st, October.
  • #33 Multi-colo
  • #43 99.99%
  • #44 0.8.2: Delete topic Automated leader rebalancing Controlled shutdown Offset management Parallel recovery min.isr and clean leader election
  • #46 Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc .. https://cwiki.apache.org/confluence/display/KAFKA/Clients Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. C - High performance C library with full protocol support C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation stdin & stdout https://cwiki.apache.org/confluence/display/KAFKA/Clients
  • #47 Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc ..