SlideShare a Scribd company logo
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8HADOOP SUMMIT 2013
People you may know
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data
and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want
• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-
the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?
“All published messages must be delivered to all consumers (quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

More Related Content

What's hot

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

What's hot (20)

Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Understanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at ScaleUnderstanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at Scale
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 

Similar to Apache Kafka at LinkedIn

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 

Similar to Apache Kafka at LinkedIn (20)

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer Presentation
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 

More from Guozhang Wang

More from Guozhang Wang (13)

Consensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfConsensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdf
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of Kafka
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Recently uploaded

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
Atif Razi
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 

Recently uploaded (20)

Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 

Apache Kafka at LinkedIn

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Apache Kafka at LinkedIn
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure About Me 2
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 3 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 4. Why We Build Kafka?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure We Have a lot of Data 5 • User activity tracking • Page views, ad impressions, etc • Server logs and metrics • Syslogs, request-rates, etc • Messaging • Emails, news feeds, etc • Computation derived • Results of Hadoop / data warehousing, etc
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and We Build Products on Data 6
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Newsfeed 7
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 8HADOOP SUMMIT 2013 People you may know
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 9
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Search 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Metrics and Monitoring 11 HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and a LOT of Monitoring 12
  • 13. The Problem: How to integrate this variety of data and make it available to all products?
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14 Life back in 2010: Point-to-Point Pipeplines
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15 Example: User Activity Data Flow
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16 What We Want • A centralized data pipeline
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17 Apache Kafka We tried some systems off- the-shelf, but…
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18 What We REALLY Want • A centralized data pipeline that is • Elastically scalable • Durable • High-throughput • Easy to use
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • A distributed pub-sub messaging system • Scale-out from groundup • Persistent to disks • High-Throughput (10s MB/sec per server) 19 Apache Kafka
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20 Life Since Kafka in Production Apache Kafka • Developed and maintained by 5 Devs + 2 SRE
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 21 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 22. Key Idea #1: Data-parallelism leads to scale-out
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Produce/consume requests are randomly balanced among brokers 23 Distribute Clients across Partitions
  • 24. Key Idea #2: Disks are fast when used sequentially
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Appends are effectively O(1) • Reads from known offset are fast still, when cached 25 Store Messages as a Log 3 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 7) Partition i of Topic A
  • 26. Key Idea #3: Batching makes best use of network/IO
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Batched send and receive • Batched compression • No message caching in JVM • Zero-copy from file to socket (Java NIO) 27 Batch Transfer
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28 The API (0.8) Producer: send(topic, message) Consumer: Iterable stream = createMessageStreams(…).get(topic) for (message: stream) { // process the message }
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 29 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30 Kafka Usage at LinkedIn • Mainly used for tracking user-activity and metrics data • 16 - 32 brokers in each cluster (615+ total brokers) • 527 billion messages/day • 7500+ topics, 270k+ partitions • Byte rates: • Writes: 97 TB/day • Reads: 430 TB/day
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31 Kafka Usage at LinkedIn
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32 Kafka Usage at LinkedIn
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33 Kafka Usage at LinkedIn
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 34 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 35. Problems • Hundreds of message types • Thousands of fields • What do they all mean? • What happens when they change?
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36 Standardized Schema on Avro • Schema • Message structure contract • Performance gain • Workflow • Check in schema • Auto compatibility check • Code review • “Ship it!”
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 37 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 38. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38 Kafka to Hadoop
  • 39. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39 Hadoop ETL (Camus) • Map/Reduce job does data load • One job loads all events • ~10 minute ETA on average from producer to HDFS • Hive registration done automatically • Schema evolution handled transparently • Open sourced: – https://github.com/linkedin/camus
  • 40. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 40 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 41. Does it really work? “All published messages must be delivered to all consumers (quickly)”
  • 43. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43 More Features in Kafka 0.8 • Intra-cluster replication (0.8.0) • Highly availability, • Reduced latency • Log compaction (0.8.1) • State storage • Operational tools (0.8.2) • Topic management • Automated leader rebalance • etc .. Checkout our page for more: http://kafka.apache.org/
  • 44. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44 Kafka 0.9 • Clients Rewrite • Remove ZK dependency • Even better throughput • Security • More operability, multi-tenancy ready • Transactional Messaing • From at-least-one to exactly-once Checkout our page for more: http://kafka.apache.org/
  • 45. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Kafka Users: Next Maybe You?
  • 46. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46 Acknowledgements
  • 49. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49 Real-time Analysis with Kafka • Analytics from Hadoop can be slow • Production -> Kafka: tens of milliseconds • Kafka - > Hadoop: < 1 minute • ETL in Hadoop: ~ 45 minutes • MapReduce in Hadoop: maybe hours
  • 50. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50 Real-time Analysis with Kafka • Solution No.1: directly consuming from Kafka • Solution No. 2: other storage than HDFS • Spark, Shark • Pinot, Druid, FastBit • Solution No. 3: stream processing • Apache Samza • Storm
  • 51. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51 How Fast can Kafka Go? • Bottleneck #1: network bandwidth • Producer: 100 Mb/s for 1 Gig-Ethernet • Consumer can be slower due to multi-sub • Bottleneck #2: disk space • Data may be deleted before consumed at peak time• • Configurable time/size-based retention policy • Bottleneck #3: Zookeeper • Mainly due to offset commit, will be lifted in 0.9
  • 52. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52 Intra-cluster Replication • Pick CA within Datacenter (failover < 10ms) • Network partition is rare • Latency less than an issue • Separate data replication and consensus • Consensus => Zookeeper • Replication => primary-backup (f to tolerate f-1 failure) • Configurable ACK (durability v.s. latency) • More details: • http://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 53. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53 Replication Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

Editor's Notes

  1. Data-serving websites, LinkedIn has a lot of data
  2. Based on relevence
  3. We have this variety of data and and we need to build all these products around such data.
  4. We have this variety of data and and we need to build all these products around such data.
  5. Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL
  6. ActiveMQ: they do not fly
  7. Now you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?
  8. Topic = message stream Topic has partitions, partitions are distributed to brokers
  9. Do not be afraid of disks
  10. File system caching
  11. And finally after all these tricks, the client interface we exposed to the users, are very simple.
  12. Now I will switch my gear and talk a little bit about Kafka usage at Linkedin
  13. 21st, October.
  14. Multi-colo
  15. 99.99%
  16. 0.8.2: Delete topic Automated leader rebalancing Controlled shutdown Offset management Parallel recovery min.isr and clean leader election
  17. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc .. https://cwiki.apache.org/confluence/display/KAFKA/Clients Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. C - High performance C library with full protocol support C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation stdin & stdout https://cwiki.apache.org/confluence/display/KAFKA/Clients
  18. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc ..