Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jay Kreps
Introduction to Apache Kafka
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Apache Kafka
A
brief
history
of
Apache
Kafka
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of ...
Kafka is about logs
What is a log?
Logs: pub/sub done right
Partitioning
Nodes Host Many Partitions
Producers Balance Load
Consumer’s Divide Up
Partitions
End-to-End
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data prod...
Performance
• Producer (3x replication):
– Async: 786,980 records/sec (75.1 MB/sec)
– Sync: 421,823 records/sec (40.2 MB/s...
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Data Integration
Maslow’s Hierarchy
For Data
New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Applicat...
New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
Comparing Data Transfer
Mechanisms
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Stream Processing = Logs + Jobs
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Frameworks Can Help
Samza Architecture
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Benchmark:
http://t...
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Upcoming SlideShare
Loading in …5
×

Apache Kafka at LinkedIn

3,737 views

Published on

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

Published in: Engineering
  • Be the first to comment

Apache Kafka at LinkedIn

  1. 1. Jay Kreps Introduction to Apache Kafka
  2. 2. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  3. 3. Apache Kafka
  4. 4. A brief history of Apache Kafka
  5. 5. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
  6. 6. Kafka is about logs
  7. 7. What is a log?
  8. 8. Logs: pub/sub done right
  9. 9. Partitioning
  10. 10. Nodes Host Many Partitions
  11. 11. Producers Balance Load
  12. 12. Consumer’s Divide Up Partitions
  13. 13. End-to-End
  14. 14. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
  15. 15. Performance • Producer (3x replication): – Async: 786,980 records/sec (75.1 MB/sec) – Sync: 421,823 records/sec (40.2 MB/sec) • Consumer: – 940,521 records/sec (89.7 MB/sec) • End-to-end latency: – 2 ms (median) – 14 ms (99.9th percentile)
  16. 16. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  17. 17. Data Integration
  18. 18. Maslow’s Hierarchy
  19. 19. For Data
  20. 20. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
  21. 21. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
  22. 22. Bad
  23. 23. Good
  24. 24. Example: User views job
  25. 25. Comparing Data Transfer Mechanisms
  26. 26. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  27. 27. Stream Processing
  28. 28. Stream processing is a generalization of batch processing
  29. 29. Stream Processing = Logs + Jobs
  30. 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
  31. 31. Frameworks Can Help
  32. 32. Samza Architecture
  33. 33. Log-centric Architecture
  34. 34. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps

×