Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real-time Data and Apache Kafka
Jay Kreps
I ♥ Log
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Apache Kafka
A
brief
history
of
Kafka
Three principles
1. One pipeline to rule them all
2. Stream processing >> messaging
3. Clusters not servers
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of ...
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Low-latency: ~1.5 ms
• Replicated to each datacenter
• Tens of...
Open source
• Apache Software Foundation
• Very healthy usage outside LinkedIn
• Broad base of committers
• 30 clients in ...
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Kafka is about logs
What is a log?
Partitioning
Logs: pub/sub done right
Logs And Distributed Systems
Example:
A Fault-tolerant CEO Hash Table
Operations Final State
State-machine Replication
Primary-backup
What use is a log?
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Data Integration
Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application ...
Systems at LinkedIn
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Stream Processing = Logs + Jobs
Systems Can Help
Samza Architecture
Example: Top Articles By Company
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Me
http://www.linke...
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Upcoming SlideShare
Loading in …5
×

I Heart Log: Real-time Data and Apache Kafka

9,368 views

Published on

This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

I Heart Log: Real-time Data and Apache Kafka

  1. 1. Real-time Data and Apache Kafka Jay Kreps I ♥ Log
  2. 2. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
  3. 3. Apache Kafka
  4. 4. A brief history of Kafka
  5. 5. Three principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers
  6. 6. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
  7. 7. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Low-latency: ~1.5 ms • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
  8. 8. Open source • Apache Software Foundation • Very healthy usage outside LinkedIn • Broad base of committers • 30 clients in 15 languages • Great ecosystem of supporting tools
  9. 9. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
  10. 10. Kafka is about logs
  11. 11. What is a log?
  12. 12. Partitioning
  13. 13. Logs: pub/sub done right
  14. 14. Logs And Distributed Systems
  15. 15. Example: A Fault-tolerant CEO Hash Table
  16. 16. Operations Final State
  17. 17. State-machine Replication
  18. 18. Primary-backup
  19. 19. What use is a log?
  20. 20. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
  21. 21. Data Integration
  22. 22. Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
  23. 23. Systems at LinkedIn • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
  24. 24. Bad
  25. 25. Good
  26. 26. Example: User views job
  27. 27. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
  28. 28. Stream Processing
  29. 29. Stream processing is a generalization of batch processing
  30. 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
  31. 31. Stream Processing = Logs + Jobs
  32. 32. Systems Can Help
  33. 33. Samza Architecture
  34. 34. Example: Top Articles By Company
  35. 35. Log-centric Architecture
  36. 36. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Me http://www.linkedin.com/in/jaykreps @jaykreps

×