Your SlideShare is downloading. ×
0
Jay Kreps
Introduction to Apache Kafka
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Apache Kafka
A
brief
history
of
Apache
Kafka
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of ...
Kafka is about logs
What is a log?
Logs: pub/sub done right
Partitioning
Nodes Host Many Partitions
Producers Balance Load
Consumer’s Divide Up
Partitions
End-to-End
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Replicated to each datacenter
• Tens of thousands of data prod...
Performance
• Producer (3x replication):
– Async: 786,980 records/sec (75.1 MB/sec)
– Sync: 421,823 records/sec (40.2 MB/s...
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Data Integration
Maslow’s Hierarchy
For Data
New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Applicat...
New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs
• Offline
– Hadoop
– Teradata
Bad
Good
Example: User views job
Comparing Data Transfer
Mechanisms
The Plan
1. What is Apache Kafka?
2. Kafka and Data Integration
3. Kafka and Stream Processing
Stream Processing
Stream processing is a
generalization
of batch processing
Stream Processing = Logs + Jobs
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
• ETL
Frameworks Can Help
Samza Architecture
Log-centric Architecture
Kafka
http://kafka.apache.org
Samza
http://samza.incubator.apache.org
Log Blog
http://linkd.in/199iMwY
Benchmark:
http://t...
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Upcoming SlideShare
Loading in...5
×

Apache Kafka at LinkedIn

1,591

Published on

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

Published in: Engineering
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,591
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
69
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Who are you?
    What is this talk about?
    Exciting topic
    More
  • Messaging system, like JMS (but different!)
    Producers, consumers distributed
  • Start with state at LinkedIn, describe each pipeline
    1 Pipeline for database data
    1 Pipeline for metrics
    1 Pipeline for events
    1 JMS-based pipeline
    No pipeline for application logs
    300 ActiveMQ brokers
  • 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • The log is fundamental abstraction Kafka provides
    You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • What is a log?
    Traditional uses?
    Non-traditional uses…
  • Time ordered
    Semi-structured
  • Data structure not a text file
    List of changes
    Contents of record doesn’t matter
    Indexed by “time”
    Not application log (i.e. text file)
  • Remotely accessible
    State machine replication
  • Data model of Kafka: A topic
    Partitions can be spread over machines, replicated
  • Path of a write
    Leadership failover
    Guarantees
  • AKA ETL
    Many systems
    Event data
    Most important problem for data-centric companies
    Integration >> ML
  • Maslow’s Hiearchy
    Abraham Maslow, Physchologist, 1943

    Physiological – eat, drink, sleep
    Safety – Not being attacked
    Love/Belonging – friends and family
    Esteem – respect of others
    Self-Actualization – morality, creativity, spontenaity
  • Want to do Deep Learning
    Instead finding that their CSV data ALSO has commas in it
    Copying files around
    Ugh The Caveman
    Data Warehousing has a bad reputation
  • Two exacerbating factors
    15 years ago, just the first one (transactional data)
    New categories are very high volume, maybe 100x the transactional data
    Look like events
    Internet of things
  • One-size fits all
  • Tell story:
    Started with Hadoop, added arrows to get data there
    Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data)
    Holy shit this is hard!
    Data is missing, data is late, computation runs on wrong data
    Hadoop without good data is just a very expensive space heater
    Never get to full connectivity
  • Metcalfe’s law
    Each new system connects to get/give data

    All data in multi-subscriber, real-time logs
    The company is a big distributed system
    The data center is the distributed system
  • Three dims:
    Throughput
    Guarantees
    Latency

    Advantages over messaging:
    Huge data backlog
    Order

    Advantages over files
    Real-time

    Advantage over both: principled notion of time
  • Whole organization is big distributed system
    Commit log = data transfer
    Stream processing = triggers

    Batch is dominant paradigm for data processing, why?
  • Service: One input = one output
    Batch job: All inputs = all outputs
    Stream computing: any window = output for that window
  • No different from batch processing flow (instead of files/tables, logs)
  • Storm and Samza
    About process management – both integrate with Kafka
    MapReduce and HDFS
  • Transcript of "Apache Kafka at LinkedIn"

    1. 1. Jay Kreps Introduction to Apache Kafka
    2. 2. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    3. 3. Apache Kafka
    4. 4. A brief history of Apache Kafka
    5. 5. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
    6. 6. Kafka is about logs
    7. 7. What is a log?
    8. 8. Logs: pub/sub done right
    9. 9. Partitioning
    10. 10. Nodes Host Many Partitions
    11. 11. Producers Balance Load
    12. 12. Consumer’s Divide Up Partitions
    13. 13. End-to-End
    14. 14. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
    15. 15. Performance • Producer (3x replication): – Async: 786,980 records/sec (75.1 MB/sec) – Sync: 421,823 records/sec (40.2 MB/sec) • Consumer: – 940,521 records/sec (89.7 MB/sec) • End-to-end latency: – 2 ms (median) – 14 ms (99.9th percentile)
    16. 16. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    17. 17. Data Integration
    18. 18. Maslow’s Hierarchy
    19. 19. For Data
    20. 20. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
    21. 21. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
    22. 22. Bad
    23. 23. Good
    24. 24. Example: User views job
    25. 25. Comparing Data Transfer Mechanisms
    26. 26. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    27. 27. Stream Processing
    28. 28. Stream processing is a generalization of batch processing
    29. 29. Stream Processing = Logs + Jobs
    30. 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
    31. 31. Frameworks Can Help
    32. 32. Samza Architecture
    33. 33. Log-centric Architecture
    34. 34. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×