Your SlideShare is downloading. ×
Apache Kafka at LinkedIn
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Kafka at LinkedIn

1,361
views

Published on

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a …

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

Published in: Engineering

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,361
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
63
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Who are you?
    What is this talk about?
    Exciting topic
    More
  • Messaging system, like JMS (but different!)
    Producers, consumers distributed
  • Start with state at LinkedIn, describe each pipeline
    1 Pipeline for database data
    1 Pipeline for metrics
    1 Pipeline for events
    1 JMS-based pipeline
    No pipeline for application logs
    300 ActiveMQ brokers
  • 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • The log is fundamental abstraction Kafka provides
    You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • What is a log?
    Traditional uses?
    Non-traditional uses…
  • Time ordered
    Semi-structured
  • Data structure not a text file
    List of changes
    Contents of record doesn’t matter
    Indexed by “time”
    Not application log (i.e. text file)
  • Remotely accessible
    State machine replication
  • Data model of Kafka: A topic
    Partitions can be spread over machines, replicated
  • Path of a write
    Leadership failover
    Guarantees
  • AKA ETL
    Many systems
    Event data
    Most important problem for data-centric companies
    Integration >> ML
  • Maslow’s Hiearchy
    Abraham Maslow, Physchologist, 1943

    Physiological – eat, drink, sleep
    Safety – Not being attacked
    Love/Belonging – friends and family
    Esteem – respect of others
    Self-Actualization – morality, creativity, spontenaity
  • Want to do Deep Learning
    Instead finding that their CSV data ALSO has commas in it
    Copying files around
    Ugh The Caveman
    Data Warehousing has a bad reputation
  • Two exacerbating factors
    15 years ago, just the first one (transactional data)
    New categories are very high volume, maybe 100x the transactional data
    Look like events
    Internet of things
  • One-size fits all
  • Tell story:
    Started with Hadoop, added arrows to get data there
    Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data)
    Holy shit this is hard!
    Data is missing, data is late, computation runs on wrong data
    Hadoop without good data is just a very expensive space heater
    Never get to full connectivity
  • Metcalfe’s law
    Each new system connects to get/give data

    All data in multi-subscriber, real-time logs
    The company is a big distributed system
    The data center is the distributed system
  • Three dims:
    Throughput
    Guarantees
    Latency

    Advantages over messaging:
    Huge data backlog
    Order

    Advantages over files
    Real-time

    Advantage over both: principled notion of time
  • Whole organization is big distributed system
    Commit log = data transfer
    Stream processing = triggers

    Batch is dominant paradigm for data processing, why?
  • Service: One input = one output
    Batch job: All inputs = all outputs
    Stream computing: any window = output for that window
  • No different from batch processing flow (instead of files/tables, logs)
  • Storm and Samza
    About process management – both integrate with Kafka
    MapReduce and HDFS
  • Transcript

    • 1. Jay Kreps Introduction to Apache Kafka
    • 2. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 3. Apache Kafka
    • 4. A brief history of Apache Kafka
    • 5. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
    • 6. Kafka is about logs
    • 7. What is a log?
    • 8. Logs: pub/sub done right
    • 9. Partitioning
    • 10. Nodes Host Many Partitions
    • 11. Producers Balance Load
    • 12. Consumer’s Divide Up Partitions
    • 13. End-to-End
    • 14. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
    • 15. Performance • Producer (3x replication): – Async: 786,980 records/sec (75.1 MB/sec) – Sync: 421,823 records/sec (40.2 MB/sec) • Consumer: – 940,521 records/sec (89.7 MB/sec) • End-to-end latency: – 2 ms (median) – 14 ms (99.9th percentile)
    • 16. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 17. Data Integration
    • 18. Maslow’s Hierarchy
    • 19. For Data
    • 20. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
    • 21. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
    • 22. Bad
    • 23. Good
    • 24. Example: User views job
    • 25. Comparing Data Transfer Mechanisms
    • 26. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 27. Stream Processing
    • 28. Stream processing is a generalization of batch processing
    • 29. Stream Processing = Logs + Jobs
    • 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
    • 31. Frameworks Can Help
    • 32. Samza Architecture
    • 33. Log-centric Architecture
    • 34. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps