Apache Kafka at LinkedIn

  • 925 views
Uploaded on

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a …

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

More in: Engineering
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
925
On Slideshare
0
From Embeds
0
Number of Embeds
9

Actions

Shares
Downloads
29
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Who are you?
    What is this talk about?
    Exciting topic
    More
  • Messaging system, like JMS (but different!)
    Producers, consumers distributed
  • Start with state at LinkedIn, describe each pipeline
    1 Pipeline for database data
    1 Pipeline for metrics
    1 Pipeline for events
    1 JMS-based pipeline
    No pipeline for application logs
    300 ActiveMQ brokers
  • 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • The log is fundamental abstraction Kafka provides
    You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • What is a log?
    Traditional uses?
    Non-traditional uses…
  • Time ordered
    Semi-structured
  • Data structure not a text file
    List of changes
    Contents of record doesn’t matter
    Indexed by “time”
    Not application log (i.e. text file)
  • Remotely accessible
    State machine replication
  • Data model of Kafka: A topic
    Partitions can be spread over machines, replicated
  • Path of a write
    Leadership failover
    Guarantees
  • AKA ETL
    Many systems
    Event data
    Most important problem for data-centric companies
    Integration >> ML
  • Maslow’s Hiearchy
    Abraham Maslow, Physchologist, 1943

    Physiological – eat, drink, sleep
    Safety – Not being attacked
    Love/Belonging – friends and family
    Esteem – respect of others
    Self-Actualization – morality, creativity, spontenaity
  • Want to do Deep Learning
    Instead finding that their CSV data ALSO has commas in it
    Copying files around
    Ugh The Caveman
    Data Warehousing has a bad reputation
  • Two exacerbating factors
    15 years ago, just the first one (transactional data)
    New categories are very high volume, maybe 100x the transactional data
    Look like events
    Internet of things
  • One-size fits all
  • Tell story:
    Started with Hadoop, added arrows to get data there
    Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data)
    Holy shit this is hard!
    Data is missing, data is late, computation runs on wrong data
    Hadoop without good data is just a very expensive space heater
    Never get to full connectivity
  • Metcalfe’s law
    Each new system connects to get/give data

    All data in multi-subscriber, real-time logs
    The company is a big distributed system
    The data center is the distributed system
  • Three dims:
    Throughput
    Guarantees
    Latency

    Advantages over messaging:
    Huge data backlog
    Order

    Advantages over files
    Real-time

    Advantage over both: principled notion of time
  • Whole organization is big distributed system
    Commit log = data transfer
    Stream processing = triggers

    Batch is dominant paradigm for data processing, why?
  • Service: One input = one output
    Batch job: All inputs = all outputs
    Stream computing: any window = output for that window
  • No different from batch processing flow (instead of files/tables, logs)
  • Storm and Samza
    About process management – both integrate with Kafka
    MapReduce and HDFS

Transcript

  • 1. Jay Kreps Introduction to Apache Kafka
  • 2. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • 3. Apache Kafka
  • 4. A brief history of Apache Kafka
  • 5. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
  • 6. Kafka is about logs
  • 7. What is a log?
  • 8. Logs: pub/sub done right
  • 9. Partitioning
  • 10. Nodes Host Many Partitions
  • 11. Producers Balance Load
  • 12. Consumer’s Divide Up Partitions
  • 13. End-to-End
  • 14. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
  • 15. Performance • Producer (3x replication): – Async: 786,980 records/sec (75.1 MB/sec) – Sync: 421,823 records/sec (40.2 MB/sec) • Consumer: – 940,521 records/sec (89.7 MB/sec) • End-to-end latency: – 2 ms (median) – 14 ms (99.9th percentile)
  • 16. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • 17. Data Integration
  • 18. Maslow’s Hierarchy
  • 19. For Data
  • 20. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
  • 21. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
  • 22. Bad
  • 23. Good
  • 24. Example: User views job
  • 25. Comparing Data Transfer Mechanisms
  • 26. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
  • 27. Stream Processing
  • 28. Stream processing is a generalization of batch processing
  • 29. Stream Processing = Logs + Jobs
  • 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
  • 31. Frameworks Can Help
  • 32. Samza Architecture
  • 33. Log-centric Architecture
  • 34. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps