Your SlideShare is downloading. ×
0
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Kafka at LinkedIn

1,475

Published on

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a …

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.

Published in: Engineering
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,475
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
65
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Who are you?
    What is this talk about?
    Exciting topic
    More
  • Messaging system, like JMS (but different!)
    Producers, consumers distributed
  • Start with state at LinkedIn, describe each pipeline
    1 Pipeline for database data
    1 Pipeline for metrics
    1 Pipeline for events
    1 JMS-based pipeline
    No pipeline for application logs
    300 ActiveMQ brokers
  • 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • The log is fundamental abstraction Kafka provides
    You can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • What is a log?
    Traditional uses?
    Non-traditional uses…
  • Time ordered
    Semi-structured
  • Data structure not a text file
    List of changes
    Contents of record doesn’t matter
    Indexed by “time”
    Not application log (i.e. text file)
  • Remotely accessible
    State machine replication
  • Data model of Kafka: A topic
    Partitions can be spread over machines, replicated
  • Path of a write
    Leadership failover
    Guarantees
  • AKA ETL
    Many systems
    Event data
    Most important problem for data-centric companies
    Integration >> ML
  • Maslow’s Hiearchy
    Abraham Maslow, Physchologist, 1943

    Physiological – eat, drink, sleep
    Safety – Not being attacked
    Love/Belonging – friends and family
    Esteem – respect of others
    Self-Actualization – morality, creativity, spontenaity
  • Want to do Deep Learning
    Instead finding that their CSV data ALSO has commas in it
    Copying files around
    Ugh The Caveman
    Data Warehousing has a bad reputation
  • Two exacerbating factors
    15 years ago, just the first one (transactional data)
    New categories are very high volume, maybe 100x the transactional data
    Look like events
    Internet of things
  • One-size fits all
  • Tell story:
    Started with Hadoop, added arrows to get data there
    Want to build fancy algorithms, need data (expectation 90% of time for fancy, 10% for data)
    Holy shit this is hard!
    Data is missing, data is late, computation runs on wrong data
    Hadoop without good data is just a very expensive space heater
    Never get to full connectivity
  • Metcalfe’s law
    Each new system connects to get/give data

    All data in multi-subscriber, real-time logs
    The company is a big distributed system
    The data center is the distributed system
  • Three dims:
    Throughput
    Guarantees
    Latency

    Advantages over messaging:
    Huge data backlog
    Order

    Advantages over files
    Real-time

    Advantage over both: principled notion of time
  • Whole organization is big distributed system
    Commit log = data transfer
    Stream processing = triggers

    Batch is dominant paradigm for data processing, why?
  • Service: One input = one output
    Batch job: All inputs = all outputs
    Stream computing: any window = output for that window
  • No different from batch processing flow (instead of files/tables, logs)
  • Storm and Samza
    About process management – both integrate with Kafka
    MapReduce and HDFS
  • Transcript

    • 1. Jay Kreps Introduction to Apache Kafka
    • 2. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 3. Apache Kafka
    • 4. A brief history of Apache Kafka
    • 5. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
    • 6. Kafka is about logs
    • 7. What is a log?
    • 8. Logs: pub/sub done right
    • 9. Partitioning
    • 10. Nodes Host Many Partitions
    • 11. Producers Balance Load
    • 12. Consumer’s Divide Up Partitions
    • 13. End-to-End
    • 14. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
    • 15. Performance • Producer (3x replication): – Async: 786,980 records/sec (75.1 MB/sec) – Sync: 421,823 records/sec (40.2 MB/sec) • Consumer: – 940,521 records/sec (89.7 MB/sec) • End-to-end latency: – 2 ms (median) – 14 ms (99.9th percentile)
    • 16. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 17. Data Integration
    • 18. Maslow’s Hierarchy
    • 19. For Data
    • 20. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
    • 21. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
    • 22. Bad
    • 23. Good
    • 24. Example: User views job
    • 25. Comparing Data Transfer Mechanisms
    • 26. The Plan 1. What is Apache Kafka? 2. Kafka and Data Integration 3. Kafka and Stream Processing
    • 27. Stream Processing
    • 28. Stream processing is a generalization of batch processing
    • 29. Stream Processing = Logs + Jobs
    • 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
    • 31. Frameworks Can Help
    • 32. Samza Architecture
    • 33. Log-centric Architecture
    • 34. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Benchmark: http://t.co/40fkKJvanx Me http://www.linkedin.com/in/jaykreps @jaykreps

    ×