I Heart Log: Real-time Data and Apache Kafka

8,580 views

Published on

This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.

Published in: Engineering
0 Comments
60 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,580
On SlideShare
0
From Embeds
0
Number of Embeds
107
Actions
Shares
0
Downloads
452
Comments
0
Likes
60
Embeds 0
No embeds

No notes for slide
  • Who are you?What is this talk about? What is a log and what is it good forExciting topic
  • Producers, consumers distributed
  • First project was an open source clone of Amazon Dynamo (Project Voldemort)Makes explaining things easier
  • 1 Pipeline for database data1 Pipeline for metrics1 Pipeline for events1 Pipeline for real-time processingNo pipeline for application logs300ActiveMQ brokers
  • 10,000 messages/sec * 100 byte messages = ~1MB/sec
  • 200 Kafka-related projects on github1000+ emails/month
  • The log is fundamental abstraction Kafka providesYou can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
  • What is a log?Traditional uses?Non-traditional uses…
  • Time orderedSemi-structured
  • List of changesContents of record doesn’t matterIndexed by “time”Not application log (i.e. text file)
  • Data model of Kafka: A topicPartitions can be spread over machines, replicated
  • The whole system is one big distributed system
  • Paxos,Zookeeper (Zab), Raft, etc.Traditional databases, Hbase/Bigtable, Spanner, HDFS namenode, etcLog has two purposes:ReplicationConsistency
  • Very important problem
  • What if replica is down?Ordering is importantTime is important
  • Log is list of changesKey point: can re-create any point-in-timeIn banking: credits and debitsIn software: the version control changelog
  • State-machine replication: log the incoming requests (logical logging)
  • Log the changed rows (physical logging)
  • Outside of distributed systems internals…
  • AKA ETLMany systemsEvent dataMost important problem for data-centric companiesIntegration >> ML
  • Two exacerbating factors
  • One-size fits all
  • Database cache coherencyData deployment from HadoopNever get to full connectivity
  • Metcalfe’s lawAll data in multi-subscriber, real-time logsThe company is a big distributed system
  • Batch is dominant paradigm for data processing, why?First thing you want when you have real-time data streams is real-time transformations
  • 1790Collected data by Networks=>stream processing3,929,214 people$44kHorses and wagons are a high latency, batch channel
  • Service: One input = one outputBatch job: All inputs = all outputsStream computing: any window = output for that window
  • Importance of the log—buffering, multisubscriberOutput goes to a live serving system or another batch processing system (Hadoop, DWH)Examples: RecommendationsEmailMonitoringSecurity
  • Storm and SamzaAbout process management – both integrate with KafkaMapReduce and HDFS
  • I Heart Log: Real-time Data and Apache Kafka

    1. 1. Real-time Data and Apache Kafka Jay Kreps I ♥ Log
    2. 2. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
    3. 3. Apache Kafka
    4. 4. A brief history of Kafka
    5. 5. Three principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers
    6. 6. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
    7. 7. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Low-latency: ~1.5 ms • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
    8. 8. Open source • Apache Software Foundation • Very healthy usage outside LinkedIn • Broad base of committers • 30 clients in 15 languages • Great ecosystem of supporting tools
    9. 9. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
    10. 10. Kafka is about logs
    11. 11. What is a log?
    12. 12. Partitioning
    13. 13. Logs: pub/sub done right
    14. 14. Logs And Distributed Systems
    15. 15. Example: A Fault-tolerant CEO Hash Table
    16. 16. Operations Final State
    17. 17. State-machine Replication
    18. 18. Primary-backup
    19. 19. What use is a log?
    20. 20. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
    21. 21. Data Integration
    22. 22. Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
    23. 23. Systems at LinkedIn • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
    24. 24. Bad
    25. 25. Good
    26. 26. Example: User views job
    27. 27. The Plan 1. Apache Kafka 2. Logs and Distributed Systems 3. Logs and Data Integration 4. Logs and Stream Processing
    28. 28. Stream Processing
    29. 29. Stream processing is a generalization of batch processing
    30. 30. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
    31. 31. Stream Processing = Logs + Jobs
    32. 32. Systems Can Help
    33. 33. Samza Architecture
    34. 34. Example: Top Articles By Company
    35. 35. Log-centric Architecture
    36. 36. Kafka http://kafka.apache.org Samza http://samza.incubator.apache.org Log Blog http://linkd.in/199iMwY Me http://www.linkedin.com/in/jaykreps @jaykreps

    ×