Apache Kafka
and the
Rise of the
Stream Data Platform
Jay Kreps
1
Experience at LinkedIn
2009: We want all our data in Hadoop!
Initial approach: “gut it out”
• Data coverage
• Many source systems
• Relational DBs
• Log files
• Metrics
• Messaging systems
• Many data formats
• Constant change
• New schemas
• New data sources
Problems
How does everything else work?
?
Database changes
User events
Application Logs & Operational Metrics
Messaging
This is a giant mess
An infrastructure solution?
Idea: Stream Data Platform
First Attempt: Messaging systems!
Problems
• Throughput
• Batch systems
• Persistence
• Stream Processing
• Ordering guarantees
• Partitioning
Second Attempt: Build Kafka!
What does it do?
Commit Log Abstraction
Logs & Publish-Subscribe Messaging
A Kafka Topic
Kafka: A Modern Distributed System for Streams
Scalability of a filesystem
◦ Hundreds of MB/sec/server throughput
◦ Many TB per server
Guarantees of a database
◦ Messages strictly ordered
◦ All data persistent
Distributed by default
◦ Replication
◦ Partitioning model
Producers, Consumers, and Brokers all fault tolerant and horizontally scalable
Stream Data Platform
Batch Data => Batch Processing
Stream processing is a
generalization
of batch processing
and request/response processing
Request/Response processing:
One input => One output
Batch processing:
All inputs => All outputs
Stream Processing:
Some inputs => some outputs
(you choose how much “some” is)
Stream Processing a la carte
cat input | grep “foo” | wc -l
Stream Processing with Frameworks
+ = Stream
Processing
LinkedIn’s Data Architecture
Usage at LinkedIn
• Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Tens of thousands of producer processes
• Backbone for data stores
• Search
• Social Graph
• Newsfeed
• Primary storage (in progress)
• Basis for stream processing
Elsewhere
• Mission: Make this a practical reality everywhere
• Product: Confluent Platform
• Apache Kafka
• Schemas and metadata management
• Connectors for common systems
• Monitor data flow end-to-end
• Stream processing integration
Sneak Preview
• 0.9 Release
• Security
• Quotas
• New consumer client
• Connector framework
• Confluent Platform 2.0
• C Client
• Connectors
• etc
Resources
• Confluent
• @confluentinc
• http://confluent.io
• http://blog.confluent.io/2015/02/25/stream-data-
platform-1
• Apache Kafka
• @apachekafka
• http://kafka.apache.org
• http://linkd.in/199iMwY
• Me
• @jaykreps

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Editor's Notes

  • #3 Year = 2009: This story takes place at LinkedIn. I was on the infra team. Previously had built a distributed key-value store Was leading Hadoop adoption effort.
  • #4 Had done initial prototypes, had some valuable things Want Data Lake/Hub
  • #5 Various ad hoc load jobs Manual parsing of each new data we got into target schema 14% of data Not the kind of team I wanted
  • #6 Data grandfathering 14% of data in Hadoop N new non-hadoop engineers imply O(N) hadoop engineers
  • #7 Got interested in things like schemas, metadata, dataflow, etc
  • #8 Real-time, mostly lossless—couldn’t go back in time Batch, mostly lossless–high latency Why CSV dumps?
  • #9 Batch aggregation Lossy, high-latency, only went to data warehouse/Hadoop
  • #10 Splunk
  • #11 Low-throughput, lossy, no real scalability story No central system No integration with batch wold
  • #12 Reality about 100x more complex 300 services ~100 databases Multi-datacenter Trolling: load into Oracle, search, etc Publish data from Hadoop to a search index Run a SQL query to find the biggest latency bottleneck Run a SQL query to find common error patterns Low latency monitoring of database changes or user activity Incorporate popularity in real-time display and relevance algorithms Products that incorporate user activity
  • #13 Not tenable—new systems, data being created faster than we integrate them Not all problems solvable with infrastructure Extract the common pattern…
  • #14 Had Hadoop—great that can be our warehouse/archive. But what about real-time data and real-time processing? All the messed up stuff? Files are really just a stream of updates stored together
  • #15 This problem is solved…don’t reinvent the wheel!
  • #16 Not suitable for ETL Not suitable for high throughput
  • #17 Reinvent the wheel! Initial time estimate: 3 months 30 second intro to Kafka
  • #18 Producers, consumers, topic Like a messaging system
  • #19 Different: Commit log Like a file that goes on forever Stolen from distributed database internals Key abstraction for systems, real-time processing, data integration Formalization of a stream
  • #20 Very fast publish subscribe mechanism Unifies batch and real-time processing
  • #22 Built like a modern dist system
  • #23 Let’s dive into the elements of this platform One of the big things this enables is stream processing…let’s dive into that. Truviso
  • #24 First thing you want when you have real-time data streams is real-time transformations 1790 Collected data by Networks=>stream processing 3,929,214 people $44k Horses and wagons are a high latency, batch channel
  • #25 Service: One input = one output Batch job: All inputs = all outputs Stream computing: any window = output for that window
  • #26 REST, RPC, etc
  • #27 REST, RPC, etc
  • #28 Batch processing => stream processing with window of 1 day
  • #29 Like unix pipe, streams data between programs Distributed, fault-tolerant, elastically scalable Modern version of unix pipes
  • #30 Streams + Processing = Stream Processing HDFS analogy Most stream processing system use Kafka Some require it—rollback recovery
  • #31 Stream Data Platform Trace data flow
  • #32 Everything that happens in the company is a real-time stream
  • #33 Describe adoption cycle
  • #34 Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it.