Data Pipelines with Apache Kafka


Introduction to Kafka and some of the systems you can build with it.

Data Pipelines with Apache Kafka

  Data Pipelines with Apache Kafka Ben Stopford @confluentinc
  Today • What is Kafka? (High level fluffy stuff) • What makes it tick? (Low level geeky stuff) • How can you use it? (Architect oriented stuff)
  What is Kafka?
  Kafka: a Streaming Platform The Log ConnectorsConnectors Producer Consumer Streaming Engine
  The Log Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Stateful The Log ConnectorsConnectors Producer Consumer Streaming Engine
  Clients JVM & C native implementations, Go, Python, many more OS The Log ConnectorsConnectors Producer Consumer Streaming Engine
  Connectors Plug into your database of choice The Log ConnectorsConnectors Producer Consumer Streaming Engine
  Streaming Engine The declarative power of a database, wrapped into a Kafka client The Log ConnectorsConnectors Producer Consumer Streaming Engine
  Kafka: The distributed Log Today we'll focus on
  The log is a type of messaging system
  What is messaging in essence? •  Take a message, keep it safe, make it available to consumers. •  Track what messages have been consumed Kafka attacks these problems separately
  What is a message broker in essence? Sender Receiver Broker (the log)
  The log is a simple idea Messages are added at the end of the log Just think of the log as a file Old New
  Consumers have a position Sally is here George is here Fred is here Old New Scan Scan Scan
  Only Sequential Access Old New Read to offset & scan
  No Random Access Index Disk Kafka avoids Indexes by keeping the approach simple (indexes impede scalability in this context)
  Topics are Broadcast Consumer Consumer Broker broadcast
  Can also behave as a queue Sender Receiver
  The problem: If you built a messaging system for internet scale, what would it look like?
  Shard data to get scalability Messages are sent to different partitions Producer (1) Producer (2) Producer (3) Cluster of machines Partitions live on different machines
  Replicate to get fault tolerance replicate msg mastership moves machines (1) (2) msg leader Machine A Machine A Machine B Machine B
  Kafka goes a step further A single topic can be spread over multiple consumers (4 consuming machines process a single topic)
  Linearly Scalable Architecture Single topic: - Many producers machines - Many consumer machines - Many Broker machines No Bottleneck!!
  Distributed Commit Log Different to a traditional messaging system
  Data is replicated
  Strong Consistency Send Message 3 replicas on different machines •  Only 1 elected leader •  Only leader can be written to, read from
  Replication provides resiliency Another replica takes over on machine failure
  Replication Protocol Send Message
  Optimistic Write (single machine delivery) Send Message Get ack (optimistic)
  Pessimistic Write (wait for replication to complete) Send Message Get ack (pessimistic)
  Replication Protocol Writer Messages can be read only after replication completes Reader
  Replication Protocol Number of replicas is a soft quorum (set min/max tolerable values) Writer Reader
  Replication is used for resiliency. No need to flush to disk synchronously. You can flush if you wish, but no one does.
  Advanced Features
  Consumers cluster too! Consumer Group 1 Consumer Group 1
  Consumers cluster too!
  Compacted Topics (Tabular View) Version 3 Version 2 Version 1 Version 2 Version 1 Version 5 Version 4 Version 3 Version 2 Version 1 Version 2 Version 3 Version 5 All versions Latest Key only
  Multi Tenancy Users isolated using security features Bandwidth segregated per user
  Use Cases
  Microservice Backbone
  Always on, Event-Driven Services The Log (streams & tables) Ingestion Services Services with Polyglotic persistence Simple Services Streaming Services
  Event Buffer
  Many producers, small messages Kafka Hadoop etc
  Stream Processing for enrichment & transformation
  Kafka Streams Example Orders Customer (Compacted) Join Customer Stream Join, aggregate, intermediary state stored in Kafka Kafka Kafka Streams Orders Stream Dashboard Query
  Stream Data Platform (Kappa Architecture)
  Allyourdata Stream Data platform Views Client Client Kafka Stream processor Connectors
  Kafka: a Streaming Platform The Log ConnectorsConnectors Producer Consumer Streaming Engine
  The end @benstopford