Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Pipelines with Apache Kafka


Published on

Introduction to Kafka and some of the systems you can build with it.

Published in: Technology

Data Pipelines with Apache Kafka

  1. 1. Data Pipelines with Apache Kafka Ben Stopford @confluentinc
  2. 2. Today • What is Kafka? (High level fluffy stuff) • What makes it tick? (Low level geeky stuff) • How can you use it? (Architect oriented stuff)
  3. 3. What is Kafka?
  4. 4. Kafka: a Streaming Platform The Log ConnectorsConnectors Producer Consumer Streaming Engine
  5. 5. The Log Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Stateful The Log ConnectorsConnectors Producer Consumer Streaming Engine
  6. 6. Clients JVM & C native implementations, Go, Python, many more OS The Log ConnectorsConnectors Producer Consumer Streaming Engine
  7. 7. Connectors Plug into your database of choice The Log ConnectorsConnectors Producer Consumer Streaming Engine
  8. 8. Streaming Engine The declarative power of a database, wrapped into a Kafka client The Log ConnectorsConnectors Producer Consumer Streaming Engine
  9. 9. Kafka: The distributed Log Today we’ll focus on
  10. 10. The log is a type of messaging system
  11. 11. What is messaging in essence? •  Take a message, keep it safe, make it available to consumers. •  Track what messages have been consumed Kafka attacks these problems separately
  12. 12. What is a message broker in essence? Sender Receiver Broker (the log)
  13. 13. The log is a simple idea Messages are added at the end of the log Just think of the log as a file Old New
  14. 14. Consumers have a position Sally is here George is here Fred is here Old New Scan Scan Scan
  15. 15. Only Sequential Access Old New Read to offset & scan
  16. 16. No Random Access Index Disk Kafka avoids Indexes by keeping the approach simple (indexes impede scalability in this context)
  17. 17. Topics are Broadcast Consumer Consumer Broker broadcast
  18. 18. Can also behave as a queue Sender Receiver
  19. 19. The problem: If you built a messaging system for internet scale, what would it look like?
  20. 20. Shard data to get scalability Messages are sent to different partitions Producer (1) Producer (2) Producer (3) Cluster of machines Partitions live on different machines
  21. 21. Replicate to get fault tolerance replicate msg mastership moves machines (1) (2) msg leader Machine A Machine A Machine B Machine B
  22. 22. Kafka goes a step further A single topic can be spread over multiple consumers (4 consuming machines process a single topic)
  23. 23. Linearly Scalable Architecture Single topic: - Many producers machines - Many consumer machines - Many Broker machines No Bottleneck!!
  24. 24. Distributed Commit Log Different to a traditional messaging system
  25. 25. Data is replicated
  26. 26. Strong Consistency Send Message 3 replicas on different machines •  Only 1 elected leader •  Only leader can be written to, read from
  27. 27. Replication provides resiliency Another replica takes over on machine failure
  28. 28. Replication Protocol Send Message
  29. 29. Optimistic Write (single machine delivery) Send Message Get ack (optimistic)
  30. 30. Pessimistic Write (wait for replication to complete) Send Message Get ack (pessimistic)
  31. 31. Replication Protocol Writer Messages can be read only after replication completes Reader
  32. 32. Replication Protocol Number of replicas is a soft quorum (set min/max tolerable values) Writer Reader
  33. 33. Replication is used for resiliency. No need to flush to disk synchronously. You can flush if you wish, but no one does.
  34. 34. Advanced Features
  35. 35. Consumers cluster too! Consumer Group 1 Consumer Group 1
  36. 36. Consumers cluster too!
  37. 37. Compacted Topics (Tabular View) Version 3 Version 2 Version 1 Version 2 Version 1 Version 5 Version 4 Version 3 Version 2 Version 1 Version 2 Version 3 Version 5 All versions Latest Key only
  38. 38. Multi Tenancy Users isolated using security features Bandwidth segregated per user
  39. 39. Use Cases
  40. 40. Microservice Backbone
  41. 41. Always on, Event-Driven Services The Log (streams & tables) Ingestion Services Services with Polyglotic persistence Simple Services Streaming Services
  42. 42. Event Buffer
  43. 43. Many producers, small messages Kafka Hadoop etc
  44. 44. Stream Processing for enrichment & transformation
  45. 45. Kafka Streams Example Orders Customer (Compacted) Join Customer Stream Join, aggregate, intermediary state stored in Kafka Kafka Kafka Streams Orders Stream Dashboard Query
  46. 46. Stream Data Platform (Kappa Architecture)
  47. 47. Allyourdata Stream Data platform Views Client Client Kafka Stream processor Connectors
  48. 48. Kafka: a Streaming Platform The Log ConnectorsConnectors Producer Consumer Streaming Engine
  49. 49. The end @benstopford