Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Meet Apache Kafka : data streaming in your hands

518 views

Published on

Apache Kafka has emerged as a leading platform for building real-time data pipelines. It provides support for high-throughput/low-latency messaging, as well as sophisticated development options that cover all the stages of a distributed data streaming pipeline, from ingestion to processing. During this session we’ll have an introduction about what Apache Kafka is, its architecture and how it works and finally the related entire ecosystem with the addition of Kafka Connect and Kafka Streams which make this platform useful in a lot of different scenarios.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Good slide!;), solve some my questions
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Meet Apache Kafka : data streaming in your hands

  1. 1. Meet Apache Kafka Data streaming in your hands Paolo Patierno Principal Software Engineer @ Red Hat 21/03/2018
  2. 2. 2 Who am I ? @ppatierno ● Principal Software Engineer @ Red Hat ○ Messaging & IoT team ● Lead/Committer @ Eclipse Foundation ○ Hono, Paho and Vert.x projects ● Microsoft MVP Azure/IoT ● Membro community DevDay ● Hacking low constrained devices in spare time (from previous life) ● Valentino Rossi’s fan !
  3. 3. 3
  4. 4. 4 Containers orchestration with Kubernetes/OpenShift ● DevDay meetup ● March 28th, Naples ● 18:45 - 20:30 ● Re.work - Isola E2 (Centro Direzionale)
  5. 5. 5 “ … a publish/subscribe messaging system …” What is Kafka ? At the beginning ...
  6. 6. 6 “ … a streaming data platform …” What is Kafka ? … today ...
  7. 7. 7 “ … a distributed, horizontally-scalable, fault-tolerant, commit log …” What is Kafka ? … but at the core
  8. 8. 8 ● Developed at Linkedin back in 2010, open sourced in 2011 ● Designed to be fast, scalable, durable and available ● Distributed by nature ● Data partitioning (sharding) ● High throughput / low latency ● Ability to handle huge number of consumers What is Kafka ?
  9. 9. 9 Applications data flow Without using Kafka Logging Application Application Web Microservices Monitoring Analytics ApplicationApplication Application
  10. 10. 10 Applications data flow Kafka as a central hub Application Web Microservices Monitoring Analytics Application Application Apache Kafka
  11. 11. 11 ● Messaging ○ As the “traditional” messaging systems ● Website activity tracking ○ Events like page views, searches, … ● Operational metrics ○ Alerting and reporting on operational metrics ● Log aggregation ○ Collect logs from multiple services ● Stream processing ○ Read, process and write stream for real-time analysis Common use cases
  12. 12. 12 ● Have to track consumers status, dispatching messages ● Delete messages, no long-time persistence ○ Support for TTL ● No re-reading the messages feature is available ● Consumers are fairly simple ● Open and standard protocols support (AMQP, MQTT, JMS, STOMP, …) ● Selectors ● Best fit for request/reply pattern Messaging : “traditional” brokers
  13. 13. 13 ● Apache Kafka has an ecosystem consisting of many components / tools ○ Kafka Core ■ Broker ■ Clients library (producer, consumer, admin) ○ Kafka Connect ○ Kafka Streams ○ Mirror Maker Apache Kafka ecosystem
  14. 14. 14 ● Kafka Broker ○ Central component responsible for hosting topics and delivering messages ○ One or more brokers run in a cluster alongside with a Zookeeper ensemble ● Kafka Producers and Consumers ○ Java-based clients for sending and receiving messages ● Kafka Admin tools ○ Java- and Scala- based tools for managing Kafka brokers ○ Managing topics, ACLs, monitoring etc. Apache Kafka ecosystem Apache Kafka components
  15. 15. 15 Kafka & Zookeeper Zookeeper Kafka Applications Admin tools
  16. 16. 16 ● Kafka Connect ○ Framework for transferring data between Kafka and other data systems ○ Facilitate data conversion, scaling, load balancing, fault tolerance, … ○ Connector plugins are deployed into Kafka connect cluster ○ Well defined API for creating new connectors (with Sink/Source) ○ Apache Kafka itself includes only FileSink and FileSource plugins (reading records from file and posting them as Kafka messages / writing Kafka messages to files) ○ Many additional plugins are available outside of Apache Kafka Kafka ecosystem Apache Kafka components
  17. 17. 17 ● Kafka Streams ○ Stream processing framework ○ Streams are Kafka topics (as input and output) ○ It’s really just a Java library to include in your application ○ Scaling the stream application horizontally ○ Creates a topology of processing nodes (filter, map, join etc) acting on a stream ○ Low level processor API ○ High level DSL ○ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations) Kafka ecosystem Apache Kafka components
  18. 18. 18 Kafka Streams API topic topic Processing topic Processing topic topic topic StreamsAPI application topic Processing
  19. 19. 19 Kafka ecosystem Extract … Transform … Load Source ConnectAPI ConnectAPI Streams API Sink
  20. 20. 20 ● Mirror Maker ○ Kafka clusters do not work well when split across multiple datacenters ○ Low bandwidth, High latency ○ For use within multiple datacenters it is recommended to setup independent cluster in each data center and mirror the data ○ Tool for replication of topics between different clusters Kafka ecosystem Apache Kafka components
  21. 21. 21 Topic & Partitions Topic anatomy Partition 0 Partition 1 Partition 2 writes reads * from official Kafka documentation Topic old new
  22. 22. 22 ● They are “backup” for a partition ○ Provide redundancy ● It’s the way Kafka guarantees availability and durability in case of node failures ○ Kafka tolerates “replicas - 1” dead brokers before losing data ● Two roles : ○ Leader : a replica used by producers/consumers for exchanging messages ○ Followers : all the other replicas ■ They don’t serve client requests ■ They replicate messages from the leader to be “in-sync” (ISR) ○ A replica changes its role as brokers come and go ● Messages ordering is “per partition” Replication Leaders & Followers
  23. 23. Broker 1 23 Partitions Distribution Leaders & Followers T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ● Leaders and followers spread across the cluster ○ producers/consumers connect to leaders ○ multiple connections needed for reading different partitions
  24. 24. Broker 1 24 Partitions Distribution Leader election T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ● A broker with leader partition goes down ● New leader partition is elected on different node
  25. 25. ● Messages are “batched” ● Messages are simply byte arrays ○ Timestamp : create or log append ○ Key : used for specifying the destination partition ○ Value : with custom payload ○ Header : with key-value pairs 25 Messages RecordBatch & Record Length Attributes Timestamp Offset Key Value First Offset Length ... Records Headers
  26. 26. 26 ● They are really “smart” (unlike “traditional” messaging) ● Configured with a bootstrap servers list for fetching first metadata ○ Where are interested topics ? Connect to broker which holds partition leaders ○ Producer specifies destination partition ○ Consumer handles messages offsets to read ○ If error happens, refresh metadata (something is changed in the cluster) ● Batching on producing/consuming Clients
  27. 27. 27 ● Destination partition computed on client ○ Round robin ○ Specified by hashing the “key” in the message ○ Custom partitioning ● Writes messages to “leader” for a partition ● Acknowledge : ○ No ack ○ Ack on message written to “leader” ○ Ack on message also replicated to “in-sync” replicas Producers
  28. 28. 28 ● Read from one (or more) partition(s) ● Track (commit) the offset for given partition ○ A partitioned topic “__consumer_offsets” is used for that ○ Key → [group, topic, partition], Value → [offset] ○ Offset is shared inside the consumer group ● QoS ○ At most once : read message, commit offset, process message ○ At least once : read message, process message, commit offset ○ Exactly once : read message, commit message output and offset to a transactional system ● Gets only “committed” messages (depends on producer “ack” level) Consumers
  29. 29. Broker 1 29 Producers & Consumers Writing/Reading to/from leaders T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Producer Consumer Consumer Producer Consumer
  30. 30. 30 ● The consumer asks for a specific partition (assign) ○ An application using one or more consumers has to handle such assignment on its own, the scaling as well ● The consumer is part of a “consumer group” ○ Consumer groups are an easier way to scale up consumption ○ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in the group ○ When new consumers join or leave, a rebalancing happens to reassign partitions ○ This allows pluggable strategies for partition assignment (e.g. stickiness) Consumer : partitions assignment Available approaches
  31. 31. 31 ● Consumer Group ○ Grouping multiple consumers ○ Each consumer reads from a “unique” subset of partition → max consumers = num partitions ○ They are “competing” consumers on the topic, each message delivered to one consumer ○ Messages with same “key” delivered to same consumer ● More consumer groups ○ Allows publish/subscribe ○ Same messages delivered to different consumers in different consumer groups Consumer Groups
  32. 32. Topic 32 Consumer Groups Partitions assignment Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  33. 33. Topic 33 Consumer Groups Rebalancing Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer
  34. 34. Topic 34 Consumer Groups Max parallelism & Idle consumer Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer
  35. 35. 35 ● It’s not standard ● It’s a binary protocol over TCP ● Request/Response message pairs ● All requests are initiated by the client ○ A client could be a broker as well, for inter-broker communication (i.e. followers) ● APIs ○ Core client requests (metadata, send, fetch, offsets management) ○ Group membership (join, sync, leave) and heartbeat ○ Administrative (describe, list, …) Kafka Protocol
  36. 36. 36 ● Encryption between clients and brokers and between brokers ○ Using SSL ● Authentication of clients (and brokers) connecting to brokers ○ Using SSL (mutual authentication) ○ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms) ● Authorization of read/writes operation by clients ○ ACLs on resources such as topics ○ Authenticated “principal” for configuring ACLs ○ Pluggable ● It’s possible to mix encryption/no-encryption and authentication/no-authentication Security
  37. 37. 37 Demo Zookeeper Kafka Stream App Device App iot-temperature iot-temperature iot-temperature-max
  38. 38. 38 ● Apache Kafka : https://kafka.apache.org/ ● Strimzi : http://strimzi.io/ ● Kubernetes : https://kubernetes.io/ ● OpenShift : https://www.openshift.org/ ● Demo : https://github.com/strimzi/strimzi-lab/tree/master/iot-demo ● Paolo’s blog : https://paolopatierno.wordpress.com/ Resources
  39. 39. Thank you ! Questions ? @ppatierno
  40. 40. Bonus slides ...
  41. 41. 41 Across Data Centers Problems Broker 1 T1 - P1 T1 - P2 Broker 1 T1 - P1 T1 - P2 Producer Data Center 1 Data Center 2 Broker 2 Broker N Broker 2 Broker N Producer Geo location 1 Geo location 2
  42. 42. 42 Across Data Centers Mirror Maker Broker 1 T1 - P1 T1 - P2 Broker 1 T1 - P1 T1 - P2 Producer Data Center 1 Data Center 2 Broker 2 Broker N Broker 2 Broker N Producer MirrorMaker MirrorMaker Geo location 1 Geo location 2
  43. 43. 43 Kafka Protocol At 20000 feet API key API version Correlation id Client id Request Message Metadata Req. Produce Req. Fetch Req. Offset Req. Offset Commit Req. Offset Fetch Req. ... Correlation id Response Message Metadata Res. Produce Res. Fetch Res. Offset Res. Offset Commit Res. Offset Fetch Res. ... Request Response
  44. 44. 44 ● The controller is one of the brokers and it is responsible for maintaining the leader/follower relationship for all the partitions ● When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away ● Elected through Zookeeper Controller
  45. 45. 45 ● Zero-copy ○ it calls the OS kernel direct rather than at the application layer to move data fast ○ It works if clients speak same protocol version, otherwise in-memory copy ● Batching data ○ batching the data into chunks. This minimises cross-machine latency with all the buffering/copying that accompanies this ● Avoids random disk access ○ an immutable commit log it does not need to rewind the disk and do many random I/O operations and can just access the disk in a sequential manner ● Horizontal scaling ○ ability to have thousands of partitions for a single topic spread among thousands of machines means Kafka can handle huge loads Fast
  46. 46. 46 ● Electing controller ○ Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes ● Cluster membership ○ which brokers are alive and part of the cluster? this is also managed through ZooKeeper ● Topic configuration ○ which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic ● Quotas and ACLs ○ how much data is each client allowed to read and write and who is allowed to do that to which topic Zookeeper

×