Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Build real time stream processing applications using Apache Kafka

127 views

Published on

This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani

In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities.

It ends with a short description of how Kafka is deployed and used at Hotstar

Published in: Technology
  • Be the first to comment

Build real time stream processing applications using Apache Kafka

  1. 1. Real time stream processing using Apache Kafka
  2. 2. Agenda ● What is Apache Kafka? ● Why do we need stream processing? ● Stream processing using Apache Kafka ● Kafka @ Hotstar Feel free to stop me for questions 2
  3. 3. $ whoami ● Personalisation lead at Hotstar ● Led Data Infrastructure team at Grofers and TinyOwl ● Kafka fanboy ● Usually rant on twitter @jayeshsidhwani 3
  4. 4. What is Kafka? 4 ● Kafka is a scalable, fault-tolerant, distributed queue ● Producers and Consumers ● Uses ○ Asynchronous communication in event-driven architectures ○ Message broadcast for database replication Diagram credits: http://kafka.apache.org
  5. 5. ● Brokers ○ Heart of Kafka ○ Stores data ○ Data stored into topics ● Zookeeper ○ Manages cluster state information ○ Leader election Inside Kafka 5 BROKER ZOOKEEPER BROKER BROKER ZOOKEEPER TOPIC TOPIC TOPIC P P P C C C
  6. 6. ● Topics are partitioned ○ A partition is a append-only commit-log file ○ Achieves horizontal scalability ● Messages written in a partitions are ordered ● Each message gets an auto-incrementing offset # ○ {“user_id”: 1, “term”: “GoT”} is a message in the topic searched Inside a topic 6Diagram credits: http://kafka.apache.org
  7. 7. How do consumers read? ● Consumer subscribes to a topic ● Consumers read from the head of the queue ● Multiple consumers can read from a single topic 7Diagram credits: http://kafka.apache.org
  8. 8. Kafka consumer scales horizontally ● Consumers can be grouped ● Consumer Groups ○ Horizontally scalable ○ Fault tolerant ○ Delivery guaranteed 8Diagram credits: http://kafka.apache.org
  9. 9. Stream processing and its use-cases 9
  10. 10. Discrete data processing models 10 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data
  11. 11. Discrete data processing models 11 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data DWH HADOOP ● Batch processing mode ○ Processing time few hours to a day ○ Analysts can use this data
  12. 12. Discrete data processing models 12 ● As the system grows, such synchronous processing model leads to a spaghetti and unmaintainable design APP APP APPAPP SEARCH MONIT CACHE
  13. 13. Promise of stream processing 13 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ○ Decouples data generation from data computation APPAPP APP APP SEARCH MONIT CACHE STREAM PROCESSING FRAMEWORK
  14. 14. Promise of stream processing 14 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ● Process, transform and react on the data as it happens ○ Sub-second latencies ○ Anomaly detection on bad stream quality ○ Timely notification to users who dropped off in a live match Intelligence APPAPP APP APP STREAM PROCESSING FRAMEWORK Filter Window Join Anomaly Action
  15. 15. Stream processing using Kafka 15
  16. 16. Stream processing frameworks ● Write your own? ○ Windowing ○ State management ○ Fault tolerance ○ Scalability ● Use frameworks such as Apache Spark, Samza, Storm ○ Batteries attached ○ Cluster manager to coordinate resources ○ High memory / cpu footprint 16
  17. 17. Kafka Streams ● Kafka Streams is a simple, low-latency, framework independent stream processing framework ● Simple DSL ● Same principles as Kafka consumer (minus operations overhead) ● No cluster manager! yay! 17
  18. 18. Writing Kafka Streams ● Define a processing topology ○ Source nodes ○ Processor nodes ■ One or more ■ Filtering, windowing, joins etc ○ Sink nodes ● Compile it and run like any other java application 18
  19. 19. Demo Simple Kafka Stream 19
  20. 20. Kafka Streams architecture and operations ● Kafka manages ○ Parallelism ○ Fault tolerance ○ Ordering ○ State Management 20Diagram credits: http://confluent.io
  21. 21. Streaming joins and state-stores ● Beyond filtering and windowing ● Streaming joins are hard to scale ○ Kafka scales at 800k writes/sec* ○ How about your database? ● Solution: Cache a static stream in-memory ○ Join with running stream ○ Stream<>table duality ● Kafka supports in-memory cache OOB ○ RocksDB ○ In-memory hash ○ Persistent / Transient 21Diagram credits: http://confluent.io * achieved using librdkafka c++ library
  22. 22. Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 22Diagram credits: http://confluent.io * achieved using librdkafka c++ library
  23. 23. Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 23Diagram credits: http://confluent.io * achieved using librdkafka c++ library CDN benchmarks Client reports Alerts
  24. 24. KSQL - Kafka Streams ++ 24
  25. 25. Kafka @ Hotstar 25
  26. 26. 26
  27. 27. Stream <> Table duality ● Heart of Kafka Stream ● A stream is a changelog of events 27Diagram credits: http://confluent.io
  28. 28. Stream <> Table duality ● Heart of Kafka Stream ● A stream is a changelog of events ● A table is a compacted stream 28Diagram credits: http://confluent.io

×