Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Nielsen Presents: Fun with Kafka, Spark and Offset Management

83 views

Published on

Ingesting billions of events per day into our big data stores and we need to do it in a scalable, cost-efficient and consistent way. When working with Spark and Kafka the how and where you manage your consumer offsets has a major implication on that. We will go in depths of the solution we ended up implementing and discuss the working process, the dos and don'ts that led us to its final design.

Published in: Engineering
  • Be the first to comment

Nielsen Presents: Fun with Kafka, Spark and Offset Management

  1. 1. Nielsen Presents: Fun with Kafka, Spark and Offset Management or how we stopped worrying and started to enjoy offset management
  2. 2. Agenda
  3. 3. Agenda • How we used to work • What we wanted to do and why • What we ended up doing • Extra Benefits
  4. 4. • Simona • Big Data Engineer at Nielsen Marketing Cloud • Data lover • Concert Goer • Japan enthusiast whoami
  5. 5. How we used to work • Kafka Spark Consumer • Offsets Strategies • Kafka Offset Manager • TEAM A vs TEAM B Flows
  6. 6. Our Different Flows TEAM RON TEAM ITAI Micro Batch Length 4-6 minutes 1 hour Output to Commit Kafka Offsets Kafka Offsets, File names
  7. 7. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  8. 8. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  9. 9. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  10. 10. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  11. 11. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  12. 12. Defining Starting Offsets • Kafka Offset Manager • earliest • latest • XML
  13. 13. kafka offset manager • Self made • Offsets stored in kafka • Everything is against brokers
  14. 14. What we wanted to do and why • JUST A SIMPLE UPGRADE • Two main goals to achieve with our infrastructure • Some of the problems we encountered
  15. 15. Just a Simple Upgrade ● New Kafka Consumer API ● New consumer config ● Subscribing to Kafka
  16. 16. SOME OF THE PROBLEMS WE ENCOUNTERED Committing offsets to Kafka fails Longer timeouts then! Spark graceful shutdown Spark Kafka aSync Commit ……?
  17. 17. what we ended up doing • RDS Offset Store • Consistent Committing • Subscribing to topics (Offsets Strategies) • Unified infrastructure • Timeouts
  18. 18. RDS Offset Store • Table Structure • Constraints and upserts • Triggers
  19. 19. RDS Offset Store
  20. 20. RDS Offset Store
  21. 21. RDS Offset Store
  22. 22. Subscribing to Topics
  23. 23. Building the Offsets Map ● What is the offsets Map[TopicPartition,Long] ● Getting the number of partitions ● Constructing the offsets map
  24. 24. Getting Number of Partitions
  25. 25. Getting Current Offsets
  26. 26. Getting Current Offsets
  27. 27. Getting Current Offsets
  28. 28. Getting Current Offsets
  29. 29. Getting Current Offsets
  30. 30. Getting Current Offsets
  31. 31. Getting Current Offsets
  32. 32. Getting Current Offsets ● Each map represents a row ● Reduce rows into the offsets map
  33. 33. “Two Phase” Commit ● What is a “two phase” commit? ● Why do we need it? ● How did we end up implementing it?
  34. 34. “Two Phase” Commit
  35. 35. “Two Phase” Commit
  36. 36. “Two Phase” Commit
  37. 37. “Two Phase” Commit
  38. 38. “Two Phase” Commit
  39. 39. “Two Phase” Commit
  40. 40. Extra benefits • Restart Time • Monitoring • Disaster Recovery • Unified Configurations
  41. 41. Committing to Kafka ● Why? ● Two options of committing to Kafka
  42. 42. New Offsets Strategies • DataBase • Earliest • Latest
  43. 43. Working against bd-commons

×