Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Heterogeneous job processing with Apache Kafka


Published on

Rehearsal record available:

Published in: Internet
  • Be the first to like this

Heterogeneous job processing with Apache Kafka

  1. 1. Heterogeneous Job processing with Apache Kafka Chu, Hua-Rong | PyCon HK 2018
  2. 2. TL; DL Journey toward job processing in large scale What you might be interested: ● Why and how to make use of Kafka for job queuing and scheduling ● Sketch of a reliable and scalable job processing system build with Python ● Exploring another use case for Kafka in addition to those in the data science area
  3. 3. Speaker Pythonista from Taiwan, plus... ● Research engineer @ Chunghwa Telecom Laboratories ○ Focus on infrastracture and platform architecture desgin ● Enthusiast of open source and software technologies ○ Involving open source projects, meetups, conferences and so on
  4. 4. So what is job processing?
  5. 5. Consumer-Producer pattern ● Some business logic are too time-consuming to run in front of user ● Creating background jobs, placing those jobs on multiple queues, and processing them later ● Well-known as Celery, RQ in Python world ● Disambiguation: workers in the system are heterogeneous in contrasted to those in Apache Spark
  6. 6. Do labor jobs underneath most fancy services Example: background tasks in Shopify ● Sending massive newsletters / image resizing / http downloads / updating smart collections / updating solr / batch imports / spam checks Fancy services also made their own fancy job processing system ● Github: Resque ● Livejournal: Gearman ● Shopify: delayed_job
  7. 7. Almost yet another job queue released every month...
  8. 8. Why we reinvent the wheel?
  9. 9. Indeed adopted existing artifact...until we break up ● Most existing artifacts are made by cool guys who are building fancy SaaSs ○ Low-latency, moderate reliability, handles relatively short time jobs (seconds to a minute) ○ Example: sending massive newsletters ● We are buidling heavy boring IaaS intra. ○ cloud resource provisioning, tiering storage system in several PB... ○ Require serious durability, handle long-run jobs (several minutes to hour) ○ Moderate latency is acceptable
  10. 10. Presented @ Sep. 2018 the Azure Blob Storage Lifecycle
  11. 11. Evolution of our job processing infrastructure
  12. 12. The Dark Ages - DB + cronjobs ● Todo lists in RDBMS ● Variety of cronjobs check the list periodically and do corresponding jobs ● Verdict ○ For - good choice in MVP (minimum viable product) building stage ○ Against - any aspect other than simplicity of dev
  13. 13. Renaissance - stood on the shoulders of Github Github has been on the journey to seek a robust job processing infra. ● They’ve experienced many different background job systems ○ SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd ● resque is the answer learned from the journey. ○ Github’s own in-house job system ○ Redis-backed
  14. 14. Renaissance - stood on the shoulders of Github ● We adopted resque as our job processing system ○ Harder, better, faster, stronger ● Problem ○ We enjoy it happily ever after until the Redis daemon is killed by OOM killer due to outage ○ redis is not as reliable as a storage ○ Take a lot of time to redo long run job after fail ● resque is good for SaaS but... ○ Consider there are “only” 7.7 billion people in the world, clicking buttons in your app ○ There are trillions of files in AWS S3
  15. 15. Revolution - our in-house system The solution satisfy: ● Both durability and scalabilities we desire ○ Jobs are stored and bypass in/via Apache Kafka ○ Let Kafka handle the hard queue/schedule problems ● Job execution is recoverable and diagnosable ○ We decouple a long-run job into shorter tasks ○ Inspired by stream processing and event sourcing AS-IS TO-BE Input Result Huge job Input Result Small tasks
  16. 16. Revolution - our in-house system A job handling procedure is defined by a series of tasks in a spec. ● Tasks are chained together via adjacent input/output ● input/output are Kafka topics
  17. 17. Revolution - our in-house system Initials the job handling procedure according the spec. ● Spawn workers which handle each tasks respectively ● Mange the number of each kinds of workers
  18. 18. Details on how we achieve this 1. Master the power of Kafka in Python 2. Let it be durable 3. Let it be scalable 4. Decouple long run job into small pieces
  19. 19. Master the power of Kafka in Python Brief of Apache Kafka ● Brought to you by Linkedin ● Focus on performance, scaliability and durability ○ Better throughput, built-in partitioning, replication, and fault-tolerance for large scale message processing applications. ● Widely adopted in the data sciense / big data areas ○ "K" of the SMACK stack ○ Website Activity Tracking, metrics, log aggregation Ch.ko123, CC BY 4.0,
  20. 20. Master the power of Kafka in Python Quick facts for pythonistas ● We can only use two of four major APIs provided by Kafka in python ○ (O) Producer API, Consumer API ○ (X) Connector API, Stream API ● Client binding: confluent-kafka-python ○ Supported by creators of Kafka ○ Wrapper around librdkafka => better performance, reliability, feature-proof
  21. 21. Producer Comsumer
  22. 22. Let it be durable Kafka is one of the most durable store for messages ● message are retained as replicas which spread among brokers ● replicas can be writen in store synchronizedly, just as a real distributed file system Other MQs are not such durable in general due to use cases. For example, RabbitMQ documented: ● “Marking messages as persistent doesn't fully guarantee that a message won't be lost” mplicity/
  23. 23. Let it be durable Durability is not out-of-box with default config... ● Adjust following parameters in Kafka cluster config ○ default.replication.factor=3 ○ unclean.leader.election.enable = false (Default in v0.11.0.0+) ○ min.insync.replicas=2 ● Add following parameters in producer config ○ request.required.acks=all
  24. 24. Let it be durable To archive the at-least-once sematic, be care of commit timing in the consumer side
  25. 25. Let it be scalable Scalability is achieved with the system design ● Share nothing ○ Each worker is a single process ○ GIL free ● Job decoupling ○ Each task in job can be scaled individually, just like fancy microservices ○ Admin can adjust number of worker in runtime
  26. 26. Let it be scalable Scalability is also achieved with Kafka properties ● Naturally-distributed ● Manage offset in consumer instread of broker Configuration ● Producer should write messages to several partitions in one topic ● Consumer/Worker responsible for the same task should belong to the same group
  27. 27. Decouple long run job into small pieces When facing business logic problem... ● Procedure can be recovered from the failed task stage ○ Auto retry ○ “Abort” topic (queue) for human being intervention ● Debug info available ○ Exception info in message header when retry Input Result ➔ Recoverable ➔ Diagnosable
  28. 28. Decouple long run job into small pieces Tasks can be ● Stateful ● Grouped into sub systems by different domain ● Flush result to external system
  29. 29. Decouple long run job into small pieces Grouped into sub systems by different domain Flush result to external system
  30. 30. Task chain can do operations in the stream processing / functional programming like way consume flatmap partition
  31. 31. Summary
  32. 32. Have done so far ● Why and how to make use of Kafka for job queuing and scheduling ● Sketch of a reliable and scalable job processing system build with Python ● Exploring another use case for Kafka in addition to those in the data science area
  33. 33. Final word I highly recommend existing wonderful artifacts such as RQ and Celery to anyone unless: ● Share the existing Kafka cluster to make infrastructure more cost-effective ● For critical applications that require scalability and durability
  34. 34. 多謝! Example code and contact
  35. 35. Job processing in large scale An IaaS-graded job processing system powered by Kafka and Python ● Stood on the shoulders of Github ● Focus on durability, scalability, and the long-run job ● Design out of necessity Against ● Availability - a CP system ● Monitoring ● Complexity