Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

1,899 views

Published on

Presented at the Seattle Spark meetup on March 23th, 2017 hosted at Expedia. (https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/)

This presentation focuses on a case study of taking Spark Streaming to production using Kafka as a data source, and highlights best practices for different concerns of streaming processing:

1. Spark Streaming & Standalone Cluster Overview
2. Design Patterns for Performance
3. Guaranteed Message Processing & Direct Kafka Integration
4. Operational Monitoring & Alerting
5. Spark Cluster & App Resilience

Published in: Technology
  • Be the first to comment

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

  1. 1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc
  2. 2. Or “A Case Study in Operationalizing Spark Streaming”
  3. 3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)
  4. 4. Demo: Spark in Action
  5. 5. Game & Scoreboard Architecture
  6. 6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  7. 7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  8. 8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos
  9. 9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster
  10. 10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  11. 11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures
  12. 12. We’re done, right?
  13. 13. We’re done, right? Just need to QA the data…
  14. 14. 70% missing data
  15. 15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  16. 16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast
  17. 17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  18. 18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/
  19. 19. Data loss fixed
  20. 20. Data loss fixed So we’re done, right?
  21. 21. Cluster & app continuously crashing
  22. 22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  23. 23. Spark Cluster & App Stability Spark slave memory utilization
  24. 24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true
  25. 25. We’re done, right?
  26. 26. We’re done, right? Finally, yes
  27. 27. Party Time
  28. 28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters
  29. 29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!
  30. 30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

×