Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

•Download as PPTX, PDF•

4 likes•4,206 views

Presented at the Seattle Spark meetup on March 23th, 2017 hosted at Expedia. (https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/) This presentation focuses on a case study of taking Spark Streaming to production using Kafka as a data source, and highlights best practices for different concerns of streaming processing: 1. Spark Streaming & Standalone Cluster Overview 2. Design Patterns for Performance 3. Guaranteed Message Processing & Direct Kafka Integration 4. Operational Monitoring & Alerting 5. Spark Cluster & App Resilience

Technology

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc

Or
“A Case Study in Operationalizing
Spark Streaming”

Context/Disclaimer
 Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some
aggregation. Values accuracy over speed.
 Spark Streaming 1.5-1.6, Kafka 0.9
 Standalone Cluster (not YARN or Mesos)
 No Hadoop
 Message velocity: k/s. Batch window: 10s
 Data sourcee: Kafka (primary), Redis (joins + ref data) & S3
(ref data)

Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience

Spark Streaming & Standalone
Cluster Overview
 RDD: Partitioned, replicated collection of data
objects
 Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
 Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
 Lazy Execution: Transformations & Actions
 Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone
Cluster Overview
 Standalone Cluster
 Each node
 Master
 Worker
 Executor
 Driver
 Zookeeper cluster

Design Patterns for Performance
 Delegate all IO/CPU to the Executors
 Avoid unnecessary shuffles (join, groupBy,
repartition)
 Externalize streaming joins & reference data
lookups. Large/volatile ref data set.
 JVM static hashmap
 External cache (e.g. Redis)
 Static LRU cache (amortize lookups)
 RocksDB
 Hygienic function closures

We’re done, right?
Just need to QA the data…

Guaranteed Message Processing &
Direct Kafka Integration
 Guaranteed Message Processing = At-least-once
processing + idempotence
 Kafka Receiver
 Consumes messages faster than Spark can process
 Checkpoints before processing finished
 Inefficient CPU utilization
 Direct Kafka Integration
 Control over checkpointing & transactionality
 Better distribution on resource consumption
 1:1 Kafka Topic-partition to Spark RDD-partition
 Use Kafka as WAL
 Statelessness, Fail-fast

Operational Monitoring
& Alerting
 Driver “Heartbeat”
 Batch processing time
 Message count
 Kafka lag (latest offsets vs last processed)
 Driver start events
 StatsD + Graphite + Seyren
 http://localhost:4040/metrics/json/

Spark Cluster & App Stability
Spark slave memory utilization

Spark Cluster & App Stability
 Slave memory overhead
 OOM killer
 Crashes + Kafka Receiver = missing data
 Supervised driver: “--supervise” for spark-submit.
Driver restart logging
 Cluster resource overprovisioning
 Standby Masters for failover
 Auto-cleanup of work directories
spark.worker.cleanup.enabled=true

TL;DR
1. Use Direct Kafka Integration + transactionality
2. Cache reference data for speed
3. Avoid shuffles & driver bottlenecks
4. Supervised driver
5. Cleanup worker temp directory
6. Beware of function closures
7. Cluster resource over-provisioning
8. Spark slave memory headroom
9. Monitoring on Driver heartbeat & Kafka lag
10. Standby masters

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Thanks!

Links
 Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-
spark-streaming-part-1/
 Direct Kafka Integration:
https://databricks.com/blog/2015/03/30/improvements-to-
kafka-integration-of-spark-streaming.html
 App metrics: http://localhost:4040/metrics/json/
 MetricsSystem:
http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/
 sparkConf.set("spark.worker.cleanup.enabled", "true")

Viewers also liked

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Brandon O'Brien

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

SingleStore

Ingesting Drone Data into Big Data Platforms

Timothy Spann

A primer on building real time data-driven products

Lars Albertsson

A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.

Real time Analytics with Apache Kafka and Apache Spark

Rahul Jain

Kafka presentation

Mohammed Fazuluddin

We all want more diversity in tech. We rarely acknowledge that the experience of inclusion is the product of Org Design. Presented at O'Reilly Design Conference with Molly Beyer, #OReillyDesign, these slides share some practical tips and advice on increasing diversity through applied design thinking. Learn how to empathize and ideate in response to real needs instead of getting people to 'hack a hairdryer'.

Designing for Diversity in Design Orgs (Presentation)

Eli Silva

The greatest tragedy of western front pakistani stupidity at its lowest height

Agha A

Gustavo Germano Proyecto Ausencias

Monica Oporto

3行ラベリングの勧め

Mizuhiro Kaimai

TEDx Manchester: AI & The Future of Work

Volker Hirsch

Recently Drupal celebrated its 15th birthday and while everybody is busy with learning Drupal 8 we would like to stop and take a look at where our beloved system emerged from 15 years ago. Most of the people don’t know about history of Drupal and how it evolved from message board platform (Drop 1.0) to a fully scaled enterprise level CMS (Drupal 8.0). Did you know some of key features of Drupal like modules, nodes, watchdog and multilingual support where available since Drupal 2.0?

History of Drupal: From Drop 1.0 to Drupal 8

ドローン農業最前線

Devel for Drupal 8

Goをカンストさせる話

Drupal Developer Days Keynote

Angela Byron

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup. Presented by Brandon O'Brien Code example: https://github.com/OpenDataMining/brandonobrien Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/

Introduction to Streaming Distributed Processing with Storm

Brandon O'Brien

Viewers also liked (17)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

Ingesting Drone Data into Big Data Platforms

A primer on building real time data-driven products

Real time Analytics with Apache Kafka and Apache Spark

Kafka presentation

Designing for Diversity in Design Orgs (Presentation)

The greatest tragedy of western front pakistani stupidity at its lowest height

Gustavo Germano Proyecto Ausencias

3行ラベリングの勧め

TEDx Manchester: AI & The Future of Work

History of Drupal: From Drop 1.0 to Drupal 8

ドローン農業最前線

Devel for Drupal 8

Goをカンストさせる話

Drupal Developer Days Keynote

Introduction to Streaming Distributed Processing with Storm

Recently uploaded

This presentation focuses on the challenges and strategies of connecting problem definitions within product development. Key Points Covered: - Kayak's mission since its inception in 2004 to simplify travel by enabling easy comparisons of flights through technological solutions. - Discussion of the complexities within the travel industry, including the high expectations for personalized user experiences and the various stakeholder influences. - Emphasis on the necessity of maintaining agility and innovation within a mature company through continuous reassessment of processes. - An explanation of the importance of disciplined problem definition to prevent project failures and team inefficiencies. - Introduction of strategies for effective communication across teams to ensure alignment and comprehension at all levels of project development. - Exploration of various problem-solving methodologies, including how to handle conflicts within team settings regarding problem definitions and project directions.

Connecting the Dots in Product Design at KAYAK

UXDXConf

Ever caught yourself nodding along when someone mentions "delivering value" in Agile, but secretly wondering what the heck they actually mean? You're not alone! Join us for an eye-opening session where we'll strip away the buzzwords and dive into the heart of Agile—value delivery. But what is "value"? Is it a mythical unicorn in the world of software development, or is there more to this overused term? This isn't going to be a sit-and-get lecture. We're talking about a face-to-face, interactive meetup where YOU play a crucial role. Come along to: Define It: What does "value" really mean? We’ll build a definition that’s not just words, but a compass for your Agile journey. Contextualise It: Discover what value means specifically to you, your team, your company, and your industry. Because one size does not fit all. Deliver It: Share strategies and gather new ones for uncovering and delivering true value—no more shooting in the dark!

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

David Michel

The standard Salesforce Approval process can be limiting in many ways, especially in complex scenarios. What if there was a way to implement very flexible approvals where one can use Apex code to make data updates in unrelated records, dynamically generate next steps details, and compute assignees on the fly? And still use UI-based configurations to implement concrete approval processes. In this session, we will share ideas behind such a solution and show a few lines of code to get you started.

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

CzechDreamin

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

Julian Hyde

The Metaverse: Are We There Yet?

Mark Billinghurst

Extensible Python: Robustness through Addition - PyCon 2024

Patrick Viafore

Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “Enterprise Knowledge Graphs: The Importance of Semantics” on May 9, 2024, at the annual Data Summit in Boston. In her presentation, Hedden describes the components of an enterprise knowledge graph and provides further insight into the semantic layer – or knowledge model – component, which includes an ontology and controlled vocabularies, such as taxonomies, for controlled metadata. While data experts tend to focus on the graph database components (RDF triple store or a label property graph), Hedden emphasizes they should not overlook the importance of the semantic layer.

Enterprise Knowledge Graphs - Data Summit 2024

Enterprise Knowledge

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf

FIDO Alliance

Discover the top Symfony development companies that excel in creating robust and scalable web applications. Our latest blog highlights the best firms specializing in Symfony, known for their expertise in delivering high-performance solutions. Whether you’re looking to start a new project or enhance an existing one, these companies offer the skills and experience needed to bring your vision to life. Read more to find your perfect Symfony development partner.

Top 10 Symfony Development Companies 2024

TopCSSGallery

How to differentiate Sales Cloud and CPQ on first glance might be tricky if you do not know where to look and what to look at. You will know :-) Managing the sales process within Salesforce is a common use case that can be managed with standart Sales Cloud. If you want to do entire quoting process you will find out Salesforce CPQ solution exists. What is then the difference if both can handle selling products? You will see comparison of 10 different features, which Sales Cloud and Salesforce CPQ handle differently. Simple question you will always remember if you should consider using Salesforce CPQ will be a cherry on top.

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

CzechDreamin

New customer? New industry? New cloud? New team? A lot to handle! How to ensure the success of the project? Start it well! I've created the 3 areas of focus at the beginning of the project that helped me in multiple roles (BA, PO, and Consultant). Learn from real-world experiences and discover how these insights can empower you to deliver unparalleled value to your customers right from the project's start.

Powerful Start- the Key to Project Success, Barbara Laskowska

CzechDreamin

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

FIDO Alliance

As an SEO expert specializing in the IPTV and VPN niches with over five years of experience, I navigate the unique challenges of these industries adeptly. My strategic approach encompasses competitive analysis, targeted keyword research, content optimization, and high-quality backlink creation. My goal is to optimize my clients' online visibility, generating targeted organic traffic and maximizing their return on investment. With a results-driven approach and a passion for innovation, I'm poised to assist my clients in thriving in an ever-evolving digital landscape. <a href="https://iptvreel.com">

THE BEST IPTV in GERMANY for 2024: IPTVreel

reely ones

Intro in Product Management - Коротко про професію продакт менеджера

Mark Opanasiuk

ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. But before you squeeze, make sure you know what to monitor! Watch our experienced Postgres developer work through monitoring and performance strategies that help him understand what mistakes he’s made moving to NoSQL. And learn with him as our database performance expert offers friendly guidance on how to use monitoring and performance tuning to get his sample Rust application on the right track. This webinar focuses on using monitoring and performance tuning to discover and correct mistakes that commonly occur when developers move from SQL to NoSQL. For example: - Common issues getting up and running with the monitoring stack - Using the CQL optimizations dashboard - Common issues causing high latency in a node - Common issues causing replica imbalance - What a healthy system looks like in terms of memory - Key metrics to keep an eye on This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

Optimizing NoSQL Performance Through Observability

ScyllaDB

Google I/O Extended 2024 Warsaw

GDSC PJATK

PLAI - Acceleration Program for Generative A.I. Startups

Stefano

This talk offers actionable insights at an executive level for enhancing productivity and refining your portfolio management approach to propel your organization to greater heights. Key Points Covered: 1. Experience Transformation: - The core challenge remains consistent across organizations: converting budget into user-centric designs. - Strategies for deploying design resources effectively in both startups and large enterprises. 2. Strategic Frameworks: - Introduction to the "Ziggurat of Impact" model, detailing layers from basic system interactions to comprehensive customer experiences. - Practical insights on creating frameworks that scale with organizational complexity. 3. Organizational Impact: - Real-world examples of navigating design in large settings, focusing on the synthesis of consumer products and customer experiences. - Emphasis on the importance of designing systems that directly influence customer interactions. 4. Design Execution: - Detailed walkthrough of organizational layers affecting design execution, from touchpoints and customer activities to shared capabilities. - How to ensure design influences both the micro and macro aspects of customer interactions. 5. Measurement and Adaptation: - Techniques for measuring the impact of design decisions and adapting strategies based on data-driven insights. - The critical role of continuous improvement and feedback in refining customer experiences.

Structuring Teams and Portfolios for Success

UXDXConf

The presentation underscores the strategic advantage of treating design systems not just as technical assets but as vital business components that require thoughtful management, robust planning, and strategic alignment with organizational goals. Key Points Covered: - Understanding Design Systems as Business Entities: Conceptualizing design systems as internal business entities can streamline their integration and evolution within a company. - Adoption and Expansion: Elaborating on the importance of tactical adoption across organizational structures, enhancing product suites to cater to user needs and broadening scope to mobile and content authoring solutions. - Data-Driven Development: Utilizing data insights for component development ensures that resources are allocated to create valuable, widely used features. - Financial Modeling for Design Systems: Developing sustainable funding models is crucial for long-term support and success of design systems. - Promoting Internal Buy-In: Stressing on strategies for promoting design systems within the organization to increase engagement and investment from internal stakeholders.

A Business-Centric Approach to Design System Strategy

UXDXConf

What's New in Teams Calling, Meetings and Devices April 2024

Stephanie Beckett

Recently uploaded (20)

Connecting the Dots in Product Design at KAYAK

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

The Metaverse: Are We There Yet?

Extensible Python: Robustness through Addition - PyCon 2024

Enterprise Knowledge Graphs - Data Summit 2024

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf

Top 10 Symfony Development Companies 2024

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

Powerful Start- the Key to Project Success, Barbara Laskowska

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

THE BEST IPTV in GERMANY for 2024: IPTVreel

Intro in Product Management - Коротко про професію продакт менеджера

Optimizing NoSQL Performance Through Observability

Google I/O Extended 2024 Warsaw

PLAI - Acceleration Program for Generative A.I. Startups

Structuring Teams and Portfolios for Success

A Business-Centric Approach to Design System Strategy

What's New in Teams Calling, Meetings and Devices April 2024

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc

2. Or “A Case Study in Operationalizing Spark Streaming”

3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

4. Demo: Spark in Action

5. Game & Scoreboard Architecture

6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos

9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster

10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures

12. We’re done, right?

13. We’re done, right? Just need to QA the data…

14. 70% missing data

15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast

17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/

19. Data loss fixed

20. Data loss fixed So we’re done, right?

21. Cluster & app continuously crashing

22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

23. Spark Cluster & App Stability Spark slave memory utilization

24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true

25. We’re done, right?

26. We’re done, right? Finally, yes

27. Party Time

28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters

29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!

30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

Editor's Notes

Tell our story, to share learnings
This was our use case, yours may be different
This is our use case, yours may be different
Live system to reason about
Not necessarily the only way to set it up. Save IP space
Ok, we built the app in the spark framework for scalability, made it fast,
Pause, check on game player
Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages. Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios Direct Kafka Integration = statelessness
Simple, At a glance, batch process time < batch interval. Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
After a few days, we notice…
After a few days, we notice…
I thought resiliency was the promise of Spark. Resilient distributed datasets
The app was crashing, but why
Crashes while using Kafka Receiver = missing data. No WAL Is Spark so flaky? Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Recently uploaded

Recently uploaded (20)

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Editor's Notes