Spark Internals Drive Design

•

10 likes•2,743 views

This document discusses lessons learned from working with Spark's machine learning library (ML Lib) for collaborative filtering on a large dataset. It covers four main lessons: 1. Spark uses more memory than expected due to JVM overhead, metadata for shuffles and jobs, and Scala vs Java. This can be addressed through careful partitioning, serialization with Kryo, and cleaning up long-running jobs. 2. Shuffles between nodes are expensive and can cause out of memory errors, so it is best to avoid them by using the driver for collection, broadcast variables, and accumulators. 3. Sending data through the driver has memory limits, so partitions and akka frame sizes must be configured based

Data & Analytics

(or: How I learned to stop worrying and love the shuffle)
By: Ilya Ganelin
Lessons from Spark

•  Recommendations (ML Lib)
–  Collaborative Filtering (CF)
–  65 million X 10 million
–  2.5 TB, ~12 PFlops
–  All recs (long tail)
•  You’re mad!
–  Don’t usually generate all possible
recs
–  6 data nodes (~40 GB RAM)
–  Shared cluster
–  Spark stability
What:

•  Goal:
–  Understand how Spark internals drive design and configuration
•  Contents:
–  Background
•  Partitions
•  Caching
•  Serialization
•  Shuffle
–  Lessons 1-4
Overview

•  Partitions
–  How data is split on disk
–  Affects memory usage, shuffle size
–  Count ~ speed, Count ~ 1/memory
•  Caching
–  Persist RDDs in distributed memory
–  Major speedup for repeated operations
•  Serialization
–  Efficient movement of data
–  Java vs. Kryo
Partitions, Caching, and Serialization

Shuffle!
•  All-all operations
–  reduceByKey, groupByKey
•  Data movement
–  Serialization
–  Akka
•  Memory overhead
–  Dumps to disk when OOM
–  Garbage collection
•  EXPENSIVE!
Map Reduce

•  Memory
–  You’re using more than you think
•  JVM overhead
•  Spark metadata (shuffles, long-running jobs)
•  Scala vs. Java
–  Shuffle & heap
•  Debugging is hard
–  Distributed logs
–  Hundreds of tasks
Lesson 1: Spark is a problem child!

•  Tame the beast (memory)
–  Partition wisely
–  Know your data!
•  Size, Types, Distribution
–  Kryo Serialization
•  Cleanup
–  Long-term jobs consume memory indefinitely
–  Spark context cleanup fails in production environment
–  Solution: YARN!
•  Separate spark-submits per batch
•  Stable Spark-based job that runs for weeks
Lesson 1: Discipline

•  Why?
–  Speed up execution
–  Increase stability
–  ????
–  Profit!
•  How?
–  Use the driver!
•  Collect
•  Broadcast
•  Accumulators
Lesson 2: Avoid shuffles!

•  Limited memory
–  Collected RDDs
–  Metadata
–  Results (Accumulators)
•  Akka messaging
–  10e6 x (120 bytes) ~ 1.2GB; 20 partitions
–  Read ~60 MB per partition – (Default is 10MB)
–  Solution: Partition & set akka.frameSize - know your data!
•  Big data
–  Solution: Batch process
•  Problem: Cleanup and long-term stability
Lesson 3: Using the driver is hard!

•  Cache, but cache wisely
–  If you use it twice, cache it
•  Broadcast variables
–  Visible to all executors
–  Only serialized once
–  Blazing-fast lookups!
•  Threading
–  Thread pool on driver
–  Fast operations, many tasks
•  75x speedup over ML Lib ALS predict()
–  Start: 1 rec / 1.5 seconds
–  End: 50 recs / second
Lesson 4: Speed!

What's hot

Spark on YARNAdarsh Pannu

Spark Community Update - Spark Summit San Francisco 2015Databricks

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit

Top 5 mistakes when writing Spark applicationshadooparchbook

Emr spark tuning demystifiedOmid Vahdaty

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Productionizing Spark and the Spark Job ServerEvan Chan

Transactional writes to cloud storage with Eric LiangDatabricks

CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark 1.6 vs Spark 2.0Sigmoid

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Deploying Accelerators At Datacenter Scale Using SparkJen Aman

Zeppelin and spark sql demystifiedOmid Vahdaty

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks

Cassandra internalsnarsiman

Top 5 mistakes when writing Spark applicationshadooparchbook

Using Spark with Tachyon by Gene PangSpark Summit

What's hot (20)

Spark on YARN

Spark Community Update - Spark Summit San Francisco 2015

A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...

Top 5 mistakes when writing Spark applications

Emr spark tuning demystified

Processing Large Data with Apache Spark -- HasGeek

Productionizing Spark and the Spark Job Server

Transactional writes to cloud storage with Eric Liang

CaffeOnSpark Update: Recent Enhancements and Use Cases

Re-Architecting Spark For Performance Understandability

Spark 1.6 vs Spark 2.0

Top 5 Mistakes When Writing Spark Applications

Deploying Accelerators At Datacenter Scale Using Spark

Zeppelin and spark sql demystified

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Spark performance tuning - Maksud Ibrahimov

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Cassandra internals

Top 5 mistakes when writing Spark applications

Using Spark with Tachyon by Gene Pang

Viewers also liked

December 2013 HUG: Spark at Yahoo!Yahoo Developer Network

Next-Gen Decision Making in Under 2msIlya Ganelin

Airstream: Spark Streaming At AirbnbJen Aman

Food Recommendation System Using Clustering Analysis for Diabetic patientsMaiyaporn Phanich

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Loan Decisioning TransformationDataWorks Summit/Hadoop Summit

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)Spark Summit

Spark Uber Development KitDataWorks Summit/Hadoop Summit

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit

Active Learning for Fraud PreventionDataWorks Summit/Hadoop Summit

SQL and Search with Spark in your browserDataWorks Summit/Hadoop Summit

Apache hadoop bigdata-in-bankingm_hepburn

A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...Spark Summit

Analysis of Major Trends in Big Data AnalyticsDataWorks Summit/Hadoop Summit

Effective Spark on Multi-Tenant ClustersDataWorks Summit/Hadoop Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Viewers also liked (20)

December 2013 HUG: Spark at Yahoo!

Next-Gen Decision Making in Under 2ms

Airstream: Spark Streaming At Airbnb

Food Recommendation System Using Clustering Analysis for Diabetic patients

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...

Loan Decisioning Transformation

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)

Spark Uber Development Kit

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)

Active Learning for Fraud Prevention

SQL and Search with Spark in your browser

Apache hadoop bigdata-in-banking

A New "Sparkitecture" for modernizing your data warehouse

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...

Analysis of Major Trends in Big Data Analytics

Effective Spark on Multi-Tenant Clusters

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Similar to Spark Internals Drive Design

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Top 5 mistakes when writing Spark applicationsmarkgrover

Hadoop - Disk Fail In Place (DFIP)mundlapudi

In-memory Data Management Trends & TechniquesHazelcast

Computer Memory Hierarchy Computer ArchitectureHaris456

9_Storage_Devices.pptxAbdulhseynAayev1

9_Storage_Devices.pptxJawaharPrasad3

Data mining 2011 09MapR Technologies

Sheepdog: yet another all in-one storage for openstackLiu Yuan

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit

Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy

Percona live linux filesystems and my sqlMichael Zhang

Colvin exadata mistakes_ioug_2014marvin herrera

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati

In-Memory Computing: How, Why? and common PatternsSrinath Perera

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Apache Spark Core – Practical OptimizationDatabricks

Similar to Spark Internals Drive Design (20)

How to Actually Tune Your Spark Jobs So They Work

Top 5 mistakes when writing Spark applications

Hadoop - Disk Fail In Place (DFIP)

In-memory Data Management Trends & Techniques

Computer Memory Hierarchy Computer Architecture

9_Storage_Devices.pptx

Data mining 2011 09

Sheepdog: yet another all in-one storage for openstack

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska

Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...

Percona live linux filesystems and my sql

Colvin exadata mistakes_ioug_2014

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

In-Memory Computing: How, Why? and common Patterns

Fabian Hueske – Juggling with Bits and Bytes

Apache Spark Core – Practical Optimization

Recently uploaded

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Introduction-to-Machine-Learning (1).pptxfirstjob4

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Carero dropshipping via API with DroFx.pptxolyaivanovalion

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Week-01-2.ppt BBB human Computer interactionfulawalesam

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Recently uploaded (20)

Smarteg dropshipping via API with DroFx.pptx

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Generative AI on Enterprise Cloud with NiFi and Milvus

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

100-Concepts-of-AI by Anupama Kate .pptx

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Edukaciniai dropshipping via API with DroFx

Introduction-to-Machine-Learning (1).pptx

VidaXL dropshipping via API with DroFx.pptx

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Carero dropshipping via API with DroFx.pptx

BabyOno dropshipping via API with DroFx.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Week-01-2.ppt BBB human Computer interaction

ALSO dropshipping via API with DroFx.pptx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Spark Internals Drive Design

1. (or: How I learned to stop worrying and love the shuffle) By: Ilya Ganelin Lessons from Spark

2. •  Recommendations (ML Lib) –  Collaborative Filtering (CF) –  65 million X 10 million –  2.5 TB, ~12 PFlops –  All recs (long tail) •  You’re mad! –  Don’t usually generate all possible recs –  6 data nodes (~40 GB RAM) –  Shared cluster –  Spark stability What:

3. •  Goal: –  Understand how Spark internals drive design and configuration •  Contents: –  Background •  Partitions •  Caching •  Serialization •  Shuffle –  Lessons 1-4 Overview

4. Background

5. •  Partitions –  How data is split on disk –  Affects memory usage, shuffle size –  Count ~ speed, Count ~ 1/memory •  Caching –  Persist RDDs in distributed memory –  Major speedup for repeated operations •  Serialization –  Efficient movement of data –  Java vs. Kryo Partitions, Caching, and Serialization

6. Shuffle?

7. Shuffle! •  All-all operations –  reduceByKey, groupByKey •  Data movement –  Serialization –  Akka •  Memory overhead –  Dumps to disk when OOM –  Garbage collection •  EXPENSIVE! Map Reduce

8. Lessons

9. •  Memory –  You’re using more than you think •  JVM overhead •  Spark metadata (shuffles, long-running jobs) •  Scala vs. Java –  Shuffle & heap •  Debugging is hard –  Distributed logs –  Hundreds of tasks Lesson 1: Spark is a problem child!

10. •  Tame the beast (memory) –  Partition wisely –  Know your data! •  Size, Types, Distribution –  Kryo Serialization •  Cleanup –  Long-term jobs consume memory indefinitely –  Spark context cleanup fails in production environment –  Solution: YARN! •  Separate spark-submits per batch •  Stable Spark-based job that runs for weeks Lesson 1: Discipline

11. •  Why? –  Speed up execution –  Increase stability –  ???? –  Profit! •  How? –  Use the driver! •  Collect •  Broadcast •  Accumulators Lesson 2: Avoid shuffles!

12. •  Limited memory –  Collected RDDs –  Metadata –  Results (Accumulators) •  Akka messaging –  10e6 x (120 bytes) ~ 1.2GB; 20 partitions –  Read ~60 MB per partition – (Default is 10MB) –  Solution: Partition & set akka.frameSize - know your data! •  Big data –  Solution: Batch process •  Problem: Cleanup and long-term stability Lesson 3: Using the driver is hard!

13. •  Cache, but cache wisely –  If you use it twice, cache it •  Broadcast variables –  Visible to all executors –  Only serialized once –  Blazing-fast lookups! •  Threading –  Thread pool on driver –  Fast operations, many tasks •  75x speedup over ML Lib ALS predict() –  Start: 1 rec / 1.5 seconds –  End: 50 recs / second Lesson 4: Speed!

14. Questions?

Spark Internals Drive Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Internals Drive Design

Similar to Spark Internals Drive Design (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Internals Drive Design