SlideShare a Scribd company logo
Lessons from Running Large
Scale Spark Workloads
Reynold Xin, Matei Zaharia
Feb 19, 2015 @ Strata
About Databricks
Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
2
A slide from 2013 …
3
4
Does Spark scale?
5
Does Spark scale? Yes!
6
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Agenda
Spark “hall of fame”
Architectural improvements for scalability
Q&A
7
8
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold’s personal knowledge
Largest Cluster & Daily Intake
9
800 million+
active users
8000+
nodes
150 PB+
1 PB+/day
Spark at Tencent
10Images courtesy of Lianhui Wang
Ads CTR prediction
Similarity metrics on billions of nodes
Spark SQL for BI and ETL
…
Spark
SQL streaming
machine
learning
graphETL
HDFS Hive HBase MySQL PostgreSQL others
See Lianhui’s session this afternoon
Longest Running Spark Job
Alibaba Taobao uses Spark to perform image feature
extraction for product recommendation.
1 week runtime on petabytes of images!
Largest e-commerce site (800m products, 48k sold/min)
ETL, machine learning, graph computation, streaming
11
Alibaba Taobao: Extensive Use of GraphX
12
clustering
(community detection)
belief propagation
(influence & credibility)
collaborative filtering
(recommendation)
Images courtesy of Andy Huang, Joseph Gonzales
See upcoming talk at Spark Summit on streaming graphs
13
Mapping the
brain at scale
Images courtesy of Jeremy Freeman
14Images courtesy of Jeremy Freeman
15Images courtesy of Jeremy Freeman
Architectural Improvements
Elastic Scaling: improve resource utilization (1.2)
Revamped Shuffle: shuffle at scale (1.2)
DataFrames: easier high performance at scale (1.3)
16
Static Resource Allocation
17
Resource
(CPU / Mem)
Time
Allocated
Used
New job New job
Stragglers Stragglers
Static Resource Allocation
18
Resource
(CPU / Mem)
Time
Allocated
Used
More resources allocated than are used!
Static Resource Allocation
19
Resource
(CPU / Mem)
Time
Allocated
Used
Resources wasted!
Elastic Scaling
20
Resource
(CPU / Mem)
Time
Allocated
Used
Elastic Scaling
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 3
spark.dynamicAllocation.maxExecutors 20
- Optional -
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
21
Shuffle at Scale
Sort-based shuffle
Netty-based transport
22
Sort-based Shuffle
Old hash-based shuffle: requires R
(# reduce tasks) concurrent streams with
buffers; limits R.
New sort-based shuffle: sort records by
partition first, and then write them; one
active stream at a time
Went up to 250,000 reduce tasks in PB sort!
Netty-based Network Transport
Zero-copy network send
Explicitly managed memory buffers for shuffle (lowering
GC pressure)
Auto retries on transient network failures
24
Network Transport
Sustaining 1.1GB/s/node in shuffle
DataFrames in Spark
Current Spark API is based on Java / Python objects
–  Hard for engine to store compactly
–  Cannot understand semantics of user functions
DataFrames make it easy to leverage structured data
–  Compact, column-oriented storage
–  Rich optimization via Spark SQL’s Catalyst optimizer
26
DataFrames in Spark
Distributed collection of data with known schema
Domain-specific functions for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
27
28
DataFrames
Similar API to data frames
in R and Pandas
Optimized and compiled to
bytecode by Spark SQL
Read/write to Hive, JSON,
Parquet, ORC, Pandas
df = jsonFile(“tweets.json”)
df[df.user == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
29
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à
30
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
DataFrames arrive in Spark 1.3!
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase!
implements Mapper<LongWritable, Text, Text, IntWritable> {!
!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
String line = value.toString();!
StringTokenizer itr = new StringTokenizer(line);!
while (itr.hasMoreTokens()) {!
word.set(itr.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class WorkdCountReduce extends MapReduceBase!
implements Reducer<Text, IntWritable, Text, IntWritable> {!
!
public void reduce(Text key, Iterator<IntWritable> values,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
val file = spark.textFile("hdfs://...")!
val counts = file.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")!
32
Thank you! Questions?

More Related Content

What's hot

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 

What's hot (20)

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 

Viewers also liked

Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
Deepak Narayanan
 

Viewers also liked (16)

Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Keynote at Spark Summit
Keynote at Spark SummitKeynote at Spark Summit
Keynote at Spark Summit
 
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetupAutoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
 
Operating Systems: Linux in Detail
Operating Systems: Linux in DetailOperating Systems: Linux in Detail
Operating Systems: Linux in Detail
 
Operating Systems: Versions of Linux
Operating Systems: Versions of LinuxOperating Systems: Versions of Linux
Operating Systems: Versions of Linux
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 

Similar to Lessons from Running Large Scale Spark Workloads

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 

Similar to Lessons from Running Large Scale Spark Workloads (20)

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Recently uploaded (20)

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring Software
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

Lessons from Running Large Scale Spark Workloads

  • 1. Lessons from Running Large Scale Spark Workloads Reynold Xin, Matei Zaharia Feb 19, 2015 @ Strata
  • 2. About Databricks Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud 2
  • 3. A slide from 2013 … 3
  • 6. 6 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 7. Agenda Spark “hall of fame” Architectural improvements for scalability Q&A 7
  • 8. 8 Spark “Hall of Fame” LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold’s personal knowledge
  • 9. Largest Cluster & Daily Intake 9 800 million+ active users 8000+ nodes 150 PB+ 1 PB+/day
  • 10. Spark at Tencent 10Images courtesy of Lianhui Wang Ads CTR prediction Similarity metrics on billions of nodes Spark SQL for BI and ETL … Spark SQL streaming machine learning graphETL HDFS Hive HBase MySQL PostgreSQL others See Lianhui’s session this afternoon
  • 11. Longest Running Spark Job Alibaba Taobao uses Spark to perform image feature extraction for product recommendation. 1 week runtime on petabytes of images! Largest e-commerce site (800m products, 48k sold/min) ETL, machine learning, graph computation, streaming 11
  • 12. Alibaba Taobao: Extensive Use of GraphX 12 clustering (community detection) belief propagation (influence & credibility) collaborative filtering (recommendation) Images courtesy of Andy Huang, Joseph Gonzales See upcoming talk at Spark Summit on streaming graphs
  • 13. 13 Mapping the brain at scale Images courtesy of Jeremy Freeman
  • 14. 14Images courtesy of Jeremy Freeman
  • 15. 15Images courtesy of Jeremy Freeman
  • 16. Architectural Improvements Elastic Scaling: improve resource utilization (1.2) Revamped Shuffle: shuffle at scale (1.2) DataFrames: easier high performance at scale (1.3) 16
  • 17. Static Resource Allocation 17 Resource (CPU / Mem) Time Allocated Used New job New job Stragglers Stragglers
  • 18. Static Resource Allocation 18 Resource (CPU / Mem) Time Allocated Used More resources allocated than are used!
  • 19. Static Resource Allocation 19 Resource (CPU / Mem) Time Allocated Used Resources wasted!
  • 20. Elastic Scaling 20 Resource (CPU / Mem) Time Allocated Used
  • 21. Elastic Scaling spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 3 spark.dynamicAllocation.maxExecutors 20 - Optional - spark.dynamicAllocation.executorIdleTimeout spark.dynamicAllocation.schedulerBacklogTimeout spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 21
  • 22. Shuffle at Scale Sort-based shuffle Netty-based transport 22
  • 23. Sort-based Shuffle Old hash-based shuffle: requires R (# reduce tasks) concurrent streams with buffers; limits R. New sort-based shuffle: sort records by partition first, and then write them; one active stream at a time Went up to 250,000 reduce tasks in PB sort!
  • 24. Netty-based Network Transport Zero-copy network send Explicitly managed memory buffers for shuffle (lowering GC pressure) Auto retries on transient network failures 24
  • 26. DataFrames in Spark Current Spark API is based on Java / Python objects –  Hard for engine to store compactly –  Cannot understand semantics of user functions DataFrames make it easy to leverage structured data –  Compact, column-oriented storage –  Rich optimization via Spark SQL’s Catalyst optimizer 26
  • 27. DataFrames in Spark Distributed collection of data with known schema Domain-specific functions for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs 27
  • 28. 28 DataFrames Similar API to data frames in R and Pandas Optimized and compiled to bytecode by Spark SQL Read/write to Hive, JSON, Parquet, ORC, Pandas df = jsonFile(“tweets.json”) df[df.user == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  • 29. 29 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  • 30. 30 logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimized plan with intelligent data sources join scan (users) filter scan (events) joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) DataFrames arrive in Spark 1.3!
  • 31. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase! implements Mapper<LongWritable, Text, Text, IntWritable> {! ! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer itr = new StringTokenizer(line);! while (itr.hasMoreTokens()) {! word.set(itr.nextToken());! output.collect(word, one);! }! }! }! ! public static class WorkdCountReduce extends MapReduceBase! implements Reducer<Text, IntWritable, Text, IntWritable> {! ! public void reduce(Text key, Iterator<IntWritable> values,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! val file = spark.textFile("hdfs://...")! val counts = file.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...")!
  • 32. 32