Lessons from Running Large Scale Spark Workloads

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Lessons from Running Large
Scale Spark Workloads
Reynold Xin, Matei Zaharia
Feb 19, 2015 @ Strata
About Databricks
Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
2
A slide from 2013 …
3
4
Does Spark scale?
5
Does Spark scale? Yes!
6
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Agenda
Spark “hall of fame”
Architectural improvements for scalability
Q&A
7
8
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold’s personal knowledge
Largest Cluster & Daily Intake
9
800 million+
active users
8000+
nodes
150 PB+
1 PB+/day
Spark at Tencent
10Images courtesy of Lianhui Wang
Ads CTR prediction
Similarity metrics on billions of nodes
Spark SQL for BI and ETL
…
Spark
SQL streaming
machine
learning
graphETL
HDFS Hive HBase MySQL PostgreSQL others
See Lianhui’s session this afternoon
Longest Running Spark Job
Alibaba Taobao uses Spark to perform image feature
extraction for product recommendation.
1 week runtime on petabytes of images!
Largest e-commerce site (800m products, 48k sold/min)
ETL, machine learning, graph computation, streaming
11
Alibaba Taobao: Extensive Use of GraphX
12
clustering
(community detection)
belief propagation
(influence & credibility)
collaborative filtering
(recommendation)
Images courtesy of Andy Huang, Joseph Gonzales
See upcoming talk at Spark Summit on streaming graphs
13
Mapping the
brain at scale
Images courtesy of Jeremy Freeman
14Images courtesy of Jeremy Freeman
15Images courtesy of Jeremy Freeman
Architectural Improvements
Elastic Scaling: improve resource utilization (1.2)
Revamped Shuffle: shuffle at scale (1.2)
DataFrames: easier high performance at scale (1.3)
16
Static Resource Allocation
17
Resource
(CPU / Mem)
Time
Allocated
Used
New job New job
Stragglers Stragglers
Static Resource Allocation
18
Resource
(CPU / Mem)
Time
Allocated
Used
More resources allocated than are used!
Static Resource Allocation
19
Resource
(CPU / Mem)
Time
Allocated
Used
Resources wasted!
Elastic Scaling
20
Resource
(CPU / Mem)
Time
Allocated
Used
Elastic Scaling
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 3
spark.dynamicAllocation.maxExecutors 20
- Optional -
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
21
Shuffle at Scale
Sort-based shuffle
Netty-based transport
22
Sort-based Shuffle
Old hash-based shuffle: requires R
(# reduce tasks) concurrent streams with
buffers; limits R.
New sort-based shuffle: sort records by
partition first, and then write them; one
active stream at a time
Went up to 250,000 reduce tasks in PB sort!
Netty-based Network Transport
Zero-copy network send
Explicitly managed memory buffers for shuffle (lowering
GC pressure)
Auto retries on transient network failures
24
Network Transport
Sustaining 1.1GB/s/node in shuffle
DataFrames in Spark
Current Spark API is based on Java / Python objects
–  Hard for engine to store compactly
–  Cannot understand semantics of user functions
DataFrames make it easy to leverage structured data
–  Compact, column-oriented storage
–  Rich optimization via Spark SQL’s Catalyst optimizer
26
DataFrames in Spark
Distributed collection of data with known schema
Domain-specific functions for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
27
28
DataFrames
Similar API to data frames
in R and Pandas
Optimized and compiled to
bytecode by Spark SQL
Read/write to Hive, JSON,
Parquet, ORC, Pandas
df = jsonFile(“tweets.json”)
df[df.user == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
29
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à
30
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
DataFrames arrive in Spark 1.3!
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase!
implements Mapper<LongWritable, Text, Text, IntWritable> {!
!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
String line = value.toString();!
StringTokenizer itr = new StringTokenizer(line);!
while (itr.hasMoreTokens()) {!
word.set(itr.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class WorkdCountReduce extends MapReduceBase!
implements Reducer<Text, IntWritable, Text, IntWritable> {!
!
public void reduce(Text key, Iterator<IntWritable> values,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
val file = spark.textFile("hdfs://...")!
val counts = file.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")!
32
Thank you! Questions?
1 of 33

Recommended

Spark Application Carousel: Highlights of Several Applications Built with Spark by
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
2.1K views29 slides
New Developments in Spark by
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
9.7K views43 slides
Spark Under the Hood - Meetup @ Data Science London by
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
2.5K views33 slides
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa... by
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
4.3K views42 slides
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell by
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
8.9K views36 slides
Spark's Role in the Big Data Ecosystem (Spark Summit 2014) by
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
3.6K views20 slides

More Related Content

What's hot

Jump Start on Apache® Spark™ 2.x with Databricks by
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
1.3K views135 slides
Real-Time Spark: From Interactive Queries to Streaming by
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
5.2K views39 slides
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ... by
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
1.9K views29 slides
Jump Start into Apache® Spark™ and Databricks by
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
3.9K views39 slides
Spark streaming state of the union by
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
3.9K views56 slides
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics by
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
1.2K views40 slides

What's hot(20)

Jump Start on Apache® Spark™ 2.x with Databricks by Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks1.3K views
Real-Time Spark: From Interactive Queries to Streaming by Databricks
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks5.2K views
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ... by Spark Summit
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit1.9K views
Jump Start into Apache® Spark™ and Databricks by Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks3.9K views
Spark streaming state of the union by Databricks
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks3.9K views
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics by Miklos Christine
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine1.2K views
Using SparkR to Scale Data Science Applications in Production. Lessons from t... by Spark Summit
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit1.5K views
The BDAS Open Source Community by jeykottalam
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
jeykottalam6.9K views
Introduction to Spark (Intern Event Presentation) by Databricks
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks2.9K views
Advanced Natural Language Processing with Apache Spark NLP by Databricks
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
Databricks742 views
Visualizing big data in the browser using spark by Databricks
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
Databricks5.8K views
Enabling Exploratory Analysis of Large Data with Apache Spark and R by Databricks
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks2K views
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell by Databricks
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks4.6K views
Announcing Databricks Cloud (Spark Summit 2014) by Databricks
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks1.9K views
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu by Databricks
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks7.3K views
Enabling exploratory data science with Spark and R by Databricks
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks8.1K views
End-to-end Data Pipeline with Apache Spark by Databricks
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks6.9K views
Structuring Spark: DataFrames, Datasets, and Streaming by Databricks
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks7.8K views
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp... by Databricks
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks1.9K views
Strata NYC 2015 - Supercharging R with Apache Spark by Databricks
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks3.7K views

Viewers also liked

Sqoop on Spark for Data Ingestion by
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
16.3K views42 slides
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber) by
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
11.5K views29 slides
Keynote at Spark Summit by
Keynote at Spark SummitKeynote at Spark Summit
Keynote at Spark SummitGloria Lau
4K views13 slides
Autoscaling Spark on AWS EC2 - 11th Spark London meetup by
Autoscaling Spark on AWS EC2 - 11th Spark London meetupAutoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetupRafal Kwasny
5.7K views46 slides
Dynamic Allocation in Spark by
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
10.1K views39 slides
New Data Transfer Tools for Hadoop: Sqoop 2 by
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
18.9K views45 slides

Viewers also liked(16)

Sqoop on Spark for Data Ingestion by DataWorks Summit
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit16.3K views
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber) by Spark Summit
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit11.5K views
Keynote at Spark Summit by Gloria Lau
Keynote at Spark SummitKeynote at Spark Summit
Keynote at Spark Summit
Gloria Lau4K views
Autoscaling Spark on AWS EC2 - 11th Spark London meetup by Rafal Kwasny
Autoscaling Spark on AWS EC2 - 11th Spark London meetupAutoscaling Spark on AWS EC2 - 11th Spark London meetup
Autoscaling Spark on AWS EC2 - 11th Spark London meetup
Rafal Kwasny5.7K views
Dynamic Allocation in Spark by Databricks
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks10.1K views
New Data Transfer Tools for Hadoop: Sqoop 2 by DataWorks Summit
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit18.9K views
Spark Summit EU 2015: SparkUI visualization: a lens into your application by Databricks
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks6.5K views
Memory Management in Apache Spark by Databricks
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks9.7K views
Lessons Learned From Running Spark On Docker by Spark Summit
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit15.5K views
Dynamically Allocate Cluster Resources to your Spark Application by DataWorks Summit
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit5.7K views
Beyond SQL: Speeding up Spark with DataFrames by Databricks
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks43.3K views
Best Practices for Using Apache Spark on AWS by Amazon Web Services
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services33.8K views
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f... by DataStax
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
DataStax1K views
Operating Systems: Linux in Detail by Damian T. Gordon
Operating Systems: Linux in DetailOperating Systems: Linux in Detail
Operating Systems: Linux in Detail
Damian T. Gordon3.7K views
Operating Systems: Versions of Linux by Damian T. Gordon
Operating Systems: Versions of LinuxOperating Systems: Versions of Linux
Operating Systems: Versions of Linux
Damian T. Gordon2.7K views

Similar to Lessons from Running Large Scale Spark Workloads

A look under the hood at Apache Spark's API and engine evolutions by
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
3.2K views56 slides
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark" by
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
333 views45 slides
Distributed Deep Learning + others for Spark Meetup by
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupVijay Srinivas Agneeswaran, Ph.D
2.1K views43 slides
An Overview of Apache Spark by
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
201 views93 slides
Jump Start with Apache Spark 2.0 on Databricks by
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
1.1K views114 slides
Unified Big Data Processing with Apache Spark (QCON 2014) by
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
3.7K views52 slides

Similar to Lessons from Running Large Scale Spark Workloads(20)

A look under the hood at Apache Spark's API and engine evolutions by Databricks
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks3.2K views
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark" by IT Event
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event333 views
Jump Start with Apache Spark 2.0 on Databricks by Anyscale
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale1.1K views
Unified Big Data Processing with Apache Spark (QCON 2014) by Databricks
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks3.7K views
In Memory Analytics with Apache Spark by Venkata Naga Ravi
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi3.5K views
An introduction To Apache Spark by Amir Sedighi
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi10.4K views
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ... by Chetan Khatri
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri283 views
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S... by BigDataEverywhere
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere896 views
Hadoop & Hive Change the Data Warehousing Game Forever by DataWorks Summit
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit3.3K views
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming by Paco Nathan
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan7.6K views
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy by Rohit Kulkarni
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni4.2K views
Unified Big Data Processing with Apache Spark by C4Media
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media2K views
Apache Spark Performance is too hard. Let's make it easier by Databricks
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks2.9K views
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3 by Databricks
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks1.5K views
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014 by Dataiku
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku2.2K views
SF Big Analytics meetup : Hoodie From Uber by Chester Chen
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen367 views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
743 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks743 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks606 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks551 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

Dapr Unleashed: Accelerating Microservice Development by
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice DevelopmentMiroslav Janeski
13 views29 slides
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...NimaTorabi2
16 views17 slides
MS PowerPoint.pptx by
MS PowerPoint.pptxMS PowerPoint.pptx
MS PowerPoint.pptxLitty Sylus
7 views14 slides
Quality Assurance by
Quality Assurance Quality Assurance
Quality Assurance interworksoftware2
5 views6 slides
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation by
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook AutomationDRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook AutomationHCLSoftware
6 views8 slides
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptxNeo4j
17 views26 slides

Recently uploaded(20)

Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski13 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi216 views
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation by HCLSoftware
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook AutomationDRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation
HCLSoftware6 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j17 views
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info33492162 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 views
Understanding HTML terminology by artembondar5
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminology
artembondar57 views
How Workforce Management Software Empowers SMEs | TraQSuite by TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuiteHow Workforce Management Software Empowers SMEs | TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuite
TraQSuite6 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin96 views
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views

Lessons from Running Large Scale Spark Workloads

  • 1. Lessons from Running Large Scale Spark Workloads Reynold Xin, Matei Zaharia Feb 19, 2015 @ Strata
  • 2. About Databricks Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud 2
  • 3. A slide from 2013 … 3
  • 6. 6 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 7. Agenda Spark “hall of fame” Architectural improvements for scalability Q&A 7
  • 8. 8 Spark “Hall of Fame” LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold’s personal knowledge
  • 9. Largest Cluster & Daily Intake 9 800 million+ active users 8000+ nodes 150 PB+ 1 PB+/day
  • 10. Spark at Tencent 10Images courtesy of Lianhui Wang Ads CTR prediction Similarity metrics on billions of nodes Spark SQL for BI and ETL … Spark SQL streaming machine learning graphETL HDFS Hive HBase MySQL PostgreSQL others See Lianhui’s session this afternoon
  • 11. Longest Running Spark Job Alibaba Taobao uses Spark to perform image feature extraction for product recommendation. 1 week runtime on petabytes of images! Largest e-commerce site (800m products, 48k sold/min) ETL, machine learning, graph computation, streaming 11
  • 12. Alibaba Taobao: Extensive Use of GraphX 12 clustering (community detection) belief propagation (influence & credibility) collaborative filtering (recommendation) Images courtesy of Andy Huang, Joseph Gonzales See upcoming talk at Spark Summit on streaming graphs
  • 13. 13 Mapping the brain at scale Images courtesy of Jeremy Freeman
  • 14. 14Images courtesy of Jeremy Freeman
  • 15. 15Images courtesy of Jeremy Freeman
  • 16. Architectural Improvements Elastic Scaling: improve resource utilization (1.2) Revamped Shuffle: shuffle at scale (1.2) DataFrames: easier high performance at scale (1.3) 16
  • 17. Static Resource Allocation 17 Resource (CPU / Mem) Time Allocated Used New job New job Stragglers Stragglers
  • 18. Static Resource Allocation 18 Resource (CPU / Mem) Time Allocated Used More resources allocated than are used!
  • 19. Static Resource Allocation 19 Resource (CPU / Mem) Time Allocated Used Resources wasted!
  • 20. Elastic Scaling 20 Resource (CPU / Mem) Time Allocated Used
  • 21. Elastic Scaling spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 3 spark.dynamicAllocation.maxExecutors 20 - Optional - spark.dynamicAllocation.executorIdleTimeout spark.dynamicAllocation.schedulerBacklogTimeout spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 21
  • 22. Shuffle at Scale Sort-based shuffle Netty-based transport 22
  • 23. Sort-based Shuffle Old hash-based shuffle: requires R (# reduce tasks) concurrent streams with buffers; limits R. New sort-based shuffle: sort records by partition first, and then write them; one active stream at a time Went up to 250,000 reduce tasks in PB sort!
  • 24. Netty-based Network Transport Zero-copy network send Explicitly managed memory buffers for shuffle (lowering GC pressure) Auto retries on transient network failures 24
  • 26. DataFrames in Spark Current Spark API is based on Java / Python objects –  Hard for engine to store compactly –  Cannot understand semantics of user functions DataFrames make it easy to leverage structured data –  Compact, column-oriented storage –  Rich optimization via Spark SQL’s Catalyst optimizer 26
  • 27. DataFrames in Spark Distributed collection of data with known schema Domain-specific functions for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs 27
  • 28. 28 DataFrames Similar API to data frames in R and Pandas Optimized and compiled to bytecode by Spark SQL Read/write to Hive, JSON, Parquet, ORC, Pandas df = jsonFile(“tweets.json”) df[df.user == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  • 29. 29 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  • 30. 30 logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimized plan with intelligent data sources join scan (users) filter scan (events) joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) DataFrames arrive in Spark 1.3!
  • 31. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase! implements Mapper<LongWritable, Text, Text, IntWritable> {! ! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer itr = new StringTokenizer(line);! while (itr.hasMoreTokens()) {! word.set(itr.nextToken());! output.collect(word, one);! }! }! }! ! public static class WorkdCountReduce extends MapReduceBase! implements Reducer<Text, IntWritable, Text, IntWritable> {! ! public void reduce(Text key, Iterator<IntWritable> values,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! val file = spark.textFile("hdfs://...")! val counts = file.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...")!
  • 32. 32