Performance tuning your Hadoop/Spark clusters to use cloud storage

Apache SparkTM and Apache Hadoop® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

stewu@microsoft.com
SparkTM and Hadoop® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

SparkTM and HiveTM are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Cloud
elasticity
Scale compute
and storage
independently
No expensive
data centers
Less
management
Global presence Availability
SLAs
SparkTM and Hadoop® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Azure Data Lake/
Azure Storage Blob
Azure Databricks/
Azure HDInsight/
Azure VM
Storage
Compute
Amazon Elastic
MapReduce/
Amazon Elastic Compute
Cloud
Amazon S3
Google Dataproc/
Google Compute
Engine
TM
Google Cloud
Storage
TM
Spark
TM
Hive
TM
MapReduce Storm
TM
Kafka
®
R Server
Azure AWS Google Cloud
TM
Amazon Web Services, the “Powered by AWS” logo, AWS, Amazon Elastic Compute Cloud, Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.©2017 Google LLC
All rights reserved. Google and the Google Logo are registered trademarks of Google LLC. SparkTM and Hadoop® are either registered trademarks or trademarks of the Apache Software Foundation in the United States
and/or other countries. All other trademarks are the property of their respective owners.

CPU
Intensive
Memory
Intensive
I/O
Intensive
• Machine Learning
• Natural Language
Processing
• PageRank
• Real-time analytics
• Copy
• Data preparation

Choose
the right
infrastructure
Use right
YARN settings
Right size your
application

Compute
Storage
Utilized
Throughput
Available
Throughput

# of VMs
Application Layer
Memory
Cores
Memory
Cores
Network
Physical Layer
Compute
Compute
Memory
GPU
VM Type VM Size
# of containers Container size
Tasks
RM (YARN)
Layer

# of VMs
RM (YARN)
Layer
Application Layer
Memory
Cores
Memory
Cores
Network
Physical Layer
Compute
Compute
Memory
GPU
VM Type VM Size
# of containers Container size
Tasks

Use VMs with more network bandwidth
Low High

SparkTM
HiveTM
SparkTM and HiveTM are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Application
Executors
Tasks
Cluster
SparkTM is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

Default – Four apps
Total resources
64 cores
192 GB
Your App
16 cores
48 GB

Two apps
Total resources
64 cores
192 GB
App #1
32 cores
96 GB
App #2
32 cores
96 GB

Your app
size
count
Two apps
Total resources
64 cores
192 GB
App #1
32 cores
96 GB
App #2
32 cores
96 GB

Executor-cores
Executor-memory
Num-executorCount
Size

16 executors
More executors Fewer executorsOut of
memory
8 executors
32 executors
Best practice: set each executor no more than 64GB
App resources
32 cores
96GB

More
executors
Fewer
executors
8 executors16 executors 6 executors
Best practice: set cores between 2 and 5
App resources
32 cores
96GB

8
4 cores
6GB
8 executors
16 executors
16
2 cores
6GB
8
2 cores
12GB
Memory underutilized
Both memory and CPU fully utilized
CPU underutilized
App resources
32 cores
96GB
16 executors
8 executors
16 executors
16 executors

Map Stage
Map tasks
Reducer Stage
Input Data
Reduce tasks
Output
HiveTM is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

6GB
16 containers
More containers Fewer containersOut of
memory
YARN resources
Memory: 96GB
CPU: 32 cores
12GB
8 containers
3GB
32 containers
Best practice: set to minimum YARN container size

32 containers
Mappers
Reducers
Read
Write
At least as
large as # of
containers
At least as
large as # of
containers

1.6 waves
(best effort)
# of containers # of map tasks
10MB
Input Data: 80MB
Tez.grouping.min-size = 20MB

1.6 waves
(best effort)
Tez.grouping.min-size = 20MB
# of containers
20MB
# of map tasks
Input Data: 80MB
Apache Tez® is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

1.6 waves
(best effort)
Set tez.grouping.min-size = 5MB
# of containers
20MB
# of map tasks
Input Data: 80MB
Apache Tez® is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

1.6 waves
(best effort)
Set tez.grouping.min-size = 5MB
# of containers
10MB
# of map tasks
Input Data: 80MB

ReducerStage
estimate
Hive.exec.reducers.
bytes.per.reducer
hive.tez.max.
partition.factor
Reduce tasks
5 containers:
2512MB 256MB

ReducerStage
estimate
Hive.exec.reducers.
bytes.per.reducer
hive.tez.max.
partition.factor
Reduce tasks
2512MB
Set hive.exec.reducers.bytes.per.reducer = 128MB
5 containers:
256MB

Set hive.exec.reducers.bytes.per.reducer = 128MB
ReducerStage
estimate
Hive.exec.reducers.
bytes.per.reducer
hive.tez.max.
partition.factor
Reduce tasks
5 containers:
2512MB 128MB

HiveTM is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

Performance tuning your Hadoop/Spark clusters to use cloud storage

Performance tuning your Hadoop/Spark clusters to use cloud storage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance tuning your Hadoop/Spark clusters to use cloud storage

Similar to Performance tuning your Hadoop/Spark clusters to use cloud storage (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Performance tuning your Hadoop/Spark clusters to use cloud storage