Spark performance tuning eng

Spark Performance
Tuning
2018. 11. 23(Fri)

Contents
1. Overview
2. Service Configuration
3. Package & Test
2

1. Overview
3
Object
• To check the processing performance of RDB and HDFS data transfer using Spark
Premise
Schedule
Tasks
1) data(15 million) transfer
2) data(100 million) transfer
• Performance differences are possible depening on network conditions
• 2018.11.22(Thur) ~ 11.23(Fri)
(completed)
(completed)

Service configuration
4
1. System Configuration
2. Data Configuration

According to the configuration below, Data tranferation is performed
5
Load Processing Unload
Source Data Spark Output Data
Sellout Data Transfer Sellout Data
Oracle
DB
Spark,
Python
Oracle
DB
S/W
H/W
S/W
H/W
S/W
H/W
HDFS

6
6
Hadoop
Name Node
Spark
Master
Hive
Master
Resource Manger
No Lv1 Lv2 Version Contents
1
Oracle
Linux
7.3 OS
2 Hadoop 2.7.6 Distributed Storage
3 Spark 2.0.2
Distributed
Processing
4 Hive 2.3.3
Supprt SQL
(Master, Only master)
5 MariaDB 10.2.11
RDB
(Master, Only master)
6
Oracle
Client
18.3.0.0.
0
Oracle DB
client
Maria DB
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Service configurationHadoop Ecosystem
Secondary
Name Node
p-master
hadoop1 hadoop2 hadoop3

1. System Configuration (Process)
7
Input DB Output DB
192.168.110.112 192.168.110.111
Hadoop
Name Node
Spark
Master
Hive
Master
Resource Manger
Maria DB
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop Ecosystem
Secondary
Name Node
p-master
hadoop1 hadoop2 hadoop3
192.168.110.117
192.168.110.118 192.168.110.119 192.168.110.120

2. Data Configuration
8
Define the outbound and inbound data
No InterfaceID Content System Type Count Periods Column cnt Comments
1 IB-001 Sellout Dev System RDB 100 million - 17
2 IB-002 Sellout Dev System RDB 15 million 17
3 IB-003 Parameter Dev System RDB 2 5
inbound
No InterfaceID Content System Type Count Periods Column cnt Comments
1 OB-001 Sellout Op System RDB
100
million
- 17
2 OB-002 Sellout Op System RDB 15 million 17
outbound

Package & Test
9
1. Parameter Mgmt
2. Source Implementation
3. Package
4. Test

1. Parameter Mgmt
10
Implement the parameter map by selecting only necessary data information.(code flexibility)

3. Package
12
Maven: Compile production code and package with jar file & manage compatible
external modules
Compile & Package
Manage plug-ins
Manage dependencies

4. Test 15 million (ORACLE → SPARK → ORACLE)
spark-env.sh
spark-defaults.conf
Fix Configure
spark-submit --class com.spark.c1_dataLoadWrite.s9_Meddata sparkProgramming-
spark-1.0.jar > test.log &
If the count of core shall be limited
spark.cores.max=10

Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 3
Executor-core 2
Executor-memory 20
Total-core 18 (3 * 3 * 2)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 3 * 20)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
6.6 minutes

Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Total-core 21(3 * 7)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 60)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
6.7 minutes

4. Test 15 million (ORACLE → SPARK → HDFS)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Cluster *
Ex-count * Ex-core
Cluster *
Ex-count * Ex-core
spark-defaults.conf
5.7 minutes

17
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 3
Executor-core 2
Executor-memory 20
Total-core 18 (3 * 3 * 2)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 3 * 20)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
44 minutes

Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Cluster *
Ex-count * Ex-core
Cluster *
Ex-count * Ex-core
spark-defaults.conf
47 minutes

4. Test 100 million (ORACLE → SPARK → HDFS)
51 minutes
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Cluster *
Ex-count * Ex-core
Cluster *
Ex-count * Ex-core
spark-defaults.conf

Conclusion
• Generating a large number of executors from the same
resource can help improve performance
• Setting the numpartition is important when manipulating data
Next time, performance check is required according to the
number of partitions
Div 15 millions data 100 millions data
Oracle <> Oracle (1Executor) 6.7 min 47 min
Oracle <> Oracle (3Executor) 6.6 min 44 min
Oracle <> HDFS (1Executor) 5.7 min 51 min

Conclusion
• Generating a large number of executors from the same
resource can help improve performance
• Setting the numpartition is important when manipulating data
Next time, performance check is required according to the
number of partitions

Spark performance tuning eng

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark performance tuning eng

Similar to Spark performance tuning eng (20)

Recently uploaded

Recently uploaded (20)

Spark performance tuning eng