SlideShare a Scribd company logo
Spark Performance
Tuning
2018. 11. 23(Fri)
Contents
1. Overview
2. Service Configuration
3. Package & Test
2
1. Overview
3
Object
• To check the processing performance of RDB and HDFS data transfer using Spark
Premise
Schedule
Tasks
1) data(15 million) transfer
2) data(100 million) transfer
• Performance differences are possible depening on network conditions
• 2018.11.22(Thur) ~ 11.23(Fri)
(completed)
(completed)
Service configuration
4
1. System Configuration
2. Data Configuration
According to the configuration below, Data tranferation is performed
5
Load Processing Unload
Source Data Spark Output Data
Sellout Data Transfer Sellout Data
Oracle
DB
Spark,
Python
Oracle
DB
1. System Configuration
S/W
H/W
S/W
H/W
S/W
H/W
HDFS
6
1. System Configuration
6
Hadoop
Name Node
Spark
Master
Hive
Master
Resource Manger
No Lv1 Lv2 Version Contents
1
Oracle
Linux
7.3 OS
2 Hadoop 2.7.6 Distributed Storage
3 Spark 2.0.2
Distributed
Processing
4 Hive 2.3.3
Supprt SQL
(Master, Only master)
5 MariaDB 10.2.11
RDB
(Master, Only master)
6
Oracle
Client
18.3.0.0.
0
Oracle DB
client
Maria DB
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Service configurationHadoop Ecosystem
Secondary
Name Node
p-master
hadoop1 hadoop2 hadoop3
1. System Configuration (Process)
7
Input DB Output DB
192.168.110.112 192.168.110.111
Hadoop
Name Node
Spark
Master
Hive
Master
Resource Manger
Maria DB
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop
DataNode
Spark
Worker
NodeManager
Hadoop Ecosystem
Secondary
Name Node
p-master
hadoop1 hadoop2 hadoop3
192.168.110.117
192.168.110.118 192.168.110.119 192.168.110.120
2. Data Configuration
8
Define the outbound and inbound data
No InterfaceID Content System Type Count Periods Column cnt Comments
1 IB-001 Sellout Dev System RDB 100 million - 17
2 IB-002 Sellout Dev System RDB 15 million 17
3 IB-003 Parameter Dev System RDB 2 5
inbound
No InterfaceID Content System Type Count Periods Column cnt Comments
1 OB-001 Sellout Op System RDB
100
million
- 17
2 OB-002 Sellout Op System RDB 15 million 17
outbound
Package & Test
9
1. Parameter Mgmt
2. Source Implementation
3. Package
4. Test
1. Parameter Mgmt
10
Implement the parameter map by selecting only necessary data information.(code flexibility)
2. Source Implementation
11
3. Package
12
Maven: Compile production code and package with jar file & manage compatible
external modules
Compile & Package
Manage plug-ins
Manage dependencies
4. Test 15 million (ORACLE → SPARK → ORACLE)
spark-env.sh
spark-defaults.conf
Fix Configure
spark-submit --class com.spark.c1_dataLoadWrite.s9_Meddata sparkProgramming-
spark-1.0.jar > test.log &
If the count of core shall be limited
spark.cores.max=10
4. Test 15 million (ORACLE → SPARK → ORACLE)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 3
Executor-core 2
Executor-memory 20
Total-core 18 (3 * 3 * 2)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 3 * 20)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
6.6 minutes
4. Test 15 million (ORACLE → SPARK → ORACLE)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Total-core 21(3 * 7)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 60)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
6.7 minutes
4. Test 15 million (ORACLE → SPARK → HDFS)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Total-core 21(3 * 7)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 60)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
5.7 minutes
17
4. Test 100 million (ORACLE → SPARK → ORACLE)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 3
Executor-core 2
Executor-memory 20
Total-core 18 (3 * 3 * 2)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 3 * 20)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
44 minutes
4. Test 100 million (ORACLE → SPARK → ORACLE)
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Total-core 21(3 * 7)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 60)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
47 minutes
4. Test 100 million (ORACLE → SPARK → HDFS)
51 minutes
Div Value Contents
Cluster 3 slave 3 → 118,119,120
Worker 1
Executor-count 1
Executor-core 7
Executor-memory 60
Total-core 21(3 * 7)
Cluster *
Ex-count * Ex-core
Total-memory 180 (3 * 60)
Cluster *
Ex-count * Ex-core
spark-defaults.conf
Conclustion
20
Conclusion
• Generating a large number of executors from the same
resource can help improve performance
• Setting the numpartition is important when manipulating data
Next time, performance check is required according to the
number of partitions
Div 15 millions data 100 millions data
Oracle <> Oracle (1Executor) 6.7 min 47 min
Oracle <> Oracle (3Executor) 6.6 min 44 min
Oracle <> HDFS (1Executor) 5.7 min 51 min
Conclusion
• Generating a large number of executors from the same
resource can help improve performance
• Setting the numpartition is important when manipulating data
Next time, performance check is required according to the
number of partitions
Thank you
23
End of Document

More Related Content

What's hot

Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
InMobi Technology
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
Sigmoid
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
DataWorks Summit
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Spark Summit
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
PGConf APAC
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
DataWorks Summit
 
Learning postgresql
Learning postgresqlLearning postgresql
Learning postgresql
DAVID RAUDALES
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
PGConf APAC
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
Mydbops
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
Mydbops
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
DataWorks Summit
 

What's hot (20)

Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Learning postgresql
Learning postgresqlLearning postgresql
Learning postgresql
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
How to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'rollHow to teach an elephant to rock'n'roll
How to teach an elephant to rock'n'roll
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 

Similar to Spark performance tuning eng

Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
DataWorks Summit
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
Adrian Huang
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentationEnkitec
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
Santosh Kangane
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
Philippe Fierens
 
Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4
uzzzle
 
De Java 8 a Java 17
De Java 8 a Java 17De Java 8 a Java 17
De Java 8 a Java 17
Víctor Leonel Orozco López
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
Databricks
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoEnkitec
 

Similar to Spark performance tuning eng (20)

Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentation
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4
 
De Java 8 a Java 17
De Java 8 a Java 17De Java 8 a Java 17
De Java 8 a Java 17
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Install oracle11gr2 rhel5
Install oracle11gr2 rhel5Install oracle11gr2 rhel5
Install oracle11gr2 rhel5
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl Arao
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Spark performance tuning eng

  • 2. Contents 1. Overview 2. Service Configuration 3. Package & Test 2
  • 3. 1. Overview 3 Object • To check the processing performance of RDB and HDFS data transfer using Spark Premise Schedule Tasks 1) data(15 million) transfer 2) data(100 million) transfer • Performance differences are possible depening on network conditions • 2018.11.22(Thur) ~ 11.23(Fri) (completed) (completed)
  • 4. Service configuration 4 1. System Configuration 2. Data Configuration
  • 5. According to the configuration below, Data tranferation is performed 5 Load Processing Unload Source Data Spark Output Data Sellout Data Transfer Sellout Data Oracle DB Spark, Python Oracle DB 1. System Configuration S/W H/W S/W H/W S/W H/W HDFS
  • 6. 6 1. System Configuration 6 Hadoop Name Node Spark Master Hive Master Resource Manger No Lv1 Lv2 Version Contents 1 Oracle Linux 7.3 OS 2 Hadoop 2.7.6 Distributed Storage 3 Spark 2.0.2 Distributed Processing 4 Hive 2.3.3 Supprt SQL (Master, Only master) 5 MariaDB 10.2.11 RDB (Master, Only master) 6 Oracle Client 18.3.0.0. 0 Oracle DB client Maria DB Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Service configurationHadoop Ecosystem Secondary Name Node p-master hadoop1 hadoop2 hadoop3
  • 7. 1. System Configuration (Process) 7 Input DB Output DB 192.168.110.112 192.168.110.111 Hadoop Name Node Spark Master Hive Master Resource Manger Maria DB Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop Ecosystem Secondary Name Node p-master hadoop1 hadoop2 hadoop3 192.168.110.117 192.168.110.118 192.168.110.119 192.168.110.120
  • 8. 2. Data Configuration 8 Define the outbound and inbound data No InterfaceID Content System Type Count Periods Column cnt Comments 1 IB-001 Sellout Dev System RDB 100 million - 17 2 IB-002 Sellout Dev System RDB 15 million 17 3 IB-003 Parameter Dev System RDB 2 5 inbound No InterfaceID Content System Type Count Periods Column cnt Comments 1 OB-001 Sellout Op System RDB 100 million - 17 2 OB-002 Sellout Op System RDB 15 million 17 outbound
  • 9. Package & Test 9 1. Parameter Mgmt 2. Source Implementation 3. Package 4. Test
  • 10. 1. Parameter Mgmt 10 Implement the parameter map by selecting only necessary data information.(code flexibility)
  • 12. 3. Package 12 Maven: Compile production code and package with jar file & manage compatible external modules Compile & Package Manage plug-ins Manage dependencies
  • 13. 4. Test 15 million (ORACLE → SPARK → ORACLE) spark-env.sh spark-defaults.conf Fix Configure spark-submit --class com.spark.c1_dataLoadWrite.s9_Meddata sparkProgramming- spark-1.0.jar > test.log & If the count of core shall be limited spark.cores.max=10
  • 14. 4. Test 15 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 3 Executor-core 2 Executor-memory 20 Total-core 18 (3 * 3 * 2) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 3 * 20) Cluster * Ex-count * Ex-core spark-defaults.conf 6.6 minutes
  • 15. 4. Test 15 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 6.7 minutes
  • 16. 4. Test 15 million (ORACLE → SPARK → HDFS) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 5.7 minutes
  • 17. 17 4. Test 100 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 3 Executor-core 2 Executor-memory 20 Total-core 18 (3 * 3 * 2) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 3 * 20) Cluster * Ex-count * Ex-core spark-defaults.conf 44 minutes
  • 18. 4. Test 100 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 47 minutes
  • 19. 4. Test 100 million (ORACLE → SPARK → HDFS) 51 minutes Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf
  • 21. Conclusion • Generating a large number of executors from the same resource can help improve performance • Setting the numpartition is important when manipulating data Next time, performance check is required according to the number of partitions Div 15 millions data 100 millions data Oracle <> Oracle (1Executor) 6.7 min 47 min Oracle <> Oracle (3Executor) 6.6 min 44 min Oracle <> HDFS (1Executor) 5.7 min 51 min
  • 22. Conclusion • Generating a large number of executors from the same resource can help improve performance • Setting the numpartition is important when manipulating data Next time, performance check is required according to the number of partitions