SlideShare a Scribd company logo
1 of 27
Spark performance tuning
for Apache Kylin
Shaofeng Shi
Background
• Kylin 2.0 starts to use Spark as the Cube build engine
• Has been proved can improve 2x to 3x build performance
• Need to have Spark tuning experience.
• Kylin 2.5 will move more jobs onto Spark
• Convert to HFile (KYLIN-3427)
• Merge segments (KYLIN-3441)
• Merge dictionaries on YARN (KYLIN-3471)
• Fact distinct columns in Spark (KYLIN-3442)
• In the future, Spark engine will replace MR
Cubing in Spark
Agenda
• Why Spark
• Spark on YARN Model
• Spark Executor Memory Model
• Executor/Driver memory/core configuration
• Dynamic Resource Allocation
• RDD Partitioning
• Shuffle
• Compression
• DFS Replication
• Deploy Modes
• Other Tips
Why Apache Spark
• Fast, memory centric distributed computing framework
• Flexible API
• Spark Core
• DataFrames, Datasets and SparkSQL
• Spark Streaming
• MLLib/SparkR
• Languages support
• Java, Scala, Python, R
• Deployment option:
• Standalone/YARN/Mesos/Kubernetes (Spark 2.3+)
Spark on YARN memory model
• Overhead memory
• JVM need memory to run
• By default: executor memory * 0.1, minimal 384 MB;
• Executor memory
Spark on YARN memory model (cont.)
• If you allocation 4GB to an executor, Spark will request:
• 4 * 0.1 + 4 = 4.4 GB as the container memory from YARN
• From our observation, the default factor (0.1) is a little small for Kylin,
executor is very likely be killed.
• Give 1GB or more as overhead memory
• spark.yarn.executor.memoryOverhead=1024
• From Kylin 2.5, default request 1GB for overhead.
Spark executor memory model
• Reserved memory
• 300MB, just for avoiding OOM
• Spark memory
• spark.memory.fraction=0.6
• For both storage/cache and
execution (shuffle, sort)
• spark.memory.storageFraction=0.5:
cache and execution half half.
• User memory
• The left is for user code execution
Spark executor memory model(cont.)
• An example:
• Given an executor 4GB memory, its max. storage/execution memory is:
• (4096 – 300) * 0.6 = 2.27GB
• If the executor need run computation (need sorting/shuffling), the space for
RDD cache can be shrined to:
• 2.27GB * 0.5 = 1.13 GB
• User memory:
• (4096 – 300) * 0.4 = 1.52 GB
• When you have big dictionaries, consider to allocate more to user
memory
Executor memory/core configuration
• Check how much memory/core available in your Hadoop cluster
• To maximize the resource utilization, use the similar ratio for Spark.
• For example, a cluster has 100 cores and 500 GB memory. You can allocate 1
core, 5GB (1 GB for overhead, 4GB for executor) for each executor instance.
• If you use multiple cores in one executor, increase the memory
accordingly
• e.g., 2 core + 10 GB per instance.
• No more than 40GB mem / instance
Driver memory configuration
• Kylin does not collect data to driver, you can configure less resource
for driver
• spark.driver.memory=2g
• spark.driver.cores=1
More instances less core, or less instance
more cores?
• Spark active task number = instance * (cores / instance)
• Both can get similar parallelism
• If use more cores in one executor, tasks can share references in the
same JVM
• Share big objects like dictionaries
• If with Spark Dynamic resource allocation, 1 core per instance.
Dynamic resource allocation
• Dynamic allocation can improve resource utilization
• Not enabled by default
Dynamic resource allocation
• Static allocation does not fit for Kylin.
• Cubing is by layer; Each layer’s size is different
• Workload is unbalanced: small -> mediate -> big -> extreme big -> small -> tiny
• DRA is highly recommended.
• With DRA enabled, 1 executor has 1 core.
RDD partitioning
• RDD Partition is similar as File Split in MapReduce;
• Spark prefers to many & small partitions, instead of less & big partition
• Kylin splits partition by estimated file size (after aggregation), by default 1
partition per 10 MB:
• kylin.engine.spark.rdd-partition-cut-mb=10
• The real size may vary as the estimation might be inaccurate
• This may affect the performance greatly!
• Min/max partition cap:
• kylin.engine.spark.min-partition=1
• kylin.engine.spark.max-partition=5000
Partition number is important
• When partition number is less than normal
• Less parallelism, low resource utilization ratio
• Executor OOM (especially when use "mapPartition ”)
• When partition number is much more than normal
• Shuffle is slow
• Many small fraction generated
• Pay attention if you observe a job has > 1000 partitions
Partition number can be wild in certain case
• If your cube has Count Distinct or TopN measures, the estimated size may
be far bigger than actual, causing too many partitions.
• Tune the parameter manually, at Cube level, according to the actual Cuboid
file size:
• kylin.engine.spark.rdd-partition-cut-mb=100
• Or, reduce the max. partition number:
• kylin.engine.spark.max-partition=500
• KYLIN-3453 Make the size estimation more accurate
• KYLIN-3472 TopN in Spark is slow
Shuffle
• Spark shuffle is similar as MapReduce
• Partition mapper’s output and send the partition only to its reducer;
• Reducer buffers data in memory, sort, aggregate and then reduce.
• But with difference
• Spark sorts the data on map side, but doesn’t merge them on reduce side;
• If user need the data be sorted, call “sortByKey”or similar, Spark will re-sort
the data. The re-sort doesn’t aware map’s output is already sorted.
• The sorting is in memory, spill if memory is full
Shuffle (cont.)
• Shuffle spill
• Spill memory = (executorMemory – 300M) * memory.fractor * (1 –
memory.StorageFraction)
• Spilled files won’t be merged, until data be request, merging on the fly
• If you need data be sorted, Spark is slower than MR.
• SPARK-2926 tries to introduce MR-style merge sort.
• Kylin’s“Convert to HFile” step need the value being sorted. Spark may
spend 2x time on this step than MR.
Compression
• Compression can significantly reduce IO
• By default Kylin enabled compression for MR in `conf/kylin_job_conf.xml`,
but not for Spark
• If your Hadoop did not enable compression, you may see 2X sized file
generated when switch from MR to Spark engine
• Manually enable compression with adding:
• kylin.engine.spark-
conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
• kylin.engine.spark-
conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.
hadoop.io.compress.DefaultCodec
• Kylin 2.5 will enable compression by default.
Compression (cont.)
• 40% performance improvement + 50% disk saving
No compression vs compression (Merge segments on Spark)
DFS replication
• Kylin keeps 2 replication for intermediate files, configurated in
`kylin_job_conf.xml` and `kylin_hive_conf.xml`
• But this does not work for Spark
• Manually add:
• kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
• Save 1/3 disk space
• Kylin 2.5 will enable this by default.
Deployment modes
• Spark on YARN has two deploy modes
• Cluster: driver runs inside app master
• Client: driver runs in client process
• When dev/debugging, use `client` mode;
• Start fast, with detailed log message printed on console
• Will occupy client node memory
• In production deployment, use `cluster` mode.
• Kylin 2.5 will use `cluster` mode by default
Other tips
• Pre-upload YARN archive
• Avoid uploading big files repeatedly
• Accelerate job startup
• Run Spark history server for trouble shooting
• Identify bottleneck much easier
• https://kylin.apache.org/docs/tutorial/cube_spark.html
Recommended configurations (Kylin 2.2-2.4,
Spark 2.1)
• kylin.engine.spark-conf.spark.submit.deployMode=cluster
• kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
• kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
• kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
• kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
• kylin.engine.spark-conf.spark.driver.memory=2G
• kylin.engine.spark-conf.spark.executor.memory=4G
• kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
• kylin.engine.spark-conf.spark.executor.cores=1
• kylin.engine.spark-conf.spark.network.timeout=600
• kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice/kylin/spark/spark-libs.jar
• kylin.engine.spark-conf.spark.shuffle.service.enabled=true
• kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
• kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
• kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
• kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
Key takeaway
• Kylin will move more jobs to Spark
• Master Spark tuning will help you run Kylin better
• Kylin aims to provide an out-of-box user experience of Spark, like MR.
We are hiring
Apache Kylin
dev@kylin.apach
e.org
Kyligence Inc
info@kyligence.io

More Related Content

What's hot

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylininovex GmbH
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 

What's hot (20)

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 

Similar to Spark tunning in Apache Kylin

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedEqunix Business Solutions
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadMarius Adrian Popa
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedSumant Tambe
 

Similar to Spark tunning in Apache Kylin (20)

Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 

Spark tunning in Apache Kylin

  • 1. Spark performance tuning for Apache Kylin Shaofeng Shi
  • 2. Background • Kylin 2.0 starts to use Spark as the Cube build engine • Has been proved can improve 2x to 3x build performance • Need to have Spark tuning experience. • Kylin 2.5 will move more jobs onto Spark • Convert to HFile (KYLIN-3427) • Merge segments (KYLIN-3441) • Merge dictionaries on YARN (KYLIN-3471) • Fact distinct columns in Spark (KYLIN-3442) • In the future, Spark engine will replace MR
  • 4. Agenda • Why Spark • Spark on YARN Model • Spark Executor Memory Model • Executor/Driver memory/core configuration • Dynamic Resource Allocation • RDD Partitioning • Shuffle • Compression • DFS Replication • Deploy Modes • Other Tips
  • 5. Why Apache Spark • Fast, memory centric distributed computing framework • Flexible API • Spark Core • DataFrames, Datasets and SparkSQL • Spark Streaming • MLLib/SparkR • Languages support • Java, Scala, Python, R • Deployment option: • Standalone/YARN/Mesos/Kubernetes (Spark 2.3+)
  • 6. Spark on YARN memory model • Overhead memory • JVM need memory to run • By default: executor memory * 0.1, minimal 384 MB; • Executor memory
  • 7. Spark on YARN memory model (cont.) • If you allocation 4GB to an executor, Spark will request: • 4 * 0.1 + 4 = 4.4 GB as the container memory from YARN • From our observation, the default factor (0.1) is a little small for Kylin, executor is very likely be killed. • Give 1GB or more as overhead memory • spark.yarn.executor.memoryOverhead=1024 • From Kylin 2.5, default request 1GB for overhead.
  • 8. Spark executor memory model • Reserved memory • 300MB, just for avoiding OOM • Spark memory • spark.memory.fraction=0.6 • For both storage/cache and execution (shuffle, sort) • spark.memory.storageFraction=0.5: cache and execution half half. • User memory • The left is for user code execution
  • 9. Spark executor memory model(cont.) • An example: • Given an executor 4GB memory, its max. storage/execution memory is: • (4096 – 300) * 0.6 = 2.27GB • If the executor need run computation (need sorting/shuffling), the space for RDD cache can be shrined to: • 2.27GB * 0.5 = 1.13 GB • User memory: • (4096 – 300) * 0.4 = 1.52 GB • When you have big dictionaries, consider to allocate more to user memory
  • 10. Executor memory/core configuration • Check how much memory/core available in your Hadoop cluster • To maximize the resource utilization, use the similar ratio for Spark. • For example, a cluster has 100 cores and 500 GB memory. You can allocate 1 core, 5GB (1 GB for overhead, 4GB for executor) for each executor instance. • If you use multiple cores in one executor, increase the memory accordingly • e.g., 2 core + 10 GB per instance. • No more than 40GB mem / instance
  • 11. Driver memory configuration • Kylin does not collect data to driver, you can configure less resource for driver • spark.driver.memory=2g • spark.driver.cores=1
  • 12. More instances less core, or less instance more cores? • Spark active task number = instance * (cores / instance) • Both can get similar parallelism • If use more cores in one executor, tasks can share references in the same JVM • Share big objects like dictionaries • If with Spark Dynamic resource allocation, 1 core per instance.
  • 13. Dynamic resource allocation • Dynamic allocation can improve resource utilization • Not enabled by default
  • 14. Dynamic resource allocation • Static allocation does not fit for Kylin. • Cubing is by layer; Each layer’s size is different • Workload is unbalanced: small -> mediate -> big -> extreme big -> small -> tiny • DRA is highly recommended. • With DRA enabled, 1 executor has 1 core.
  • 15. RDD partitioning • RDD Partition is similar as File Split in MapReduce; • Spark prefers to many & small partitions, instead of less & big partition • Kylin splits partition by estimated file size (after aggregation), by default 1 partition per 10 MB: • kylin.engine.spark.rdd-partition-cut-mb=10 • The real size may vary as the estimation might be inaccurate • This may affect the performance greatly! • Min/max partition cap: • kylin.engine.spark.min-partition=1 • kylin.engine.spark.max-partition=5000
  • 16. Partition number is important • When partition number is less than normal • Less parallelism, low resource utilization ratio • Executor OOM (especially when use "mapPartition ”) • When partition number is much more than normal • Shuffle is slow • Many small fraction generated • Pay attention if you observe a job has > 1000 partitions
  • 17. Partition number can be wild in certain case • If your cube has Count Distinct or TopN measures, the estimated size may be far bigger than actual, causing too many partitions. • Tune the parameter manually, at Cube level, according to the actual Cuboid file size: • kylin.engine.spark.rdd-partition-cut-mb=100 • Or, reduce the max. partition number: • kylin.engine.spark.max-partition=500 • KYLIN-3453 Make the size estimation more accurate • KYLIN-3472 TopN in Spark is slow
  • 18. Shuffle • Spark shuffle is similar as MapReduce • Partition mapper’s output and send the partition only to its reducer; • Reducer buffers data in memory, sort, aggregate and then reduce. • But with difference • Spark sorts the data on map side, but doesn’t merge them on reduce side; • If user need the data be sorted, call “sortByKey”or similar, Spark will re-sort the data. The re-sort doesn’t aware map’s output is already sorted. • The sorting is in memory, spill if memory is full
  • 19. Shuffle (cont.) • Shuffle spill • Spill memory = (executorMemory – 300M) * memory.fractor * (1 – memory.StorageFraction) • Spilled files won’t be merged, until data be request, merging on the fly • If you need data be sorted, Spark is slower than MR. • SPARK-2926 tries to introduce MR-style merge sort. • Kylin’s“Convert to HFile” step need the value being sorted. Spark may spend 2x time on this step than MR.
  • 20. Compression • Compression can significantly reduce IO • By default Kylin enabled compression for MR in `conf/kylin_job_conf.xml`, but not for Spark • If your Hadoop did not enable compression, you may see 2X sized file generated when switch from MR to Spark engine • Manually enable compression with adding: • kylin.engine.spark- conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true • kylin.engine.spark- conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache. hadoop.io.compress.DefaultCodec • Kylin 2.5 will enable compression by default.
  • 21. Compression (cont.) • 40% performance improvement + 50% disk saving No compression vs compression (Merge segments on Spark)
  • 22. DFS replication • Kylin keeps 2 replication for intermediate files, configurated in `kylin_job_conf.xml` and `kylin_hive_conf.xml` • But this does not work for Spark • Manually add: • kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 • Save 1/3 disk space • Kylin 2.5 will enable this by default.
  • 23. Deployment modes • Spark on YARN has two deploy modes • Cluster: driver runs inside app master • Client: driver runs in client process • When dev/debugging, use `client` mode; • Start fast, with detailed log message printed on console • Will occupy client node memory • In production deployment, use `cluster` mode. • Kylin 2.5 will use `cluster` mode by default
  • 24. Other tips • Pre-upload YARN archive • Avoid uploading big files repeatedly • Accelerate job startup • Run Spark history server for trouble shooting • Identify bottleneck much easier • https://kylin.apache.org/docs/tutorial/cube_spark.html
  • 25. Recommended configurations (Kylin 2.2-2.4, Spark 2.1) • kylin.engine.spark-conf.spark.submit.deployMode=cluster • kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true • kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1 • kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000 • kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300 • kylin.engine.spark-conf.spark.driver.memory=2G • kylin.engine.spark-conf.spark.executor.memory=4G • kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 • kylin.engine.spark-conf.spark.executor.cores=1 • kylin.engine.spark-conf.spark.network.timeout=600 • kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice/kylin/spark/spark-libs.jar • kylin.engine.spark-conf.spark.shuffle.service.enabled=true • kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 • kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true • kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec • kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
  • 26. Key takeaway • Kylin will move more jobs to Spark • Master Spark tuning will help you run Kylin better • Kylin aims to provide an out-of-box user experience of Spark, like MR.
  • 27. We are hiring Apache Kylin dev@kylin.apach e.org Kyligence Inc info@kyligence.io

Editor's Notes

  1. Spark.shuffle.memoryFraction and spark.storage.memoryFraction are deprecated. They are replahttps://spark.apache.org/docs/2.1.2/configuration.htmlced by spark.memory.fraction. See https://spark.apache.org/docs/2.1.2/running-on-yarn.html
  2. https://0x0fff.com/spark-memory-management/
  3. https://0x0fff.com/spark-architecture-shuffle/ https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
  4. https://0x0fff.com/spark-architecture-shuffle/ https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md https://issues.apache.org/jira/browse/SPARK-2926 In a case, MR (convert to hfile) took 6 min, Spark took 11 min; Even enlarge Spark memory, the performance doesn’t improve.