Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

More from Databricks(20)

Recently uploaded(20)

Advertisement

Fine Tuning and Enhancing Performance of Apache Spark Jobs

  1. Fine Tuning and Enhancing Performance of Apache Spark Jobs Blake Becerra, Kira Lindke, Kaushik Tadikonda
  2. Our Setup ▪ Data Validation Tool for ETL ▪ Millions of comparisons and aggregations ▪ One of the larger datasets initially took 4+ hours, unstable ▪ Challenge: improve reliability and performance ▪ Months of research and tuning, same application takes 35 minutes
  3. Configuring Cluster ▪ Test changes with ▪ spark.driver.cores ▪ spark.driver.memory ▪ executor-memory ▪ executor-cores ▪ spark.cores.max ▪ reserve cores (~1 core per node) for daemons ▪ num_executors = total_cores / num_cores ▪ num_partitions ▪ Too much memory per executor can result in excessive GC delays ▪ Too little memory can lose benefits from running multiple tasks in a single JVM ▪ Look at stats (network, CPU, memory, etc. and tweak to improve performance)
  4. Skew ▪ Can severely downgrade performance ▪ Extreme imbalance of work in the cluster ▪ Tasks within a stage (like a join) take uneven amounts of time to finish ▪ How to check ▪ Spark UI shows job waiting only on some of the tasks ▪ Look for large variances in memory usage within a job (primarily works if it is at the beginning of the job and doing data ingestion – otherwise can be misleading) ▪ Executor missing heartbeat ▪ Check partition sizes of RDD while debugging to confirm
  5. Example
  6. Example – skew in ingestion of data Uneven partitions Even partitions
  7. Handling skew cont’d ▪ Blind repartition is the most naïve approach but effective ▪ Great for "narrow transformations" ▪ Good for increasing the number of partitions ▪ Use coalesce not repartition to decrease partitions ▪ Choosing a better approach for joins specifically coming later
  8. Handling skew - ingestion ▪ Use spark options to do partitioned reads with JDBC ▪ partitionColumn ▪ lower/upperBound - used to determine stride ▪ numPartitions – maximum number of partitions ▪ partitionColumn ▪ Ideally is not skewed (such as primary key) ▪ Stride can have skew ▪ Possible trick – mod function ▪ Example of working with slow jdbc databases: ▪ Initial query took ~40 minutes ▪ Took it down to 10 minutes
  9. Example lowerBound: 0 upperBound: 1000 numPartitions: 10 => Stride is equal to 100 and partitions correspond to following queries: SELECT * FROM tableName WHERE partitionModColumn < 100 SELECT * FROM tableName WHERE partitionModColumn >= 100 AND partitionModColumn < 200 ... SELECT * FROM tableName WHERE partitionModColumn >= 900
  10. Also works with: partitioned data ingestion Example – reading from object storage, files, etc. ▪ Reads in with same skew ▪ Read in then repartition to get even distribution
  11. Cache/Persist ▪ Reuse a DataFrame with transformations ▪ Unpersist when done ▪ Without cache, DataFrame is built from scratch each time ▪ Don't over persist ▪ Worse memory performance ▪ Possible slowdown ▪ GC pressure
  12. Other Performance Improvements ▪ Try seq.par.foreach instead of just seq.foreach ▪ Increases parallelization ▪ Race conditions and non-deterministic results ▪ Use accumulators or synchronization to protect ▪ Avoid UDFs if possible ▪ Deserialize every row to object ▪ Apply lambda ▪ Then reserialize it ▪ More garbage generated
  13. Join Optimization ▪ Different join strategies ▪ Data preprocessing - clever filtering ▪ Avoiding unnecessary scans ▪ Locality - use of same partitioner ▪ Salting - skewed join keys
  14. Filter Trick Included in Spark 3.0
  15. Salting – Reduce Skew
  16. Things to remember ▪ Follow good partitioning strategies ▪ Too few partitions – less parallelism ▪ Too many partitions – scheduling issues ▪ Improve scan efficiency ▪ Try to use same partitioner between DataFrames for join ▪ Skipped stages are good ▪ Caching prevents repeated exchanges where data is re-used
  17. Fair Scheduling ▪ Round Robin fashion ▪ Multiple jobs can run simultaneously if submitted by threads inside an application ▪ Better resource utilization ▪ Harder to debug and makes performance tuning more difficult
  18. Fair Scheduling and Pools ▪ Enabled by setting spark.scheduler.mode to FAIR ▪ Allows jobs to be grouped into pools with prioritization options ▪ Jobs created from threads without a pool defined always go to the “default” pool ▪ Pools, weight and minShare are specified in fairscheduler.xml
  19. FAIR FIFO
  20. Serialization ▪ Default for most types ▪ Can work with any class ▪ More flexible ▪ Slower ▪ Default for shuffling RDDs with simple types ▪ Significantly faster and more compact ▪ Set your SparkConf to use "org.apache.spark.serializer.KryoSe rializer" ▪ Register classes in order to take full advantage Kryo SerializerJava Serializer https://spark.apache.org/docs/latest/tuning.html#data-serialization
  21. CompressedOOPs ▪ JVM flag –XX:UseCompressedOops allows usage of 4-byte pointers instead of 8-byte ▪ Does not work as heap size grows beyond 32 GB ▪ Need 48 GB heap without compressed pointers to hold same number of objects as 32 GB
  22. Garbage Collection Tuning ▪ Contention when allocating more objects ▪ Full GC occurring before tasks finish ▪ Too many RDDs cached ▪ Frequent garbage collection ▪ Long time spent in GC ▪ Missed heartbeats
  23. Garbage Collection Tuning
  24. Enable GC Logging -XX:+UseParallelGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions -XX:+PrintAdaptiveSizePolicy -XX:+G1SummarizeConcMark G1GC:ParallelGC:
  25. ParallelGC (default) Heap space is divided into Young and Old ▪ Minor GC occurs frequently: ▪ Increase Eden and Survivor space ▪ Major GC occurs frequently: ▪ Increase young space if too many objects are being promoted ▪ Increase old space ▪ Decrease spark.memory.fraction ▪ Full GC occurs before task finishes: ▪ Try clearing young space by frequent GC triggers ▪ Increase memory
  26. G1GC ▪ Breaks heap into thousands of regions ▪ Recommended if heap sizes are large (> 8GB) and GC times are long ▪ Lower latency traded with higher CPU usage for GC bookkeeping For large scale applications, these properties help preventing full collections: ▪ -XX:ParallelGCThreads = n ▪ -XX:ConcGCThreads = [n, 2n] ▪ -XX:InitiatingHeapOccupancyPercent = 35
  27. -XX:+UseG1GC -XX:ParallelGCThreads=8 -XX:ConcGCThreads=16 -XX:InitiatingHeapOccupancyPercent=35 -XX:+UseG1GC
  28. Takeaways ▪ Performance tuning is iterative ▪ Tuning is case by case ▪ Take advantage of the Spark UI, logs, and available monitoring ▪ Focus on the major slowdowns, not on one particular trick ▪ You can't be perfect
  29. Thank you for your time! Questions
Advertisement