SORT & JOIN IN SPARK 2.0
Harsha Tenneti
CONTENTS
● Benchmarking
● Sort and Join
● Shuffle Manager
● GC optimisations
Benchmarking
● Joins
● Sort
Spark Version Time for two jobs Cores Memory Data Size
1.6 12min 133 288gb 1 * 12GB with 12 * 10mb
2 11min 70 60gb Same as above
Spark Version Time for two jobs Cores Memory Data SIze
1.6 Did not work NA NA 30GB parquet which is approx 500GB
raw data
2 50-60 min 37 37g 30GB parquet which is approx 500GB
raw data
Contd...
● Join with GC Configs
Spark Version Time for two jobs Cores Memory Data Size
2 11min 36 48g 1 *12GB with 12 * 10mb
Sort and Join
Both sort and join need the keys to be in same partition.
If not, then we need to shuffle the data which makes sure keys lies in same
partitioner which is a costly operation.
This is done by shuffle manager which is a service in spark
Shuffle Manager
● Both driver and executors have their own shuffle service.
● Driver registers shuffles with a shuffle manager and executors ask to read
and write data.
● The setting “spark.shuffle.manager” sets up the default shuffle manager.
● Couple of shuffles in spark are hash and sort
Contd...
In 2.0, LZ4 compression of the shuffled data included appending which help
to reduce small files in shuffle spill
● Included “spark.reducer.maxReqsInFlight” property to limits the number
of remote requests to fetch blocks at any given point
● Reusability of shuffle data because of “Whole code stage Generation”
● Found that changing our machine disk from magnetic to sd1 increased
the IO of shuffle read and write
GC optimisations
● -XX:G1HeapRegionSize
● -XX:+AlwaysPreTouch
● -XX:ParallelGCThreads
● -XX:InitiatingHeapOccupancyPercent=0
● -Xms
Contd...
● -XX:InitialTenuringThreshold
● -XX:MaxMetaspaceSize
● -XX:G1MaxNewSizePercent
● --conf "spark.executor.extraJavaOptions=”
● spark.executor.extraJavaOptions=-XX:SurvivorRatio=16 -XX:+UseG1GC -
XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -
XX:+PrintAdaptiveSizePolicy
Thank You

SORT & JOIN IN SPARK 2.0

  • 1.
    SORT & JOININ SPARK 2.0 Harsha Tenneti
  • 2.
    CONTENTS ● Benchmarking ● Sortand Join ● Shuffle Manager ● GC optimisations
  • 3.
    Benchmarking ● Joins ● Sort SparkVersion Time for two jobs Cores Memory Data Size 1.6 12min 133 288gb 1 * 12GB with 12 * 10mb 2 11min 70 60gb Same as above Spark Version Time for two jobs Cores Memory Data SIze 1.6 Did not work NA NA 30GB parquet which is approx 500GB raw data 2 50-60 min 37 37g 30GB parquet which is approx 500GB raw data
  • 4.
    Contd... ● Join withGC Configs Spark Version Time for two jobs Cores Memory Data Size 2 11min 36 48g 1 *12GB with 12 * 10mb
  • 5.
    Sort and Join Bothsort and join need the keys to be in same partition. If not, then we need to shuffle the data which makes sure keys lies in same partitioner which is a costly operation. This is done by shuffle manager which is a service in spark
  • 6.
    Shuffle Manager ● Bothdriver and executors have their own shuffle service. ● Driver registers shuffles with a shuffle manager and executors ask to read and write data. ● The setting “spark.shuffle.manager” sets up the default shuffle manager. ● Couple of shuffles in spark are hash and sort
  • 7.
    Contd... In 2.0, LZ4compression of the shuffled data included appending which help to reduce small files in shuffle spill ● Included “spark.reducer.maxReqsInFlight” property to limits the number of remote requests to fetch blocks at any given point ● Reusability of shuffle data because of “Whole code stage Generation” ● Found that changing our machine disk from magnetic to sd1 increased the IO of shuffle read and write
  • 8.
    GC optimisations ● -XX:G1HeapRegionSize ●-XX:+AlwaysPreTouch ● -XX:ParallelGCThreads ● -XX:InitiatingHeapOccupancyPercent=0 ● -Xms
  • 9.
    Contd... ● -XX:InitialTenuringThreshold ● -XX:MaxMetaspaceSize ●-XX:G1MaxNewSizePercent ● --conf "spark.executor.extraJavaOptions=” ● spark.executor.extraJavaOptions=-XX:SurvivorRatio=16 -XX:+UseG1GC - XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC - XX:+PrintAdaptiveSizePolicy
  • 10.