A Deeper Understanding of Spark’s Internals
Aaron Davidson"
07/01/2014
This Talk
•  Goal: Understanding how Spark runs, focus
on performance
•  Major core components:
– Execution Model
– The Shuffle
– Caching
This Talk
•  Goal: Understanding how Spark runs, focus
on performance
•  Major core components:
– Execution Model
– The Shuffle
– Caching
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, Set(Ahir, Andy))
 (P, Set(Pat))
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
Why understand internals?
Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
Spark Execution Model
1.  Create DAG of RDDs to represent
computation
2.  Create logical execution plan for DAG
3.  Schedule and execute individual tasks
Step 1: Create RDDs
sc.textFile(“hdfs:/names”)
map(name => (name.charAt(0), name))
groupByKey()
mapValues(names => names.toSet.size)
collect()
Step 1: Create RDDs
HadoopRDD
map()
groupBy()
mapValues()
collect()
Step 2: Create execution plan
•  Pipeline as much as possible
•  Split into “stages” based on need to
reorganize data
Stage 1
 HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), ...]
Step 2: Create execution plan
•  Pipeline as much as possible
•  Split into “stages” based on need to
reorganize data
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
 (P, [Pat])
(A, 2)
 (P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
•  Split each stage into tasks
•  A task is data + computation
•  Execute all tasks within a stage before
moving on

Step 3: Schedule tasks
Step 3: Schedule tasks
Computation
 Data
hdfs:/names/0.gz
hdfs:/names/1.gz
hdfs:/names/2.gz
Task 0
Task 1
Task 2
hdfs:/names/3.gz
…
Stage 1
HadoopRDD
map()
Task 3
hdfs:/names/0.gz
Task 0
HadoopRDD
map()
hdfs:/names/1.gz
Task 1
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
/names/0.gz
HadoopRDD
map()
Time
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/2.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
The Shuffle
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
The Shuffle
Stage	
  1	
  
Stage	
  2	
  
•  Redistributes data among partitions
•  Hash keys into buckets
•  Optimizations:
– Avoided when possible, if"
data is already properly"
partitioned
– Partial aggregation reduces"
data movement
The Shuffle
Disk	
  
Stage	
  2	
  
Stage	
  1	
  
•  Pull-based, not push-based
•  Write intermediate files to disk
Execution of a groupBy()
•  Build hash map within each partition
•  Note: Can spill across keys, but a single
key-value pair must fit in memory
A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …],
E => [Erin, Earl, Ed, …]
…
Done!
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
What went wrong?
•  Too few partitions to get good concurrency
•  Large per-key groupBy()
•  Shipped all data across the cluster
Common issue checklist
1.  Ensure enough partitions for concurrency
2.  Minimize memory consumption (esp. of
sorting and large keys in groupBys)
3.  Minimize amount of data shuffled
4.  Know the standard library
1 & 2 are about tuning number of partitions!
Importance of Partition Tuning
•  Main issue: too few partitions
–  Less concurrency
–  More susceptible to data skew
–  Increased memory pressure for groupBy,
reduceByKey, sortByKey, etc.
•  Secondary issue: too many partitions
•  Need “reasonable number” of partitions
–  Commonly between 100 and 10,000 partitions
–  Lower bound: At least ~2x number of cores in
cluster
–  Upper bound: Ensure tasks take at least 100ms
Memory Problems
•  Symptoms:
–  Inexplicably bad performance
–  Inexplicable executor/machine failures"
(can indicate too many shuffle files too)
•  Diagnosis:
–  Set spark.executor.extraJavaOptions to include 
•  -XX:+PrintGCDetails
•  -XX:+HeapDumpOnOutOfMemoryError
–  Check dmesg for oom-killer logs
•  Resolution:
–  Increase spark.executor.memory
–  Increase number of partitions
–  Re-evaluate program structure (!)
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
1.  Ensure enough partitions for
concurrency
2.  Minimize memory consumption (esp. of
large groupBys and sorting)
3.  Minimize data shuffle
4.  Know the standard library
Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
Original:
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
Questions?

A deeper-understanding-of-spark-internals-aaron-davidson

  • 1.
    A Deeper Understandingof Spark’s Internals Aaron Davidson" 07/01/2014
  • 2.
    This Talk •  Goal:Understanding how Spark runs, focus on performance •  Major core components: – Execution Model – The Shuffle – Caching
  • 3.
    This Talk •  Goal:Understanding how Spark runs, focus on performance •  Major core components: – Execution Model – The Shuffle – Caching
  • 4.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect()
  • 5.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir
  • 6.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, Andy) (P, Pat) (A, Ahir)
  • 7.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Andy) (P, Pat) (A, Ahir)
  • 8.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Andy) (P, Pat) (A, Ahir)
  • 9.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, Set(Ahir, Andy)) (P, Set(Pat)) (A, Andy) (P, Pat) (A, Ahir)
  • 10.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir)
  • 11.
    Why understand internals? Goal:Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues(names => names.toSet.size) .collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), (P, 1)]
  • 12.
    Spark Execution Model 1. Create DAG of RDDs to represent computation 2.  Create logical execution plan for DAG 3.  Schedule and execute individual tasks
  • 13.
    Step 1: CreateRDDs sc.textFile(“hdfs:/names”) map(name => (name.charAt(0), name)) groupByKey() mapValues(names => names.toSet.size) collect()
  • 14.
    Step 1: CreateRDDs HadoopRDD map() groupBy() mapValues() collect()
  • 15.
    Step 2: Createexecution plan •  Pipeline as much as possible •  Split into “stages” based on need to reorganize data Stage 1 HadoopRDD map() groupBy() mapValues() collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), ...]
  • 16.
    Step 2: Createexecution plan •  Pipeline as much as possible •  Split into “stages” based on need to reorganize data Stage 1 Stage 2 HadoopRDD map() groupBy() mapValues() collect() Andy Pat Ahir (A, [Ahir, Andy]) (P, [Pat]) (A, 2) (P, 1) (A, Andy) (P, Pat) (A, Ahir) res0 = [(A, 2), (P, 1)]
  • 17.
    •  Split eachstage into tasks •  A task is data + computation •  Execute all tasks within a stage before moving on Step 3: Schedule tasks
  • 18.
    Step 3: Scheduletasks Computation Data hdfs:/names/0.gz hdfs:/names/1.gz hdfs:/names/2.gz Task 0 Task 1 Task 2 hdfs:/names/3.gz … Stage 1 HadoopRDD map() Task 3 hdfs:/names/0.gz Task 0 HadoopRDD map() hdfs:/names/1.gz Task 1 HadoopRDD map()
  • 19.
    Step 3: Scheduletasks /names/0.gz /names/3.gz /names/0.gz HadoopRDD map() Time HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS
  • 20.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() Time
  • 21.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() Time
  • 22.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 23.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 24.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/2.gz HadoopRDD map() /names/1.gz HadoopRDD map() Time
  • 25.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time
  • 26.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 27.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 28.
    Step 3: Scheduletasks /names/0.gz /names/3.gz HDFS /names/1.gz /names/2.gz HDFS /names/2.gz /names/3.gz HDFS /names/0.gz HadoopRDD map() /names/1.gz HadoopRDD map() /names/2.gz HadoopRDD map() Time /names/3.gz HadoopRDD map()
  • 29.
    The Shuffle Stage 1 Stage2 HadoopRDD map() groupBy() mapValues() collect()
  • 30.
    The Shuffle Stage  1   Stage  2   •  Redistributes data among partitions •  Hash keys into buckets •  Optimizations: – Avoided when possible, if" data is already properly" partitioned – Partial aggregation reduces" data movement
  • 31.
    The Shuffle Disk   Stage  2   Stage  1   •  Pull-based, not push-based •  Write intermediate files to disk
  • 32.
    Execution of agroupBy() •  Build hash map within each partition •  Note: Can spill across keys, but a single key-value pair must fit in memory A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …], E => [Erin, Earl, Ed, …] …
  • 33.
  • 34.
    What went wrong? • Too few partitions to get good concurrency •  Large per-key groupBy() •  Shipped all data across the cluster
  • 35.
    Common issue checklist 1. Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of sorting and large keys in groupBys) 3.  Minimize amount of data shuffled 4.  Know the standard library 1 & 2 are about tuning number of partitions!
  • 36.
    Importance of PartitionTuning •  Main issue: too few partitions –  Less concurrency –  More susceptible to data skew –  Increased memory pressure for groupBy, reduceByKey, sortByKey, etc. •  Secondary issue: too many partitions •  Need “reasonable number” of partitions –  Commonly between 100 and 10,000 partitions –  Lower bound: At least ~2x number of cores in cluster –  Upper bound: Ensure tasks take at least 100ms
  • 37.
    Memory Problems •  Symptoms: – Inexplicably bad performance –  Inexplicable executor/machine failures" (can indicate too many shuffle files too) •  Diagnosis: –  Set spark.executor.extraJavaOptions to include •  -XX:+PrintGCDetails •  -XX:+HeapDumpOnOutOfMemoryError –  Check dmesg for oom-killer logs •  Resolution: –  Increase spark.executor.memory –  Increase number of partitions –  Re-evaluate program structure (!)
  • 38.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .map(name=> (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 39.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .map(name=> (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 40.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .distinct() .map(name=> (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 41.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .repartition(6) .distinct() .map(name=> (name.charAt(0), name)) .groupByKey() .mapValues { names => names.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 42.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions= 6) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.size } .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 43.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions= 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of large groupBys and sorting) 3.  Minimize data shuffle 4.  Know the standard library
  • 44.
    Fixing our mistakes sc.textFile(“hdfs:/names”) .distinct(numPartitions= 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect() Original: sc.textFile(“hdfs:/names”) .map(name => (name.charAt(0), name)) .groupByKey() .mapValues { names => names.toSet.size } .collect()
  • 45.