Beyond shuffling global big data tech conference 2015 sj

Beyond Shuffling
tips & tricks for scaling Apache Spark
Global Big
Data SJ 2015
early version

Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Software Engineer at IBM
previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
co-author of a new book focused on Spark performance coming out next year*
@holdenkarau
Slide share http://www.slideshare.net/hkarau

What is going to be covered:
What I think I might know about you
RDD re-use (caching, persistence levels, and checkpointing)
Working with key/value data
Why group key is evil and what we can do about it
Best practices for Spark accumulators*
When Spark SQL can be amazing and wonderful
A quick detour into some future performance work in Spark MLLib

Who I think you wonderful humans are?
Nice* people
Know some Apache Spark
Want to scale your Apache Spark jobs
Comfortable reading Scala
Lori Erickson

If you want to follow along with the exercise
Make sure you have recent-ish JDK
Install Spark (any precompiled Hadoop version)
http://spark.apache.org/downloads.html

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream

RDD re-use - sadly not magic
If we know we are going to re-use the RDD what should we do?
If it fits nicely in memory caching in memory
persisting at another level
MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER
checkpointing
Noisey clusters
_2 & checkpointing can help
Richard Gillin

Considerations for Key/Value Data
What does the distribution of keys look like?
What type of aggregations do we need to do?
Do we want our data in any particular order?
Are we joining with another RDD?
Whats our partitioner?
eleda 1

What is key skew and why do we care?
Keys aren’t evenly distributed
Sales by zip code, or records by city, etc.
groupByKey will explode (but it's pretty easy to break)
We can have really unbalanced partitions
If we have enough key skew sortByKey could even fail
Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce

groupByKey - just how evil is it?
Pretty evil
Groups all of the records with the same key into a single record
Even if we immediately reduce it (e.g. sum it or similar)
This can be too big to fit in memory, then our job fails
Unless we are in SQL then happy pandas
PROgeckoam

Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)

And now back to the “normal” version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts

Let’s launch spark and compare the two
You can get Spark from http://spark.apache.org/downloads.html
You need a recent version of Java
If installing is difficult don’t worry - the results will be in the slides
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp

Code to compare the two:
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions
// Evil group by key version
val grouped = wordPairs.groupByKey()
val evilWordCounts = grouped.mapValues(_.sum)
evilWordCounts.take(5)
// Less evil version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.take(5)

So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes

So what did we do instead?
reduceByKey
Works when the types are the same (e.g. in our summing version)
aggregateByKey
Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)

Can just the shuffle cause problems?
Sorting by key can put all of the records in the same partition
We can run into partition size limits (around 2GB)
Or just get bad performance
So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy

Spark accumulators
Really “great” way for keeping track of failed records
Double counting makes things really tricky
Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
Relative rules can save us* under certain conditions
Found Animals Foundation Follow

Using an accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow

Where can Spark SQL benefit perf?
Structured or semi-structured data
OK with having less* complex operations available to us
We may only need to operate on a subset of the data
The fastest data to process isn’t even read
Remember that non-magic cat? Its got some magic** now
In part from peeking inside of boxes
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila

Why is Spark SQL good for those things?
Space efficient columnar cached representation
Able to push down operations to the data store
Optimizer is able to look inside of our operations
Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
Matti Mattila

Preview: bringing codegen to Spark ML
Based on Spark SQL’s code generation
First draft using quasiquotes
Switch to janino for Java compilation
Initial draft for Gradient Boosted Trees
Based on DB’s work
First draft with QuasiQuotes
Moved to Java for speed
See SPARK-10387 for the details
Jon

@Override
public double call(Vector input) throws
Exception {
if (input.apply(1) <= 1.0) {
return 0.1;
} else {
if (input.apply(0) <= 0.5) {
return 0.0;
} else {
return 2.0;
}
}
}
(1, 1.0)
0.1 (0, 0.5)
0.0 2.0
What the generated code looks like: Glenn Simmons

Everyone* needs reduce, let’s make it faster!
reduce & aggregate have “tree” versions
we already had free map-side reduction
but now we can get even better!**
**And we might be able to make even cooler versions

Additional Resources
Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
http://spark.apache.org/docs/latest/
Books
Videos
Our next meetup!
Spark Office Hours
follow me on twitter for future ones - https://twitter.com/holdenkarau
fill out this survey to choose the next date - http://bit.ly/spOffice1
raider of gin

Q&A OR A quick detour into spark testing?
It's like a choose your own adventure novel, but with
voting
But more like the voting in High School since if we are
running out of time we might just skip it

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action

Spark Videos
Apache Spark Youtube Channel
My Spark videos on YouTube -
http://bit.ly/holdenSparkVideos
Spark Summit 2014 training
Paco’s Introduction to Apache Spark

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau

Beyond shuffling global big data tech conference 2015 sj

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Beyond shuffling global big data tech conference 2015 sj

Similar to Beyond shuffling global big data tech conference 2015 sj (20)

Recently uploaded

Recently uploaded (20)

Beyond shuffling global big data tech conference 2015 sj

Editor's Notes