Beyond Shuffling
Holden Karau
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
○ IBM is, by the way, founding member of the Scala Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
What is going to be covered:
● What I think I might know about you
● RDD re-use (caching, persistence levels, and checkpointing)
● Working with key/value data
○ Why group key is evil and what we can do about it
● When Spark SQL can be amazing and wonderful
● A brief introduction to Datasets (new in Spark 1.6)
● Iterator-to-Iterator transformations (or yet another way to go OOM in the night)
● How to test your Spark code :)
Torsten Reuschling
Or….
Huang
Yun
Chung
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
○ May or may not like pictures of David Hasselhof
● Know some Apache Spark
● Want to scale your Apache Spark jobs
● Don’t overly mind a grab-bag of topics
Lori Erickson
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
Lets look at some old stand bys:
val rdd = sc.textFile("python/pyspark/*.py", 20)
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
val warnings = rdd.filter(_.toLower.contains("error")).count()
Tomomi
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Considerations for Key/Value Data
● What does the distribution of keys look like?
● What type of aggregations do we need to do?
● Do we want our data in any particular order?
● Are we joining with another RDD?
● Whats our partitioner?
○ If we don’t have an explicit one: what is the partition structure?
eleda 1
What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
Tomomi
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
GroupByKey
reduceByKey
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_T, T, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
Well there is a bit of magic in the shuffle….
● We can reuse shuffle files
● But it can (and does) explode*
Sculpture by Flaming Lotus Girls
Photo by Zaskoda
Photo by Christian Heilmann
Iterator to Iterator transformations
● Iterator to Iterator transformations are super useful
○ They allow Spark to spill to disk if reading an entire partition is too much
○ Not to mention better pipelining when we put multiple transformations together
● Most of the default transformations are already set up for this
○ map, filter, flatMap, etc.
● But when we start working directly with the iterators
○ Sometimes to save setup time on expensive objects
○ e.g. mapPartitions, mapPartitionsWithIndex etc.
● Reading into a list or other structure (implicitly or otherwise) can even cause
OOMs :(
Christian Heilmann
tl;dr: be careful with mapPartitions
Christian Heilmann
Where can Spark SQL benefit perf?
● Structured or semi-structured data
● OK with having less* complex operations available to us
● We may only need to operate on a subset of the data
○ The fastest data to process isn’t even read
● Remember that non-magic cat? Its got some magic** now
○ In part from peeking inside of boxes
● non-JVM (aka Python & R) users: saved from double serialization cost! :)
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila
Why is Spark SQL good for those things?
● Space efficient columnar cached representation
● Able to push down operations to the data store
● Optimizer is able to look inside of our operations
○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
How much faster can it be?
Andrew Skudder
But where will it explode?
● Iterative algorithms - large plans
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data (200 partitions)
● Default partition size when reading in is also sad
How to avoid lineage explosions:
/**
* Cut the lineage of a DataFrame which has too long a query
plan.
*/
def cutLineage(df: DataFrame): DataFrame = {
val sqlCtx = df.sqlContext
//tag::cutLineage[]
val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd, df.schema)
//end::cutLineage[]
}
karmablue
Introducing Datasets
● New in Spark 1.6 - coming to more languages in 2.0
● Provide templated compile time strongly typed version of DataFrames
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Houser Wolf
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions. Not needed in
2.0 + & almost free
anyways
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Switching gears: Validating Spark jobs
● Spark has it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can write rules for if the values are expected
○ Simple rules (X > J)
■ The number of records should be greater than 0
○ Historic rules (X > Avg(last(10, J)))
■ We need to keep track of our previous values - but this can be great for debugging &
performance investigation too.
Photo by:
Paul Schadler
Spark accumulators
● Really “great” way for keeping track of failed records
● Double counting makes things really tricky
○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
● Relative rules can save us* under certain conditions
Found Animals Foundation Follow
Using an accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
Found Animals Foundation Follow
P.S: If you are interested in this check out spark-validator (still early stages) & help
review Spark PR #11105 / SPARK-12469
Validation
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
● Doesn’t need to be done in your Spark job (can be done in your scripting
language of choice with whatever job control system you are using)
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
- that is ok!
Photo by:
Paul Schadler
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some
(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Do work here....
assert(v.validate(5) === true)
}
Photo by Dvortygirl
Making our code testable
● Before we can refactor our code we need to have good tests
● Testing individual functions is easy if we factor them out
○ less lambdas! :(
● Testing our Spark transformations is a bit trickier, but there are a variety of
tools
● Property checking can save us from having to come up lots of test cases
A simple unit test with spark-testing-base
class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}
Ok but what about testing @ “scale”
● Maybe our program works fine on our local sized input
● If we are using Spark our actual workload is probably huge
● How do we test workloads too large for a single machine?
○ we can’t just use parallelize and collect
Qfamily
Distributed “set” operations to the rescue*
● Pretty close - already built into Spark
● Doesn’t do so well with floating points :(
○ damn floating points keep showing up everywhere :p
● Doesn’t really handle duplicates very well
○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD
[T], result: RDD[T]): Option[(T, T)] = {
expected.zip(result).filter{case (x, y) => x != y}.take(1).
headOption
}
Matti Mattila
Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option
[(T, Int, Int)] = {
val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +
_)
val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
=>
i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).
headOption.
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
i2.headOption.getOrElse(0))}
}
Matti Mattila
But where do we get the data for those tests?
● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● If our data is a bunch of Vectors or Doubles Spark’s got tools :)
● Coming up with good test data can take a long time
Lori Rielly
QuickCheck / ScalaCheck
● QuickCheck generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
With spark-testing-base
test("map should not change number of elements") {
val property =
forAll(RDDGenerator.genRDD[String](sc)
(Arbitrary.arbitrary[String])) {
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
With spark-testing-base & a million entries
test("map should not change number of elements") {
implicit val generatorDrivenConfig =
PropertyCheckConfig(minSize = 0, maxSize = 1000000)
val property =
forAll(RDDGenerator.genRDD[String](sc)
(Arbitrary.arbitrary[String])) {
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
Additional Spark Testing Resources
● Libraries
○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
○ Java: spark-testing-base (unit)
○ Python: spark-testing-base (unittest2), pyspark.test (pytest)
● Strata San Jose Talk (up on YouTube)
● Blog posts
○ Unit Testing Spark with Java by Jesse Anderson
○ Making Apache Spark Testing Easy with Spark Testing Base
○ Unit testing Apache Spark with py.test
raider of gin
Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Kay Ousterhout’s work
○ http://www.eecs.berkeley.edu/~keo/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
raider of gin
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
k thnx bye!
If you care about Spark testing and
don’t hate surveys: http://bit.
ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016

Beyond shuffling - Scala Days Berlin 2016

  • 1.
  • 2.
    Who am I? ●My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ○ IBM is, by the way, founding member of the Scala Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
    What is goingto be covered: ● What I think I might know about you ● RDD re-use (caching, persistence levels, and checkpointing) ● Working with key/value data ○ Why group key is evil and what we can do about it ● When Spark SQL can be amazing and wonderful ● A brief introduction to Datasets (new in Spark 1.6) ● Iterator-to-Iterator transformations (or yet another way to go OOM in the night) ● How to test your Spark code :) Torsten Reuschling
  • 4.
  • 5.
    Who I thinkyou wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ○ May or may not like pictures of David Hasselhof ● Know some Apache Spark ● Want to scale your Apache Spark jobs ● Don’t overly mind a grab-bag of topics Lori Erickson
  • 6.
    The different piecesof Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Jon Ross
  • 7.
    Cat photo fromhttp://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 8.
    Lets look atsome old stand bys: val rdd = sc.textFile("python/pyspark/*.py", 20) val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) val warnings = rdd.filter(_.toLower.contains("error")).count() Tomomi
  • 9.
    RDD re-use -sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 10.
    Considerations for Key/ValueData ● What does the distribution of keys look like? ● What type of aggregations do we need to do? ● Do we want our data in any particular order? ● Are we joining with another RDD? ● Whats our partitioner? ○ If we don’t have an explicit one: what is the partition structure? eleda 1
  • 11.
    What is keyskew and why do we care? ● Keys aren’t evenly distributed ○ Sales by zip code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 12.
    groupByKey - justhow evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  • 13.
    So what doesthat look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 14.
    Let’s revisit wordcountwith groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) Tomomi
  • 15.
    And now backto the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 16.
  • 17.
  • 18.
    So what didwe do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 19.
    So why didwe read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  • 20.
    Can just theshuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 21.
    Shuffle explosions :( (94110,A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  • 22.
    100% less explosions (94110,A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R) Jennifer Williams
  • 23.
    Well there isa bit of magic in the shuffle…. ● We can reuse shuffle files ● But it can (and does) explode* Sculpture by Flaming Lotus Girls Photo by Zaskoda
  • 24.
  • 25.
    Iterator to Iteratortransformations ● Iterator to Iterator transformations are super useful ○ They allow Spark to spill to disk if reading an entire partition is too much ○ Not to mention better pipelining when we put multiple transformations together ● Most of the default transformations are already set up for this ○ map, filter, flatMap, etc. ● But when we start working directly with the iterators ○ Sometimes to save setup time on expensive objects ○ e.g. mapPartitions, mapPartitionsWithIndex etc. ● Reading into a list or other structure (implicitly or otherwise) can even cause OOMs :( Christian Heilmann
  • 26.
    tl;dr: be carefulwith mapPartitions Christian Heilmann
  • 27.
    Where can SparkSQL benefit perf? ● Structured or semi-structured data ● OK with having less* complex operations available to us ● We may only need to operate on a subset of the data ○ The fastest data to process isn’t even read ● Remember that non-magic cat? Its got some magic** now ○ In part from peeking inside of boxes ● non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  • 28.
    Why is SparkSQL good for those things? ● Space efficient columnar cached representation ● Able to push down operations to the data store ● Optimizer is able to look inside of our operations ○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _))
  • 29.
    How much fastercan it be? Andrew Skudder
  • 30.
    But where willit explode? ● Iterative algorithms - large plans ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 31.
    How to avoidlineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] } karmablue
  • 32.
    Introducing Datasets ● Newin Spark 1.6 - coming to more languages in 2.0 ● Provide templated compile time strongly typed version of DataFrames ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● Still an experimental component (API will change in future versions) ○ Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless Houser Wolf
  • 33.
    Using Datasets tomix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 34.
    So what wasthat? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions. Not needed in 2.0 + & almost free anyways Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 35.
    And functional stylemaps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 36.
    Switching gears: ValidatingSpark jobs ● Spark has it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can write rules for if the values are expected ○ Simple rules (X > J) ■ The number of records should be greater than 0 ○ Historic rules (X > Avg(last(10, J))) ■ We need to keep track of our previous values - but this can be great for debugging & performance investigation too. Photo by: Paul Schadler
  • 37.
    Spark accumulators ● Really“great” way for keeping track of failed records ● Double counting makes things really tricky ○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen ● Relative rules can save us* under certain conditions Found Animals Foundation Follow
  • 38.
    Using an accumulatorfor validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe Found Animals Foundation Follow P.S: If you are interested in this check out spark-validator (still early stages) & help review Spark PR #11105 / SPARK-12469
  • 39.
    Validation ● For nowchecking file sizes & execution time seem like the most common best practice (from survey) ● spark-validator is still in early stages and not ready for production use but interesting proof of concept ● Doesn’t need to be done in your Spark job (can be done in your scripting language of choice with whatever job control system you are using) ● Sometimes your rules will miss-fire and you’ll need to manually approve a job - that is ok! Photo by: Paul Schadler
  • 40.
    Validating records readmatches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some (1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Do work here.... assert(v.validate(5) === true) } Photo by Dvortygirl
  • 41.
    Making our codetestable ● Before we can refactor our code we need to have good tests ● Testing individual functions is easy if we factor them out ○ less lambdas! :( ● Testing our Spark transformations is a bit trickier, but there are a variety of tools ● Property checking can save us from having to come up lots of test cases
  • 42.
    A simple unittest with spark-testing-base class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  • 43.
    Ok but whatabout testing @ “scale” ● Maybe our program works fine on our local sized input ● If we are using Spark our actual workload is probably huge ● How do we test workloads too large for a single machine? ○ we can’t just use parallelize and collect Qfamily
  • 44.
    Distributed “set” operationsto the rescue* ● Pretty close - already built into Spark ● Doesn’t do so well with floating points :( ○ damn floating points keep showing up everywhere :p ● Doesn’t really handle duplicates very well ○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  • 45.
    Or use RDDComparisions: defcompareWithOrderSamePartitioner[T: ClassTag](expected: RDD [T], result: RDD[T]): Option[(T, T)] = { expected.zip(result).filter{case (x, y) => x != y}.take(1). headOption } Matti Mattila
  • 46.
    Or use RDDComparisions: defcompare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option [(T, Int, Int)] = { val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ + _) val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1). headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  • 47.
    But where dowe get the data for those tests? ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time Lori Rielly
  • 48.
    QuickCheck / ScalaCheck ●QuickCheck generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● sscheck ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  • 49.
    With spark-testing-base test("map shouldnot change number of elements") { val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
  • 50.
    With spark-testing-base &a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
  • 51.
    Additional Spark TestingResources ● Libraries ○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) ○ Java: spark-testing-base (unit) ○ Python: spark-testing-base (unittest2), pyspark.test (pytest) ● Strata San Jose Talk (up on YouTube) ● Blog posts ○ Unit Testing Spark with Java by Jesse Anderson ○ Making Apache Spark Testing Easy with Spark Testing Base ○ Unit testing Apache Spark with py.test raider of gin
  • 52.
    Additional Spark Resources ●Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Kay Ousterhout’s work ○ http://www.eecs.berkeley.edu/~keo/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau raider of gin
  • 53.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 54.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 55.
    And the nextbook….. First four chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 56.
    Spark Videos ● ApacheSpark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark Paul Anderson
  • 57.
    k thnx bye! Ifyou care about Spark testing and don’t hate surveys: http://bit. ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF