SlideShare a Scribd company logo
Beyond Shuffling
tips & tricks for scaling Apache Spark
Reviside
May
2016
Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Principal Software Engineer at IBM’s Spark Technology Center
previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
co-author of a new book focused on Spark performance coming out next year*
@holdenkarau
Slide share http://www.slideshare.net/hkarau
What is going to be covered:
What I think I might know about you
RDD re-use (caching, persistence levels, and checkpointing)
Working with key/value data
Why group key is evil and what we can do about it
When Spark SQL can be amazing and wonderful
A brief introduction to Datasets (new in Spark 1.6)
Iterator-to-Iterator transformations (or yet another way to go OOM in the night)
How to test your Spark code :)
Torsten Reuschling
Or….
Huang
Yun
Chung
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon
Ross
Who I think you wonderful humans are?
Nice* people
Don’t mind pictures of cats
Know some Apache Spark
Want to scale your Apache Spark jobs
Don’t overly mind a grab-bag of topics
Lori Erickson
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
Lets look at some old stand bys:
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
val warnings = rdd.filter(_.toLower.contains("error")).count()
Tomomi
RDD re-use - sadly not magic
If we know we are going to re-use the RDD what should we do?
If it fits nicely in memory caching in memory
persisting at another level
MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER
checkpointing
Noisey clusters
_2 & checkpointing can help
persist first for checkpointing
Richard Gillin
Considerations for Key/Value Data
What does the distribution of keys look like?
What type of aggregations do we need to do?
Do we want our data in any particular order?
Are we joining with another RDD?
Whats our partitioner?
If we don’t have an explicit one: what is the partition structure?
eleda 1
What is key skew and why do we care?
Keys aren’t evenly distributed
Sales by zip code, or records by city, etc.
groupByKey will explode (but it's pretty easy to break)
We can have really unbalanced partitions
If we have enough key skew sortByKey could even fail
Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
Pretty evil
Groups all of the records with the same key into a single record
Even if we immediately reduce it (e.g. sum it or similar)
This can be too big to fit in memory, then our job fails
Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
Tomomi
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
Let’s see what it looks like when we run the two
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions
// Evil group by key version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
val evilWordCounts = grouped.mapValues(_.sum)
evilWordCounts.take(5)
// Less evil version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.take(5)
GroupByKey
reduceByKey
So what did we do instead?
reduceByKey
Works when the types are the same (e.g. in our summing version)
aggregateByKey
Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
Can just the shuffle cause problems?
Sorting by key can put all of the records in the same partition
We can run into partition size limits (around 2GB)
Or just get bad performance
So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_T, T, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Well there is a bit of magic in the shuffle….
We can reuse shuffle files
But it can (and does) explode*
Sculpture by Flaming Lotus Girls
Photo by Zaskoda
Where can Spark SQL benefit perf?
Structured or semi-structured data
OK with having less* complex operations available to us
We may only need to operate on a subset of the data
The fastest data to process isn’t even read
Remember that non-magic cat? Its got some magic** now
In part from peeking inside of boxes
non-JVM (aka Python & R) users: saved from double serialization cost! :)
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila
Why is Spark SQL good for those things?
Space efficient columnar cached representation
Able to push down operations to the data store
Optimizer is able to look inside of our operations
Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
Matti Mattila
How much faster can it be?
Andrew Skudder
But where will it explode?
Iterative algorithms - large plans
Some push downs are sad pandas :(
Default shuffle size is too small for big data (200 partitions WTF?)
Default partition size when reading in is also WTF
How to avoid lineage explosions:
/**
* Cut the lineage of a DataFrame which has too long a query plan.
*/
def cutLineage(df: DataFrame): DataFrame = {
val sqlCtx = df.sqlContext
//tag::cutLineage[]
val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd, df.schema)
//end::cutLineage[]
}
karmablue
Introducing Datasets
New in Spark 1.6 - coming to more languages in 2.0
Provide templated compile time strongly typed version of DataFrames
Make it easier to intermix functional & relational code
Do you hate writing UDFS? So do I!
Still an experimental component (API will change in future versions)
Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Houser Wolf
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions. Shouldn’t be
needed in 2.0 + &
almost free anyways
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Photo by Christian Heilmann
Iterator to Iterator transformations
Iterator to Iterator transformations are super useful
They allow Spark to spill to disk if reading an entire partition is too much
Not to mention better pipelining when we put multiple transformations together
Most of the default transformations are already set up for this
map, filter, flatMap, etc.
But when we start working directly with the iterators
Sometimes to save setup time on expensive objects
e.g. mapPartitions, mapPartitionsWithIndex etc.
Christian Heilmann
Making our code testable
Before we can refactor our code we need to have good tests
Testing individual functions is easy if we factor them out - less lambdas :(
Testing our Spark transformations is a bit trickier, but there are a variety of tools
Property checking can save us from having to come up lots of test cases
A simple unit test with spark-testing-base
class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}
Ok but what about problems @ scale
Maybe our program works fine on our local sized input
If we are using Spark our actual workload is probably huge
How do we test workloads too large for a single machine?
we can’t just use parallelize and collect
Qfamily
Distributed “set” operations to the rescue*
Pretty close - already built into Spark
Doesn’t do so well with floating points :(
damn floating points keep showing up everywhere :p
Doesn’t really handle duplicates very well
{“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected:
RDD[T], result: RDD[T]): Option[(T, T)] = {
expected.zip(result).filter{case (x, y) => x !=
y}.take(1).headOption
}
Matti Mattila
Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]):
Option[(T, Int, Int)] = {
val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +
_)
val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
=>
i1.isEmpty || i2.isEmpty || i1.head !=
i2.head}.take(1).headOption.
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
i2.headOption.getOrElse(0))}
}
Matti Mattila
But where do we get the data for those tests?
If you have production data you can sample you are lucky!
If possible you can try and save in the same format
If our data is a bunch of Vectors or Doubles Spark’s got tools :)
Coming up with good test data can take a long time
Lori Rielly
QuickCheck / ScalaCheck
QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
sscheck
Awesome people*, supports generating DStreams too!
spark-testing-base
Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
PROtara hunt
With spark-testing-base
test("map should not change number of elements") {
val property =
forAll(RDDGenerator.genRDD[String](sc)
(Arbitrary.arbitrary[String])) {
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
With spark-testing-base & a million entries
test("map should not change number of elements") {
implicit val generatorDrivenConfig =
PropertyCheckConfig(minSize = 0, maxSize = 1000000)
val property =
forAll(RDDGenerator.genRDD[String](sc)
(Arbitrary.arbitrary[String])) {
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
Additional Spark Testing Resources
Libraries
Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
Java: spark-testing-base (unit)
Python: spark-testing-base (unittest2), pyspark.test (pytest)
Strata San Jose Talk (up on YouTube)
Blog posts
Unit Testing Spark with Java by Jesse Anderson
Making Apache Spark Testing Easy with Spark Testing Base
Unit testing Apache Spark with py.test
raider of gin
Additional Spark Resources
Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
http://spark.apache.org/docs/latest/
Kay Ousterhout’s work
http://www.eecs.berkeley.edu/~keo/
Books
Videos
Spark Office Hours
Normally in the bay area - will do Google Hangouts ones soon
follow me on twitter for future ones - https://twitter.com/holdenkarau
raider of gin
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:
Buy from O’Reilly - http://bit.ly/highPerfSpark
Book signing Friday June 3rd 10:45** @ O’Reilly booth
Get notified when updated & finished:
http://www.highperformancespark.com
https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
**Normally there are about 25 printed copies first come first serve and when we run out its just
me, a stuffed animal, and a bud light lime (or local equivalent)
Spark Videos
Apache Spark Youtube Channel
My Spark videos on YouTube -
http://bit.ly/holdenSparkVideos
Spark Summit 2014 training
Paco’s Introduction to Apache Spark
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
PySpark Users: Have some simple
UDFs you wish ran faster you are
willing to share?:
http://bit.ly/pySparkUDF

More Related Content

What's hot

Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
Holden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 

What's hot (20)

Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 

Similar to Beyond shuffling - Strata London 2016

Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
Hadoop
HadoopHadoop
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
Andy Petrella
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
Matija Gobec
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx
0111002
 

Similar to Beyond shuffling - Strata London 2016 (20)

Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Hadoop
HadoopHadoop
Hadoop
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx
 

Recently uploaded

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 

Recently uploaded (20)

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 

Beyond shuffling - Strata London 2016

  • 1. Beyond Shuffling tips & tricks for scaling Apache Spark Reviside May 2016
  • 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Principal Software Engineer at IBM’s Spark Technology Center previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark co-author of a new book focused on Spark performance coming out next year* @holdenkarau Slide share http://www.slideshare.net/hkarau
  • 3. What is going to be covered: What I think I might know about you RDD re-use (caching, persistence levels, and checkpointing) Working with key/value data Why group key is evil and what we can do about it When Spark SQL can be amazing and wonderful A brief introduction to Datasets (new in Spark 1.6) Iterator-to-Iterator transformations (or yet another way to go OOM in the night) How to test your Spark code :) Torsten Reuschling
  • 5. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Jon Ross
  • 6. Who I think you wonderful humans are? Nice* people Don’t mind pictures of cats Know some Apache Spark Want to scale your Apache Spark jobs Don’t overly mind a grab-bag of topics Lori Erickson
  • 7. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 8. Lets look at some old stand bys: val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) val warnings = rdd.filter(_.toLower.contains("error")).count() Tomomi
  • 9. RDD re-use - sadly not magic If we know we are going to re-use the RDD what should we do? If it fits nicely in memory caching in memory persisting at another level MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER checkpointing Noisey clusters _2 & checkpointing can help persist first for checkpointing Richard Gillin
  • 10. Considerations for Key/Value Data What does the distribution of keys look like? What type of aggregations do we need to do? Do we want our data in any particular order? Are we joining with another RDD? Whats our partitioner? If we don’t have an explicit one: what is the partition structure? eleda 1
  • 11. What is key skew and why do we care? Keys aren’t evenly distributed Sales by zip code, or records by city, etc. groupByKey will explode (but it's pretty easy to break) We can have really unbalanced partitions If we have enough key skew sortByKey could even fail Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 12. groupByKey - just how evil is it? Pretty evil Groups all of the records with the same key into a single record Even if we immediately reduce it (e.g. sum it or similar) This can be too big to fit in memory, then our job fails Unless we are in SQL then happy pandas PROgeckoam
  • 13. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 14. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) Tomomi
  • 15. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 16. Let’s see what it looks like when we run the two Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  • 19. So what did we do instead? reduceByKey Works when the types are the same (e.g. in our summing version) aggregateByKey Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 20. So why did we read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  • 21. Can just the shuffle cause problems? Sorting by key can put all of the records in the same partition We can run into partition size limits (around 2GB) Or just get bad performance So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 22. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  • 23. 100% less explosions (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R)
  • 24. Well there is a bit of magic in the shuffle…. We can reuse shuffle files But it can (and does) explode* Sculpture by Flaming Lotus Girls Photo by Zaskoda
  • 25. Where can Spark SQL benefit perf? Structured or semi-structured data OK with having less* complex operations available to us We may only need to operate on a subset of the data The fastest data to process isn’t even read Remember that non-magic cat? Its got some magic** now In part from peeking inside of boxes non-JVM (aka Python & R) users: saved from double serialization cost! :) **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  • 26. Why is Spark SQL good for those things? Space efficient columnar cached representation Able to push down operations to the data store Optimizer is able to look inside of our operations Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  • 27. How much faster can it be? Andrew Skudder
  • 28. But where will it explode? Iterative algorithms - large plans Some push downs are sad pandas :( Default shuffle size is too small for big data (200 partitions WTF?) Default partition size when reading in is also WTF
  • 29. How to avoid lineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] } karmablue
  • 30. Introducing Datasets New in Spark 1.6 - coming to more languages in 2.0 Provide templated compile time strongly typed version of DataFrames Make it easier to intermix functional & relational code Do you hate writing UDFS? So do I! Still an experimental component (API will change in future versions) Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless Houser Wolf
  • 31. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 32. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions. Shouldn’t be needed in 2.0 + & almost free anyways Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 33. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 34. Photo by Christian Heilmann
  • 35. Iterator to Iterator transformations Iterator to Iterator transformations are super useful They allow Spark to spill to disk if reading an entire partition is too much Not to mention better pipelining when we put multiple transformations together Most of the default transformations are already set up for this map, filter, flatMap, etc. But when we start working directly with the iterators Sometimes to save setup time on expensive objects e.g. mapPartitions, mapPartitionsWithIndex etc. Christian Heilmann
  • 36. Making our code testable Before we can refactor our code we need to have good tests Testing individual functions is easy if we factor them out - less lambdas :( Testing our Spark transformations is a bit trickier, but there are a variety of tools Property checking can save us from having to come up lots of test cases
  • 37. A simple unit test with spark-testing-base class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  • 38. Ok but what about problems @ scale Maybe our program works fine on our local sized input If we are using Spark our actual workload is probably huge How do we test workloads too large for a single machine? we can’t just use parallelize and collect Qfamily
  • 39. Distributed “set” operations to the rescue* Pretty close - already built into Spark Doesn’t do so well with floating points :( damn floating points keep showing up everywhere :p Doesn’t really handle duplicates very well {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  • 40. Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = { expected.zip(result).filter{case (x, y) => x != y}.take(1).headOption } Matti Mattila
  • 41. Or use RDDComparisions: def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ + _) val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  • 42. But where do we get the data for those tests? If you have production data you can sample you are lucky! If possible you can try and save in the same format If our data is a bunch of Vectors or Doubles Spark’s got tools :) Coming up with good test data can take a long time Lori Rielly
  • 43. QuickCheck / ScalaCheck QuickCheck generates tests data under a set of constraints Scala version is ScalaCheck - supported by the two unit testing libraries for Spark sscheck Awesome people*, supports generating DStreams too! spark-testing-base Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs PROtara hunt
  • 44. With spark-testing-base test("map should not change number of elements") { val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
  • 45. With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
  • 46. Additional Spark Testing Resources Libraries Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) Java: spark-testing-base (unit) Python: spark-testing-base (unittest2), pyspark.test (pytest) Strata San Jose Talk (up on YouTube) Blog posts Unit Testing Spark with Java by Jesse Anderson Making Apache Spark Testing Easy with Spark Testing Base Unit testing Apache Spark with py.test raider of gin
  • 47. Additional Spark Resources Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) http://spark.apache.org/docs/latest/ Kay Ousterhout’s work http://www.eecs.berkeley.edu/~keo/ Books Videos Spark Office Hours Normally in the bay area - will do Google Hangouts ones soon follow me on twitter for future ones - https://twitter.com/holdenkarau raider of gin
  • 48. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 49. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 50. And the next book….. First four chapters are available in “Early Release”*: Buy from O’Reilly - http://bit.ly/highPerfSpark Book signing Friday June 3rd 10:45** @ O’Reilly booth Get notified when updated & finished: http://www.highperformancespark.com https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book. **Normally there are about 25 printed copies first come first serve and when we run out its just me, a stuffed animal, and a bud light lime (or local equivalent)
  • 51. Spark Videos Apache Spark Youtube Channel My Spark videos on YouTube - http://bit.ly/holdenSparkVideos Spark Summit 2014 training Paco’s Introduction to Apache Spark
  • 52. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF

Editor's Notes

  1. https://www.flickr.com/photos/torsten-reuschling/4489795036/in/photolist-7QKnDf-7Gkbt8-5LEi6M-9KSLFE-6Vs1bE-8NdViU-6izzpM-9xpNrw-axS4LK-9Dw1Kk-94mkbN-9T4Aad-9zhcZW-99t2R8-e5RwZ-9bprvG-5sse2X-34dFWm-9fLKRk-9dCx8r-96rzDf-7q8b4a-8qMe6K-6VCQac-ajbXRy-2ShesV-5LJw2W-bEYk3P-aiKz7J-9xUEkf-6Xu4CG-aV6jHZ-6u5wQG-91Y9ac-booFDT-9ddxwE-9n8uVD-bK41MK-dTQkdt-5JVdqi-bCBLSK-7SEWVH-9qaou1-6xfxE5-7yrNCC-7kZgKV-satSwL-5LE1Wz-imuSHr-8QPQut
  2. From https://bighugelabs.com/poster.php https://www.flickr.com/photos/hyc/122643306/in/photolist-bQzzL-7Jd4VQ-6GHHbB-xfBFy-bQwKT-pyEuFR-bQCzf-fRWyLU-bQCaV-5KPina-bQyEN-aoTEiU-9fRxRC-63j3uo-r7ZoEA-bQwmg-93NCZM-phsK23-5E3jw-kVHuU-ht6Gp-4RDvR1-xfBkB-qM5M7-bQCqL-ox8Py-bQvVn-ByUasK-b6CjC-4TCPph-wNxqK-66Yosr-PoxA-49Gaji-ofQfGw-4svq18-fzCrBP-bNj7Cn-bhomEe-8KTkAm-cmLC17-4Lexyy-adYKN-4wFtZT-8v5CKo-qdgSwb-7zViMM-xfBzu-bQx4y-xfBoe https://www.flickr.com/photos/99539788@N08/9653874479/in/photolist-fH5CxH-7Kj3RD-7nTf1j-6k1hTt-5LQQo-dMMnQ-qofbJR-tNhQU-5HXkzh-mZpXh-2KG7LR-GybCoW-bF6zi-pEN3Sp-d3wMuh-hhRsP-4zXtja-5iyHio-jt7X2k-oNuSF9-r33dme-scosN6-7aZiRM-h1LHU-6eHyf9-9m7M7g-5JTkqx-rYec11-cNcb8U-dzt1Gi-nf9QaC-iYyew7-n7kiRo-9maRcs-qu1r6D-5eMjHD-4p45xY-68EiH-9maQYq-7FDeXS-DxGDU-PMLcF-cih8kj-3Jxqj1-5Jd1MC-4hFmyA-qJngBv-6mrjvt-6ZYKQf-8NKxNT
  3. https://www.flickr.com/photos/jon_a_ross/2679856182/in/photolist-55NXSW-4UZZHe-e1Ubar-8oA19X-4V2hrU-4UX6dT-4HpqVm-58CV9k-ardHmQ-72uLB3-6p6gqL-58gez2-hjhDoA-4MqZrU-8ZMidf-4NFd8N-4NFcMQ-9R6Dr6-55JQDr-rxeWPU-oDVKTS-arbcbX-arbbTp-aVNBqi-47TCvC-4NFctq-b4BE3p-7WcAGh-9w8FFR-6HYNpP-662zun-5LX51n-5BWeR2-oZc3Xk-ewax6c-7Z3vKE-e5W5AJ-bi3HtM-bEBTUZ-s1c3gw-qMbK5K-6heJzF-g6YbwT-aoRa8z-kNDkqL-YRwm-4BESNo-iRhKvk-ib7bUU-nmuxdF
  4. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  5. https://www.flickr.com/photos/haoli/6349372032/in/photolist-aF5c6A-beRSyF-cnUjBm-dxujoM-cukarf-5osZv-7LrwZb-8hzdGg-dWAXVw-7j8eCn-mU1GDC-du6Njj-9fNeUF-9fNf2c-jeQw2Z-pCQxin-pCPx1S-oYtpxt-pCSwKY-oYtpz2-5nAgBd-4kR3Xg-2CLt3B-mU1HuL-pCPx4h-54W9r-mTYJGa-pVkTdo-2CLrVX-9qkxeT-9s2gwi-9qkx1X-oYqiWL-pCSwD5-2xFigB-72vWUH-dWoBAi-opf1Pw-7jc8Bu-6QfmGS-pVcDuv-4FDmvY-dWufM9-9rFwy5-RAsAG-csnYJu-7QF7sx-83wqki-6faJ2B-7NJT8E
  6. From https://www.flickr.com/photos/photoverulam/22626301622/in/photolist-AtpHbE-2biJ8i-cbDxLj-5SbTJs-bvJ6pR-4cKd6r-c5io3W-x7fuW-8GEnYV-7ngpwq-7ncv4F-7ncv36-6UPdLM-cS2j3s-6zXf6D-pps5P-6UPdZc-qbhws-egQRmW-61si6q-N864-65o5nN-4D4R6z-wavuvy-zzzrqc-6RG2Wn-zhbLnM-zhbLPP-coidfb-6d9XaA-cfPRY7-coidn7-coidLC-6hDKxj-se5vfT-t8y1tQ-5pRoHx-N854-8UuUYz-msyfx-9DqPba-49vTz-4c4F5-5QL2qk-v7G7z9-w4GYdP-irqiN-6Dc9WZ-2h4pkp-uKaPa
  7. https://www.flickr.com/photos/eleda/531867386/in/photolist-NZXDm-4H2JU2-chHH61-aDPTFx-5SYV6V-cgjJVm-bmsnCt-bWgJiD-eiwHzX-dQgyhR-3bN33R-eXWaq2-7Cr1HJ-5TxxkF-9prgZh-2Fehf-9xVUGJ-guZfLW-bWgJk2-93HkH6-9prh4Q-9poftp-eL99JM-9prerC-93LqUf-eLkz5L-6gsr2T-4ofma3-4obj4M-bV2a3u-7ygQQr-gS4GzY-GTrX9-7cLyNh-6yFvoe-fv6smP-4GRE5r-5kLaJv-5BE2Eg-4GVR4f-5Qnzri-6N33MP-4XfVC8-56HJVB-s5HTfd-4GVPwW-27SD6T-dGk3Vj-4ofqNC-9e2NoY
  8. https://www.flickr.com/photos/hckyso/2055866250/in/photolist-48ERpQ-c3rR-7DyiPo-4wAZRn-hzYJD4-9KvP1D-81rV7R-7F1fnm-dTQkdt-AdUJ-95BJsR-4hy1LG-891ckh-orpiij-7sDjhG-qdro34-s1x8Sm-7X3N8R-9JXXZC-aSJdR-ampKtE-6aTcDC-4P6QUv-9Zry8g-4d54Qi-ZMHEJ-g16RaZ-j95eU-9pp82n-7Efa4Y-apJqbb-6kYmJ8-t4N6G5-DCbLQ-7Smuw8-eir9us-ek6wdx-eiGMj1-5iMBeE-9bh3qr-8MpZPp-9kRy1L-ekLggu-du4gyZ-7bmbow-eir9vo-9kunTs-a2Wru-cQGXy5-DCcaR
  9. photo from https://www.flickr.com/photos/geckoam/2956778600/
  10. https://www.flickr.com/photos/girliemac/6508023679/in/photolist-aV6jGP-rYW3Uk-e5mpBX-5vSsW4-ba1Gog-aVvDhR-6reyLQ-6rez9q-6reyVL-6rezd7-dEoKUS-5C1SPj-6bGQYy-zMjVzp-9EWN4B-5BWzXV-5dnc6i-77fZX5-5BWtHR-fa1iUP-fAuBd-axg5zb-5L1UuF-5L5j1Q-7LZ41T-5BWuVT-5C1PWJ-5BWvtH-5BWvEx-5C1Pbd-5BWvSa-iRZwVZ-5BWv8t-6nQfdh-5L1bkZ-e4kEi1-r2Bis3-Ghv4Q-99TJ5w-aVEWZ6-rBBzT3-bWLL9X-p5jQKP-jtA7Su-nqxTYN-6em9ox-a68TS6-6rapN2-6rez4G-6reyHC
  11. https://www.flickr.com/photos/picsoflife/6684444319/in/photolist-bbFwpZ-kDRdPD-4dft9y-bbFXa6-kLKmq-6FFpRJ-bbFv8B-bbFtXi-52xMWp-yMamtU-aTsVq4-6YkdRU-qKuwaK-5DMr2w-96aGQK-dc1Uud-9mZo1U-5Z1ZtF-41x2VZ-CJwzqw-6N9pTY-72vqix-ynnrR-3KGm3j-5oEY4g-6NAXg-8eDsRR-3KBZL2-43FdQB-7SUcN8-4Pvr3R-uJLoS-62Tf6S-CyEcHD-5N4dwS-7jg7Td-5CWswE-76SasX-rMKRV-767XJ4-63kK4d-7oA2u6-4PJ3UT-pYNj8P-76QYMM-7vuFro-4A5nNt-quby5D-qu3MaW-qu3M7Q
  12. http://wafflescat.com/
  13. https://www.flickr.com/photos/latitudes/66424863/in/photolist-6SrNg-4FS7h3-n3aG-675Ggf-2mvpnV-4EPRi-agTjTx-3fuHL-7xHxwK-2RnrK-9hNfoi-2RnV1-2RnV3-6y5i2D-4EPSo-rgtUq-6amUo9-2RnV4-dxZEgS-HS6QM-dzGYC-cWsXC5-2RnV6-aDHNC-2RnV2-bqQu1Q-5kwTda-n35c-tvq1-rgu6G-NdcJr-6ahMeZ-oUnQSw-4kPxbs-xGmP-63cN61-6ahKok-rgtZY-zE7Wf-dghvFQ-sQaV1s-aLr6Tn-aWCMd4-whPuJ-jhaCqH-wM72t-Z5TfQ-a8Tqys-Nopr3-gz9b7W/
  14. https://twitter.com/javier_artiles/status/724457437189672961
  15. https://www.flickr.com/photos/zaskoda/3918492260/in/photolist-6Ygieo-4KYBWE-84RsUz-5Eimot-shJEhu-9L4ui2-9L4utF-5iuVcg-mJ5aK-o6eW1y-mJkZR-fEz23g-9L7hxm-5iA983-72dDuP-ajZz59-ossMFX-mJm8J-fERApN-fEz11R-fEz2u6-MhysA-4Lcv8q-6XaNSS-akRiJK-PwTMw-fEz1za-fHqbUu-ajWEC2-56CjAx-ajWND4-ajWQaK-ajZzgG-ajWDux-ajZyXj-ajWNmp-ajZABb-ajZrFQ-ajZAVG-mTGXo-9Lmv2J-ogRDSB-6X1Q5z-ajWFEH-ajWN22-ajWDkV-ajWPWn-a9hwc7-ajZB25-ajWMG4
  16. https://www.flickr.com/photos/mattimattila/8190148857/in/photolist-dtJDV4-9tFyUo-9tBqdY-9tymzv-9tBFDf-9tBf1Y-9tyhGp-9tBerj-9tBe4u-9tygGt-9tBc1L-9tB7aJ-9tBeC5-9tzzx6-9tzzq2-9tCw4u-9tzyAv-9tCsHo-9tzvf8-9tyS8X-9tCx5Y-c1JFsu-9tBD8s-9tyt9Z-9tymqa-9tykmD-9tyi3D-9tBPo1-9tyvJt-9tBofj-9tBB9E-9tyBVx-9tBanw-9ty7KM-9ty662-9tCwwY-9tCrVq-4YqUM-9ty2Fv-9tCry1-9tzu72-9tCrbo-9tyReT-c1PhzY-9tyR5r-9tywiK-9tyw9B-9tBt3b-9tBsFs-9tBswf
  17. https://www.flickr.com/photos/lin/438130701/in/photolist-EHwZB-62s6ob-ehu1Vy-uGtao-6xefC-9v9Jrv-caMgTo-yomai-dTfG4Z-68pVbh-8ARiuT-a6DCLa-6MRAV8-bwWKa3-rig6Sq-6cyQm2-jiit9y-9aeC6x-q4crGv-pakTMr-9rj175-9h2JSE-aLR3o2-pRBcCU-bxai1h-f3mkGH-amuapV-khAMhT-bnwBu1-e2eBYx-q6wdiU-6HYV19-af9A17-8hzUcT-6Az1KM-fgSdhH-7kcSUS-aqaRyw-e4bykV-66AYDg-ebLQP2-bjFyTR-eaU17n-9ZKtdZ-wimwgK-8z31fd-of8Brs-99t2R8-6AanG8-fkxYMS
  18. https://www.flickr.com/photos/houserwolf/15121666653/in/photolist-p3fuEz-qhPRYw-qtGC9K-ggSUp4-dPwwo3-bbWZ8g-jg2siS-8YWZH8-ouKa9x-dwTL8F-gTANnQ-gfn7i-aSWr2e-e2bh2N-dea2zL-bmL8wG-bZK7L-4s8Twu-4RMewQ-9CkRuK-broVri-aZZcwK-8W6LoP-dbAgus-cHRNyL-bBrjQR-89J5Eu-9Xmwrn-9Mt6RD-dLyEzf-azJezy-cURUQL-eZao4b-eYY28g-b6D4bR-8zF9T5-b4xRRT-dDv8Pg-eX4cxc-6kPcyC-eYY2oT-aRzQNr-7wZFRx-6oiZEg-eYY2UM-bBaVqp-gSqP4K-aQcNHn-9pxgoX-6pYd5k
  19. https://www.flickr.com/photos/codepo8/4543536014/in/photolist-7VuNXQ-7kBVQ6-n9AJc5-8sW2Yq-bnQJi-6xLVCQ-sGeC9-7qeLWC-6xGGGT-6xGDjP-6xM9Rb-6xGCpM-6xGK8M-aCKCq-6xGQdc-EBdFu-usggf-8rMG3r-cEA6Py-Reqe-bK8bf2-9AJVrH-abECqF-sGeBM-fzPZ1-sGdRf-sGdQu-54e2d2-54iggQ-54e44D-54igU7-c5pF9-54if9N-rd6N-pi7pmz-zUpqxM-6zGsin-77Keuy-46uZse-ny2tj-bx7num-4jKWps-7zmDvJ-AVXUuK-yZBEuA-aEWrJi-yZGK1F-8Ytibh-f9Wm2i-ykmg4R
  20. https://www.flickr.com/photos/dasqfamily/2689550144/in/photolist-56EDyd-efQnXR-7pgVnD-4UMS9Q-7Txe2X-5rKoLh-5rF3Jv-AiJDB-8bdX7z-4WUcin-q64D3v-44zDJw-bVocHs-2JEjyQ-qj83RW-45VZ8N-5hZehC-eFNna-pNRDyB-qisyve-457EPa-e6uRmn-5mjX9f-7iFMZ5-7iFMCY-4Sukyq-76RvKL-8yoq5h-yKcCs-7HsQGb-7HoV2Z-7HoUQt-71XbYG-hFGHfP-6FPisU-bvhhXE-8E3pv3-fQ61m2-6JwT35-ffYJqu-ffJDhg-p9umgB-eiRgoc-ejA2z3-pNQN19-4XsWJL-bRtmHt-q63y18-81VoSv-p9r6vm
  21. https://www.flickr.com/photos/loririelly/310423284/in/photolist-tr12A-ce3umw-ce3usA-sTuju-4KzpXM-bWF7N4-cmgsuj-5VAsVe-5ez5rL-dU3xmR-63Yk5R-aiAzej-axU8gH-itrDy4-yDZ4h-9Ndmmx-7H4JUh-7MxHEw-67LCuV-bN9iWp-dk2fqW-8WyXvu-efkbSV-81V6UF-Kfh5K-649haa-649hTT-6dbcH1-63faGr-63TBmt-63TzMZ-s1SZi-LbcWy-5UkVv5-67uA3-cAb4b-afspPn-Awb3N-8WvNvH-6tjfnB-7teVts-61F8tE-62pHDa-d3rnjE-4AwRMn-5UgxY6-dgv268-62LqUt-64dtZq-645hnk
  22. https://www.flickr.com/photos/missrogue/2976197742/in/photolist-5wZMUb-byJi8-e6WsRf-7fTLLH-5wVqsT-8Qpce3-8QAhV6-94zTWo-9PE5CH-5Pdi4T-5wVpw2-8PCDEp-8N31iD-8Dgo7D-aPHB8k-ecrFYb-3LbyXN-bmMaha-4WBsuY-cXnPdC-8VFVah-9nBTao-oQETkH-8XEpf6-guycFY-5zohWb-axDVJw-9nBB6S-9nBAsS-9nBVAJ-9nyRUi-guyeNd-dWQLit-9nBFkG-9nySHg-9nBKhQ-9nBKYd-9nBJxC-9nyPCZ-9nBCsN-9nyB1i-755Ci6-9nBBHU-9nBPQ1-9nyCFi-f1cuCZ-7cqyfu-9nCnbG-9nzjuB-dhSg5C
  23. https://www.flickr.com/photos/fairerdingo/2320356657/in/photolist-4x3rcg-AvWp-9apz9a-jJhpoQ-2ov2gu-7Rr2qr-3P2KH-5YFdAJ-gACrN-HTUWP-6j9ooG-dXpN3Q-9kccaV-aFuUfB-8ZN65i-6pQSAv-btZvjV-9ddxwE-4Lq8UH-dXaZ7j-73Xojt-mUZSq-fTy1P-e4n9B-hYwP4-89QrWo-67bSSJ-aThabK-bTctDK-94iUu2-asHJSr-bBnVA8-5MbBJM-g2Vrky-efhYzw-8NxAKw-e3baUF-grvK9-48GJ6n-bAV4eh-btJDEK-4zJtyV-8naFTb-dgJfT-5H88ML-vRFsiA-bHt6pc-7eVJa6-bm2YzR-63sSC5
  24. https://www.flickr.com/photos/fairerdingo/2320356657/in/photolist-4x3rcg-AvWp-9apz9a-jJhpoQ-2ov2gu-7Rr2qr-3P2KH-5YFdAJ-gACrN-HTUWP-6j9ooG-dXpN3Q-9kccaV-aFuUfB-8ZN65i-6pQSAv-btZvjV-9ddxwE-4Lq8UH-dXaZ7j-73Xojt-mUZSq-fTy1P-e4n9B-hYwP4-89QrWo-67bSSJ-aThabK-bTctDK-94iUu2-asHJSr-bBnVA8-5MbBJM-g2Vrky-efhYzw-8NxAKw-e3baUF-grvK9-48GJ6n-bAV4eh-btJDEK-4zJtyV-8naFTb-dgJfT-5H88ML-vRFsiA-bHt6pc-7eVJa6-bm2YzR-63sSC5