SlideShare a Scribd company logo
Beyond Shuffling
tips & tricks for scaling Apache Spark
Global Big
Data SJ 2015
early version
Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Software Engineer at IBM
previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
co-author of a new book focused on Spark performance coming out next year*
@holdenkarau
Slide share http://www.slideshare.net/hkarau
What is going to be covered:
What I think I might know about you
RDD re-use (caching, persistence levels, and checkpointing)
Working with key/value data
Why group key is evil and what we can do about it
Best practices for Spark accumulators*
When Spark SQL can be amazing and wonderful
A quick detour into some future performance work in Spark MLLib
Who I think you wonderful humans are?
Nice* people
Know some Apache Spark
Want to scale your Apache Spark jobs
Comfortable reading Scala
Lori Erickson
If you want to follow along with the exercise
Make sure you have recent-ish JDK
Install Spark (any precompiled Hadoop version)
http://spark.apache.org/downloads.html
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
RDD re-use - sadly not magic
If we know we are going to re-use the RDD what should we do?
If it fits nicely in memory caching in memory
persisting at another level
MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER
checkpointing
Noisey clusters
_2 & checkpointing can help
Richard Gillin
Considerations for Key/Value Data
What does the distribution of keys look like?
What type of aggregations do we need to do?
Do we want our data in any particular order?
Are we joining with another RDD?
Whats our partitioner?
eleda 1
What is key skew and why do we care?
Keys aren’t evenly distributed
Sales by zip code, or records by city, etc.
groupByKey will explode (but it's pretty easy to break)
We can have really unbalanced partitions
If we have enough key skew sortByKey could even fail
Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
Pretty evil
Groups all of the records with the same key into a single record
Even if we immediately reduce it (e.g. sum it or similar)
This can be too big to fit in memory, then our job fails
Unless we are in SQL then happy pandas
PROgeckoam
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
Let’s launch spark and compare the two
You can get Spark from http://spark.apache.org/downloads.html
You need a recent version of Java
If installing is difficult don’t worry - the results will be in the slides
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
Code to compare the two:
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions
// Evil group by key version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
val evilWordCounts = grouped.mapValues(_.sum)
evilWordCounts.take(5)
// Less evil version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.take(5)
GroupByKey
reduceByKey
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
So what did we do instead?
reduceByKey
Works when the types are the same (e.g. in our summing version)
aggregateByKey
Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
Can just the shuffle cause problems?
Sorting by key can put all of the records in the same partition
We can run into partition size limits (around 2GB)
Or just get bad performance
So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Spark accumulators
Really “great” way for keeping track of failed records
Double counting makes things really tricky
Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
Relative rules can save us* under certain conditions
Found Animals Foundation Follow
Using an accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Where can Spark SQL benefit perf?
Structured or semi-structured data
OK with having less* complex operations available to us
We may only need to operate on a subset of the data
The fastest data to process isn’t even read
Remember that non-magic cat? Its got some magic** now
In part from peeking inside of boxes
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila
Why is Spark SQL good for those things?
Space efficient columnar cached representation
Able to push down operations to the data store
Optimizer is able to look inside of our operations
Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
Matti Mattila
Preview: bringing codegen to Spark ML
Based on Spark SQL’s code generation
First draft using quasiquotes
Switch to janino for Java compilation
Initial draft for Gradient Boosted Trees
Based on DB’s work
First draft with QuasiQuotes
Moved to Java for speed
See SPARK-10387 for the details
Jon
@Override
public double call(Vector input) throws
Exception {
if (input.apply(1) <= 1.0) {
return 0.1;
} else {
if (input.apply(0) <= 0.5) {
return 0.0;
} else {
return 2.0;
}
}
}
(1, 1.0)
0.1 (0, 0.5)
0.0 2.0
What the generated code looks like: Glenn Simmons
Everyone* needs reduce, let’s make it faster!
reduce & aggregate have “tree” versions
we already had free map-side reduction
but now we can get even better!**
**And we might be able to make even cooler versions
Additional Resources
Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
http://spark.apache.org/docs/latest/
Books
Videos
Our next meetup!
Spark Office Hours
follow me on twitter for future ones - https://twitter.com/holdenkarau
fill out this survey to choose the next date - http://bit.ly/spOffice1
raider of gin
Q&A OR A quick detour into spark testing?
It's like a choose your own adventure novel, but with
voting
But more like the voting in High School since if we are
running out of time we might just skip it
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Spark Videos
Apache Spark Youtube Channel
My Spark videos on YouTube -
http://bit.ly/holdenSparkVideos
Spark Summit 2014 training
Paco’s Introduction to Apache Spark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau

More Related Content

What's hot

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Spark Summit
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
Holden Karau
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Holden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
Holden Karau
 

What's hot (20)

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 

Viewers also liked

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
Holden Karau
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 

Viewers also liked (7)

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 

Similar to Beyond shuffling global big data tech conference 2015 sj

Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Spark rdd
Spark rddSpark rdd
Spark rdd
Manindar G
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Devoxx
DevoxxDevoxx
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
Yasoda Jayaweera
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
NoSQLmatters
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 

Similar to Beyond shuffling global big data tech conference 2015 sj (20)

Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Spark rdd
Spark rddSpark rdd
Spark rdd
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Devoxx
DevoxxDevoxx
Devoxx
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 

Beyond shuffling global big data tech conference 2015 sj

  • 1. Beyond Shuffling tips & tricks for scaling Apache Spark Global Big Data SJ 2015 early version
  • 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Software Engineer at IBM previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark co-author of a new book focused on Spark performance coming out next year* @holdenkarau Slide share http://www.slideshare.net/hkarau
  • 3. What is going to be covered: What I think I might know about you RDD re-use (caching, persistence levels, and checkpointing) Working with key/value data Why group key is evil and what we can do about it Best practices for Spark accumulators* When Spark SQL can be amazing and wonderful A quick detour into some future performance work in Spark MLLib
  • 4. Who I think you wonderful humans are? Nice* people Know some Apache Spark Want to scale your Apache Spark jobs Comfortable reading Scala Lori Erickson
  • 5. If you want to follow along with the exercise Make sure you have recent-ish JDK Install Spark (any precompiled Hadoop version) http://spark.apache.org/downloads.html
  • 6. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 7. RDD re-use - sadly not magic If we know we are going to re-use the RDD what should we do? If it fits nicely in memory caching in memory persisting at another level MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER checkpointing Noisey clusters _2 & checkpointing can help Richard Gillin
  • 8. Considerations for Key/Value Data What does the distribution of keys look like? What type of aggregations do we need to do? Do we want our data in any particular order? Are we joining with another RDD? Whats our partitioner? eleda 1
  • 9. What is key skew and why do we care? Keys aren’t evenly distributed Sales by zip code, or records by city, etc. groupByKey will explode (but it's pretty easy to break) We can have really unbalanced partitions If we have enough key skew sortByKey could even fail Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 10. groupByKey - just how evil is it? Pretty evil Groups all of the records with the same key into a single record Even if we immediately reduce it (e.g. sum it or similar) This can be too big to fit in memory, then our job fails Unless we are in SQL then happy pandas PROgeckoam
  • 11. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum)
  • 12. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 13. Let’s launch spark and compare the two You can get Spark from http://spark.apache.org/downloads.html You need a recent version of Java If installing is difficult don’t worry - the results will be in the slides Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
  • 14. Code to compare the two: Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  • 17. So why did we read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  • 18. So what did we do instead? reduceByKey Works when the types are the same (e.g. in our summing version) aggregateByKey Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 19. Can just the shuffle cause problems? Sorting by key can put all of the records in the same partition We can run into partition size limits (around 2GB) Or just get bad performance So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 20. Spark accumulators Really “great” way for keeping track of failed records Double counting makes things really tricky Jobs which worked “fine” don’t continue to work “fine” when minor changes happen Relative rules can save us* under certain conditions Found Animals Foundation Follow
  • 21. Using an accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 22. Where can Spark SQL benefit perf? Structured or semi-structured data OK with having less* complex operations available to us We may only need to operate on a subset of the data The fastest data to process isn’t even read Remember that non-magic cat? Its got some magic** now In part from peeking inside of boxes **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  • 23. Why is Spark SQL good for those things? Space efficient columnar cached representation Able to push down operations to the data store Optimizer is able to look inside of our operations Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  • 24. Preview: bringing codegen to Spark ML Based on Spark SQL’s code generation First draft using quasiquotes Switch to janino for Java compilation Initial draft for Gradient Boosted Trees Based on DB’s work First draft with QuasiQuotes Moved to Java for speed See SPARK-10387 for the details Jon
  • 25. @Override public double call(Vector input) throws Exception { if (input.apply(1) <= 1.0) { return 0.1; } else { if (input.apply(0) <= 0.5) { return 0.0; } else { return 2.0; } } } (1, 1.0) 0.1 (0, 0.5) 0.0 2.0 What the generated code looks like: Glenn Simmons
  • 26. Everyone* needs reduce, let’s make it faster! reduce & aggregate have “tree” versions we already had free map-side reduction but now we can get even better!** **And we might be able to make even cooler versions
  • 27. Additional Resources Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) http://spark.apache.org/docs/latest/ Books Videos Our next meetup! Spark Office Hours follow me on twitter for future ones - https://twitter.com/holdenkarau fill out this survey to choose the next date - http://bit.ly/spOffice1 raider of gin
  • 28. Q&A OR A quick detour into spark testing? It's like a choose your own adventure novel, but with voting But more like the voting in High School since if we are running out of time we might just skip it
  • 29. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action
  • 30. Spark Videos Apache Spark Youtube Channel My Spark videos on YouTube - http://bit.ly/holdenSparkVideos Spark Summit 2014 training Paco’s Introduction to Apache Spark
  • 31. Cat wave photo by Quinn Dombrowski k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau

Editor's Notes

  1. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  2. https://www.flickr.com/photos/haoli/6349372032/in/photolist-aF5c6A-beRSyF-cnUjBm-dxujoM-cukarf-5osZv-7LrwZb-8hzdGg-dWAXVw-7j8eCn-mU1GDC-du6Njj-9fNeUF-9fNf2c-jeQw2Z-pCQxin-pCPx1S-oYtpxt-pCSwKY-oYtpz2-5nAgBd-4kR3Xg-2CLt3B-mU1HuL-pCPx4h-54W9r-mTYJGa-pVkTdo-2CLrVX-9qkxeT-9s2gwi-9qkx1X-oYqiWL-pCSwD5-2xFigB-72vWUH-dWoBAi-opf1Pw-7jc8Bu-6QfmGS-pVcDuv-4FDmvY-dWufM9-9rFwy5-RAsAG-csnYJu-7QF7sx-83wqki-6faJ2B-7NJT8E
  3. From https://www.flickr.com/photos/photoverulam/22626301622/in/photolist-AtpHbE-2biJ8i-cbDxLj-5SbTJs-bvJ6pR-4cKd6r-c5io3W-x7fuW-8GEnYV-7ngpwq-7ncv4F-7ncv36-6UPdLM-cS2j3s-6zXf6D-pps5P-6UPdZc-qbhws-egQRmW-61si6q-N864-65o5nN-4D4R6z-wavuvy-zzzrqc-6RG2Wn-zhbLnM-zhbLPP-coidfb-6d9XaA-cfPRY7-coidn7-coidLC-6hDKxj-se5vfT-t8y1tQ-5pRoHx-N854-8UuUYz-msyfx-9DqPba-49vTz-4c4F5-5QL2qk-v7G7z9-w4GYdP-irqiN-6Dc9WZ-2h4pkp-uKaPa
  4. https://www.flickr.com/photos/eleda/531867386/in/photolist-NZXDm-4H2JU2-chHH61-aDPTFx-5SYV6V-cgjJVm-bmsnCt-bWgJiD-eiwHzX-dQgyhR-3bN33R-eXWaq2-7Cr1HJ-5TxxkF-9prgZh-2Fehf-9xVUGJ-guZfLW-bWgJk2-93HkH6-9prh4Q-9poftp-eL99JM-9prerC-93LqUf-eLkz5L-6gsr2T-4ofma3-4obj4M-bV2a3u-7ygQQr-gS4GzY-GTrX9-7cLyNh-6yFvoe-fv6smP-4GRE5r-5kLaJv-5BE2Eg-4GVR4f-5Qnzri-6N33MP-4XfVC8-56HJVB-s5HTfd-4GVPwW-27SD6T-dGk3Vj-4ofqNC-9e2NoY
  5. https://www.flickr.com/photos/hckyso/2055866250/in/photolist-48ERpQ-c3rR-7DyiPo-4wAZRn-hzYJD4-9KvP1D-81rV7R-7F1fnm-dTQkdt-AdUJ-95BJsR-4hy1LG-891ckh-orpiij-7sDjhG-qdro34-s1x8Sm-7X3N8R-9JXXZC-aSJdR-ampKtE-6aTcDC-4P6QUv-9Zry8g-4d54Qi-ZMHEJ-g16RaZ-j95eU-9pp82n-7Efa4Y-apJqbb-6kYmJ8-t4N6G5-DCbLQ-7Smuw8-eir9us-ek6wdx-eiGMj1-5iMBeE-9bh3qr-8MpZPp-9kRy1L-ekLggu-du4gyZ-7bmbow-eir9vo-9kunTs-a2Wru-cQGXy5-DCcaR
  6. photo from https://www.flickr.com/photos/geckoam/2956778600/
  7. https://www.flickr.com/photos/latitudes/66424863/in/photolist-6SrNg-4FS7h3-n3aG-675Ggf-2mvpnV-4EPRi-agTjTx-3fuHL-7xHxwK-2RnrK-9hNfoi-2RnV1-2RnV3-6y5i2D-4EPSo-rgtUq-6amUo9-2RnV4-dxZEgS-HS6QM-dzGYC-cWsXC5-2RnV6-aDHNC-2RnV2-bqQu1Q-5kwTda-n35c-tvq1-rgu6G-NdcJr-6ahMeZ-oUnQSw-4kPxbs-xGmP-63cN61-6ahKok-rgtZY-zE7Wf-dghvFQ-sQaV1s-aLr6Tn-aWCMd4-whPuJ-jhaCqH-wM72t-Z5TfQ-a8Tqys-Nopr3-gz9b7W/
  8. Photo https://www.flickr.com/photos/foundanimalsfoundation/8055190879/in/photolist-dgNXBn-4L53ub-ajWE6R-ovhrAn-buEU2i-6TM1kv-6F62SX-dv1zwm-6JiU12-e3GnSr-877jwm-nkEHyT-5q27Jq-6Yngd4-4xcRaU-4x8Mgn-6g3oAX-8Hcwvh-6bdxVW-4xcUnq-idRQ5-4x93fz-9ix9t5-4x8QSt-4x9dhT-ovW6RV-ou7PoH-aukUjT-dbHTpJ-aPCdta-4xdaNG-4x8ViZ-4xd8kh-4x97ge-4xd1WS-4xduUs-4x8LaV-4x8Nig-4x8JEM-4x8Dxe-4x8U7n-4xdhs1-4xdfi9-4x8Gsg-4x9fL2-4xcSfW-4xcPmq-4x9akx-4x95e2-4x99n8
  9. https://www.flickr.com/photos/mattimattila/8190148857/in/photolist-dtJDV4-9tFyUo-9tBqdY-9tymzv-9tBFDf-9tBf1Y-9tyhGp-9tBerj-9tBe4u-9tygGt-9tBc1L-9tB7aJ-9tBeC5-9tzzx6-9tzzq2-9tCw4u-9tzyAv-9tCsHo-9tzvf8-9tyS8X-9tCx5Y-c1JFsu-9tBD8s-9tyt9Z-9tymqa-9tykmD-9tyi3D-9tBPo1-9tyvJt-9tBofj-9tBB9E-9tyBVx-9tBanw-9ty7KM-9ty662-9tCwwY-9tCrVq-4YqUM-9ty2Fv-9tCry1-9tzu72-9tCrbo-9tyReT-c1PhzY-9tyR5r-9tywiK-9tyw9B-9tBt3b-9tBsFs-9tBswf
  10. https://www.flickr.com/photos/jb-london/6659711647/in/photolist-b9uLfB-oMFLKY-psumAe-PvmTe-9vatNK-qektu-8g3jSA-349iv-6GtGmj-oK9cEY-991iGG-cPJ8QU-8dxxkB-mF2Hpc-jKLC8r-o6k2UB-eqbByC-6RGY2L-56P4E3-75QJPn-meLnko-athMJ5-dshXvy-9Ddf4h-dWcYXQ-8cxGxH-4EaXuw-nSfe14-eeXM3G-6w6p2X-dz1VFC-cirujw-nRjjjG-nRon7D-BBRxV-b8Y4UZ-4ang32-8N4tS6-aqNUJJ-3daDSd-bdnv4Z-9jJxG8-otHbqV-CsKnA-4rLoBN-pczP4-niPcP4-f9xNuq-fpDcRL-7khdoc
  11. https://www.flickr.com/photos/simmogl/4055700308/in/photolist-7bowpo-zvyWaQ-3EhtfM-zM3uwo-zM3Ba3-pG7vSq-oMmzPG-oMpCrt-5uSRnw-4HXxs-bwiGtb-9u29aC-oMmLtY-kop9e-4HXwH-oMoqa7-5zd1U6-9pB2jn-hCMrd5-bGyH6i-4Kj7q-dDaF1-prNtLn-zM3yR5-yRhpTn-yR8sRL-yR8oMd-yRhsqB-gTH7qx-zvz4sW-92waWk-yR8wph-yRhrJB-3EmRpS-7eqcqM-4Kj3p-njURVR-2aHanh-iFykZQ-9x97CL-9NfNL-k9N6fm-5RSaZ-4BxAv5-a51APZ-dqhjnr-dqhqZ9-eb9V7X-3EmR3d-6sCnb7
  12. https://www.flickr.com/photos/fairerdingo/2320356657/in/photolist-4x3rcg-AvWp-9apz9a-jJhpoQ-2ov2gu-7Rr2qr-3P2KH-5YFdAJ-gACrN-HTUWP-6j9ooG-dXpN3Q-9kccaV-aFuUfB-8ZN65i-6pQSAv-btZvjV-9ddxwE-4Lq8UH-dXaZ7j-73Xojt-mUZSq-fTy1P-e4n9B-hYwP4-89QrWo-67bSSJ-aThabK-bTctDK-94iUu2-asHJSr-bBnVA8-5MbBJM-g2Vrky-efhYzw-8NxAKw-e3baUF-grvK9-48GJ6n-bAV4eh-btJDEK-4zJtyV-8naFTb-dgJfT-5H88ML-vRFsiA-bHt6pc-7eVJa6-bm2YzR-63sSC5