Apache Spark 3: The (possible) future

Apache Spark 3:
The (possible) future!

Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos

What will be covered?
● Constraints on predicting the future in open source
● The current state of Spark
● Some exciting new things likely/possibly in Spark 3
● A guide on how to build your own crystal ball to double check my crystal ball
● A look through JIRA
● A plea for you to help us with code reviews
● Q & A

Predicting the future in OSS is hard
● This represents my views as someone who works on Spark
● What people end up deciding to work on / review may not match
● Conesus decision making can sometimes be unpredictable
● While we don't have a crystal ball, we do have JIRA
Hisashi

Some key themes for Spark 3
● Deep Learning
○ VC dollars
● Kubernetes
○ If all the cool kids replaced their scheduler, would you?
○ Also see above
● Removing deprecated APIs (yay?)
● … Scala upgrade?
Andy
Blackledge

Deep Learning: Does this slide have cat?
● New scheduler to support deep learning
● New data types to support deep learning
● Better interchange to support deep learning
● Actual deep learning algorithms…. Where are they?
Quinn Dombrowski

New "Gang" Scheduler
● Announced at Spark Summit back in 2.3 -
https://www.datanami.com/2018/06/05/project-hydrogen-u
nites-apache-spark-with-dl-frameworks/
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src
hkarau@hkarau-glaptop:~/repos/spark$
Lisa Larsson

New "Gang" Barrier Scheduler
● https://issues.apache.org/jira/browse/SPARK-24374
● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy
4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit#
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class
RDDBarrierSuite extends SparkFunSuite with SharedSparkContext {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an
RDDBarrier") {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:
Lisa Larsson

What does this fix?
● Allows scheduling all of the DL job together
● Allows regular scheduling otherwise
● Handles failures (e.g. single executor failure == retry all)
Lisa Larsson

Spark Deep Learning Pipelines
● ML pipelines being extended to better support image data
● Some work external e.g.
https://github.com/databricks/spark-deep-learning
Alternatives:
● TensorflowOnSpark
● MMLSpark, etc.
Smokey Combs

Growth of use of Arrow (maybe)?
Logos trademarks of their respective projects
Juha Kettunen

Kubernetes
● A new cluster manager, used for more than "just" big data
● Spark "supports" but lacks difficult to use
● Active work by people at many companies (yay!)
Lisa Zins

What do we need to do next?
● Better* dynamic scaling
● Easier uploading for user code & dependencies
● Better auth integration
● Better documentation (ugh client mode)
● Better job resource requirement tagging
● Better shell scripts for packaging dependencies
○ It's easier than YARN but that's not saying a lot
○ Asking a junior Data Scientist to build a docker
container doesn't always go so well
Hisashi

Building your own crystal ball:
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Lee Jordan

Spark JIRA
● Let's go look at issues on the Spark JIRA together!
● Don't all rush at once….
Jean Georges Perrin
#SparkInAction

Who likes doing code reviews?
● We have over 400* of them!
● Don't all rush at once….

Want to get involved?
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Please help review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Hisashi

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
You can buy it today! On the internet!
Nothing on Spark 3 because it doesn't exist yet
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.

Sign up for the mailing list @
http://www.distributedcomputing4kids.com

And some upcoming talks:
● March
○ Dataworks Barcelona
○ Strata San Francisco
● May
○ KiwiCoda Mania
● June
○ "Secret" (for another week or so)
● July
○ OSCON Portland
○ Skills Matter in London

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback

Apache Spark 3: The (possible) future

Recommended

Recommended

More Related Content

Similar to Apache Spark 3: The (possible) future

Similar to Apache Spark 3: The (possible) future (20)

More from Lightbend

More from Lightbend (20)

Recently uploaded

Recently uploaded (20)

Apache Spark 3: The (possible) future