The document provides an overview of a presentation on Apache Spark 3. It introduces the presenter and their background working on Spark. The presentation will cover predicting the future of open source projects, the current state of Spark, potential new features in Spark 3 like deep learning support and a new scheduler, upgrading to Scala 3, and how attendees can get involved in the Spark community through code reviews and issue triage.
3. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
4.
5. What will be covered?
● Constraints on predicting the future in open source
● The current state of Spark
● Some exciting new things likely/possibly in Spark 3
● A guide on how to build your own crystal ball to double check my crystal ball
● A look through JIRA
● A plea for you to help us with code reviews
● Q & A
6. Predicting the future in OSS is hard
● This represents my views as someone who works on Spark
● What people end up deciding to work on / review may not match
● Conesus decision making can sometimes be unpredictable
● While we don't have a crystal ball, we do have JIRA
Hisashi
7. Some key themes for Spark 3
● Deep Learning
○ VC dollars
● Kubernetes
○ If all the cool kids replaced their scheduler, would you?
○ Also see above
● Removing deprecated APIs (yay?)
● … Scala upgrade?
Andy
Blackledge
8. Deep Learning: Does this slide have cat?
● New scheduler to support deep learning
● New data types to support deep learning
● Better interchange to support deep learning
● Actual deep learning algorithms…. Where are they?
Quinn Dombrowski
9. New "Gang" Scheduler
● Announced at Spark Summit back in 2.3 -
https://www.datanami.com/2018/06/05/project-hydrogen-u
nites-apache-spark-with-dl-frameworks/
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src
hkarau@hkarau-glaptop:~/repos/spark$
Lisa Larsson
10. New "Gang" Barrier Scheduler
● https://issues.apache.org/jira/browse/SPARK-24374
● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy
4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit#
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class
RDDBarrierSuite extends SparkFunSuite with SharedSparkContext {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an
RDDBarrier") {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:
Lisa Larsson
11. What does this fix?
● Allows scheduling all of the DL job together
● Allows regular scheduling otherwise
● Handles failures (e.g. single executor failure == retry all)
Lisa Larsson
12. Spark Deep Learning Pipelines
● ML pipelines being extended to better support image data
● Some work external e.g.
https://github.com/databricks/spark-deep-learning
Alternatives:
● TensorflowOnSpark
● MMLSpark, etc.
Smokey Combs
13. Growth of use of Arrow (maybe)?
Logos trademarks of their respective projects
Juha Kettunen
14. Kubernetes
● A new cluster manager, used for more than "just" big data
● Spark "supports" but lacks difficult to use
● Active work by people at many companies (yay!)
Lisa Zins
15. What do we need to do next?
● Better* dynamic scaling
● Easier uploading for user code & dependencies
● Better auth integration
● Better documentation (ugh client mode)
● Better job resource requirement tagging
● Better shell scripts for packaging dependencies
○ It's easier than YARN but that's not saying a lot
○ Asking a junior Data Scientist to build a docker
container doesn't always go so well
Hisashi
17. Building your own crystal ball:
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Lee Jordan
18. Spark JIRA
● Let's go look at issues on the Spark JIRA together!
● Don't all rush at once….
Jean Georges Perrin
#SparkInAction
19. Who likes doing code reviews?
● We have over 400* of them!
● Don't all rush at once….
20. Want to get involved?
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Please help review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Hisashi
21. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
22. High Performance Spark!
You can buy it today! On the internet!
Nothing on Spark 3 because it doesn't exist yet
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
23. Sign up for the mailing list @
http://www.distributedcomputing4kids.com
24. And some upcoming talks:
● March
○ Dataworks Barcelona
○ Strata San Francisco
● May
○ KiwiCoda Mania
● June
○ "Secret" (for another week or so)
● July
○ OSCON Portland
○ Skills Matter in London
25. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback