Apache Spark 3:
The (possible) future!
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
What will be covered?
● Constraints on predicting the future in open source
● The current state of Spark
● Some exciting new things likely/possibly in Spark 3
● A guide on how to build your own crystal ball to double check my crystal ball
● A look through JIRA
● A plea for you to help us with code reviews
● Q & A
Predicting the future in OSS is hard
● This represents my views as someone who works on Spark
● What people end up deciding to work on / review may not match
● Conesus decision making can sometimes be unpredictable
● While we don't have a crystal ball, we do have JIRA
Hisashi
Some key themes for Spark 3
● Deep Learning
○ VC dollars
● Kubernetes
○ If all the cool kids replaced their scheduler, would you?
○ Also see above
● Removing deprecated APIs (yay?)
● … Scala upgrade?
Andy
Blackledge
Deep Learning: Does this slide have cat?
● New scheduler to support deep learning
● New data types to support deep learning
● Better interchange to support deep learning
● Actual deep learning algorithms…. Where are they?
Quinn Dombrowski
New "Gang" Scheduler
● Announced at Spark Summit back in 2.3 -
https://www.datanami.com/2018/06/05/project-hydrogen-u
nites-apache-spark-with-dl-frameworks/
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src
hkarau@hkarau-glaptop:~/repos/spark$
Lisa Larsson
New "Gang" Barrier Scheduler
● https://issues.apache.org/jira/browse/SPARK-24374
● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy
4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit#
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class
RDDBarrierSuite extends SparkFunSuite with SharedSparkContext {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an
RDDBarrier") {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:
Lisa Larsson
What does this fix?
● Allows scheduling all of the DL job together
● Allows regular scheduling otherwise
● Handles failures (e.g. single executor failure == retry all)
Lisa Larsson
Spark Deep Learning Pipelines
● ML pipelines being extended to better support image data
● Some work external e.g.
https://github.com/databricks/spark-deep-learning
Alternatives:
● TensorflowOnSpark
● MMLSpark, etc.
Smokey Combs
Growth of use of Arrow (maybe)?
Logos trademarks of their respective projects
Juha Kettunen
Kubernetes
● A new cluster manager, used for more than "just" big data
● Spark "supports" but lacks difficult to use
● Active work by people at many companies (yay!)
Lisa Zins
What do we need to do next?
● Better* dynamic scaling
● Easier uploading for user code & dependencies
● Better auth integration
● Better documentation (ugh client mode)
● Better job resource requirement tagging
● Better shell scripts for packaging dependencies
○ It's easier than YARN but that's not saying a lot
○ Asking a junior Data Scientist to build a docker
container doesn't always go so well
Hisashi
Photo by: squidish
Building your own crystal ball:
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Lee Jordan
Spark JIRA
● Let's go look at issues on the Spark JIRA together!
● Don't all rush at once….
Jean Georges Perrin
#SparkInAction
Who likes doing code reviews?
● We have over 400* of them!
● Don't all rush at once….
Want to get involved?
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Please help review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Hisashi
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today! On the internet!
Nothing on Spark 3 because it doesn't exist yet
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
And some upcoming talks:
● March
○ Dataworks Barcelona
○ Strata San Francisco
● May
○ KiwiCoda Mania
● June
○ "Secret" (for another week or so)
● July
○ OSCON Portland
○ Skills Matter in London
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback

A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes

  • 2.
    Apache Spark 3: The(possible) future!
  • 3.
    Holden: ● My nameis Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 5.
    What will becovered? ● Constraints on predicting the future in open source ● The current state of Spark ● Some exciting new things likely/possibly in Spark 3 ● A guide on how to build your own crystal ball to double check my crystal ball ● A look through JIRA ● A plea for you to help us with code reviews ● Q & A
  • 6.
    Predicting the futurein OSS is hard ● This represents my views as someone who works on Spark ● What people end up deciding to work on / review may not match ● Conesus decision making can sometimes be unpredictable ● While we don't have a crystal ball, we do have JIRA Hisashi
  • 7.
    Some key themesfor Spark 3 ● Deep Learning ○ VC dollars ● Kubernetes ○ If all the cool kids replaced their scheduler, would you? ○ Also see above ● Removing deprecated APIs (yay?) ● … Scala upgrade? Andy Blackledge
  • 8.
    Deep Learning: Doesthis slide have cat? ● New scheduler to support deep learning ● New data types to support deep learning ● Better interchange to support deep learning ● Actual deep learning algorithms…. Where are they? Quinn Dombrowski
  • 9.
    New "Gang" Scheduler ●Announced at Spark Summit back in 2.3 - https://www.datanami.com/2018/06/05/project-hydrogen-u nites-apache-spark-with-dl-frameworks/ hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src hkarau@hkarau-glaptop:~/repos/spark$ Lisa Larsson
  • 10.
    New "Gang" BarrierScheduler ● https://issues.apache.org/jira/browse/SPARK-24374 ● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy 4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit# hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class RDDBarrierSuite extends SparkFunSuite with SharedSparkContext { ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an RDDBarrier") { ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: Lisa Larsson
  • 11.
    What does thisfix? ● Allows scheduling all of the DL job together ● Allows regular scheduling otherwise ● Handles failures (e.g. single executor failure == retry all) Lisa Larsson
  • 12.
    Spark Deep LearningPipelines ● ML pipelines being extended to better support image data ● Some work external e.g. https://github.com/databricks/spark-deep-learning Alternatives: ● TensorflowOnSpark ● MMLSpark, etc. Smokey Combs
  • 13.
    Growth of useof Arrow (maybe)? Logos trademarks of their respective projects Juha Kettunen
  • 14.
    Kubernetes ● A newcluster manager, used for more than "just" big data ● Spark "supports" but lacks difficult to use ● Active work by people at many companies (yay!) Lisa Zins
  • 15.
    What do weneed to do next? ● Better* dynamic scaling ● Easier uploading for user code & dependencies ● Better auth integration ● Better documentation (ugh client mode) ● Better job resource requirement tagging ● Better shell scripts for packaging dependencies ○ It's easier than YARN but that's not saying a lot ○ Asking a junior Data Scientist to build a docker container doesn't always go so well Hisashi
  • 16.
  • 17.
    Building your owncrystal ball: ● Join us on th dev@ list - http://spark.apache.org/community.html ● Triage Issues: ○ https://issues.apache.org/jira/projects/SPARK/issues/? filter=allopenissues ● Review code! ○ http://spark-prs.appspot.com ○ https://www.youtube.com/user/holdenkarau Lee Jordan
  • 18.
    Spark JIRA ● Let'sgo look at issues on the Spark JIRA together! ● Don't all rush at once…. Jean Georges Perrin #SparkInAction
  • 19.
    Who likes doingcode reviews? ● We have over 400* of them! ● Don't all rush at once….
  • 20.
    Want to getinvolved? ● Join us on th dev@ list - http://spark.apache.org/community.html ● Triage Issues: ○ https://issues.apache.org/jira/projects/SPARK/issues/? filter=allopenissues ● Please help review code! ○ http://spark-prs.appspot.com ○ https://www.youtube.com/user/holdenkarau Hisashi
  • 21.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 22.
    High Performance Spark! Youcan buy it today! On the internet! Nothing on Spark 3 because it doesn't exist yet Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 23.
    Sign up forthe mailing list @ http://www.distributedcomputing4kids.com
  • 24.
    And some upcomingtalks: ● March ○ Dataworks Barcelona ○ Strata San Francisco ● May ○ KiwiCoda Mania ● June ○ "Secret" (for another week or so) ● July ○ OSCON Portland ○ Skills Matter in London
  • 25.
    k thnx bye:) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark . Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF It’s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback