Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Thanks for coming early!
Want to make clothes from code?
https://haute.codes
Want to hear about a KF book?
http://www.intr...
Spark ML to Spark + TF
Alternate: Things that almost work
An Adventure powered by
Kubeflow
Presented by
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Open Source Dev @ Apple
● Apache Spark PMC
● co-author...
Today's adventure:
● Who our players are (Spark, Kubeflow, Tensorflow)
● Why you would want to do this
● How to do make th...
What is Spark?
Apache Spark “Core”
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
...
What is Tensorflow?
● Fancy machine learning tool
● Big enough it has it's own conference now too
● More deep learning opt...
What is Tensorflow Extended (TFX)?
● Tools to create and manage Tensorflow pipelines
● Includes things like serving, data ...
What is Kubeflow?
● "The Machine Learning Toolkit for Kubernetes" (kubeflow.org)
● Provides a buffet-like collection of ML...
Why would want to augment Spark ML?
● Spark ML doesn't have a huge variety of algorithms
● Serving Spark models is painful...
How could we do this?
● Install Kubeflow
● Take our Spark job and split it into feature prep & model training
● Have our f...
What is the catch?
● Kubeflow is alpha software
● Kubeflow had Spark integrated in 0.5, broke in 0.6, and there's a PR to ...
Setting up KF
● Download kfctl from https://api.github.com/repos/kubeflow/kubeflow/release
○ Get the latest rc, leave an o...
Or in diff view...
And on GCP...
The magic incantation*
CONFIG="my_app.yaml"
KFAPP="cheeseburger"
kkdir ${KFAPP} && pushd ${KFAPP} && cp ${CONFIG} ./ && kf...
Using Spark on Kubeflow options:
● Use the spark-operator
○ Use PR#441
○ Or helm install the operator into the KF namespae...
Using the spark-operator:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
names...
Using spark in a notebook on KF*:
*Does not work in 0.7. Needs more istio magic
slgckgc
Using spark in a notebook on KF*:
*Does not work in 0.7, I think I need to do more istio magic
agirlnamednee
No demo, but cat picture:
● We're going to skip the demo because Kubeflow 0.7 RCs aren't quite there
○ RC1 broke the webui...
A "traditional" Spark ML pipeline (1 of 2):
val extensionIndexer = new StringIndexer()
.setHandleInvalid("keep") // Some f...
A "traditional" Spark ML pipeline (2 of 2):
prepPipeline.setStages(Array(
extensionIndexer,
tokenizer,
word2vec,
featureVe...
Splitting our Spark pipeline
prepPipeline.setStages(Array(
extensionIndexer,
tokenizer,
word2vec,
featureVec))
Roy Wolfe
Saving the results in a TF friendly way
● Use Tensorflow's specific format
○ Build https://github.com/tensorflow/ecosystem...
Making a TF job to use the Spark job output
● There are many great introduction to Tensorflow resources
○ And a talk @ 4pm...
Putting our training together in a KF pipeline
● KF pipeline documentation:
https://www.kubeflow.org/docs/pipelines/overvi...
Our options for Spark in a pipeline:
● We can use the Kubeflow pipeline dsl elements + Spark operator
○ "ResourceOp" - cre...
Spark in KF Pipeline using the operator from PR
import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.onprem as onprem
fr...
Spark in KF Pipeline using the operator from PR
def spark_hello_world_pipeline(
jar_location="gcs://....",
tf_job_image="....
Spark in KF Pipeline using the operator from PR
"spec": {
"type": "Scala",
"mode": "cluster",
"mainApplicationFile": "$jar...
Spark in KF Pipeline using the operator from PR
spark_resource = dsl.ResourceOp(
name='spark-job',
k8s_resource=spark_job,...
Spark in KF Pipeline using the notebook
● Take the previous notebook code and add annotations
@dsl.python_component(
name=...
Spark in KF Pipeline using the notebook
Then add some boiler plate to create the pipeline in KF:
pipeline_filename = pipel...
Run that KF pipeline (demo)
We can either role play the demo:
● Someone be the cluster I deployed last night
● Now pretend...
So how do we do serving?
● Manually hand translate Apache Spark output to constant TF operations
○ Please don't do this
● ...
There are alternatives:
● You can integrated TF into Spark instead of separate steps
○ TensorflowOnSpark (mostly) from Yah...
Some ending notes
● Kubeflow is very early (pre-1.0)
● Don't just replace systems for the sake of it
● This is only one of...
Related Links:
● Kubeflow: https://www.kubeflow.org/
● Introduction to ML With Kubeflow Book*:
http://www.introtomlwithkub...
Interested in OSS (especially Spark)?
● Join me in the afternoons talk on how to contribute to Spark @ 15:20
○ In the AUDI...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
High Performance Spark!
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
Sparkling Pink Panda Scooter group photo by Kenzi
k thnx bye! (or questions…)
Give feedback on this presentation
http://bi...
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Upcoming SlideShare
Loading in …5
×

of

Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 1 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 2 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 3 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 4 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 5 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 6 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 7 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 8 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 9 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 10 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 11 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 12 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 13 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 14 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 15 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 16 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 17 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 18 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 19 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 20 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 21 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 22 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 23 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 24 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 25 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 26 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 27 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 28 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 29 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 30 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 31 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 32 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 33 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 34 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 35 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 36 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 37 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 38 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 39 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 40 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 41 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 42 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 43 Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow Slide 44
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow

Download to read offline

This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).

Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow

  1. 1. Thanks for coming early! Want to make clothes from code? https://haute.codes Want to hear about a KF book? http://www.introtomlwithkubeflow.com Teach kids Apache Spark? http://distributedcomputing4kids.com
  2. 2. Spark ML to Spark + TF Alternate: Things that almost work An Adventure powered by Kubeflow Presented by @holdenkarau
  3. 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Open Source Dev @ Apple ● Apache Spark PMC ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback: http://bit.ly/holdenTalkFeedback
  4. 4. Today's adventure: ● Who our players are (Spark, Kubeflow, Tensorflow) ● Why you would want to do this ● How to do make this "work" ● Some alternatives to all this effort ● Illustrated with existing projects of ML on Spark mailing lists & ML on code ● No demos because 0.7RC1 broke "everything"* Sucram Yef
  5. 5. What is Spark? Apache Spark “Core” SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming Spark on Yarn Spark on Mesos Spark on Kubernetes Standalone Spark umbrellahead56
  6. 6. What is Tensorflow? ● Fancy machine learning tool ● Big enough it has it's own conference now too ● More deep learning options than inside of Spark ML Tak H.
  7. 7. What is Tensorflow Extended (TFX)? ● Tools to create and manage Tensorflow pipelines ● Includes things like serving, data validation, data transformation, etc. ● Many parts of it's data ecosystem depend on Apache Beam's Python interface which has challenges in open source Miguel Discart
  8. 8. What is Kubeflow? ● "The Machine Learning Toolkit for Kubernetes" (kubeflow.org) ● Provides a buffet-like collection of ML related tools ● Very very unstable ● But aiming for a 1.0 release "soon" Mr Thinktank
  9. 9. Why would want to augment Spark ML? ● Spark ML doesn't have a huge variety of algorithms ● Serving Spark models is painful ● Raising money from venture capitalists is easier with tensorflow ● A chance to revisit our base assumptions Tamsin Cooper
  10. 10. How could we do this? ● Install Kubeflow ● Take our Spark job and split it into feature prep & model training ● Have our feature prep job save the results in a TF-compatabile format ● Create a TF-job ● Create a Kubeflow (or Argo, or…) pipeline to train or new model ● Optional: Use katib to do hyper-parameter tuning ● Validate if our classic ML or new fancy ML works "better" ivva
  11. 11. What is the catch? ● Kubeflow is alpha software ● Kubeflow had Spark integrated in 0.5, broke in 0.6, and there's a PR to add fix it in 0.7 ● So… using this is a bit tricky (for now) ● To be clear: DO NOT DO THINGS I SHOW YOU IN PRODUCTION ○ It's like that "professional drivers" warning on TV except maybe a different word than professional ○ Unless you're trying to get fired, in which case I have some PRs for you to try Dale Cruse
  12. 12. Setting up KF ● Download kfctl from https://api.github.com/repos/kubeflow/kubeflow/release ○ Get the latest rc, leave an offering to Cthulhu, and add it to your path ● Download a config from https://github.com/kubeflow/manifests/tree/master/kfdef (not https://github.com/kubeflow/kubeflow/tree/master/bootstrap/config ) ○ Edit as you need (for us we need to add Spark) & configure our cluster ○ You can point to manifest PR#441 for now (e.g. https://github.com/kubeflow/manifests/blob/50516461ce327624ad4e107a9286c69e5332e150/kfdef/kfctl_gcp_iap.yaml ) ○ Edit the yaml URI to point to the PR, put in the app name ● Run the magic incantation Marco Verch
  13. 13. Or in diff view...
  14. 14. And on GCP...
  15. 15. The magic incantation* CONFIG="my_app.yaml" KFAPP="cheeseburger" kkdir ${KFAPP} && pushd ${KFAPP} && cp ${CONFIG} ./ && kfctl apply all -f `pwd`/${CONFIG} -V # As of 0.7RC1 appdir is ignored and everything is under /tmp #Now take a 5~30 minute nap depending on your deployment. Long story. Official documentation: https://www.kubeflow.org/docs/started/ (not yet updated to 0.7rcs) I am R.
  16. 16. Using Spark on Kubeflow options: ● Use the spark-operator ○ Use PR#441 ○ Or helm install the operator into the KF namespae ● Add Spark to your notebook image ○ This one kind of broke with new KF & Istio ○ If you understand istio reasonably well please come talk to me after ● Wait for PR#441 to be merged ● Wait for 0.7 ● Or wait for 1.0 Eden, Janine and Jim
  17. 17. Using the spark-operator: apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: kubeflow spec: type: Scala mode: cluster image: "gcr.io/spark-operator/spark:v2.4.4" imagePullPolicy: Always masatsu
  18. 18. Using spark in a notebook on KF*: *Does not work in 0.7. Needs more istio magic slgckgc
  19. 19. Using spark in a notebook on KF*: *Does not work in 0.7, I think I need to do more istio magic agirlnamednee
  20. 20. No demo, but cat picture: ● We're going to skip the demo because Kubeflow 0.7 RCs aren't quite there ○ RC1 broke the webui on GCP :( jeri leandera
  21. 21. A "traditional" Spark ML pipeline (1 of 2): val extensionIndexer = new StringIndexer() .setHandleInvalid("keep") // Some files no extensions .setInputCol("extension") .setOutputCol("extension_index") val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("tokens") val word2vec = new Word2Vec().setInputCol("tokens").setOutputCol("wordvecs") alljengi
  22. 22. A "traditional" Spark ML pipeline (2 of 2): prepPipeline.setStages(Array( extensionIndexer, tokenizer, word2vec, featureVec, classifier)) alljengi
  23. 23. Splitting our Spark pipeline prepPipeline.setStages(Array( extensionIndexer, tokenizer, word2vec, featureVec)) Roy Wolfe
  24. 24. Saving the results in a TF friendly way ● Use Tensorflow's specific format ○ Build https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector and add as a jar with --jar (or publish to local maven) ○ That seems… stable ● Use CSVs ○ What could go wrong? Oh you have strings… well… oh and a byte array… ummm... ● Wait for Tensorflow/TFX to adopt Arrow as an input format ○ https://github.com/tensorflow/community/pull/162 Maxime Goossens
  25. 25. Making a TF job to use the Spark job output ● There are many great introduction to Tensorflow resources ○ And a talk @ 4pm introducing Tensorflow 2 you should totally check out ● That our output came from Spark doesn't matter to TF ○ Although the format does ● Although when we go to do inference things are complicated ○ This is where tensorflow-transform could be really awesome Oregon State University
  26. 26. Putting our training together in a KF pipeline ● KF pipeline documentation: https://www.kubeflow.org/docs/pipelines/overview/concepts/pipeline/ ● Python! ● Can involve a surprising amount of YAML templating
  27. 27. Our options for Spark in a pipeline: ● We can use the Kubeflow pipeline dsl elements + Spark operator ○ "ResourceOp" - create a Spark job ● We can also use the Kubeflow pipeline DSL elements + notebook ○ Each "step" will set up and tear down the Spark cluster, so do your Spark work in one step
  28. 28. Spark in KF Pipeline using the operator from PR import kfp.dsl as dsl import kfp.gcp as gcp import kfp.onprem as onprem from string import Template import json @dsl.pipeline( name='Simple spark pipeline demo', description='Shows how to use Spark operator inside KF'
  29. 29. Spark in KF Pipeline using the operator from PR def spark_hello_world_pipeline( jar_location="gcs://....", tf_job_image="..."): spark_json_template = Template(""" { "apiVersion": "sparkoperator.k8s.io/v1beta2", "kind": "SparkApplication", "metadata": { "name": "spark-frank", "namespace": "kubeflow"},
  30. 30. Spark in KF Pipeline using the operator from PR "spec": { "type": "Scala", "mode": "cluster", "mainApplicationFile": "$jar_location" }""") spark_json = spark_json_template.substitute({ 'jar_location': jar_location}) spark_job = json.loads(spark_json)
  31. 31. Spark in KF Pipeline using the operator from PR spark_resource = dsl.ResourceOp( name='spark-job', k8s_resource=spark_job, success_condition='status.state == Succeeded') train = dsl.ContainerOp( name='train', image=tf_job_image, ).after(spark_resoure) *Getting better (https://github.com/kubeflow/pipelines/issues/677)
  32. 32. Spark in KF Pipeline using the notebook ● Take the previous notebook code and add annotations @dsl.python_component( name='spark_job', description='does_fancy_dataprep', base_image=BASE_IMAGE # Pick the same image as your notebook is in ) abbeyprivate
  33. 33. Spark in KF Pipeline using the notebook Then add some boiler plate to create the pipeline in KF: pipeline_filename = pipeline_func.__name__ + '.pipeline.zip' compiler.Compiler().compile(pipeline_func, pipeline_filename) client = kfp.Client() experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
  34. 34. Run that KF pipeline (demo) We can either role play the demo: ● Someone be the cluster I deployed last night ● Now pretend to be dead ● I'll get frustrated ● Someone else be the cluster I deployed right now. Wait until the end of the talk before saying anything else when we come back to the demo. ● Now say the deployment failed because of SSL. Or I can show you how KF 0.7RC1 breaks right now! Wolfgang Lonien
  35. 35. So how do we do serving? ● Manually hand translate Apache Spark output to constant TF operations ○ Please don't do this ● Write code to do the above after doing it twice ● Use Seldon + wrap your pipeline to do feature transformation, roughly: Kubeflow Ambassador Seldon Feature Prep (Exported from Spark) Fancy deep learning model mliu92
  36. 36. There are alternatives: ● You can integrated TF into Spark instead of separate steps ○ TensorflowOnSpark (mostly) from Yahoo ○ Manually using UDFS ● Spark's deep learning pipelines ○ You can use tools other than Tensorflow ● Tensorflow Transform on Beam On Spark + Tensorflow ○ Doesn't work yet, but hopefully in the future ● Not bothering EmsiProduction
  37. 37. Some ending notes ● Kubeflow is very early (pre-1.0) ● Don't just replace systems for the sake of it ● This is only one of many ways to do things ● Please be careful James Petts
  38. 38. Related Links: ● Kubeflow: https://www.kubeflow.org/ ● Introduction to ML With Kubeflow Book*: http://www.introtomlwithkubeflow.com ● Book example repo (it's… a work in progress): https://github.com/intro-to-ml-with-kubeflow/intro-to-ml-with-kubeflow-example s ● KF + Spark 0.7 PR: https://github.com/kubeflow/manifests/pull/441 ● KF Serving: https://github.com/kubeflow/kfserving ● Seldon: https://www.seldon.io/ Becky Lai
  39. 39. Interested in OSS (especially Spark)? ● Join me in the afternoons talk on how to contribute to Spark @ 15:20 ○ In the AUDITORIUM ● Check out my Twitch & Youtube for live streams (including one tomorrow) - ○ http://twitch.tv/holdenkarau ○ https://www.youtube.com/user/holdenkarau alljengi
  40. 40. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  41. 41. High Performance Spark! Cat’s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  42. 42. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  43. 43. Sparkling Pink Panda Scooter group photo by Kenzi k thnx bye! (or questions…) Give feedback on this presentation http://bit.ly/holdenTalkFeedback I'll be in the hallway or you can email me: holden@pigscanfly.ca
  • YinanLi

    Jan. 30, 2020

This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).

Views

Total views

2,389

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

50

Shares

0

Comments

0

Likes

1

×