Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro - End to end ML with Kubeflow @ SignalConf 2018


Published on

There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.

Published in: Internet
  • Be the first to comment

Intro - End to end ML with Kubeflow @ SignalConf 2018

  1. 1. End to End ML With Kubeflow & friends @holdenkarau Signal 2018 Legit-enough
  2. 2. Some links (slides & recordings will be at): ^ Slides & code-lab links (after) CatLoversShow
  3. 3. Holden: ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC/Committer, contribute to many other projects ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share ● Code review livestreams: / ● Spark Talk Videos
  4. 4. Who do I think you all are? ● Nice people* ● Interested in Machine Learning ● Possibly Familiar with one of Java, Scala, or Python Amanda
  5. 5. What is in store for our adventure? ● We have 30 minutes :) ● Brief intros to what Kubernetes & Spark, and Kubeflow are ● How to train a model (ish) ● How to serve a model (ish) ● Scaling (ish) ● Updating models and other scary thoughts Ada Doglace
  6. 6. What is Kubernetes?
  7. 7. ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project ● Faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets ● Has ML Libraries ● WIP Kubeflow integration PR 1467 What is Spark?
  8. 8. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  9. 9. So what does ML look like?
  10. 10. Code on Laptop
  11. 11. Train Model on ML-Rig Photo by Tomomi
  12. 12. Deploy to Production
  13. 13. Problem: Models are Cool, Feature prep is Hard Training is Tedious, Everyone Forgot Deployment
  14. 14. What is Kubeflow?
  15. 15. What is Kubeflow?
  16. 16. What is Kubeflow? “Data Scientists” Model Serving On Kube Model Training *
  17. 17. What is Kubeflow? “Kubeflow is a Cloud Native platform for machine learning based on Google’s internal machine learning pipelines.” or: ● The recognition that just a bunch of model weights isn’t enough ● Designed to support the ecosystem of tools needed (from data prep to serving) ● Open source project :) Ada Doglace
  18. 18. Really just want to replace this: Photo by: Milestoned
  19. 19. So you want to use this?
  20. 20. What’s Next?! Step away from keyboard Think about type(s) of model Look at components directory and see what’s a fit tool wise Don’t know? Choose jupyter deal with the details live Can’t find it?
  21. 21. Containers Buffet argo automation chainer-job core credentials-pod-preset katib mpi-job mxnet-job openmpi pachyderm pytorch-job seldon tf-serving weaveflux
  22. 22. What about just the basics?* ./scripts/ init ${KFAPP} --platform gcp --project ${PROJECT} cd ${KFAPP} ../scripts/ generate platform ../scripts/ apply platform ../scripts/ generate k8s ../scripts/ apply k8s
  23. 23. What about just tensorflow?* ks registry add kubeflow${VERSION}/kubeflow ks pkg install kubeflow/core@${VERSION} ks pkg install kubeflow/tf-serving@${VERSION} ks pkg install kubeflow/tf-job@${VERSION}
  24. 24. Ok well I need to be able to access Jupyter too... kubectl port-forward -n ${NAMESPACE} `kubectl get pods -n ${NAMESPACE} --selector=service=ambassador -o jsonpath='{.items[0]}'` 8080:80
  25. 25. Your Special ML Training Goes here Don’t have any pressing projects but still want to have fun? Check out Michelle’s notebook for Github Issue summarization. Or want to see mnist again? here :)
  26. 26. Your Special ML Training Goes here ... from keras.callbacks import CSVLogger, ModelCheckpoint script_name_base = 'tutorial_seq2seq' csv_logger = CSVLogger('{:}.log'.format(script_name_base)) model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss: .5f}}.hdf5'.format(script_name_base), save_best_only=True)
  27. 27. Your Special ML Training Goes here history =[encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1), batch_size=batch_size, epochs=epochs, validation_split=0.12, callbacks=[csv_logger, model_checkpoint]) Really just check out Michelle’s notebook for Github Issue summarization.
  28. 28. But what about [special foo-baz-inator] or [special-yak-shaving-tool]? Write a Dockerfile and build an image, use FROM so you’re not starting from scratch. FROM RUN pip install py-special-yak-shaving-tool Then tell set it as a param for your training/serving job as needed: ks param set tfjob-v1alpha2 image "my-special-image-goes-here”
  29. 29. What about that magical feature prep? For now it’s a mostly write-by-hand situation However TFX has some cool tools we can use today (like TF.Transform) if we’re ok with DirectRunner or Dataflow (with Flink support in the works indirectly)
  30. 30. Enter: TF.Transform ● For pre-processing of your data ● e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Runs on top of Apache Beam, but current release not yet scalable outside of GCP ● On Apache Beam master this can run-ish on Flink, but rough ● Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  31. 31. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  32. 32. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  33. 33. Analyze normalize multiply bucketize constant tensors data mean stddev normalize multiply quantiles bucketize
  34. 34. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  35. 35. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  36. 36. Serving: TF is probably easiest for now... MODEL_COMPONENT=my-model-server MODEL_NAME=cat-finder-3k ks generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME} ks param set ${MODEL_COMPONENT} deployHttpProxy true ks param set ${MODEL_COMPONENT} modelPath ${MODEL_PATH} ks apply ${KF_ENV} -c ${MODEL_COMPONENT}
  37. 37. Or use Seldon Core & friends* Seldon Core is an OSS platform for deploying ML models on Kubernetes supported by Kubeflow. Supports Many Model types/formats: ● Tensorflow ● Sklearn ● Spark ML** ● R ● H20
  38. 38. Set up seldon core for serving # Gives cluster-admin role to the default service account kubectl create clusterrolebinding seldon-admin --clusterrole=cluster-admin --serviceaccount=${NAMESPACE}:default # Install the kubeflow/seldon package ks pkg install kubeflow/seldon # Generate the seldon component and deploy it ks generate seldon seldon --name=seldon
  39. 39. Build an image with your model* docker run -v $(pwd):/my_model seldonio/core-python-wrapper:0.7 /my_model IssueSummarization 0.1 --base-image=python:3.6 --image-name=gcr-repository-name/my-image-name
  40. 40. And kick off the new model: ks generate seldon-serve-simple new-serving-magic --name=model-name --namespace=${NAMESPACE} --replicas=2 ks apply ${KF_ENV} -c new-serving-magic
  41. 41. Wait so how do I use this? Your favourite rest library goes here* Timeouts matter! Doing recommendations? Have fall-backs Have multiple models? fall-backs *Need to use in batch? Maybe skip seldon, tf-serving & friends and integrate the library into your code. Or not. Trish Hamme
  42. 42. Scaling - or ruh roh people are using this! replicas: 1 Becomes replicas: 10 Factor of 10 =~ “science”
  43. 43. Wait really? ● Early: switch from mini-kube to ${cloud provider} with GPUs ○ “Vertical” scaling ● Next: increase # of workers for training ○ “Horizontal” scaling ○ Auto-scaling also WIP per-backend for the most part ● Serving, # of replicas ○ Auto-scaling is a WIP - PROJennifer C.
  44. 44. What about validation? TensorFlow Data Validation (TFDV) Or Roll your own? ● Counters & execution time most common ● Please also check % of data change Spark-validator (proof of concept) Please validate your pipelines, and not just for data code changes too.
  45. 45. Demo!
  46. 46. Recorded Demos
  47. 47. Previously live demos recorded ● Kubeflow intro oduction/index.html & streamed ● Kubeflow E2E with Github issue summurization s/cloud-kubeflow-e2e-gis/ & streamed ● You can tell they were live streamed by how poorly went, I promise no video editing has occurred. ● You can do these yourself too (including one of them at our booth)!
  48. 48. Join me & Boo @ Google’s booth @ 5PM And join my-coworker Casey West @ 6talking about: Building Captain Obvious: Understand Faster with Machine Learning APIs
  49. 49. Want to watch working on a Kubeflow PR? ● Join Holden Friday @ 2pm pacific for live coding continuing working on her Apache Spark to Kubeflow (using the existing Spark operator as a base) ● Or just & like + subscribe + click the bell :p
  50. 50. k thnx bye :) Give feedback on this presentation