Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura

2,890 views

Published on

機械学習アプリをKubernetesを用いて管理可能にすることを目的としたKubeflowプロジェクトで開発されている、機械学習ジョブをKubernetesへの展開を容易にするためのOperator群について解説します。

【訂正】
スライド16において、PyTorch, Caffe2がParameter Server Styleとして記述されていますが、正しくはAllReduce Styleのです。

Published in: Technology

20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura

  1. 1. Shingo Omura (@everpeace), Preferred Networks, Inc. Kubeflow Meetup #1 2018-09-26 (Cloud Native Meetup Tokyo #5) Kubeflow Operators 1
  2. 2. Shingo Omura • Engineer, Preferred Networks, Inc. • Dev/Ops in-house GPU clusters • chainer usability improvement on clouds • kubeflow/chainer-operator developer – spin up distributed chainer jobs with one yaml !! • @everpeace (twitter) • shingo.omura (facebook) 2 We’re Hiring!!
  3. 3. Shingo Omura Key Note at July Tech Festa 2018 SlideShare Kubernetes Meetup Tokyo #13 28th(Fri) at Yahoo Japan!!! 3 Please Join!!
  4. 4. Today’s Topic 4 c.f. Kubeflow Deep Dive – David Aronchick & Jeremy Lewi, Google, KubeCon + CloudNativeCon Europe 2018 Training!!
  5. 5. Kubeflow supports multiple ML frameworks New!! 0.3.0 HOROVOD 5
  6. 6. How? ➔ Operators and CRDs !! Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY kind: CustomResourceDefinition … spec: kind: MyKind What is CRD !? 6 Operator What is Operator !?
  7. 7. What is CRD !? Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY kind: MyKind metadata: name: my-name kind: CustomResourceDefinition … spec: kind: MyKind Custom Resource Definition Custom Resource 7
  8. 8. What is Operator !? Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY kind: MyKind metadata: name: my-name Custom Resource & Cluster State Cluster State Operator 8
  9. 9. Kubeflow’s multi ML framework support apiVersion: kubeflow.org/v1alpha* kind: **Job ... Operator CRDs TFJob PyTorchJob MPIJob MXJob Caffe2Job ChainerJob Operators tf-opeartor pytorch-operator mpi-operator mxnet-operator caffe2-operator chainer-operator kssonnet packages examples pytorch-job mpi-job mxnet-job _no pkg for caffe2_ chainer-job * mpi-operator supports horovod jobs * examples package contains TFJob 9Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
  10. 10. Kubeflow’s multi ML framework support apiVersion: kubeflow.org/v1alpha* kind: **Job ... Operator Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY CRDs TFJob PyTorchJob MPIJob MXJob Caffe2Job ChainerJob Operators tf-opeartor pytorch-operator mpi-operator mxnet-operator caffe2-operator chainer-operator kssonnet packages examples pytorch-job mpi-job mxnet-job _no pkg for caffe2_ chainer-job * mpi-operator supports horovod jobs * examples package contains TFJob 10 All the CRDs support single-node and multi-nodes machine learning jobs
  11. 11. A CLI-supported framework for extensible Kubernetes configurations ksonnet 11
  12. 12. ksonnet save us from editing lengthy yaml files ! 12
  13. 13. ksonnet save us from editing length yaml files! apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: name: sample namespace: user-omura spec: tfReplicaSpecs: Ps: template: spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: PS Worker: replicas: 1 template: spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu ……. 13
  14. 14. How Does Kubeflow Operators Work?? 14
  15. 15. Two Different Distributed Training Job Styles Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY Parameter Servers Style All-Reduce Style Parameter servers ● calc gradient avgs ● send them back to Workers Workers ● train (calc gradients) in parallel ● send them to parameter servers Workers ● train (calc gradients) in parallel ● exchange them each other 15
  16. 16. Two Different Distributed Training Job Styles Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY Parameter Servers Style All-Reduce Style Parameter servers ● calc gradient avgs ● send them back to Workers Workers ● train (calc gradients) in parallel ● send them to parameter servers Workers ● train (calc gradients) in parallel ● exchange them each other HORO VOD 16
  17. 17. TFJob structure (Parameter Server style) apiVersion: kubeflow.org/v1alpha2 kind: TFJob spec: tfReplicaSpecs: cleanPodPolicy: ... # controls deletion of pods when a job terminates (Running, All, None) Chief: … # orchestrating training and performing tasks like checkpointing the model Evaluator: … # compute evaluation metrics as the model is trained Ps: … # parameter servers Worker: # the actual work of training the model. worker 0 might also act as the chief replicas: ... # number of replicas restartPolicy: # behaviour when they exit. (Always, OnFailure, ExitCode, Never) template: … # PodTemplate c.f. https://www.kubeflow.org/docs/guides/components/tftraining/ 17
  18. 18. Pod Pod Pod Pod Anatomy of TFJobs tf-operator k8s Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY TFJob Pod Pod Pod Pod ● expand TFJob to bear Pods and Service ● retry when pods exits by restartPolicy ● clean up pods when job finished by cleanPodPolicy Service 18
  19. 19. ChainerJob structure (All-Reduce style) apiVersion: kubeflow.org/v1alpha2 kind: ChainerJob spec: backend: mpi # defines the protocol to initiate process groups (only ‘mpi’ is supported now) master: # initiate and orchestrate distributed job activeDeadlineSeconds: # the same with Jobspec backoffLimit: # the same with Jobspec ... workerSets: # a set of workerSet (for defining heterogeneous workers) workerSetName: # your own workerSet name replicas: # number of replicas of workerSet mpiConfig: # you can define number of slot for each worker template: # PodTemplate c.f. https://www.kubeflow.org/docs/guides/components/chainer/ 19
  20. 20. Anatomy of ChainerJob ● expand ChainerJob to ConfigMap, Job, Service and StatefulSets ● fault tolerancy borrow from Job and StatefulSets ● scale down when job finished for cleanup Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY chainer-operator ChainerJob Pod Job PodPodPodPod k8s Service StatefulSets ConfiMap 20
  21. 21. Icons made by Eucalyp, rom www.flaticon.com is licensed by CC 3.0 BY 21 Demo Time!!demo script
  22. 22. PFNでは 効率的で柔軟な機械学習クラスタの構築 を一緒に 挑戦してみたい人を募集 しています https://www.preferred-networks.jp/jobs We’re Hiring!! 22
  23. 23. Icons made by Vincent Le Moign from https://icon-icons.com/ licensed by CC 3.0 BY Thank you for Listening!! Any Questions? 23

×