Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed tensorflow on kubernetes

1,234 views

Published on

講師:蔣是文
10/23 Kubernetes「容企新航向」巡迴論壇-新竹場

Published in: Technology
  • Be the first to comment

Distributed tensorflow on kubernetes

  1. 1. Distributed TensorFlow on Kubernetes 資訊與通訊研究所 蔣是文 Mac Chiang
  2. 2. Copyright 2017 ITRI 工業技術研究院 Agenda • Kubernetes Introduction • Scheduling GPUs on Kubernetes • Distributed TensorFlow Introduction • Running Distributed TensorFlow on Kubernetes • Experience Sharing • Summary 2
  3. 3. Copyright 2017 ITRI 工業技術研究院 What’s Kubernetes • “Kubernetes” is Greek for captain or pilot • Experiences from Google and design by Goolge • Kubernetes is a production-grade, open-source platform that orchestrates the placement (scheduling) and execution of application containers within and across computer clusters. • Masters manage the cluster and the nodes are used to host the running applications. 3
  4. 4. Copyright 2017 ITRI 工業技術研究院 Why Kubernetes 4 • Automatic binpacking • Horizontal scaling • Automated rollouts and rollback • Service monitor • Self-healing • Service discovery and load balancing • 100% Open source, written in Go
  5. 5. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes • Nvidia drivers are installed • Turned on alpha feature gate Accelerators across the system ▪ --feature-gates="Accelerators=true“ • Nodes must be using docker engine as the container runtime 5
  6. 6. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes (cont.) 6 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container-1 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs - name: gpu-container-2 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
  7. 7. Copyright 2017 ITRI 工業技術研究院 Scheduling on Different GPU Versions 7 • Labeling nodes with GPU HW type • Specify the GPU types via Node Affinity rules Tesla P100 Node1 Node2 2 * K80 1 * P100 Tesla K80 gpu:k80 gpu:p100
  8. 8. Copyright 2017 ITRI 工業技術研究院 Access to CUDA libraries from Docker nvidia-driver libcuda.so 8
  9. 9. Copyright 2017 ITRI 工業技術研究院 TensorFlow 9 • Originally developed by the Google Brain Team within Google's Machine Intelligence research organization • An open source software library for numerical computation using data flow graphs • Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. • Support one or more CPUs or GPUs in a desktop, server, or mobile device with a single API
  10. 10. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow 10 http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
  11. 11. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow (cont.) 11 • Replication ▪ In-graph ▪ Between-graph • Training ▪ Asynchronous ▪ Synchronous
  12. 12. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S 12 • TensorFlow ecosystem ▪ https://github.com/tensorflow/ecosystem
  13. 13. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 13 • Prepare codes for distributed training ▪ Flags for configuring the task ▪ Construct the cluster and start the server ▪ Set the device before graph construction # Flags for configuring the task flags.DEFINE_integer("task_index", None, "Worker task index, should be >= 0. task_index=0 is " "the master worker task the performs the variable " "initialization.") flags.DEFINE_string("ps_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("worker_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("job_name", None, "job name: worker or ps") # Construct the cluster and start the server ps_spec = FLAGS.ps_hosts.split(",") worker_spec = FLAGS.worker_hosts.split(",") cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec}) server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) if FLAGS.job_name == "ps": server.join() with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)): # Construct the TensorFlow graph. # Run the TensorFlow graph.
  14. 14. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 14 • Build docker image ▪ Prepare Dockerfile ▪ Build docker image docker build -t <image_name>:v1 -f Dockerfile . docker build -t macchiang/mnist:v7 -f Dockerfile . docker push <image_name>:v1 Push image to docker hub docker push macchiang/mnist:v7 FROM tensorflow/tensorflow:latest-gpu COPY mnist_replicatensorflow/tensorflow.py / ENTRYPOINT ["python", "/mnist_replica.py"]
  15. 15. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 15 • My revised history ▪ https://hub.docker.com/r/macchiang/mnist/tags/
  16. 16. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 16 • Specify parameters in jinja template file ▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir ▪ You may optionally specify credential_secret_name and credential_secret_key if you need to read and write to Google Cloud Storage • Generate K8S YAML and create services and pods ▪ python render_template.py mnist.yaml.jinja | kubectl create -f - command: - "/root/inception/bazel-bin/inception/imagenet_distributed_train" args: - "--data_dir=/data/raw-data" - "--task_index=0" - "--job_name=worker“ - "--worker_hosts=inception-worker-0:5000,inception-worker-1:5000“ - "--ps_hosts=inception-ps-0:5000"
  17. 17. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 17 Worker0 Worker1 Service Worker0 Service Worker1 :5000 PS0 Service PS0 :5000 :5000 Training Data NFS Training Result NFS Read Write
  18. 18. Copyright 2017 ITRI 工業技術研究院 Distributed Tensorflow with CPUs 18 Container Orchestration Platform 35 Nodes ImageNet Data (145GB) NFS Training Result NFS 2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 48GB Memory • Inception Model ▪ Spent 9.23 days 35 Containers Rethinking the Inception Architecture for Computer Vision
  19. 19. Copyright 2017 ITRI 工業技術研究院 Summary • Kubernetes ▪ Production-grade container orchestration platform ▪ GPU resources management a. Nvidia GPU only now b. In Kuberntest 1.8, you can use NVIDIA device plugin. » https://github.com/NVIDIA/k8s-device-plugin • Kubernetes + Distributed TensorFlow ▪ Easy to build the distributed training cluster ▪ Leverage Kubernetes advantages a. Restart failed container b. Monitoring c. Scheduling 19
  20. 20. Thank you! macchiang@itri.org.tw Kubernetes Taiwan User Group

×