Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

3,566 views

Published on

One of the main challenges customers face is running efficient deep learning training over multiple nodes. In this chalk talk, Uber discusses how to use Horovod, a distributed training framework, to speed up deep learning training on TensorFlow and PyTorch.

  • Be the first to comment

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uber on Using Horovod for Distributed Deep Learning Alex Sergeev Horovod Project Lead Uber Technologies, Inc. A I M 4 1 1
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why deep learning? • Continues to improve accuracy long after older algorithms reach data saturation • State of the art for vision, machine translation, and many other domains • Capitalize on massive amount of research happening in the global community - Andrew Ng
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Massive amounts of data . . . . . . make things slow (weeks!) Solution distributed training p3x16large has 128GB GPU memory Most models fit in a server → Use data-parallel training
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goals There are many ways to do data-parallel training Some are more confusing than others. UX varies greatly Our goals 1. Infrastructure people (like me ) deal with choosing servers, network gear, container environment, default containers, and tuning distributed training performance 2. ML engineers focus on making great models that improve business using deep learning frameworks that they love
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Meet Horovod • Library for distributed deep learning • Works with stock TensorFlow, Keras, and PyTorch • Installs on top with `pip install horovod` • Uses advanced algorithms & can leverage features of high-performance networks (RDMA, GPUDirect) • Separates infrastructure from ML engineers • Infra team provides container & MPI environment • ML engineers use DL frameworks that they love • Both Infra team and ML engineers have consistent expectations for distributed training across frameworks horovod.ai
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Initialize the library import horovod.tensorflow as hvd hvd.init()
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank())
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adjust LR & add Horovod Distributed Optimizer opt = tf.train.MomentumOptimizer(lr=0.01 * hvd.size()) opt = hvd.DistributedOptimizer(opt)
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Synchronize initial state between workers hooks = [hvd.BroadcastGlobalVariablesHook(0)] with tf.train.MonitoredTrainingSession(hooks=hooks, …) as mon_sess: … # Or bcast_op = hvd.broadcast_global_variables(0) sess.run(bcast_op)
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use checkpoints only on first worker ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as mon_sess: …
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to synchronize initial state hooks =[hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of # session initialization, restoring from a # checkpoint, saving to a checkpoint, and # closing when done or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint _dir=ckpt_dir, config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run! # Use AWS Deep Learning AMI laptop$ ssh ubuntu@<aws-ip-1> aws-ip-1$ source activate tensorflow_p27 aws-ip-1$ ssh-keygen aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub [copy contents of the pubkey] aws-ip-1$ exit laptop$ ssh ubuntu@<aws-ip-2> aws-ip-2$ source activate tensorflow_p27 aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys [paste contents of the pubkey] aws-ip-2$ exit laptop$ ssh ubuntu@<aws-ip-1> aws-ip-2$ ssh aws-ip-2 [will ask for prompt, say yes] aws-ip-2$ exit aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 wget https://raw.githubusercontent.com/uber/horov od/master/examples/tensorflow_mnist.py aws-ip-1$ mpirun -bind-to none -map-by slot -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_exclude lo,docker0 –np 16 -H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py # Pro tip: hide mpirun args into mpirun.sh aws-ip-1$ mpirun.sh –np 16 –H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod for TensorFlow import horovod.tensorflow as hvd
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod for TensorFlow, Keras, and PyTorch import horovod.tensorflow as hvd import horovod.keras as hvd import horovod.tensorflow.keras as hvd import horovod.torch as hvd # more frameworks coming
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta( lr=1.0 * hvd.size()) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile( loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # Broadcast initial variable state. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallb ack(0)] model.fit( x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – Estimator API import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... def model_fn(features, labels, mode): loss = … opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) return tf.estimator.EstimatorSpec(…) # Broadcast initial variable state. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Create the Estimator mnist_classifier = tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir=ckpt_dir, config=tf.estimator.RunConfig( session_config=config)) mnist_classifier.train( input_fn=train_input_fn, steps=100, hooks=hooks)
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – PyTorch import torch import horovod.torch as hvd # Initialize Horovod hvd.init() # Horovod: pin GPU to local rank. torch.cuda.set_device(hvd.local_rank()) # Build model. model = Net() model.cuda() optimizer = optim.SGD(model.parameters()) # Wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters()) # Horovod: broadcast parameters. hvd.broadcast_parameters( model.state_dict(), root_rank=0) for epoch in range(100): for batch_idx, (data, target) in …: optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling Horovod
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod knobs • Single-ring NCCL versus Hierarchical All-reduce • HOROVOD_HIERARCHICAL_ALLREDUCE=1 • Tensor Fusion • HOROVOD_FUSION_THRESHOLD=67108864 • HOROVOD_CYCLE_TIME=5 • FP16 all-reduce • hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16) Stay tuned for more . . . • Auto-tuning • Further gradient compression • Local gradient aggregation
  23. 23. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Sergeev @alsrgv
  24. 24. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×