Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

72 views

Published on

The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters.

Published in: Data & Analytics
  • Be the first to comment

Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

  1. 1. Benchmark Tests and How- tos of Distributed Deep Learning On HorovodRunner Jing Pan and Wendao Liu
  2. 2. 2020 Copyright eHealth Insurance ABOUT US Wendao Liu, Sr. Data Scientist at eHealth, Inc § Wears many hats: data science/machine learning, data pipeline, end-to- end data product § Currently studying Doctor in Business Administration Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc § Architect of customer facing machine learning models § Expert in application of deep learning models on Spark § Author of multiple patents and speaker at top AI conferences (KDD, AAAI)
  3. 3. 2020 Copyright eHealth Insurance AGENDA § Horovod § HorovodRunner § HorovodRunner Benchmark § How to Use HorovodRunner
  4. 4. 2020 Copyright eHealth Insurance WHY DISTRIBUTED DEEP LEARNING? Rapidly Growing Data § Image net has 1.3M images (150 GB) § Amazon has 143 million product reviews (20 GB) Increasing Model Complexity § AlexNet with batch size 128 requires 1.1GB memory (5 conv layers and 3 fully connected layers) § VGG-16 with batch size 128 requires 14GB memory, size 256 requires 28GB
  5. 5. 2020 Copyright eHealth Insurance MEET HOROVOD § Uber's open source distributed deep learning library § Easy to use § Slightly modify single-node DL code to make it distributed using Horovod § Great scaling efficiency § Supports four popular frameworks § TensorFlow, Keras, PyTorch, MXNet § Supports both data and model parallelismHorovod Github Courtesy of Uber
  6. 6. 2020 Copyright eHealth Insurance HOROVOD – DATA PARALLELISM Courtesy of Uber
  7. 7. 2020 Copyright eHealth Insurance HOROVOD – RING-ALLREDUCE Courtesy of Uber
  8. 8. 2020 Copyright eHealth Insurance HOROVOD – RING-ALLREDUCE 0 1 2 3 4 6789 13 14 15 Horovod Size: Number of processing units, e.g. 16 Horovod Rank: Ordinal rank of processing units, e.g. 0-15 Courtesy of Uber
  9. 9. 2020 Copyright eHealth Insurance HOROVOD BENCHMARK § Great scaling efficiency, but requires dedicated engineering resources to set it up • Container, MPI, and NCCL § Fine-tuning infra is not trivial § Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18) Courtesy of Uber
  10. 10. 2020 Copyright eHealth Insurance HOROVODRUNNER – DATABRICKS HorovodRunner is a general API to run distributed deep learning workloads on Databricks using Uber's Horovod framework § Built on top of Horovod § No need to set up underlying infrastructure • Supports AWS and Azure § Run in Databricks’ Spark § Data prep and data training in one place § Takes care of random shuffling, fault tolerance, etc. § Barrier execute mode Non-Endorsement Disclaimer
  11. 11. 2020 Copyright eHealth Insurance HOROVODRUNNER DIAGRAM Courtesy of Databricks § A spark driver and num of executors that run Horovod § Barrier execution mode § Enable synchronize training § Start all tasks together § Restart all tasks in case of failure
  12. 12. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK – MNIST Dataset: MNIST Instance: C4.2xlarge Instance Type: CPU Model: Simple CNN (2 convolutional layers) Epochs: 50 Network: 10 Gbps Demonstrated scaling efficiency on simple CNN runs on CPU clusters.
  13. 13. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK Achieved good scaling efficiency using HorovodRunner for both models: Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)
  14. 14. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK OTHERS § GCN § Currently no scaling efficiency § Adjacency matrix is input and cannot be divided § Stochastic GCN might able to help § Multiple GPUs instance § Horovod usually outperforms multithreading
  15. 15. 2020 Copyright eHealth Insurance HOW TO USE HOROVODRUNNER
  16. 16. 2020 Copyright eHealth Insurance CLUSTER SETUP TensorFlow 1 (DB ML GPU 6.x) VGG and Inception ok ResNet requires tf2 No DB ML GPU 7.x (tf2) yet No SSL encryption DATABRICKS.HOROVOD.IGNORESSL true CONF_DISABLE_HIPAA true Fix timeout error in optimizers, run this in init script dbutils.fs.put("tmp/tf_horovod/tf_dbfs_timeout_fix.sh",""" #!/bin/bash fusermount -u /dbfs nohup /databricks/spark/scripts/fuse/goofys-dbr -f - o allow_other --file-mode=0777 --dir-mode=0777 --type- cache- ttl 0 --stat-cache-ttl 1s --http-timeout 5m /: /dbfs >& /databricks/data/logs/dbfs_fuse_stderr &""", True)
  17. 17. 2020 Copyright eHealth Insurance BASIC CODE STRUCTURE 1. INIT LIBRARY 2. PIN GPU 3. WRAP OPTIMIZER 4. SYNC PARAMETERS 5. CHECKPOINT MODEL Courtesy of Uber
  18. 18. 2020 Copyright eHealth Insurance INITIALIZE THE LIBRARY Single node code HorovodRunner Code def train(learning_rate=0.1): from tensorflow import keras get_dataset() model = get_model() optimizer = keras.optimizers.Adadelta(lr=learning_rate) model.compile() model.fit() train(learning_rate=0.1) def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() hr = HorovodRunner(np=2) # np: number of GPUs on slaves, # aka, hvd_size hr.run(train_hvd,learning_rate=0.1) Courtesy of Databricks
  19. 19. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() hr = HorovodRunner(np=2) hr.run(train_hvd,learning_rate=0.1) Graphics by Van Oktop. Code Courtesy of Databricks.
  20. 20. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE HorovodRunner-Barebone def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() 1. Init the library 2. Pin GPUs 3. Wrap the Optimizer 4. Sync Parameters 5. Checkpoint the Model
  21. 21. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() hr = HorovodRunner(np=2) hr.run(train_hvd,learning_rate=0.1)
  22. 22. 2020 Copyright eHealth Insurance PIN GPUs def train_hvd(learning_rate=0.1): from tensorflow.keras import backend as K import tensorflow as tf import horovod.tensorflow.keras as hvd config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) GPU 0 GPU 1 GPU 2... GPU 15 § For ring-all reduce to function properly § Find all GPU device ids on the slaves § Assign an invariant ordinal rank to each GPU device id
  23. 23. 2020 Copyright eHealth Insurance DATA PARALLELISM def get_dataset(num_classes, rank=0, size=1): from tensorflow import keras (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data('MNIST-data-%d' % rank) x_train = x_train[rank::size] y_train = y_train[rank::size] def train_hvd(batch_size=512, epochs=12, learning_rate=0.1): (x_train, y_train), (x_test, y_test) = get_data(hvd.rank(), hvd.size()) Conceptually, data in the train_hvd function = data in one GPU Chunk 0 Chunk 1 Chunk ... Chunk k Entire Data Set Rank 0 Rank 1 Rank ... Rank k Slave GPUs Graphics for conceptual illustration purpose only, not for backend implementation
  24. 24. 2020 Copyright eHealth Insurance GET DATA – INDEXED SOLUTION def get_dataset(num_classes, rank=0, size=1): from tensorflow import keras (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data('MNIST-data-%d' % rank) x_train = x_train[rank::size] y_train = y_train[rank::size] def train_hvd(batch_size=512, #on 1 GPU epochs=12, learning_rate=0.1): (x_train, y_train), (x_test, y_test) = get_data(hvd.rank(), hvd.size()) Graphics for conceptual illustration purpose only, not for backend implementation GPU(rank) Slice Row_ind 0 Rank_0+size*0 0 1 Rank_1+size*0 1 .. ... ... k Rank_k+size*0 ... 0 Rank_0+size*1 k+1 1 Rank_1+size*1 k+2 .. ... ... k Rank_k+size*1 k+size ... ... ... k Rank_k+size*n i N is how many rows that can be in each GPU, N= number of rows I//hvd.size
  25. 25. 2020 Copyright eHealth Insurance GET DATA – INDEX SOLUTION PROBLEM? § At each step, in each GPU, are the rows the same? § Yes § No Shuffle, no representativeness. § Solution for parquet files on S3 § Petastrom, shuffle by default https://github.com/uber/petast orm § Image files? GPU(rank) Slice Row_ind 0 Rank_0+size*0 0 1 Rank_1+size*0 1 .. ... ... k Rank_k+size*0 ... 0 Rank_0+size*1 k+1 1 Rank_1+size*1 k+2 .. ... ... k Rank_k+size*1 k+size ... ... ... k Rank_k+size*n i Graphics for conceptual illustration purpose only, not for backend implementation
  26. 26. 2020 Copyright eHealth Insurance GET DATA – GENERATOR SOLUTION Generator-based solution will shuffle by default at each epoch. train_generator, validation_generator = get_dataset() #shuffle set to true step_size_train = train_generator.n//train_generator.batch_size step_size_validation = validation_generator.n//validation_generator.batch_size history = model.fit_generator( generator = train_generator, steps_per_epoch = step_size_train // hvd.size() , validation_data = validation_generator, validation_steps = step_size_validation // hvd.size() , epochs = epochs, callbacks = callbacks, verbose=2 )
  27. 27. 2020 Copyright eHealth Insurance GET DATA – GENERATOR SOLUTION § Entire data set step_size_train = train_generator.n//train_generator.batch_size § Inside each GPU steps_per_epoch = step_size_train // hvd.size() GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total) 0 0 0 Batch_size 346,…, 29 0 1 1 Batch_size 420,…,1032 0 2 2 Batch_size 75,…,89 0 3 3 Batch_size ... 1 0 4 Batch_size ... 1 1 5 Batch_size ... 1 2 ... Batch_size ... 1 3 ... Batch_size ... ... ... ... Batch_size ... k 0 ... Batch_size ... k 1 ... Batch_size ... k 2 ... Batch_size ... k 3 m Batch_size ... § Ensures no repetition on images in an epoch § How? § Why? Images are shuffled
  28. 28. 2020 Copyright eHealth Insurance DISTRIBUTED MODEL RETRIEVAL § Why Every GPU will load the model structure at the beginning of training; too many requests error if loading from github § How Save model to S3 or dbfs example_model = get_model() example_model.save("path_on_master/vgg_model.h5") shutil.copy("path_on_master/vgg_model.h5", "dbfs_or_s3_path/vgg_model.h5") § Then in train_hvd, Replace model=get_model() With model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")
  29. 29. 2020 Copyright eHealth Insurance WRAP THE OPTIMIZER #single machine optimizer optimizer = keras.optimizers.Adadelta (lr=learning_rate * hvd.size()) # Wrap with Distributed Optimizer. optimizer = hvd.DistributedOptimizer(optimizer) model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy']) Paper by Facebook Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour § Preserve the same number of epochs in hvdRunner as in a single machine for model to converge to preserve accuracy § By linearly scaling of learning rate with batch size § Synchronous hvdRunner batch size = batch_size*hvd_size § LR_n = LR_1*N § HvdRunner's steps_per_epoch is inversely proportionate to the number of GPUs § Same epochs * less_steps_per_epoch = faster training time § Same epochs ~ comparable accuracy
  30. 30. 2020 Copyright eHealth Insurance RECTIFIED ADAM OPTIMIZER § Why § Fast convergence § Accurate initial direction finding to avoid bad local optima § Setting § Cluster install keras-retified-adam § Notebook set %env TF_KERAS =1 § RA optimizer setting optimizer = RAdam(total_steps=5000, warmup_proportion=0.1, learning_rate=learning_rate*hvd.size(), min_lr=1e-5) callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback(), hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1), keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)] ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND Liu et al 2020
  31. 31. 2020 Copyright eHealth Insurance SYNCHRONIZE & CHECKPOINT Checkpoint model parameters from GPU 0 if hvd.rank() == 0: callbacks.append(keras.callbacks.ModelCheckpoi nt(checkpoint_dir + '/checkpoint- {epoch}.ckpt', save_weights_only = True)) callbacks = [ hvd.callbacks.BroadcastGlobalVariables Callback(0) ] Synchronize parameters from GPU 0 GPU 0 GPU 1 GPU 2... GPU 15 Graphics for conceptual illustration purpose only, not for backend implementation At the end of synchronous step § GPU 0 gets the averaged gradient from ring- Allreduce § And send the updated parameters to the rest of the GPUs (broadcast) § The weights from each step is saved from GPU 0
  32. 32. 2020 Copyright eHealth Insurance AVOID HVD.TIMELINE § Why: hvd.timeline = no scaling efficiency § How: add timestamp to standard output Redirect HorovodRunner output to log reset_stdout() redirect_stdout(output_dir+filename) hr = HorovodRunner(np = np_setup) hr.run(train_hvd, learning_rate=learning_rate) #checkpointed model is on master #If you want to keep your model after cluster went down save_model_to_s3() move_log_to_s3() import logging def redirect_stdout(log_filename): class StreamToLogger … stdout_logger = logging.getLogger('STDOUT') sl = StreamToLogger(stdout_logger,logging.INFO) sys.stdout = sl
  33. 33. 2020 Copyright eHealth Insurance EXAMPLE TIMESTAMP ADDED OUTPUT Hvd.rank Current step Total steps per epoch Current epoch Total epoch Added Timestamp
  34. 34. 2020 Copyright eHealth Insurance SUMMARY HorovodRunner is great for distributed deep learning § Unlike Horovod, does not require engineering resources to set up infrastructure § Simplicity of coding inherited from Horovod § Scaling efficiency is good; has room for improvement § Choose better network bandwidth instances § Change AWS S3 to EC2 instance store § Works best if the data can be divided § Horovod Timeline adversely impacts performance § Security § Since Open MPI does not use encrypted communication and can launch new processes, it's recommended to use network-level security to isolate Horovod jobs from potential attackers
  35. 35. 2020 Copyright eHealth Insurance LINK TO CODE AND PAPER § Code: https://github.com/psychologyphd/horovodRunnerBenchMark_IPython § Paper (AAAI2020 Workshop 8 Accepted Poster): http://arxiv.org/abs/2005.05510 or https://deep-learning-graphs.bitbucket.io/dlg- aaai20/accepted_papers/DLGMA_2020_paper_23.pdf
  36. 36. 2020 Copyright eHealth Insurance FEEDBACK Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.
  37. 37. 2020 Copyright eHealth Insurance APPENDIX
  38. 38. 2020 Copyright eHealth Insurance APPENDIX Some things we found and can be useful to share § When training some NLP models where we need to determine certain constraints like vocab, create the vocab first and read on each worker. But during the training, each worker is still only processing a subset of the data independently. § Shuffle § Default is random shuffling. But only works on parquet data. https://github.com/uber/petastorm Save dataframe to parquet and use petastorm for data digestion. § Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order. § Kereas data_generator is by default random shuffling too § Recitified Adam https://www.zhihu.com/question/340834465 § Real time serving: § Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other sessions today. § ring-Allreduce bandwidth optimized § https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy- steps.html
  39. 39. Some tips/takeaways: 1. You can use HorovodRunner out of box and it works great 2. Do not use Horovod timeline 3. Init script and disable hippa to run HorovodRunner 4. Not all optimizers are well supported, some learning rates require special setting. 5. Make sure everything is wrapped in the function including import statements so it can be serialized. 6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU memory usage first. 7. You will still gain performance from a single machine with multiple GPUs (JP the rest of the tips in the appendix)
  40. 40. Wendao, see here. https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM. -- Jing Pan, Ph.D Quantitative User Researcher eHealth From: Wendao Liu <Wendao.Liu@ehealth.com> Date: Tuesday, June 2, 2020 at 4:04 PM To: Brandon Williams <brandon.williams@databricks.com> Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com> Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture Thanks Brandon for quick reply! First question make totally sense, the entire process will fail. for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more context here: Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in a meaningful way. Hope it help to clarify the problem. Thanks a lot! From: Brandon Williams <brandon.williams@databricks.com> Date: Tuesday, June 2, 2020 at 2:47 PM To: Wendao Liu <Wendao.Liu@ehealth.com> Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com> Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture Hi Wendao,
  41. 41. Hi Wendao, +1 Jing Pan as well. Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre- transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time and can be read pretty independently so shuffling would be fine. But that may cause a large bit of storage but is worth a try. Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by time chronologically, as each rowgroup should be contiguous in time. Following that you should be able to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team believes this is also good approach to test out albeit not a trivial task.
  42. 42. Logging get log from master Ifi you want to Retrieve log from slave, db MLflow

×