Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Resource-Efficient Deep Learning Model Selection on Apache Spark

Download to read offline

Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.

  • Be the first to like this

Resource-Efficient Deep Learning Model Selection on Apache Spark

  1. 1. Resource-efficient Deep Learning Model Selection on Apache Spark Yuhao Zhang and Supun Nakandala ADALab, University of California, San Diego
  2. 2. About us ▪ PHD students from ADALab at UCSD, advised by Prof. Arun Kumar ▪ Our research mission: democratize data science ▪ More: Supun Nakandala https://scnakandala.github.io/ Yuhao Zhang https://yhzhang.info/ ADALab https://adalabucsd.github.io/
  3. 3. Introduction Artificial Neural Networks (ANNs) are revolutionizing many domains - “Deep Learning”
  4. 4. Problem: training deep nets is Painful! Batch size? 8, 16, 64, 256 ... Model architecture? 3 layer CNN,5 layer CNN, LSTM… Learning rate? 0.1, 0.01, 0.001, 0.0001 ... Regularization? L2, L1, Dropout, Batchnorm ... 4 4 4 4 256 Different configurations ! Model performance = f(model architecture, hyperparameters, ...) →Trial and error Need for speed → $$$ (Distributed DL) → Better utilization of resources
  5. 5. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  6. 6. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  7. 7. Introduction - mini-batch SGD Model Updated Model η ∇ Learning rate Avg. of gradients X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One mini-batch The most popular algorithm family for training deep nets
  8. 8. Introduction - mini-batch SGD X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One epoch One mini-batch Sequential
  9. 9. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  10. 10. Models (tasks) Machines with replicated datasets Task Parallelism - Problem Setting
  11. 11. (Embarrassing) Task Parallelism Con: wasted storage
  12. 12. (Embarrassing) Task Parallelism Con: wasted network Shared FS or data repo
  13. 13. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  14. 14. Data Parallelism - Problem Setting Models(tasks) Partitioned data High data scalability
  15. 15. Data Parallelism Queue Training on one mini-batch or full partition ● Update only per epoch: bulk synchronous parallelism (model averaging) ○ Bad convergence ● Update per mini-batch: sync parameter server ○ + Async updates: async parameter server ○ + Decentralized: MPI allreduce (Horovod) ○ High communication cost Updates
  16. 16. Task Parallelism + high throughput - low data scalability - memory/storage wastage Data Parallelism + high data scalability - low throughput - high communication cost Model Hopper Parallelism (Cerebro) + high throughput + high data scalability + low communication cost + no memory/storage wastage
  17. 17. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  18. 18. Model Hopper Parallelism - Problem Setting Models (tasks) Partitioned data
  19. 19. Model Hopper Parallelism Training on full local partitions One sub-epoch
  20. 20. Model Hopper Parallelism Training on full local partitions Model hopping & training One sub-epoch
  21. 21. Model Hopper Parallelism Training on full local partitions Model hopping & training Model hopping & training One sub-epoch
  22. 22. Model Hopper Parallelism Training on full local partitions Model hopping & training Model hopping & trainingOne epoch One sub-epoch
  23. 23. Heterogeneous Tasks Time Redundant sync barrier! Queue
  24. 24. Randomized Scheduler Time
  25. 25. Cerebro -- Data System with MOP
  26. 26. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  27. 27. MOP (Cerebro) on Spark Spark Driver Cerebro Scheduler Spark Worker Cerebro Worker Spark Worker Cerebro Worker Distributed File System (HDFS, NFS)
  28. 28. Implementation Details ▪ Spark DataFrames converted to partitioned Parquet and locally cached in workers ▪ TensorFlow threads run training on local data partitions ▪ Model Hopping implemented via shared file system
  29. 29. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  30. 30. Example: Grid Search on Model Selection + Hyperparameter Search ▪ Two model architecture: {VGG16, ResNet50} ▪ Two learning rate: {1e-4, 1e-6} ▪ Two batch size: {32, 256}
  31. 31. Initialization from pyspark.sql import SparkSession import cerebro spark = SparkSession.builder.master(...) # initialize spark spark_backend = cerebro.backend.SparkBackend( spark_context=spark.sparkContext, num_workers=num_workers ) # initialize cerebro data_store = cerebro.storage.HDFSStore('hdfs://...') # set the shared data storage
  32. 32. Define the Models params = {'model_arch':['vgg16', 'resnet50'], 'learning_rate':[1e-4, 1e-6], 'batch_size':[32, 256]} def estimator_gen_fn(params): '''A model factory that returns an estimator, given the input hyper-parameters, as well as model architectures''' if params['model_arch'] == 'resnet50': model = ... # tf.keras model elif params['model_arch'] == 'vgg16': model = ... # tf.keras model optimizer = tf.keras.optimizers.Adam(lr=params['learning_rate']) # choose optimizer loss = ... # define loss estimator = cerebro.keras.SparkEstimator(model=model, optimizer=optimizer, loss=loss, batch_size=params['batch_size']) return estimator
  33. 33. Run Grid Search df = ... # read data in as Spark DataFrame grid_search = cerebro.tune.GridSearch(spark_backend, data_store, estimator_gen_fn, params, epoch=5, validation=0.2, feature_columns=['features'], label_columns=['labels']) model = grid_search.fit(df)
  34. 34. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  35. 35. Tests - Setups - Hardware ▪ 9-node cluster, 1 master + 8 workers ▪ On each nodes: ▪ Intel Xeon 10-core 2.20 GHz CPU x 2 ▪ 192 GB RAM ▪ Nvidia P100 GPU x 1
  36. 36. Tests - Setups - Workload ▪ Model selection + hyperparameter tuning on ImageNet ▪ Adam optimizer ▪ Grid search space: ▪ Model architecture: {ResNet50, VGG16} ▪ Learning rate: {1e-4, 1e-6} ▪ Batch size: {32, 256} ▪ L2 regularization: {1e-4, 1e-6}
  37. 37. Tests - Results - Learning Curves
  38. 38. Tests - Results - Learning Curves
  39. 39. Tests - Results - Per Epoch Runtimes * Horovod uses GPU kernels for communication. Thus, it has high GPU utilization.
  40. 40. Tests - Results - Runtimes * Horovod uses GPU kernels for communication. Thus, it has high GPU utilization. System Runtime (hrs/epoch) GPU Utili. (%) Storage Footprint (GiB) Train Validation TF PS - Async 8.6 250 Horovod 92.1 250 Cerebro-Spark 2.63 0.57 42.4 250 TF Model Averaging 1.94 0.03 72.1 250 Celery 1.69 0.03 82.4 2000 Cerebro-Standalone 1.72 0.05 79.8 250
  41. 41. Tests - Cerebro-Spark Gantt Chart ▪ Only overhead: stragglers randomly caused by TF 2.1 Keras Model saving/loading. Overheads range from 1% to 300% Stragglers
  42. 42. Tests - Cerebro-Spark Gantt Chart ▪ One epoch of training ▪ (Almost) optimal!
  43. 43. Tests - Cerebro-Standalone Gantt Chart
  44. 44. Other Available Hyperparameter Tuning Algorithms ▪ PBT ▪ HyperBand ▪ ASHA ▪ Hyperopt
  45. 45. More Features to Come ▪ Grouped learning ▪ API for transfer learning ▪ Model parallelism
  46. 46. References ▪ Cerebro project site ▪ https://adalabucsd.github.io/cerebro-system ▪ Github repo ▪ https://github.com/adalabucsd/cerebro-system ▪ Blog post ▪ https://adalabucsd.github.io/research-blog/cerebro.html ▪ Tech report ▪ https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf
  47. 47. Questions?
  48. 48. Thank you!
  49. 49. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.

Views

Total views

247

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

12

Shares

0

Comments

0

Likes

0

×