Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit Seoul 2019

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Learning 모델의 효과적인
분산 트레이닝과 모델 최적화 방법
김무현, Data Scientist
AWS ML Solutions Lab

Amazon ML Solutions Lab
Brainstorming Modeling Teaching
Leverage Amazon experts with decades of ML
experience with technologies like Amazon Echo,
Amazon Alexa, Prime Air and Amazon Go
Amazon ML Solutions Lab
provides ML expertise

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Now let’s make it as
fast, efficient and inexpensive
as possible
Put machine learning in the
hands of every developer

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
M L F R A M E W O R K S &
I N F R A S T R U C T U R E
The Amazon ML Stack: Broadest & Deepest Set of Capabilities
A I S E R V I C E S
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
C O M P R E H E N D
M E D I C A L
L E XR E K O G N I T I O N
V I D E O
Vision Speech Chatbots
A M A Z O N S A G E M A K E R
B U I L D T R A I N
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
D E P L O Y
Pre-built algorithms & notebooks
Data labeling (G R O U N D T R U T H )
One-click model training & tuning
Optimization ( N E O )
One-click deployment & hosting
M L S E R V I C E S
F r a m e w o r k s I n t e r f a c e s I n f r a s t r u c t u r e
E C 2 P 3
& P 3 d n
E C 2 C 5 F P G A s G R E E N G R A S S E L A S T I C
I N F E R E N C E
Models without training data (REINFORCEMENT LEARNING)
Algorithms & models ( A W S M A R K E T P L A C E )
Language Forecasting Recommendations
NEW NEWNEW
NEW
NEW
NEWNEW
NEW
NEW
RL Coach

Agenda
• Optimizing Infrastructure and Frameworks
• Distributed training for TensorFlow, MXNet, Keras, PyTorch
• Let’s tune models using Amazon SageMaker HPO
• Optimizing the trained model for deployment

Where to train and deploy deep learning models
Amazon
SageMaker
Amazon Elastic
Container Service
for Kubernetes
Amazon Elastic
Container Service
AWS Deep Learning
Containers
Amazon
EC2
AWS Deep Learning
AMIs

Making TensorFlow faster
Training a ResNet-50 benchmark with the synthetic ImageNet dataset
using our optimized build of TensorFlow 1.11 on a c5.18xlarge instance
type is 11x faster than training on the stock binaries.
https://aws.amazon.com/about-aws/whats-new/2018/10/chainer4-4_theano_1-0-2_launch_deep_learning_ami/
October 2018
Available with Amazon SageMaker,
AWS Deep Learning AMIs, and AWS Deep Learning Containers

Amazon EC2 P3dn
https://aws.amazon.com/blogs/aws/new-ec2-p3dn-gpu-instances-with-100-gbps-networking-local-nvme-storage-for-faster-machine-learning-p3-price-reduction/
Reduce machine
learning training time
Better GPU
utilization
Support larger, more
complex models
K E Y F E A T U R E S
100Gbps of networking
bandwidth
8 NVIDIA Tesla
V100 GPUs
32GB of
memory per GPU
(2x more P3)
96 Intel
Skylake vCPUs
(50% more than P3)
with AVX-512

Amazon EC2 P3 instance type has the most powerful GPU, NVIDIA V100
But
Are you fully utilizing GPUs?

Tensor Core and mixed-precision training
https://arxiv.org/abs/1710.03740

How to port training scripts for mixed precision
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Porting the model to use FP16 data type where appropriate.
1. Use float16 data type on models containing convolutions or matrix
multiplication
2. Check if trainable variables is float32 before converting to float16
3. Use float32 for softmax calculation
Adding loss scaling to preserve small gradient values.
1. Multiply by a scale factor before computing gradient
2. Divide the calculated gradient by the same scale factor

Code snip for mix-precision training in TensorFlow
x = tf.placeholder(tf.float32, [None, 784])
W1 = tf.Variable(tf.truncated_normal([784, FLAGS.num_hunits]))
b1 = tf.Variable(tf.zeros([FLAGS.num_hunits]))
z = tf.nn.relu(tf.matmul(x, W1) + b1)
W2 = tf.Variable(tf.truncated_normal([FLAGS.num_hunits, 10]))
b2 = tf.Variable(tf.zeros([10]))
y = tf.matmul(z, W2) + b2
y_ = tf.placeholder(tf.int64, [None])
cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=y_,
logits=y)
train_step =
tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
data = tf.placeholder(tf.float16, shape=(None, 784))
W1 = tf.get_variable('w1', (784, FLAGS.num_hunits), tf.float16)
b1 = tf.get_variable('b1', (FLAGS.num_hunits), tf.float16,
initializer=tf.zeros_initializer())
z = tf.nn.relu(tf.matmul(data, W1) + b1)
W2 = tf.get_variable('w2', (FLAGS.num_hunits, 10), tf.float16)
b2 = tf.get_variable('b2', (10), tf.float16,
initializer=tf.zeros_initializer())
y = tf.matmul(z, W2) + b2
y_ = tf.placeholder(tf.int64, shape=(None))
loss = tf.losses.sparse_softmax_cross_entropy(y_,
tf.cast(y, tf.float32))
* Source code from https://github.com/khcs/fp16-demo-tf
MLP normal implementation MLP mixed-precision implementation

Code snip for mix-precision training in TensorFlow
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(3000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
def gradients_with_loss_scaling(loss, variables, loss_scale):
return [grad / loss_scale
for grad in tf.gradients(loss * loss_scale, variables)]
with tf.device('/gpu:0'),
tf.variable_scope(
'fp32_storage', custom_getter=float32_variable_storage_getter):
data, target, logits, loss = create_model(nbatch, nin, nout, dtype)
variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
grads = gradients_with_loss_scaling(loss, variables, loss_scale)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
training_step_op = optimizer.apply_gradients(zip(grads, variables))
init_op = tf.global_variables_initializer()
sess.run(init_op)
for step in range(6000):
batch_xs, batch_ys = mnist.train.next_batch(100)
np_loss, _ = sess.run([loss, training_step_op],
feed_dict={data: batch_xs, target: batch_ys})* Source code from https://github.com/khcs/fp16-demo-tf
MLP normal implementation MLP mixed-precision implementation

For other Deep Learning frameworks such as Apache MXNet, PyTorch, etc
please refer to
AWS Deep Learning AMI Developer Guide
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-opt-training.html
NVIDIA Deep Learning SDK
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Scaling TensorFlow near-linearly 256 GPUs
https://aws.amazon.com/about-aws/whats-new/2018/11/tensorflow-scalability-to-256-gpus/
Stock
TensorFlow
65%
scaling efficiency
with 256 GPUs
30m
training time
AWS-Optimized
TensorFlow
90%
scaling efficiency
with 256 GPUs
Available with
Amazon SageMaker
and the AWS Deep
Learning AMIs
14m
training time

I also have huge amount of data or large models for training
How to scale deep learning training tasks?

Infra for distributed training - scale up
Amazon
Elastic Block
Store (EBS)
Amazon EC2
GPU GPU
GPU GPU
GPU GPU
GPU GPU

Infra for distributed training - scale out
Amazon
Elastic Block
Store (EBS)
Amazon EC2

Multi-GPUs and multi-nodes options
Using DL framework’s feature
• TensorFlow
- Multi-powering for multi-GPUs training
- Parameter server for multi-node training
• Apache MXNet
- Multi-GPUs by defining context with list of GPUs
- Parameter server for multi-node training
Using Horovod
• https://eng.uber.com/horovod/
• Open source distributed training framework based on Message Passing Interface (MPI)
• Baidu’s draft implementation of the TensorFlow ring-allreduce algorithm
• Support famous deep learning frameworks such as TensorFlow, MXNet, Keras, PyTorch
Performance scalability using Horovod

Using Horovod
Install Horovod and related packages
à AWS Deep Learning AMI and Deep Learning Containers have all already
Modify your training code to be trained using Horovod
Run multi-GPUs or distributed training using Horovod mpirun command

Using Horovod with TensorFlow
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per
process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other
processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers
from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session
initialization,
# restoring from a checkpoint, saving to a checkpoint, and
closing when done
# or an error occurs.
with
tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
( source code from https://github.com/horovod/horovod )

Using Horovod with Apache MXNet
import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()
# Build model
model = ...
model.hybridize()
# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)
# Initialize parameters
model.initialize(initializer, ctx=context)
# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)
# Create loss function
loss_fn = ...
# Train model
for epoch in range(num_epoch):
train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
data = batch.data[0].as_in_context(context)
label = batch.label[0].as_in_context(context)
with autograd.record():
output = model(data.astype(dtype, copy=False))
loss = loss_fn(output, label)
loss.backward()
trainer.step(batch_size)

Using Horovod with Keras
import keras
import horovod.keras as hvd
# Horovod: initialize Horovod.
hvd.init()
# Horovod: pin GPU to be used to process local rank (one GPU
per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Horovod: adjust number of epochs based on number of GPUs.
epochs = int(math.ceil(12.0 / hvd.size()))
model = ...
# Horovod: adjust learning rate based on number of GPUs.
opt = keras.optimizers.Adadelta(1.0 * hvd.size())
# Horovod: add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=opt, metrics=['accuracy'])))
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to
all other processes.
# This is necessary to ensure consistent initialization of
all workers when
# training is started with random weights or restored from a
checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]
# Horovod: save checkpoints only on worker 0 to prevent other
workers from corrupting them.
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint(
'./checkpoint-{epoch}.h5'))
model.fit(x_train, y_train,
batch_size=batch_size,
callbacks=callbacks,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))

Using Horovod in Amazon EC2
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html
STEP 1. Configure Horovod Hosts file
172.100.1.200 slots=8
172.200.8.99 slots=8
172.48.3.124 slots=8
localhost slots=8
STEP 2. Configure nodes to not do “StrickHostKeyChecking”
STEP 3. Execute training script using mpirun command
~/anaconda3/envs/tensorflow_p36/bin/mpirun -np $gpus -hostfile ~/hosts -mca plm_rsh_no_tree_spawn 1
-bind-to socket -map-by slot
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib
-x NCCL_SOCKET_IFNAME=$INTERFACE -mca btl_tcp_if_exclude lo,docker0
-x TF_CPP_MIN_LOG_LEVEL=0
python -W ignore ~/examples/horovod/tensorflow/train_imagenet_resnet_hvd.py
--data_dir ~/data/tf-imagenet/ --num_epochs 90 --increased_aug -b $BATCH_SIZE
--mom 0.977 --wdecay 0.0005 --loss_scale 256. --use_larc
--lr_decay_mode linear_cosine --warmup_epochs 5 --clear_log

Using Horovod in Amazon EKS
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-eks-tutorials-distributed-gpu-training.html
STEP 1. Install Kubeflow to setup a cluster for distributed training
STEP 2. Set the app name and initialize it.
STEP 3. Install mpi-operator from kubeflow
STEP 4. Create a MPI Job template, define the number of nodes (replicas), number of GPUs each
node has (gpusPerReplica)
STEP 5. Apply the manifest to the default environment. The MPI Job will create a launch pod

Using Horovod in Amazon SageMaker
from sagemaker.tensorflow import TensorFlow
distributions = {'mpi': {'enabled': True, "processes_per_host": 2}}
# METHOD 1 - Using Amazon SageMaker provided VPC
estimator = TensorFlow(entry_point=train_script,
role=sagemaker_iam_role,
train_instance_count=2,
train_instance_type='ml.p3.8xlarge',
script_mode=True,
framework_version='1.12',
distributions=distributions)
# METHOD 2 - Using your own VPC for training performance improvement
estimator = TensorFlow(entry_point=train_script,
role=sagemaker_iam_role,
train_instance_count=2,
train_instance_type='ml.p3.8xlarge',
script_mode=True,
framework_version='1.12',
distributions=distributions,
security_group_ids=['sg-0919a36a89a15222f'],
subnets=['subnet-0c07198f3eb022ede', 'subnet-055b2819caae2fd1f’])
estimator.fit({"train":s3_train_path, "test":s3_test_path})

Examples of hyperparameters
Neural Networks
Number of layers
Hidden layer width
Learning rate
Embedding
dimensions
Dropout
…
Decision Trees
Tree depth
Max leaf nodes
Gamma
Eta
Lambda
Alpha
…

Automatic Model Tuning
Finding the optimal set of hyperparameters
1. Manual Search (”I know what I’m doing”)
2. Grid Search (“X marks the spot”)
• Typically training hundreds of models
• Slow and expensive
3. Random Search (“Spray and pray”)
• Works better and faster than Grid Search
• But… but… but… it’s random!
4. HPO: use Machine Learning
• Training fewer models
• Gaussian Process Regression and Bayesian Optimization
• You can now resume from a previous tuning job

How to use Amazon SageMaker HPO
Configuration
Training Jobs
Resulting Models
Estimator

Hardware optimization is extremely complex

Neo is a compiler and runtime for machine learning
Compiler
Runtime
Processor vendors can integrate
hardware-specific optimizations
Device makers can embed runtime
into edge devices and IoT
github.com/neo-ai
Apache Software License
Neo

How to compile a model
https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation-cli.html
Configure the compilation job
{
"RoleArn":$ROLE_ARN,
"InputConfig": {
"S3Uri":"s3://jsimon-neo/model.tar.gz",
"DataInputConfig": "{"data": [1, 3, 224, 224]}",
"Framework": "MXNET"
},
"OutputConfig": {
"S3OutputLocation": "s3://jsimon-neo/",
"TargetDevice": "rasp3b"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 300
}
}
Compile the model
$ aws sagemaker create-compilation-job
--cli-input-json file://config.json
--compilation-job-name resnet50-mxnet-pi
$ aws s3 cp s3://jsimon-neo/model-
rasp3b.tar.gz .
$ gtar tfz model-rasp3b.tar.gz
compiled.params
compiled_model.json
compiled.so
Predict with the compiled model
from dlr import DLRModel
model = DLRModel('resnet50', input_shape,
output_shape, device)
out = model.run(input_data)

Model compilation using AWS console

Performance improvement result
Image file name MXNet model (seconds)
Neo-compiled model
(seconds)
Improvement
(mxnet model / neo-
compiled model)
input_001 0.0299 0.0128 233.59%
input_002 0.0223 0.0129 172.86%
input_003 0.0275 0.0125 220.00%

Do I need really
that much complex & deep
neural networks
to meet the required accuracy?

Compressing deep learning models
• Compression is the process of reducing the size of a trained network,
either by removing certain layers or by shrinking layers, while
maintaining accuracy.
• A smaller model will predict faster and require less memory.
• The number of possible combinations makes is difficult to perform this
task manually, or even programmatically.
• Reinforcement learning to the rescue!

Defining the problem
• Objective: find the smallest possible network
architecture from a pre-trained network
architecture, while producing the best
accuracy.
• Environment: a custom developed
environment that accepts a Boolean array of
layers to remove from the RL agent and
produces an observation describing layers.
• State: the layers.
• Action: A boolean array one for each layer.
• Reward: a combination of compression ratio
and accuracy.

Amazon SageMaker RL
Reinforcement learning for every developer and data scientist
Broad support
for frameworks
Broad support for simulation
environments
2D & 3D physics
environments and
OpenGym support
Support Amazon Sumerian, AWS
RoboMaker and the open source
Robotics Operating System
(ROS) project
Fully
managed
Example notebooks
and tutorials

https://github.com/awslabs/amazon-sagemaker-
examples/tree/master/reinforcement_learning/rl_network_compression_ray_custom

Predictions drive
complexity and
cost in production
Training
10%
Inference
90%

Are you making the most of your infrastructure?
One size does not fit allLow utilization and high costs

Amazon Elastic Inference
https://aws.amazon.com/blogs/aws/amazon-elastic-inference-gpu-powered-deep-learning-inference-acceleration/
Match capacity
to demand
Available between 1 to 32
TFLOPS
Integrated with
Amazon EC2,
Amazon SageMaker, and
Amazon DL AMIs
Support for TensorFlow, Apache
MXNet, and ONNX
with PyTorch coming soon
Single and
mixed-precision
operations
Lower inference costs
up to 75%

Using Elastic Inference with TensorFlow
OPTION 1 - Using Elastic Inference TensorFlow Serving
$ amazonei_tensorflow_model_server --model_name=ssdresnet
--model_base_path=/tmp/ssd_resnet50_v1_coco --port=9000
OPTION 2 - Using Elastic Inference TensorFlow Predictor
from tensorflow.contrib.ei.python.predictor.ei_predictor import EIPredictor
img = mpimg.imread(FLAGS.image)
img = np.expand_dims(img, axis=0)
ssd_resnet_input = {'inputs': img}
eia_predictor = EIPredictor(model_dir='/tmp/ssd_resnet50_v1_coco/1/')
pred = eia_predictor(ssd_resnet_input)

Using Elastic Inference with Apache MXNet
OPTION 1 - Use EI with the MXNet Symbol API
import mxnet as mx
data = mx.sym.var('data', shape=(1,))
sym = mx.sym.exp(data)
# Pass mx.eia() as context during simple bind operation
executor = sym.simple_bind(ctx=mx.eia(), grad_req='null')
# Forward call is performed on remote accelerator
executor.forward(data=mx.nd.ones((1,)))
print('Inference %d, output = %s' % (i, executor.outputs[0]))
OPTION 2 - Use EI with the Module API
ctx = mx.eia()
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)

Other tips
SageMaker Pipemode using TensorFlow Pipemode
Dataset extension
https://github.com/aws/sagemaker-tensorflow-
extensions
Apache MXNet can read training data from Amazon
S3 directly
https://mxnet.incubator.apache.org/versions/master/
faq/s3_integration.html
* dataset – a 3.9 GB CSV file– contained 2 million records, each record having
100 comma-separated, single-precision floating-point values.

Summary
Training
• Make it sure to utilize Tensor Core by using mix-precision training
• Learn to use Horovod for efficient multi-GPU or multi-node distributed
training
• Find the most optimal hyperparameter using SageMaker HPO
Deployment
• Compile your model using Amazon SageMaker Neo
• Use Amazon Elastic Inference to reduce inference cost if applicable

Dive into Deep Learning
An interactive deep learning book
with code, math, and discussions
http://d2l.ai/
http://ko.d2l.ai/
STAT 157 Course at UC Berkeley, Spring 2019
한국어 version of the first 4 chapters is available NOW.
• GitHub Pull Request for any correction is welcome
• Raise issue at https://github.com/d2l-ai/d2l-ko/issues

Getting started
https://ml.aws
https://aws.amazon.com/blogs/machine-learning
https://aws.amazon.com/sagemaker
https://github.com/awslabs/amazon-sagemaker-examples
https://medium.com/@julsimon

Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit Seoul 2019

More Related Content

What's hot

Similar to Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit Seoul 2019

More from Amazon Web Services Korea

Recently uploaded

Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit Seoul 2019