[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Inference Cost up to 75% (AIM366) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Elastic Inference – Reduce Deep
Learning inference costs by 75%
A I M 3 6 6
Dominic Divakaruni
Sr. Product Manager
Sudipta Sengupta
Sr. Principal Technologist
AWS – Machine Learning
Peter Jones
Head of AI Engineering
Liviu Calin
AI Systems Engineer
Autodesk AI Lab

Agenda
❖ Challenges scaling deep learning applications
❖ Our solution that addresses the cost efficiency and flexibility
challenges.
❖ Share Autodesk’s experience

Machine learning – the centerpiece for transformation
Customer
experience
Business
operations
Decision-
making Innovation
Competitive
advantage

Inference
(Prediction)
90%
Training
10%

The challenges of inference in production

A closer look at GPU utilization for inference
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5 6 7
90% underutilized
for single batch size
inference

0
50
100
150
200
1 2 3 4 5 6
More sessions doesn’t solve the problem

How cost effective are GPU instances for inference?
Smaller P2 instances are more effective for real time inference with small batch sizes
How cost effective are GPU instancesfor inference?

How do we optimize resources and reduce costs?

How do we optimize resources and reduce costs?
Introducing

Amazon Elastic Inference
Integrated with
Amazon EC2 and
Amazon SageMaker
Support for TensorFlow, Apache
MXNet, and ONNX
with PyTorch coming soon
Single and
mixed-precision
operations
Reduce deep learning inference costs up to 75%

Acceleration sizes tailored for inference
Accelerator
Type
FP32
Throughput
(TOPS)
FP16
Throughput
(TOPS)
Accelerator
Memory
(GB)
Price ($/hr)
(US)
eia1.medium 1 8 1 $0.13
eia1.large 2 16 2 $0.26
eia1.xlarge 4 32 4 $0.52
Now available in N. Virginia, Ohio, Oregon, Dublin, Tokyo, and Seoul

Inference Performance with EI and GPU
0
20
40
60
80
100
120
0
10
20
30
40
50
60
70
0
20
40
60
80
100
120
140

How does Elastic Inference work with Amazon EC2?
VPC
Region
Availability Zone

Scale capacity in EC2 Auto Scaling groups
Auto Scaling group

How does Elastic Inference work with SageMaker?
SageMaker Notebooks
SageMaker Hosted Endpoints

Model Support
ONNX
Amazon EI enabled
TensorFlow Serving
Amazon EI
enabled Apache
MXNet
Applied using
Apache MXNet

Loading models and serving requests
AmazonEI_TensorFlow_Serving_v1.11_v1 --model_name=inception --
model_base_path=[model location] --port=9000
python inception_client.py --server=localhost:9000 --image
Siberian_Husky_bi-eyed_Flickr.jpg
TensorFlow models using Amazon EI enabled TensorFlow serving

Loading models and serving requests
Load MXNet models using Amazon EI enabled Apache MXNet
# For ONNX models use MXNet’s import.model API as follows:
sym, arg, aux = onnx_mxnet.import_model(onnx_model_file)
# Pass mx.eia() as context while creating Module object
mod = mx.mod.Module(symbol=sym, context=mx.eia())
Load ONNX models using Amazon EI enabled Apache MXNet
# example for MXNet models
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
mod = mx.mod.Module(symbol=sym, context=mx.eia(), label_names=None)
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],
label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

How to choose?
Considerations as you choose an instance and accelerator type combination
for your model:
➢ What is your target latency SLA for your application, and what are you
constraints?
➢ Start small and size up if you need more capacity.
➢ Input/output data payload has an impact on latency.
➢ Convert to Fp16 for lower latency and higher throughput.

Peter Jones
Head of AI Engineering
Autodesk AI Lab
Liviu-Mihai Calin
AI Systems Engineer
Autodesk AI Lab

MORE
IS INEVITABLE

LESS
IS A REALITY

Image courtesy of Tesla Motors, Inc. Image courtesy of Gensler.
The Martian © 2015 Twentieth Century Fox. All rights reserved.

OPPORTUNITY OF
BETTER

AI LAB

Softmax Classifier
Embedding
2d-conv 2d-conv 2d-conv 2d-conv 2d-conv batch-max dense dense dense
Multi-view Convolutional Neural Network MVCNN

Variational Autoencoder (VAE)

Instance Setup with Elastic Inference
aws ec2 run-instances
--image-id <preconfigured_ami_id>
--instance-type <ec2_instance_type>
--key-name <key_name>
--subnet-id <subnet_id>
--security-group-ids <security_group_id
--iam-instance-profile Name=”iam_profile_name”
--elastic-inference-accelerator Type=eia1.<size>
• Just like setting up a normal EC2 instance
• Create instance with preconfigured AMI and reference to accelerator
• A VPC endpoint to allow EC2 instance to connect to accelerator (done once)

Using Elastic Inference
• Serve saved model with EI version of TensorFlow model server
• Send requests to the server to predict with test data
• Elastic inference takes care of accelerating the operations

Creating a Saved Model
classifier = tf.estimator.Estimator(…)
input_tensor = tf.placeholder(dtype=tf.float32,
shape=[1, 80, 128, 128, 1],
name='images_tensor’)
input_map = {'images’ : input_tensor}
classifier.export_savedmodel(model_dir,
tf.estimator.export.build_raw_serving_input_receiver_fn(input_map))
• The MVCNN model is in TF Estimator format and has been trained
• It expects grayscale multi-view images named “images” as input
• Dimensions: [batch_size , num_views , width , height , color_channels]

Predicting with EI TensorFlow Serving
AmazonEI_TensorFlow_Serving_v1.11_v1 --model_name=mvcnn
--model_base_path=model_dir --port=9000
• Have one process serve the previously exported saved model
• Have another process send requests containing input data

Predicting with EI TensorFlow Serving
tf.app.flags.DEFINE_string('server', 'localhost:9000',
'PredictionService host:port’)
FLAGS = tf.app.flags.FLAGS
channel = grpc.insecure_channel(FLAGS.server)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mvcnn'
request.model_spec.signature_name = 'serving_default’
input_array = get_next_input()
request.inputs['images'].CopyFrom(tf.contrib.util.make_tensor_proto(input_array,
dtype=tf.float32,shape=[1,80,128,128,1]))
result = stub.Predict(request, 30.0) # 30 secs timeout

Results - MVCNN
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
INFERENCETIME(SECONDS)
HOURLY COST ($)

Results - VAE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
INFERENCETIME(SECONDS)
HOURLY COST ($)

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.Autodesk, the Autodesk logo, and Revit are registered trademarks or trademarks of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. Autodesk reserves the
right to alter product and services offerings, and specifications and pricing at any time without notice, and is not responsible for typographical or graphical errors that may appear in this document.
© 2018 Autodesk. All rights reserved.

Summary
• EI accelerators available in a range of sizes suitable for inference workloads-
• Configure to launch with any EC2 instance type– scale capacity with autoscaling
groups.
• EI configuration is also available though CloudFormation as you configure your
instance resource.
• Deploy TensorFlow, MXNet and ONNX models with no code changes.
• Integrated with SageMaker for a fully managed experience
aws.amazon.com/machine-learning/elastic-inference/

Thank you!

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Inference Cost up to 75% (AIM366) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Inference Cost up to 75% (AIM366) - AWS re:Invent 2018

Similar to [NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Inference Cost up to 75% (AIM366) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Inference Cost up to 75% (AIM366) - AWS re:Invent 2018