Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solutions Architect, Amazon Web Services Japan
Generating Recommendations
at Amazon Scale with
Apache Spark and Amazon DSSTNE
Ryosuke Iwanaga
October 2016

Agenda
• Recommendation and DSSTNE
• Data science productivity with AWS
Note: Details are not the actual Amazon case, but general pattern

Product Recommendations
What are people who bought items A, B, C … Z most likely
to purchase next?

Input and Output
Input
Purchase history for each customer
Output
Possibility to buy each products for each customer

Machine Learning for Recommendation
Lots of algorithms
Matrix Factorization
Logistic Regression
Naïve Bayes
etc.
=> Neural Network

Neural Networks for Product Recommendations
Output (10K-10M)
Input (10K-10M)
Hidden (100-1K)

This Is A Huge Sparse Data Problem
l Uncompressed sparse data either eats a lot of memory
or it eats a lot of bandwidth uploading it to the GPU
l Naively running networks with uncompressed sparse
data leads to lots of multiplications of zero by zero. This
wastes memory, power, and time
l Product Recommendation Networks can have billions of
parameters that cannot fit in a single GPU so
summarizing...

Framework Requirements (2014)
l Efficient support for large input and output layers
l Efficient handling of sparse data (i.e. don't store zero)
l Automagic multi-GPU support for large networks and
scaling
l Avoids multiplying zero and/or by zero
l 24 hour or less training and recommendations
turnaround
l Human-readable descriptions of networks

DSSTNE: Deep Sparse Scalable Tensor Network Engine*
l A Neural Network framework released into OSS by Amazon
l Optimized for large sparse data problems and for fully
connected layers
l Extremely efficient model-parallel multi-GPU support
l 100% Deterministic Execution
l Full SM 3.x, 5.x, and 6.x support (Kepler or better GPUs)
l Distributed training support OOTB (~20 lines of MPI calls)
*”Destiny”

Describes Neural Networks As JSON Objects{
"Version" : 0.7,
"Name" : "AE",
"Kind" : "FeedForward",
"SparsenessPenalty" : {
"p" : 0.5,
"beta" : 2.0
},
"ShuffleIndices" : false,
"Denoising" : {
"p" : 0.2
},
"ScaledMarginalCrossEntropy" : {
"oneTarget" : 1.0,
"zeroTarget" : 0.0,
"oneScale" : 1.0,
"zeroScale" : 1.0
},
"Layers" : [
{ "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "input", "Sparse" : true },
{ "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true },
{ "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true }
],
"ErrorFunction" : "ScaledMarginalCrossEntropy"
}

Summary for DSSTNE
Very efficient performance for sparse fully-connected NN
Multiple GPU by Model parallel and Data parallel
Declare NN by human readable format
JSON definition
100% Deterministic execution

Data science productivity
with AWS

Productivity
Agile iteration is the most important for productivity
design=>train=>predict=>evaluate=>design=>…
Training: GPU (DSSTNE and others)
Pre/Post process: CPU
How to unify these different workload?
Data scientists don't want to use too much tools

What are Containers?
OS virtualization
Process isolation
Images
AutomationServer
Guest OS
Bins/Libs Bins/Libs
App2App1

Deep Learning meets Docker(Container)
A lot of Deep Learning frameworks
DSSTNE, Caffe, Theano, TensorFlow, etc.
To compare each framework using the same input and output
Containerize each framework
Just swap the container image and configuration
No more worry about setup machines!

Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle

Apache Zeppelin notebook to develop queries

Control CPU cluster and GPU cluster
Both CPU and GPU jobs are submitted via Spark driver
CPU jobs: Normal Spark tasks running on Amazon EMR
GPU jobs: Spark submits jobs to Amazon ECS
Not only DSSTNE but also other DL frameworks by Docker

Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent

Why EMR? Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling

Why EMR? Decoupled Architecture
Separate compute
and storage
Resize and shutdown
with no data loss
Point multiple clusters
ad the same data on
Amazon S3
Easily evolve
infrastructure as
technology evolves
HDFS for iterative
and disk I/O intensive
workloads
Save with spot and
reserved instances

Why EMR? Decouple Storage and Compute
Amazon Kinesis
(Streams, Firehose)
Hadoop Jobs
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto | Impala)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
ETL Jobs
Hive External Metastore
i.e Amazon RDS
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3 for Storage
create external table t_name(..)...
location s3://bucketname/path-to-file/

Amazon EC2 Container Service (ECS)
Container Management
at Any Scale
Flexible Container
Placement
Integration
with the AWS Platform

Components of Amazon ECS
Task
Actual containers running on
Instances
Task Definition
Definition of containers and
environment for task
Cluster
Fleet of EC2 instances on
which tasks run
Manager
Manage cluster resource and
state of tasks
Scheduler
Place tasks considering cluster
status
Agent
Coordinate EC2 instances and
Manager

How Amazon ECS runs Task Scheduler
ManagerCluster
Task Definition
Task
Agent

Integration with Spark and ECS
Install AWS SDK for Java on the EMR cluster
Create Task Definition for each Deep Learning framework
Call RunTask API
ECS Scheduler will try to find enough space to run it

Why AWS?
Scalability
Fully-managed services
GPU instances

Amazon Personalization runs on AWS
Spark and Zeppelin for the single interface for data scientists
DSSTNE helps running DL on a huge amount of sparse NN
Using Amazon EMR for CPU and Amazon ECS for GPU
You can do it!

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

More Related Content

What's hot

Viewers also liked

Similar to Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE