AWS re:INVENTRole of Tensors in Large-Scale
Machine Learning
Anima Anandk umar
Pr inc ipal Sc ientis t, AW S
Br en Pr ofes s or, C altec h
Tensors are Multi-Dimensional
Tensors are Multi-Dimensional
Modern data is inherently multi-dimensional
Tensors to encode multi-dimensional data
Images: 3 dimensions
Tensors to encode multi-dimensional data
Images: 3 dimensions Videos: 4 dimensions
Tensors to encode multi-dimensional data
Are there trees in
the image?
Images: 3 dimensions Videos: 4 dimensions Visual q&a: ? dimensions
Tensors to encode higher order moments
Pairwise correlations
Tensors to encode higher order moments
Pairwise correlations
Third order correlations
Operations on Tensors: Tensor
Contraction
Tensorized Neural Networks
h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
Deep Neural Nets: Transforming Tensors
Deep Tensorized Networks
Tensor Contraction & Regression Networks by J. Kossaifi, Z. C. Lipton,A. Khanna,T. Furlanello, A, 2017.
Space Saving in Deep Tensorized
Networks
Tensorly: Framework for Tensor Algebra
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository
Tensorly with Pytorch backend
import tensorly as tl
from tensorly.random import tucker_tensor
tl.set_backend(‘pytorch’)
core, factors = tucker_tensor((5, 5, 5),
rank=(3, 3, 3))
core = Variable(core, requires_grad=True)
factors = [Variable(f, requires_grad=True) for f in factors]
optimiser = torch.optim.Adam([core]+factors, lr=lr)
for i in range(1, n_iter):
optimiser.zero_grad()
rec = tucker_to_tensor(core, factors)
loss = (rec - tensor).pow(2).sum()
for f in factors:
loss = loss + 0.01*f.pow(2).sum()
loss.backward()
optimiser.step()
Set Pytorch backend
Attach gradients
Set optimizer
TuckerTensor form
Speeding up Tensor Contractions
h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
Speeding up Tensor Contractions
Speeding up Tensor Contractions
New Primitive for Tensor Contractions
Performance of StridedBatchedGEMM
Performance on par with pure GEMM (P100 and beyond).
Tensors in Time Series
h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
Tensors for long-term forecasting
Difficulties in long term forecasting:
• Long-term dependencies
• High-order correlations
• Error propagation
22
RNNs: First-Order Markov Models
Input state 𝑥 𝑡, hidden state ℎ 𝑡, output 𝑦𝑡,
ℎ 𝑡= 𝑓 𝑥 𝑡, ℎ 𝑡−1 ; 𝜃 ; 𝑦𝑡 = 𝑔( ℎ 𝑡; 𝜃)
23
Tensorized RNNs
L-order Markov
ℎ 𝑡= 𝑓 𝑥𝑡, ℎ 𝑡−1 , ℎ 𝑡−2 , … ℎ 𝑡−𝐿 ; 𝜃 ; 𝑦𝑡 = 𝑔( ℎ 𝑡; 𝜃)
Polynomial Interactions
𝑠𝑡−1 = [1, ℎ 𝑡−1, ℎ 𝑡−2 , … ℎ 𝑡−𝐿 ]
ℎ 𝑡,𝛼 = 𝑓 𝑊ℎ𝑥
𝑥𝑡 + 𝒲 𝛼,𝑖1,…𝑖 𝑃
𝑠𝑡−1;𝑖1
⨂ … ⨂𝑠𝑡−1;𝑖 𝑃
; 𝜃 ;
24
Tensor-Train Decomposition
25
Reduce # of parameters
Linear tensor network
Dimension-free decomposition
𝒲𝑖1…𝑖 𝑃
=
𝛼0…𝛼 𝑃
𝒜 𝛼0 𝑖1 𝛼1
… 𝒜 𝛼 𝑃−1 𝑖 𝑃 𝛼 𝑃
Tensor-Train RNNs and LSTMs
26
Seq2seq architecture
TT-LSTM cells
C l i m a t e d a t a s e tTr a ff i c d a t a s e t
Tensor LSTM for Long-term Forecasting
Approximation Guarantees
Theorem 1 [Yu et al 2017]: Let the state transition function 𝑓 ∈ ℋ𝜇
𝑘
be a
Hölder continuous, with bounded derivatives up to order k and finite
Fourier magnitude distribution 𝐶𝑓. Then a single layer Tensor Train RNN
can approximate 𝑓 with an estimation error of 𝜀 using with ℎ hidden units:
ℎ ≤
𝐶𝑓
2
𝜀
𝑑 − 1
𝑟 + 1 − 𝑘−1
𝑘 − 1
+
𝐶𝑓
2
𝜀
𝐶 𝑘 𝑝−𝑘
Where 𝑑 is the size of the state space, 𝑟 is the tensor-train rank and 𝑝 is
the degree of high-order polynomials i.e., the order of tensor.
• Easier to approximate if function is smooth
• Polynomial interactions are more efficient
Long-term Forecasting usingTensor-Train RNNs by R.Yu, S. Zheng, A,Y.Yue, 2017.
Visual Question & Answering
Tensors for multiple modalities
Visual Question & Answering
Tensors for multiple modalities
Visual Question & Answering
Tensor Sketching Algorithms
Tensors for Topic Modelingh t t p s
Tensors for Topic Detection
Tensors for Topic Detection
Government
Information
Technology
Politics
Topics
LDA Topic Model
brai
n
comput
data
evolve
gene
neuron
Justice
Education
Sports
Topics
Topic-word matrix [word = i|topic = j ]
Topic proportions P[topic = j|document]
Moment Tensor: Co-occurrence of Word Triplets
= + +
crim
e
Sports
Educa
on
Learning LDA Model
Topic-word matrix [word = i|topic = j ]
Topic proportions P[topic = j|document]
Moment Tensor: Co-occurrence of Word Triplets
= + +
crim
e
Sports
Educa
on
Learning LDA Model
Spectral Decomposition
Why Tensors?
Statistical reasons:
• Incorporate higher order relationships in data
• Discover hidden topics (not possible with matrix methods)
A. Anandkumar etal,Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.
Why Tensors?
Statistical reasons:
• Incorporate higher order relationships in data
• Discover hidden topics (not possible with matrix methods)
Computational reasons:
• Tensor algebra is parallelizable like linear algebra.
• Faster than other algorithms for LDA
• Flexible: Training and inference decoupled
• Guaranteed in theory to converge to global optimum
A. Anandkumar etal,Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.
Spectral LDA on AWS SageMaker
SageMaker LDA training is faster
0.00
20.00
40.00
60.00
80.00
100.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
• Mallet is an open-source framework for topic modeling
• Mallet does training and inference together
• Benchmarks on AWS SageMaker Platform
300000 documents
End-to-end
Machine Learning
Platform
Zero setup Flexible model
training
Pay by the
second
Introducing Amazon SageMaker
The quickest and easiest way to get ML models from idea to production
XGBoost, FM,
and Linear for
classification and
regression
Kmeans and PCA
for clustering and
dimensionality
reduction
Image
classification with
convolutional
neural networks
LDA and NTM for
topic modeling,
seq2seq for
translation
Many optimized algorithms on SageMaker
AWS ML Stack
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
Mobile
CPU
(C5 Instances)
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
CONCLUSION
Conclusion
• Tensors are higher extensions of matrices.
• Encode data dimensions, modalities and higher order relationships.
• Tensor algebra is richer than matrix algebra. Richer neural network
architectures. More compact networks/better accuracy.
• Tensor operations are embarrassingly parallel. New primitives for tensor
operations.
= + ..
THANK YOU!

Role of Tensors in Machine Learning

  • 1.
    AWS re:INVENTRole ofTensors in Large-Scale Machine Learning Anima Anandk umar Pr inc ipal Sc ientis t, AW S Br en Pr ofes s or, C altec h
  • 2.
  • 3.
    Tensors are Multi-Dimensional Moderndata is inherently multi-dimensional
  • 4.
    Tensors to encodemulti-dimensional data Images: 3 dimensions
  • 5.
    Tensors to encodemulti-dimensional data Images: 3 dimensions Videos: 4 dimensions
  • 6.
    Tensors to encodemulti-dimensional data Are there trees in the image? Images: 3 dimensions Videos: 4 dimensions Visual q&a: ? dimensions
  • 7.
    Tensors to encodehigher order moments Pairwise correlations
  • 8.
    Tensors to encodehigher order moments Pairwise correlations Third order correlations
  • 9.
    Operations on Tensors:Tensor Contraction
  • 10.
    Tensorized Neural Networks ht t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
  • 11.
    Deep Neural Nets:Transforming Tensors
  • 12.
    Deep Tensorized Networks TensorContraction & Regression Networks by J. Kossaifi, Z. C. Lipton,A. Khanna,T. Furlanello, A, 2017.
  • 13.
    Space Saving inDeep Tensorized Networks
  • 14.
    Tensorly: Framework forTensor Algebra • Python programming • User-friendly API • Multiple backends: flexible + scalable • Example notebooks in repository
  • 15.
    Tensorly with Pytorchbackend import tensorly as tl from tensorly.random import tucker_tensor tl.set_backend(‘pytorch’) core, factors = tucker_tensor((5, 5, 5), rank=(3, 3, 3)) core = Variable(core, requires_grad=True) factors = [Variable(f, requires_grad=True) for f in factors] optimiser = torch.optim.Adam([core]+factors, lr=lr) for i in range(1, n_iter): optimiser.zero_grad() rec = tucker_to_tensor(core, factors) loss = (rec - tensor).pow(2).sum() for f in factors: loss = loss + 0.01*f.pow(2).sum() loss.backward() optimiser.step() Set Pytorch backend Attach gradients Set optimizer TuckerTensor form
  • 16.
    Speeding up TensorContractions h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
  • 17.
    Speeding up TensorContractions
  • 18.
    Speeding up TensorContractions
  • 19.
    New Primitive forTensor Contractions
  • 20.
    Performance of StridedBatchedGEMM Performanceon par with pure GEMM (P100 and beyond).
  • 21.
    Tensors in TimeSeries h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s
  • 22.
    Tensors for long-termforecasting Difficulties in long term forecasting: • Long-term dependencies • High-order correlations • Error propagation 22
  • 23.
    RNNs: First-Order MarkovModels Input state 𝑥 𝑡, hidden state ℎ 𝑡, output 𝑦𝑡, ℎ 𝑡= 𝑓 𝑥 𝑡, ℎ 𝑡−1 ; 𝜃 ; 𝑦𝑡 = 𝑔( ℎ 𝑡; 𝜃) 23
  • 24.
    Tensorized RNNs L-order Markov ℎ𝑡= 𝑓 𝑥𝑡, ℎ 𝑡−1 , ℎ 𝑡−2 , … ℎ 𝑡−𝐿 ; 𝜃 ; 𝑦𝑡 = 𝑔( ℎ 𝑡; 𝜃) Polynomial Interactions 𝑠𝑡−1 = [1, ℎ 𝑡−1, ℎ 𝑡−2 , … ℎ 𝑡−𝐿 ] ℎ 𝑡,𝛼 = 𝑓 𝑊ℎ𝑥 𝑥𝑡 + 𝒲 𝛼,𝑖1,…𝑖 𝑃 𝑠𝑡−1;𝑖1 ⨂ … ⨂𝑠𝑡−1;𝑖 𝑃 ; 𝜃 ; 24
  • 25.
    Tensor-Train Decomposition 25 Reduce #of parameters Linear tensor network Dimension-free decomposition 𝒲𝑖1…𝑖 𝑃 = 𝛼0…𝛼 𝑃 𝒜 𝛼0 𝑖1 𝛼1 … 𝒜 𝛼 𝑃−1 𝑖 𝑃 𝛼 𝑃
  • 26.
    Tensor-Train RNNs andLSTMs 26 Seq2seq architecture TT-LSTM cells
  • 27.
    C l im a t e d a t a s e tTr a ff i c d a t a s e t Tensor LSTM for Long-term Forecasting
  • 28.
    Approximation Guarantees Theorem 1[Yu et al 2017]: Let the state transition function 𝑓 ∈ ℋ𝜇 𝑘 be a Hölder continuous, with bounded derivatives up to order k and finite Fourier magnitude distribution 𝐶𝑓. Then a single layer Tensor Train RNN can approximate 𝑓 with an estimation error of 𝜀 using with ℎ hidden units: ℎ ≤ 𝐶𝑓 2 𝜀 𝑑 − 1 𝑟 + 1 − 𝑘−1 𝑘 − 1 + 𝐶𝑓 2 𝜀 𝐶 𝑘 𝑝−𝑘 Where 𝑑 is the size of the state space, 𝑟 is the tensor-train rank and 𝑝 is the degree of high-order polynomials i.e., the order of tensor. • Easier to approximate if function is smooth • Polynomial interactions are more efficient Long-term Forecasting usingTensor-Train RNNs by R.Yu, S. Zheng, A,Y.Yue, 2017.
  • 29.
    Visual Question &Answering Tensors for multiple modalities
  • 30.
    Visual Question &Answering Tensors for multiple modalities
  • 31.
    Visual Question &Answering Tensor Sketching Algorithms
  • 32.
    Tensors for TopicModelingh t t p s
  • 33.
  • 34.
    Tensors for TopicDetection Government Information Technology Politics Topics
  • 35.
  • 36.
    Topic-word matrix [word= i|topic = j ] Topic proportions P[topic = j|document] Moment Tensor: Co-occurrence of Word Triplets = + + crim e Sports Educa on Learning LDA Model
  • 37.
    Topic-word matrix [word= i|topic = j ] Topic proportions P[topic = j|document] Moment Tensor: Co-occurrence of Word Triplets = + + crim e Sports Educa on Learning LDA Model
  • 38.
  • 39.
    Why Tensors? Statistical reasons: •Incorporate higher order relationships in data • Discover hidden topics (not possible with matrix methods) A. Anandkumar etal,Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.
  • 40.
    Why Tensors? Statistical reasons: •Incorporate higher order relationships in data • Discover hidden topics (not possible with matrix methods) Computational reasons: • Tensor algebra is parallelizable like linear algebra. • Faster than other algorithms for LDA • Flexible: Training and inference decoupled • Guaranteed in theory to converge to global optimum A. Anandkumar etal,Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.
  • 41.
    Spectral LDA onAWS SageMaker
  • 42.
    SageMaker LDA trainingis faster 0.00 20.00 40.00 60.00 80.00 100.00 5 10 15 20 25 30 50 75 100 Timeinminutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes) 0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100 Timeinminutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes) 8 million documents 22x faster on average 12x faster on average • Mallet is an open-source framework for topic modeling • Mallet does training and inference together • Benchmarks on AWS SageMaker Platform 300000 documents
  • 43.
    End-to-end Machine Learning Platform Zero setupFlexible model training Pay by the second Introducing Amazon SageMaker The quickest and easiest way to get ML models from idea to production
  • 44.
    XGBoost, FM, and Linearfor classification and regression Kmeans and PCA for clustering and dimensionality reduction Image classification with convolutional neural networks LDA and NTM for topic modeling, seq2seq for translation Many optimized algorithms on SageMaker
  • 45.
    AWS ML Stack Frameworks& Infrastructure AWS Deep Learning AMI GPU (P3 Instances) Mobile CPU (C5 Instances) IoT (Greengrass) Vision: Rekognition Image Rekognition Video Speech: Polly Transcribe Language: Lex Translate Comprehend Apache MXNet PyTorch Cognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens
  • 46.
  • 47.
    Conclusion • Tensors arehigher extensions of matrices. • Encode data dimensions, modalities and higher order relationships. • Tensor algebra is richer than matrix algebra. Richer neural network architectures. More compact networks/better accuracy. • Tensor operations are embarrassingly parallel. New primitives for tensor operations. = + ..
  • 48.

Editor's Notes

  • #44 End-to-End Machine Learning Platform Amazon SageMaker offers a familiar integrated development environment so that you can start processing your training dataset and developing your algorithms immediately. With one-click training, Amazon SageMaker provides a distributed training environment complete with high-performance machine learning algorithms, and built-in hyperparameter optimization for auto-tuning your models. When you’re ready to deploy, launching a secure and elastically scalable production environment is as simple as clicking a button in the Amazon SageMaker management console.   Zero Setup Amazon SageMaker provides hosted Jupyter notebooks that require no setup, so you can begin processing your training datasets and developing your algorithms immediately. With a few clicks in the Amazon SageMaker console, you can create a fully managed notebook instance, pre-loaded with useful libraries for machine learning and deep learning frameworks like TensorFlow, and Apache MXNet. You need only add your data. Flexible Model Training With native support for bring-your-own-algorithms and frameworks, model training in Amazon SageMaker is flexible. Amazon SageMaker provides native Apache MXNet and TensorFlow support, and offers a range of built-in, high performance machine learning algorithms, in addition to supporting popular open source algorithms. If you want to train against another algorithm or with an alternative deep learning framework, you simply bring your own algorithms or deep learning frameworks via a Docker container. Pay by the second With Amazon SageMaker , you pay only for what you use. Authoring, training, and hosting is billed by the second, with no minimum fees and no upfront commitments. Pricing within Amazon SageMaker is broken down by on-demand ML instances, ML storage, and fees for data processing in notebooks and hosting instances.
  • #45 The result of this is 1)      Linear Learner - Regression 2)      Linear Learner - Classification 3)      K-means 4)      Principal Component Analysis 5)      Factorization Machines 6)      Neural Topic Modeling 7)      Latent Dirichlet Allocation 8)      XGBoost 9)      Seq2Seq 10)  Image classification (ResNet)