What are algorithms? How can I build a machine learning model? In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. At Amazon, we created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorisation machines for recommendations, and time-series forecasting. This talk will discuss those algorithms, understand where and how they can be used, and our design choices.
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
And Then There Are Algorithms
1. And Then There Are Algorithms
Danilo Poccia
Evangelist, Serverless
danilop@amazon.com
@danilop
danilop
2. Letter from Ada Lovelace to Charles Babbage 1843
In this letter, Lovelace suggests an example of a calculation
which “may be worked out by the engine without having been
worked out by human head and hands first”.
6. What is an Algorithm?
https://commons.wikimedia.org/wiki/File:Euclid_flowchart.svg
By Somepics (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
A B
12 18
12 6
6 6
6 0
Euclid’s algorithm for the GCD
of two numbers
7. “You use code to tell a computer what to do.
Before you write code you need an algorithm.
An algorithm is a list of rules to follow
in order to solve a problem.”
BBC Bitesize
What is an Algorithm?
https://commons.wikimedia.org/wiki/File:Euclid_flowchart.svg
By Somepics (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
8. The Master Algorithm
“The future belongs to those who
understand at a very deep level how
to combine their unique expertise
with what algorithms do best.”
Pedro Domingos
17. Stochastic Gradient Descent (SGD)
https://en.wikipedia.org/wiki/Himmelblau's_function
Global
Vs
Local
Minimum
18. Factorization Machines
• It is an extension of a linear model that is
designed to parsimoniously capture interactions
between features within high dimensional
sparse datasets
• Factorization machines are a good choice for
tasks such as click prediction and item
recommendation
• They are usually trained by stochastic gradient
descent (SGD), alternative least square (ALS),
or Markov chain Monte Carlo (MCMC)
23. XGBoost
• Ensemble methods use multiple learning algorithms
to improve predictions
• Boosting: “Can a set of weak learners create a single
strong learner?”
• Gradient Boosting: using gradient descent over a
function space
• eXtreme Gradient Boosting
• https://github.com/dmlc/xgboost
• Supports regression, classification, ranking
and user defined objectives
27. Convolutional Neural Networks (CNNs)
By Debarko De @debarko
https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc
28. Sequence to Sequence (seq2seq)
• seq2seq is a supervised learning algorithm where the
input is a sequence of tokens (for example, text,
audio) and the output generated is another sequence
of tokens.
• Example applications include:
• machine translation (input a sentence from
one language and predict what that sentence
would be in another language)
• text summarization (input a longer string of
words and predict a shorter string of words
that is a summary)
• speech-to-text (audio clips converted into
output sentences in tokens).
29. Sequence to Sequence (seq2seq)
• Recently, problems in this domain have been
successfully modeled with deep neural networks
that show a significant performance boost over
previous methodologies.
• Amazon released in open source the Sockeye
package, which uses Recurrent Neural Networks
(RNNs) and Convolutional Neural Network (CNN)
models with attention as encoder-decoder
architectures.
• https://github.com/awslabs/sockeye
30. Sequence to Sequence (seq2seq)
https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
31. Sequence to Sequence (seq2seq)
https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
“DasgrüneHaus”
“the Green House”
34. Principal Component Analysis (PCA)
• PCA is an unsupervised learning algorithm that
attempts to reduce the dimensionality (number of
features) within a dataset while still retaining as
much information as possible
• This is done by finding a new set of features called
components, which are composites of the original
features that are uncorrelated with one another
• They are also constrained so that the first
component accounts for the largest possible
variability in the data, the second component the
second most variability, and so on
38. Latent Dirichlet Allocation (LDA)
• As an extremely simple example, given a set of documents where the
only words that occur within them are eat, sleep, play, meow, and bark,
LDA might produce topics like the following:
Topic eat sleep play meow bark
Cats? Topic 1 0.1 0.3 0.2 0.4 0.0
Dogs? Topic 2 0.2 0.1 0.4 0.0 0.3
39. Neural Topic Model (NTM)
Encoder: feedforward net
Input term counts vector
Document
Posterior
Sampled Document
Representation
Decoder:
Softmax
Output term counts vector
40. Time Series Forecasting (DeepAR)
• DeepAR is a supervised learning algorithm for forecasting
scalar time series using recurrent neural networks
(RNN)
• Classical forecasting methods fit one model to each
individual time series, and then use that model to
extrapolate the time series into the future
• In many applications you might have many similar time
series across a set of cross-sectional units
• For example, demand for different products, load of servers,
requests for web pages, and so on
• In this case, it can be beneficial to train a single model
jointly over all of these time series
• DeepAR takes this approach, training a model for predicting a time
series over a large set of (related) time series
43. Word2vec ⇾ Word Embedding
Contextual
Bag-Of-Words
(CBOW)
to predict a word
given its context
Skip-Gram with
Negative Sampling
(SGNS)
to predict the context
given a word
49. Our Customers use ML at a massive scale
“We collect 160M events
daily in the ML pipeline and
run training over the last 15
days and need it to
complete in one hour.
Effectively there's 100M
features in the model.”
Valentino Volonghi, CTO
“We process 3 million ad
requests a second,
100,000 features per
request. That’s 250 trillion
per day. Not your run of
the mill Data science
problem!”
Bill Simmons, CTO
“Our data warehouse is
100TB and we are
processing 2TB daily.
We're running mostly
gradient boosting (trees),
LDA and K-Means
clustering and collaborative
filtering.“
Shahar Cizer Kobrinsky,
VP Architecture
64. Amazon SageMaker
• Hosted Jupyter notebooks that
require no setup, so that you can
start processing your training
dataset and developing your
algorithms immediately
• One-click, on-demand distributed
training that sets up and tears
down the cluster after training.
• Built-in, high-performance ML
algorithms, re-engineered for
greater, speed, accuracy, and
data-throughput
Exploration Training
Hosting
65. Amazon SageMaker
• Built-in model tuning
(hyperparameter optimization)
that can automatically adjust
hundreds of different combinations
of algorithm parameters
• An elastic, secure, and scalable
environment to host your models,
with one-click deployment
67. Run a large set of training
jobs with varying
hyperparameters...
... and search the
hyperparameter space for
improved accuracy.
Automatic Model Tuning
70. And Then There Are (Built-in) Algorithms
Algorithm Scope Pipe Input Mode
Linear Learner classification, regression Y
Factorization Machines classification, regression, sparse datasets Y
XGBoost regression, classification (binary and multiclass), and ranking
Image Classification CNNs (ResNet, DenseNet, Inception)
Sequence to Sequence (seq2seq) translation, text summarization, speech-to-text (RNNs, CNN)
K-Means Clustering clustering, unsupervised Y
Principal Component Analysis (PCA) dimensionality reduction, unsupervised Y
Latent Dirichlet Allocation (LDA) topic modeling, unsupervised Y
Neural Topic Model (NTM) topic modeling, unsupervised Y
Time Series Forecasting (DeepAR) time series forecasting (RNN)
BlazingText (Word2vec) word embeddings
Random Cut Forest (RCF) anomaly detection Y
71. And Then There Are Algorithms
Danilo Poccia
Evangelist, Serverless
danilop@amazon.com
@danilop
danilop