And Then There Are Algorithms

And Then There Are Algorithms
Danilo Poccia
Evangelist, Serverless
danilop@amazon.com
@danilop
danilop

Letter from Ada Lovelace to Charles Babbage 1843
In this letter, Lovelace suggests an example of a calculation
which “may be worked out by the engine without having been
worked out by human head and hands first”.

ScienceMuseumGroupCollection
©TheBoardofTrusteesoftheScienceMuseum

DiagramofanalgorithmfortheAnalyticalEngineforthecomputationofBernoullinumbers,fromSketchof
TheAnalyticalEngineInventedbyCharlesBabbagebyLuigiMenabreawithnotesbyAdaLovelace

Muhammad ibn Musa al-Khwarizmi
(c. 780 – c. 850)
Why “Algorithm”?

What is an Algorithm?
https://commons.wikimedia.org/wiki/File:Euclid_flowchart.svg
By Somepics (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
A B
12 18
12 6
6 6
6 0
Euclid’s algorithm for the GCD
of two numbers

“You use code to tell a computer what to do.
Before you write code you need an algorithm.
An algorithm is a list of rules to follow
in order to solve a problem.”
BBC Bitesize
What is an Algorithm?
https://commons.wikimedia.org/wiki/File:Euclid_flowchart.svg
By Somepics (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

The Master Algorithm
“The future belongs to those who
understand at a very deep level how
to combine their unique expertise
with what algorithms do best.”
Pedro Domingos

The Five Tribes of Machine Learning
Tribe Origins Master Algorithm
Symbolists Logic, philosophy Inverse deduction
Connectionists Neuroscience Backpropagation
Evolutionaries
Evolutionary
biology
Genetic
programming
Bayesians Statistics
Probabilistic
inference
Analogizers Psychology Kernel machines

Linear Learner
Regression
Estimate a real valued function
Binary Classification
Predict a 0/1 class

Bike Sharing Prediction (Regression)
Date Time
Temperature
(Celsius)
Relative
Humidity
Rain (mm/h) Rented Bikes
2018-04-01 08:30 13 64 2 45
2018-04-01 11:30 18 57 0 156
2018-04-02 08:30 14 69 8 87
2018-04-02 11:30 17 73 12 34
… … … … … …

Date Time
Temperature
(Celsius)
Relative
Humidity
Rain (mm/h) Rented Bikes
2018-04-01 08:30 13 64 2 45
2018-04-01 11:30 18 57 0 156
2018-04-02 08:30 14 69 8 87
2018-04-02 11:30 17 73 12 34
2018-04-14 16:30 23 56 0 ???
Date & Time

Day of
the Year
Weekday
Public
Holiday
Time
(seconds)
Temperature
(Celsius)
Relative
Humidity
Rain
(mm/h)
Rented
Bikes
91 7 1 30600 13 64 2 45
91 7 1 41400 18 57 0 156
92 1 1 30600 14 69 8 87
92 1 1 41400 17 73 12 34
104 6 0 59400 23 56 0 ???
Date & Time (Feature Engineering)

Linear Learner
basis functions
basis functions can be nonlinear

Minimizing the Error
you know the expected
values
(use separate datasets for
training and validation)
this is always
positive (convex
function)

Objective Function
loss
function
regularization
term
measures
how predictive
our model is on
your data
measures the
complexity of
the model

Stochastic Gradient Descent (SGD)
https://en.wikipedia.org/wiki/Himmelblau's_function
Global
Vs
Local
Minimum

Factorization Machines
• It is an extension of a linear model that is
designed to parsimoniously capture interactions
between features within high dimensional
sparse datasets
• Factorization machines are a good choice for
tasks such as click prediction and item
recommendation
• They are usually trained by stochastic gradient
descent (SGD), alternative least square (ALS),
or Markov chain Monte Carlo (MCMC)

Source: data-artisans.com
? ?
??
??
?

not in a Linear Learner
Alternative
least square
(ALS)
features

Factorization Machines (k=4)
Movie
1
action
2
romantic
3
thriller
4
horror
Blade Runner 0.4 0.3 0.5 0.2
Notting Hill 0.2 0.8 0.1 0.01
Arrival 0.2 0.4 0.6 0.1
But you cannot really control how features are used!
Intuitively, each “feature” describes a property of the “items”

Vectors ⇾ “Bearer of Information”
how much are
they related?

XGBoost
• Ensemble methods use multiple learning algorithms
to improve predictions
• Boosting: “Can a set of weak learners create a single
strong learner?”
• Gradient Boosting: using gradient descent over a
function space
• eXtreme Gradient Boosting
• https://github.com/dmlc/xgboost
• Supports regression, classification, ranking
and user defined objectives

XGBoost
Classification And Regression Trees (CART)

Convolutional Neural Networks (CNNs)
By Debarko De @debarko
https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc

Sequence to Sequence (seq2seq)
• seq2seq is a supervised learning algorithm where the
input is a sequence of tokens (for example, text,
audio) and the output generated is another sequence
of tokens.
• Example applications include:
• machine translation (input a sentence from
one language and predict what that sentence
would be in another language)
• text summarization (input a longer string of
words and predict a shorter string of words
that is a summary)
• speech-to-text (audio clips converted into
output sentences in tokens).

• Recently, problems in this domain have been
successfully modeled with deep neural networks
that show a significant performance boost over
previous methodologies.
• Amazon released in open source the Sockeye
package, which uses Recurrent Neural Networks
(RNNs) and Convolutional Neural Network (CNN)
models with attention as encoder-decoder
architectures.
• https://github.com/awslabs/sockeye

https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/

https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
“DasgrüneHaus”
“the Green House”

K-Means Clustering
Clustering converges
when the centers
“don’t move” anymore

Principal Component Analysis (PCA)
• PCA is an unsupervised learning algorithm that
attempts to reduce the dimensionality (number of
features) within a dataset while still retaining as
much information as possible
• This is done by finding a new set of features called
components, which are composites of the original
features that are uncorrelated with one another
• They are also constrained so that the first
component accounts for the largest possible
variability in the data, the second component the
second most variability, and so on

Principal Component Analysis (PCA)

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)
• As an extremely simple example, given a set of documents where the
only words that occur within them are eat, sleep, play, meow, and bark,
LDA might produce topics like the following:
Topic eat sleep play meow bark
Cats? Topic 1 0.1 0.3 0.2 0.4 0.0
Dogs? Topic 2 0.2 0.1 0.4 0.0 0.3

Neural Topic Model (NTM)
Encoder: feedforward net
Input term counts vector
Document
Posterior
Sampled Document
Representation
Decoder:
Softmax
Output term counts vector

Time Series Forecasting (DeepAR)
• DeepAR is a supervised learning algorithm for forecasting
scalar time series using recurrent neural networks
(RNN)
• Classical forecasting methods fit one model to each
individual time series, and then use that model to
extrapolate the time series into the future
• In many applications you might have many similar time
series across a set of cross-sectional units
• For example, demand for different products, load of servers,
requests for web pages, and so on
• In this case, it can be beneficial to train a single model
jointly over all of these time series
• DeepAR takes this approach, training a model for predicting a time
series over a large set of (related) time series

Time Series Forecasting (DeepAR)

Word2vec ⇾ Word Embedding
Contextual
Bag-Of-Words
(CBOW)
to predict a word
given its context
Skip-Gram with
Negative Sampling
(SGNS)
to predict the context
given a word

@data_monsters
https://twitter.com/data_monsters/status/844256398393462784

BlazingText (Word2vec) Scaling

Our Customers use ML at a massive scale
“We collect 160M events
daily in the ML pipeline and
run training over the last 15
days and need it to
complete in one hour.
Effectively there's 100M
features in the model.”
Valentino Volonghi, CTO
“We process 3 million ad
requests a second,
100,000 features per
request. That’s 250 trillion
per day. Not your run of
the mill Data science
problem!”
Bill Simmons, CTO
“Our data warehouse is
100TB and we are
processing 2TB daily.
We're running mostly
gradient boosting (trees),
LDA and K-Means
clustering and collaborative
filtering.“
Shahar Cizer Kobrinsky,
VP Architecture

Model Selection (Hyperparameters)
1
1

Incremental Training with Streaming
2
3
1
2

Incremental Training with Streaming
3
1
2

Distributed
GPU State
GPU State
GPU State

Shared State
GPU
GPU
GPU Local
State
Shared
State
Local
State
Local
State

State ≥ Model
GPU State
what is the effect of
different
hyperparameters?

Model Selection
1
trying different
hyperparameters

Abstraction and Containerization
def
initialize(...)
def update(...)
def finalize(...)

Amazon SageMaker
• Hosted Jupyter notebooks that
require no setup, so that you can
start processing your training
dataset and developing your
algorithms immediately
• One-click, on-demand distributed
training that sets up and tears
down the cluster after training.
• Built-in, high-performance ML
algorithms, re-engineered for
greater, speed, accuracy, and
data-throughput
Exploration Training
Hosting

Amazon SageMaker
• Built-in model tuning
(hyperparameter optimization)
that can automatically adjust
hundreds of different combinations
of algorithm parameters
• An elastic, secure, and scalable
environment to host your models,
with one-click deployment

Amazon SageMaker
https://github.com/aws/sagemaker-tensorflow-containers
https://github.com/aws/sagemaker-mxnet-containers
https://github.com/aws/sagemaker-chainer-container
https://github.com/aws/sagemaker-container-support
https://github.com/aws/sagemaker-spark
https://github.com/aws/sagemaker-python-sdk

Run a large set of training
jobs with varying
hyperparameters...
... and search the
hyperparameter space for
improved accuracy.
Automatic Model Tuning

Machine Learning = Algorithms + Data + Tools

And Then There Are (Built-in) Algorithms
Algorithm Scope Pipe Input Mode
Linear Learner classification, regression Y
Factorization Machines classification, regression, sparse datasets Y
XGBoost regression, classification (binary and multiclass), and ranking
Image Classification CNNs (ResNet, DenseNet, Inception)
Sequence to Sequence (seq2seq) translation, text summarization, speech-to-text (RNNs, CNN)
K-Means Clustering clustering, unsupervised Y
Principal Component Analysis (PCA) dimensionality reduction, unsupervised Y
Latent Dirichlet Allocation (LDA) topic modeling, unsupervised Y
Neural Topic Model (NTM) topic modeling, unsupervised Y
Time Series Forecasting (DeepAR) time series forecasting (RNN)
BlazingText (Word2vec) word embeddings
Random Cut Forest (RCF) anomaly detection Y

And Then There Are Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to And Then There Are Algorithms

Similar to And Then There Are Algorithms (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

And Then There Are Algorithms