Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications

We Build Together
Shakir Mohamed
Research Scientist, DeepMind
Lead, Deep Learning Indaba
@shakir_za shakir@deepmind.com
IndabaXGhana April 2019

Shakir Mohamed
IndabaXGhana !
2017
Strengthen Machine Learning and
Artiﬁcial Intelligence in Africa

Shakir Mohamed
IndabaXGhana !
2018
Masakhane!
A partnership of a community
determined to take responsibility for
its own upliftment.

IndabaX
Shakir Mohamed
IndabaXGhana !
Building Local
Leadership in ML and AI
cross our Continent

Masakhane
We Build Together

Statistical Machine Learning
from Principles to Practice
Shakir Mohamed

Principles to Products
Shakir Mohamed
IndabaXGhana !
Probability
Theory
Bayesian
Analysis
Hypothesis
Testing
Estimation
Theory
AsymptoticsPrinciples
Uncertainty Information Gain CausalityInformation Prediction
Planning Explanation Rapid Learning
World
Simulation
Objects and
Relations
Reasoning
Advancing
Science
Assistive
Technology
Climate and
Energy
Healthcare
Fairness and
Safety
Autonomous
systemsApplications

Probability
Shakir Mohamed
IndabaXGhana !
Some Definitions for probability
Probability is sufficient for the task of reasoning
under uncertainty
Statistical Probability
Frequency ratio of items
Subjective Probability
Probability as a
degree of belief
Logical Probability
Degree of confirmation of a
hypothesis based on logical
analysis
Probability as Propensity
Probability used
for predictions

Statistical Operations
Shakir Mohamed
IndabaXGhana !
Modelling
Estimation and
Learning
Hypothesis
Testing
Experimental
DesignData
Enumeration
Summarisation Comparison
Inference

Probabilistic Models
Shakir Mohamed
IndabaXGhana !
Model: Description of the world, of data, of
potential scenarios, of processes.
Most models in machine learning
are probabilistic.
Probabilistic models let you learn
probability distributions of data.
Peak
hour
Bad
Weather
Accident
Traffic
Jam
Sirens
prob(traffic Jam)
prob(sirens | Accident)
prob(peak hour | Traffic Jam)
You can choose what to learn: Just
the mean. Or the entire distribution.
A probabilistic model writes out these models
using the language of probability

Centrality of Inference
Shakir Mohamed
IndabaXGhana !
The core questions of AI will be
those of probabilistic inference
Artiﬁcial Intelligence will be the reﬁned
instantiation of these statistical
operations.
Data
Enumeration
Summarisation Comparison
Inference

Inference and Decision-making
Shakir Mohamed
IndabaXGhana !
1.Flexible ways of building rich probabilistic
models
2.Ability to learn and make consistent
inferences and maintain beliefs
3.Reason about potential outcomes and take
actions.
Have many of the tools needed to build
plausible reasoning systems:
What we can
know about our data
Inference
What we can
do with our data.
Decision-making

Linear Regression
Generalised Linear Regression
Shakir Mohamed
IndabaXGhana !
Optimise the negative log-likelihood
L = log p(y|g(⌘); ✓)
Table 1: Correspondence between link and activations functions in
generalised regression.
Target Regression Link Inv link Activation
Real Linear Identity Identity
Binary Logistic Logit log µ
1-µ Sigmoid
1
1+exp(-⌘)
Sigmoid
Binary Probit Inv Gauss
CDF -1(µ)
Gauss CDF
(⌘)
Probit
Binary Gumbel Compl.
log-log
log(-log(µ))
Gumbel CDF
e-e-x
Binary Logistic Hyperbolic
Tangent
tanh(⌘)
Tanh
Categorical Multinomial Multin. Logit
⌘iP
j ⌘j
Softmax
Counts Poisson log(µ) exp(⌫)
Counts Poisson
p
(µ) ⌫2
Non-neg. Gamma Reciprocal 1
µ
1
⌫
Sparse Tobit max max(0; ⌫) ReLU
Ordered Ordinal Cum. Logit
( k - ⌘)
the Bernoulli distribution.
⌘ = w>
x + b
p(y|x) = p(y|g(⌘); ✓)
• g(.) is an inverse link function that we’ll refer
to as an activation function.
• The basic function can be any linear function,
e.g., afﬁne, convolution.
g()
⌘ = Bx
E[y]

Deep Networks
Recursive Generalised Linear Regression
Shakir Mohamed
IndabaXGhana !
A general, ﬂexible framework for building
non-linear, parametric models
• Recursively compose the basic linear functions.
• Gives a deep neural network.
E[y] = hL . . . hl h0(x)
⌘1 = Bx1
g()
g()
⌘l = Bxl
…
E[y]

Estimation Theory
Shakir Mohamed
IndabaXGhana !
Likelihood function
Maximum Likelihood
Optimisation
Objective
Probabilistic Model
• Straightforward and natural way to learn parameters
• Can be biased in ﬁnite sample size, e.g., Gaussian variances with N and N-1.
• Easy to observe overﬁtting of parameters.
⌘1 = Bx1
g()
g()
⌘l = Bxl
…
E[y]

Bayesian Analysis
Shakir Mohamed
IndabaXGhana !
Issues arise as a consequence of:
• Reasoning only about the most likely solution, and
• Not maintaining knowledge of the underlying variability (and averaging over this).
Pragmatic Bayesian Approach for
Probabilistic Reasoning in Deep Networks.
(and all of machine learning)
Bayesian reasoning over some, but not all parts of our models (yet).
Motivates learning more than the mean. This
is the core of a Bayesian philosophy.

Two Streams of Machine Learning
- Mainly conjugate and linear models
- Potentially intractable inference,
computationally expensive or long
simulation time.
+ Unified framework for model building,
inference, prediction and decision making
+ Explicit accounting for uncertainty and
variability of outcomes
+ Robust to overfitting; tools for model
selection and composition.
Shakir Mohamed
IndabaXGhana !
Bayesian Reasoning
+ Rich non-linear models for classification and
sequence prediction.
+ Scalable learning using stochastic
approximation and conceptually simple.
+ Easily composable with other gradient-
based methods
- Only point estimates
- Hard to score models, do selection and
complexity penalisation.
Deep Learning
Natural to consider the marriage of these approaches: Bayesian Deep Learning

Regression and Classiﬁcation
Shakir Mohamed
IndabaXGhana !
•Make predictions of future based on past correlations.
•Ways of learning distributions over functions and
maintaining uncertainty over functions.
•Many ways to learn the posterior distribution.
Prior
Observation model
Posterior
Probabilistic models over functions
y

Density Estimation
Shakir Mohamed
IndabaXGhana !
Factor Analysis / PCA
z ⇠ N(z|µ, ⌃)
y ⇠ N(y|Wz, 2
yI)
z
y
W
n = 1, …, N
μ Σ
•How can you learn from data without any labels. Structure of the data.
•Deep Generative Models and Unsupervised learning.
Learn probability distributions over the data itself

Decision-making
Shakir Mohamed
IndabaXGhana !
Setup is common in experimental
design, causal learning,
reinforcement learning.
External Environment
Decision-maker
Observation/
SensationAction
Environment
Probabilistic models of environments and actions
Prior over actions
Interaction only
Reward/Utility

Shakir Mohamed
IndabaXGhana !
Products
Super-resolution,
Compression,
Text-to-speech
Science
Proteomics,
Drug Discovery,
Astronomy,
High-energy physics
AI Planning,
Exploration
Intrinsic motivation
Model-based RL
Applications

Machine Translation
Shakir Mohamed
IndabaXGhana !

Shakir Mohamed
IndabaXGhana !
Reducing
Energy Consumption

Compression-Communication
Shakir Mohamed
IndabaXGhana !
Compression rate:
0.2bits/dimension
Original
JPEG-2000
JPEG
VAE1
VAE2

Assistive Tools
Shakir Mohamed
IndabaXGhana !
Fully-observed conditional generative model

Assistive Tools
Shakir Mohamed
IndabaXGhana !

Creative Tools
Shakir Mohamed
IndabaXGhana !

Style Transfer
Shakir Mohamed
IndabaXGhana !

Stellar Initial Mass Functions
Shakir Mohamed
IndabaXGhana !
The distribution of star masses after a star
formation event within a speciﬁed volume
of space.
Can explore new models, like those that
simulate preferential attachment.
R.N. Bailey, Wikipedia
Cisewski-Kehe

Advancing Science
Shakir Mohamed
IndabaXGhana !

Electronic Health Records
Shakir Mohamed
IndabaXGhana !
Non-linear data
Sequential
representation

Medical Imaging
Shakir Mohamed
IndabaXGhana !

Molecular Structures
Proposing candidate molecules and for improving prediction
Shakir Mohamed
IndabaXGhana !

Foundations
Shakir Mohamed
IndabaXGhana !
How will you approach your ML research and practice?
Sociological
Psychological
Componential
Physiological
Sun’s Phenomenological
Levels
In general:
Human-centred,
interdisciplinary approach
Model-Inference-Algorithm
For the ML Core:
Probabilistic and pragmatic in approach
Architecture-Loss

Architecture-Loss
Shakir Mohamed
IndabaXGhana !
1. Computational Graphs
W¹: Weight X :Input
T¹:Times B¹:Weight
P¹: Plus
W²: Weight S¹: Sigmoid
T² :Times B²: Weight
P²: Plus
O: So#max
2. Error propagation

Model-Inference-Algorithm
Shakir Mohamed
IndabaXGhana !
1. Models
2. Learning
Principles
3. Algorithms

Shakir Mohamed
IndabaXGhana !
Fully-observed
Latent Variable
y1
z1
…y2
z2
yD
zD
…
μ, Σ
n = 1, …, N
Parametric, Non-parametric
And semi-parametric
Directed and Undirected
Models

Shakir Mohamed
IndabaXGhana !
Statistical Inference
Laplace
approximation
Maximum
Likelihood
Maximum a
posteriori
Cavity Methods
Integr. Nested
Laplace Approx
Expectation
Maximisation
Markov chain
Monte Carlo
Variational
Inference
Sequential
Monte Carlo
Noise
Contrastive
Two Sample
Comparison
Transpo!ation
methods
Approx Bayesian
Computation
Method of
Moments
Max Mean
Discrepency
Direct Indirect
Learning
Principles

Shakir Mohamed
IndabaXGhana !
A given model and learning principle can be implemented in many ways.
!Optimisation methods
(SGD, Adagrad)
!Regularisation (L1, L2,
batchnorm, dropout)
Convolutional neural network
+ penalised maximum likelihood
Latent variable model
+ variational inference
! VEM algorithm
! Expectation propagation
! Approximate message passing
! Variational auto-encoders (VAE)
Restricted Boltzmann Machine
+ maximum likelihood
! Contrastive Divergence
! Persistent CD
! Parallel Tempering
! Natural gradients
Implicit Generative Model
+ Two-sample testing
! Unsupervised-as-supervised learning
! Approximate Bayesian Computation (ABC)
! Generative adversarial network (GAN)
Algorithms

Critical Practice for ML
Shakir Mohamed
IndabaXGhana !
Consider the uses of our algorithms.
What are the dual uses of generative models. How do we think critically
about these uses, educate, regulate, co-design these tools.

Dual Uses and Value Alignment
Shakir Mohamed
IndabaXGhana !

Neutrality and Universality
Shakir Mohamed
IndabaXGhana !
Neutrality Traps
• The Portability Trap: Failure to understand how repurposing algorithmic solutions designed for one
social context may be inaccurate / do harm when applied to a different context.
• The Formalism Trap: Failure to account for the full meaning of social concepts such as fairness,
which be resolved through mathematical formalisms.
• The Ripple Effect Trap: Failure to understand how the insertion of technology into an existing social
system changes the behaviours and embedded values of the pre-existing system .
• The Solutionism Trap: Failure to recognise the possibility that the best solution to a problem may not
involve technology.
Universality
‘A mono-cultural view of ethics conceives itself as the only valid one. In order to avoid this kind of ethical
chauvinism and colonialism it is necessary that transcultural ethics arise from an intercultural dialogue instead of
thinking of itself as universal without noticing its own cultural bias.’ Capurro, 2004

Shakir Mohamed
IndabaXGhana !
Shakir Mohamed
We Build Together
Statistical Machine
Learning from
Principles to Practice

Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications

Similar to Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications (20)

More from Aseda Owusua Addai-Deseh

More from Aseda Owusua Addai-Deseh (8)

Recently uploaded

Recently uploaded (20)

Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications