- Brian Reich gave a presentation on climate informatics and machine learning.
- He discussed different conceptual views of statistics, including parametric modeling, linear regression, inferential statistics, and machine learning.
- Reich provided examples of unsupervised learning techniques like principal component analysis (PCA) and supervised learning using deep neural networks.
- The presentation concluded with a challenge for the audience to build and evaluate a neural network model on simulated wildfire detection data.
2. Outline
Conceptual views of statistics
Definitions
Example of unsupervised learning: PCA
Example of supervised learning: Deep neural networks
Data challenge!
Brian Reich, NC State Climate Informatics 2 / 34
3. Conceptual view of parametric statistics
Last week, Prof Haran discussed an ice sheet study that
was a great example parametric modeling
They assumed that the entire glacial system was known up
to a few parameters
The role of statistics is then to compare the observed data
the model output to refine our understanding of these
parameters
This delivers new scientific insights, at least within the
current paradigm
Brian Reich, NC State Climate Informatics 3 / 34
4. Conceptual view of linear regression
For more complex systems we will not know the system up
to a few parameters
We might that use a first-order approximation via linear
regression
If the approximation is obviously wrong, we might try to
patch it up with, e.g., quadratic terms
This is now is an approximate system that is known up to a
few parameters (slopes, variances, etc)
If we have a decent statistical representation of the system
we can carefully interpret the parameters scientifically
Brian Reich, NC State Climate Informatics 4 / 34
5. Conceptual view of inferential statistics
Rather than model the entire system, we can conduct
experiments to learning about specific relationships
For example, it is impossible to model the entire biological
system that leads to cancer under different treatments
Instead, we might conduct a randomized clinical trial to
compare treatments
This doesn’t necessarily add to our biological
understanding of cancer, but it is surely useful
Brian Reich, NC State Climate Informatics 5 / 34
6. Conceptual view of machine learning
The premise of machine learning is that we can train an
algorithm to mimic how a complex system operates without
understanding the fundamental science
Last week you called this a statistical emulator
Often this requires a huge amount of training data
The algorithm that mimics reality may be a black box that is
not a function of parameters or equations we can interpret
However, in many applications, prediction even without
scientific understanding is powerful
Brian Reich, NC State Climate Informatics 6 / 34
7. Example: Short-term weather forecasting
Scientist scientist: study physics, chemistry, etc and
encode this in a mathematical model that takes the current
state of projects forward
Data scientist: Dump 100M observations of current met
variables into a deep learning algorithm and make
statistical predictions
What are the advantages and disadvantages of each
approach?
Brian Reich, NC State Climate Informatics 7 / 34
8. When to use parametric stats vs machine learning
Brian Reich, NC State Climate Informatics 8 / 34
9. A made-up example
https://satelliteliaisonblog.com/2017/03/
03/using-goes-16-to-detect-wildfires/
I made up some fake data in the R workspace trainingdata
It is meant to represent taking a bunch of subregions and
recording for each snapshot i = 1, ..., 10K
Yi a measure of fire in the region
Xij is the gray scale of pixel j = 1, ..., 100
Goal: Predict the response given the image predictor
A good spot for machine learning?
Brian Reich, NC State Climate Informatics 9 / 34
10. Flow chart of machine learning algorithms (from SAS)
Brian Reich, NC State Climate Informatics 10 / 34
11. Definitions - Unsupervised learning
The data consists of several variables, but none are the
“response”
That is, we have X but not Y
Usually unsupervised methods try to identify the main
patterns in the variables
Clustering: put the n observations into L < n clusters
Principal components analysis (PCA): explain the
correlations between variables concisely
Brian Reich, NC State Climate Informatics 11 / 34
12. Definitions - Supervised learning
The data consists of both independent variables X and
dependent variables Y
The goal is to study the effects of X or Y and/or predict Y|X
Regression is the obvious example, where we estimate
E(Y|X) = f(X)
Examples: Linear regression, trees, nets
Classification is another example, were the data are from
Q unordered classes, Y ∈ {1, ..., Q}
The goal is to assign Y to a class based on X
Examples: Logistic regression, support vector machines
Brian Reich, NC State Climate Informatics 12 / 34
13. Principal components analysis (PCA)
Say we have a collection of variables X = (X1, ..., Xp)T
For simplicity, assume all p variables are centered (mean
zero) and scaled (variance one)
We observe n samples from their joint distribution X1, ..., Xn
Linear relationships are summarized by the p × p sample
correlation matrix,
S ≈ Cor(X)
If p = 1000, S is huge and hard to interpret
Brian Reich, NC State Climate Informatics 13 / 34
14. PCA
The eigen decomposition of a matrix can be used to
approximate the full matrix with a few vectors
This dimension reduction highlights the most important
trends
Denote the eigen decomposition as S = ΓΛΓT
Γ’s columns, γ1, ..., γp, are the p orthonormal eigenvectors
That is, γT
j γk = 0 and thus Cor(γj, γk ) = 0
Λ is diagonal with eigenvalues λ1 ≥ ... ≥ λp on the
diagonal
Brian Reich, NC State Climate Informatics 14 / 34
15. PCA
The full eigen decomposition is
S =
p
l=1
λlγlγT
l
The λl are decreasing and are so the term’s importance
The “best” approximation with L < p terms is
S ≈
L
l=1
λlγT
l γl
The proportion of the S’s variance explained by L terms is
L
l=1 λl
p
l=1 λl
Brian Reich, NC State Climate Informatics 15 / 34
16. Using EOFs to study spatiotemporal trends
PCA is referred to as empirically orthogonal functions
(EOF) in the climate literature
As an example, say the p variables are the daily values of
precipitation at p spatial locations
The eigenvectors γl are spatial surfaces
They give uncorrelated weighted averages of the data
Zlt = γT
l Xt
A plot of the Z1t by t reveals spatiotemporal trends
http://www4.stat.ncsu.edu/~reich/
SpatialStats/code/EOF.html
Brian Reich, NC State Climate Informatics 16 / 34
17. PCA
How would you find the first eigenvector if p = 1, 000, 000?
http:
//www4.stat.ncsu.edu/~reich/SAMSI/PCA.html
Brian Reich, NC State Climate Informatics 17 / 34
18. PC regression (PCR)
PCA leads to an efficient high-dimensional regression
model
Linear regression
E(Y|X) = β0 +
p
j=1
Xjβj
High dimensional regression has n > p
Least squares doesn’t exist!
Need some form of dimension reduction
LASSO is one way, PCR is another
Brian Reich, NC State Climate Informatics 18 / 34
19. PC regression (PCR)
In PCA, we reduced the dimension from p correlated
variables to L << p uncorrelated variables
Zl = γT
l X
PCR uses the covariates Z1, ..., ZL as predictors
E(Y|X) = b +
L
l=1
Zlwl
where wl are fit via least squares
This is still linear in X,
E(Y|X) = b +
L
l=1
p
j=1
wlγljXj,
but because we use the correlation of X, we only have to
estimate L < p parameters
Brian Reich, NC State Climate Informatics 19 / 34
20. Nonparametric regression
General set-up: Y = f(X) +
Y is the continuous response
f is the response surface or mean function
are iid errors
In linear regression, f is assumed to be a linear in X
Life isn’t linear
If we want an algorithm to mimic life it can’t be linear
NP regression tries to model f non-linearly
Brian Reich, NC State Climate Informatics 20 / 34
21. Nonparametric regression
The response surface f is a function
1D: f(X) is a curve
2D: f(X) is a surface
3D: f(X) is complicated
What properties do we want in an algorithm to estimate f?
We want it to be able to approximate any continuous
function f that takes a p-dimensional input and returns at
univariate output
If this is the case, then as long as the sample size goes to
infinity we will be able to estimate an system
Brian Reich, NC State Climate Informatics 21 / 34
22. Types of nonparametric regression
Polynomial regression
Generalized additive models (GAM)
Gaussian process regression
Regression trees
Random forests
...
Neural networks
Brian Reich, NC State Climate Informatics 22 / 34
23. Single-layer neural network (NN)
Construct covariates (similar to PCR)
zj = b0
j +
p
l=1
w0
jl Xj
PCR: weights w0
jl determined by the sample covariance of
X and the mean is linear,
f(X) = α + z1β1 + ...zLβL
NN: weights w0
jl (and biases b0
j ) are fit using least squares
and the mean is non-linear,
f(X) = b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
The activation function σ is chosen by the user, e.g.,
σ(x) = exp(x)/[1 + exp(x)] or σ(x) = x+
Brian Reich, NC State Climate Informatics 23 / 34
24. Analogy with brain activity
f(X) = b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
zl = b0
l +
p
j=1
w0
lj Xj
The response is the sum of activity from L neurons
zl measures the intensity on neuron l
When the intensity gets high enough, it fires, and σ(zl) = 1
When neuron l fires it adds w1
l to the response
Brian Reich, NC State Climate Informatics 24 / 34
25. Let’s do a proof!
Say there is a single covariate (p = 1) and
σ(x) =
1 x > 0
0 x ≤ 0
Prove that any smooth curve f(X) for X ∈ [0, 1] can be
approximated by a function of the form
f(X) ≈ b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
zl = b0
l + w0
l X
Argue without math that your proof will extend to p = 2
Brian Reich, NC State Climate Informatics 25 / 34
26. Optimization
To fit a NN we need to estimate all the biases b and
weights w to minimize
SSE =
n
i=1
e2
i =
n
i=1
[Yi − f(Xi)]2
It’s common to use gradient descent
The gradients with respect to b and w have a nice forms
They have a recursive structure which leads to the
back-propagation algorithm
Brian Reich, NC State Climate Informatics 26 / 34
27. Recursive gradients for back-propagation
∂ei
∂b1
= −2ei
∂ei
∂w1
l
=
∂ei
∂b1
σ(zl)
∂ei
∂b0
l
=
∂ei
∂b1
w
(1)
l σ (zl)
∂ei
∂w0
lj
=
∂ei
∂b0
l
Xj
Brian Reich, NC State Climate Informatics 27 / 34
28. Deep learning
Deep learning just adds more layers
Here is a two-layer NN
f(X) = b2
+ σ(z2
1 )w2
1 + ... + σ(z2
L )w2
L
z2
l = b1
l + σ(z1
l )w1
l1 + ... + σ(z1
L )w1
lM
z1
k = b0
k + w0
k1X1 + ... + w0
kpXp
There are L and M neurons in the two layers, respectively
The mean is a non-linear function of non-linear functions of
the inputs
Brian Reich, NC State Climate Informatics 28 / 34
29. DNN in our fire example
Xj is the intensity for image j = 1, ..., p
X = (X1, ..., Xp)T is the image
Truth: f(X) = 10 if at least two regions have high intensity
and f(X) = 0 otherwise
This is a weird non-linear function!
How to pick the biases and weights to get this function?
Brian Reich, NC State Climate Informatics 29 / 34
30. DNN in our fire example
Let σ(x) = 1 if x > 0 and σ(x) < 0 otherwise
Set L = 1 and M = p
Let p
j=1 w0
kjXj be the average intensity around voxel k
Pick b0
k so that σ(z1
k ) = 1 indicates in the intensity around
voxel k is high
Pick w1
lk = 1 so w1
11σ(z1
1 ) + ... + w1
1Mσ(z1
L ) is the number of
high intensity regions
Pick b1
1 so that σ(z2
1 ) = 1 indicates the number of high
intensity regions exceeds 1
Finally, set b2 = 0 and w2
1 = 10
Brian Reich, NC State Climate Informatics 30 / 34
33. Black magic
We need to pick the number of layers and number of
neurons in each layer
For large datasets computing the gradients is slow and
stochastic gradient descent is a common remedy
How many mini-batches?
Drop-out rate?
Regularization?
Transformations?
More to come!
Brian Reich, NC State Climate Informatics 33 / 34
34. Data challenge!
Get in small groups and load the R workspace trainingdata
Fit NN model to the training data
The set up is the same as the worked example, but the
true response surface is different
Soon I will email everyone the test data
Evaluate your predictions on the test data
Winner take all! No prisoners!
Brian Reich, NC State Climate Informatics 34 / 34