Climate Informatics: Machine Learning and Data Science Approaches

Climate Informatics
Brian Reich, NCSU
SAMSI, 9/19/2017
Brian Reich, NC State Climate Informatics 1 / 34

Outline
Conceptual views of statistics
Deﬁnitions
Example of unsupervised learning: PCA
Example of supervised learning: Deep neural networks
Data challenge!

Conceptual view of parametric statistics
Last week, Prof Haran discussed an ice sheet study that
was a great example parametric modeling
They assumed that the entire glacial system was known up
to a few parameters
The role of statistics is then to compare the observed data
the model output to reﬁne our understanding of these
parameters
This delivers new scientiﬁc insights, at least within the
current paradigm

Conceptual view of linear regression
For more complex systems we will not know the system up
to a few parameters
We might that use a ﬁrst-order approximation via linear
regression
If the approximation is obviously wrong, we might try to
patch it up with, e.g., quadratic terms
This is now is an approximate system that is known up to a
few parameters (slopes, variances, etc)
If we have a decent statistical representation of the system
we can carefully interpret the parameters scientiﬁcally

Conceptual view of inferential statistics
Rather than model the entire system, we can conduct
experiments to learning about speciﬁc relationships
For example, it is impossible to model the entire biological
system that leads to cancer under different treatments
Instead, we might conduct a randomized clinical trial to
compare treatments
This doesn’t necessarily add to our biological
understanding of cancer, but it is surely useful

Conceptual view of machine learning
The premise of machine learning is that we can train an
algorithm to mimic how a complex system operates without
understanding the fundamental science
Last week you called this a statistical emulator
Often this requires a huge amount of training data
The algorithm that mimics reality may be a black box that is
not a function of parameters or equations we can interpret
However, in many applications, prediction even without
scientiﬁc understanding is powerful

Example: Short-term weather forecasting
Scientist scientist: study physics, chemistry, etc and
encode this in a mathematical model that takes the current
state of projects forward
Data scientist: Dump 100M observations of current met
variables into a deep learning algorithm and make
statistical predictions
What are the advantages and disadvantages of each
approach?

When to use parametric stats vs machine learning

A made-up example
https://satelliteliaisonblog.com/2017/03/
03/using-goes-16-to-detect-wildfires/
I made up some fake data in the R workspace trainingdata
It is meant to represent taking a bunch of subregions and
recording for each snapshot i = 1, ..., 10K
Yi a measure of ﬁre in the region
Xij is the gray scale of pixel j = 1, ..., 100
Goal: Predict the response given the image predictor
A good spot for machine learning?

Flow chart of machine learning algorithms (from SAS)

Deﬁnitions - Unsupervised learning
The data consists of several variables, but none are the
“response”
That is, we have X but not Y
Usually unsupervised methods try to identify the main
patterns in the variables
Clustering: put the n observations into L < n clusters
Principal components analysis (PCA): explain the
correlations between variables concisely

Deﬁnitions - Supervised learning
The data consists of both independent variables X and
dependent variables Y
The goal is to study the effects of X or Y and/or predict Y|X
Regression is the obvious example, where we estimate
E(Y|X) = f(X)
Examples: Linear regression, trees, nets
Classiﬁcation is another example, were the data are from
Q unordered classes, Y ∈ {1, ..., Q}
The goal is to assign Y to a class based on X
Examples: Logistic regression, support vector machines

Principal components analysis (PCA)
Say we have a collection of variables X = (X1, ..., Xp)T
For simplicity, assume all p variables are centered (mean
zero) and scaled (variance one)
We observe n samples from their joint distribution X1, ..., Xn
Linear relationships are summarized by the p × p sample
correlation matrix,
S ≈ Cor(X)
If p = 1000, S is huge and hard to interpret

PCA
The eigen decomposition of a matrix can be used to
approximate the full matrix with a few vectors
This dimension reduction highlights the most important
trends
Denote the eigen decomposition as S = ΓΛΓT
Γ’s columns, γ1, ..., γp, are the p orthonormal eigenvectors
That is, γT
j γk = 0 and thus Cor(γj, γk ) = 0
Λ is diagonal with eigenvalues λ1 ≥ ... ≥ λp on the
diagonal

PCA
The full eigen decomposition is
S =
p
l=1
λlγlγT
l
The λl are decreasing and are so the term’s importance
The “best” approximation with L < p terms is
S ≈
L
l=1
λlγT
l γl
The proportion of the S’s variance explained by L terms is
L
l=1 λl
p
l=1 λl

Using EOFs to study spatiotemporal trends
PCA is referred to as empirically orthogonal functions
(EOF) in the climate literature
As an example, say the p variables are the daily values of
precipitation at p spatial locations
The eigenvectors γl are spatial surfaces
They give uncorrelated weighted averages of the data
Zlt = γT
l Xt
A plot of the Z1t by t reveals spatiotemporal trends
http://www4.stat.ncsu.edu/~reich/
SpatialStats/code/EOF.html

PCA
How would you ﬁnd the ﬁrst eigenvector if p = 1, 000, 000?
http:
//www4.stat.ncsu.edu/~reich/SAMSI/PCA.html

PC regression (PCR)
PCA leads to an efﬁcient high-dimensional regression
model
Linear regression
E(Y|X) = β0 +
p
j=1
Xjβj
High dimensional regression has n > p
Least squares doesn’t exist!
Need some form of dimension reduction
LASSO is one way, PCR is another

PC regression (PCR)
In PCA, we reduced the dimension from p correlated
variables to L << p uncorrelated variables
Zl = γT
l X
PCR uses the covariates Z1, ..., ZL as predictors
E(Y|X) = b +
L
l=1
Zlwl
where wl are ﬁt via least squares
This is still linear in X,
E(Y|X) = b +
L
l=1
p
j=1
wlγljXj,
but because we use the correlation of X, we only have to
estimate L < p parameters

Nonparametric regression
General set-up: Y = f(X) +
Y is the continuous response
f is the response surface or mean function
are iid errors
In linear regression, f is assumed to be a linear in X
Life isn’t linear
If we want an algorithm to mimic life it can’t be linear
NP regression tries to model f non-linearly

Nonparametric regression
The response surface f is a function
1D: f(X) is a curve
2D: f(X) is a surface
3D: f(X) is complicated
What properties do we want in an algorithm to estimate f?
We want it to be able to approximate any continuous
function f that takes a p-dimensional input and returns at
univariate output
If this is the case, then as long as the sample size goes to
inﬁnity we will be able to estimate an system

Types of nonparametric regression
Polynomial regression
Generalized additive models (GAM)
Gaussian process regression
Regression trees
Random forests
...
Neural networks

Single-layer neural network (NN)
Construct covariates (similar to PCR)
zj = b0
j +
p
l=1
w0
jl Xj
PCR: weights w0
jl determined by the sample covariance of
X and the mean is linear,
f(X) = α + z1β1 + ...zLβL
NN: weights w0
jl (and biases b0
j ) are ﬁt using least squares
and the mean is non-linear,
f(X) = b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
The activation function σ is chosen by the user, e.g.,
σ(x) = exp(x)/[1 + exp(x)] or σ(x) = x+

Analogy with brain activity
f(X) = b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
zl = b0
l +
p
j=1
w0
lj Xj
The response is the sum of activity from L neurons
zl measures the intensity on neuron l
When the intensity gets high enough, it ﬁres, and σ(zl) = 1
When neuron l ﬁres it adds w1
l to the response

Let’s do a proof!
Say there is a single covariate (p = 1) and
σ(x) =
1 x > 0
0 x ≤ 0
Prove that any smooth curve f(X) for X ∈ [0, 1] can be
approximated by a function of the form
f(X) ≈ b1
+ σ(z1)w1
1 + ...σ(zL)w1
L
zl = b0
l + w0
l X
Argue without math that your proof will extend to p = 2

Optimization
To ﬁt a NN we need to estimate all the biases b and
weights w to minimize
SSE =
n
i=1
e2
i =
n
i=1
[Yi − f(Xi)]2
It’s common to use gradient descent
The gradients with respect to b and w have a nice forms
They have a recursive structure which leads to the
back-propagation algorithm

Recursive gradients for back-propagation
∂ei
∂b1
= −2ei
∂ei
∂w1
l
=
∂ei
∂b1
σ(zl)
∂ei
∂b0
l
=
∂ei
∂b1
w
(1)
l σ (zl)
∂ei
∂w0
lj
=
∂ei
∂b0
l
Xj

Deep learning
Deep learning just adds more layers
Here is a two-layer NN
f(X) = b2
+ σ(z2
1 )w2
1 + ... + σ(z2
L )w2
L
z2
l = b1
l + σ(z1
l )w1
l1 + ... + σ(z1
L )w1
lM
z1
k = b0
k + w0
k1X1 + ... + w0
kpXp
There are L and M neurons in the two layers, respectively
The mean is a non-linear function of non-linear functions of
the inputs

DNN in our ﬁre example
Xj is the intensity for image j = 1, ..., p
X = (X1, ..., Xp)T is the image
Truth: f(X) = 10 if at least two regions have high intensity
and f(X) = 0 otherwise
This is a weird non-linear function!
How to pick the biases and weights to get this function?

DNN in our ﬁre example
Let σ(x) = 1 if x > 0 and σ(x) < 0 otherwise
Set L = 1 and M = p
Let p
j=1 w0
kjXj be the average intensity around voxel k
Pick b0
k so that σ(z1
k ) = 1 indicates in the intensity around
voxel k is high
Pick w1
lk = 1 so w1
11σ(z1
1 ) + ... + w1
1Mσ(z1
L ) is the number of
high intensity regions
Pick b1
1 so that σ(z2
1 ) = 1 indicates the number of high
intensity regions exceeds 1
Finally, set b2 = 0 and w2
1 = 10

DNN (from www.edureka.co)

Worked example
http://www4.stat.ncsu.edu/~reich/SAMSI/
neural_net_example.html

Black magic
We need to pick the number of layers and number of
neurons in each layer
For large datasets computing the gradients is slow and
stochastic gradient descent is a common remedy
How many mini-batches?
Drop-out rate?
Regularization?
Transformations?
More to come!

Data challenge!
Get in small groups and load the R workspace trainingdata
Fit NN model to the training data
The set up is the same as the worked example, but the
true response surface is different
Soon I will email everyone the test data
Evaluate your predictions on the test data
Winner take all! No prisoners!

Climate Informatics: Machine Learning and Data Science Approaches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Climate Informatics: Machine Learning and Data Science Approaches

Similar to Climate Informatics: Machine Learning and Data Science Approaches (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Climate Informatics: Machine Learning and Data Science Approaches