Machine Learning for Scientific Applications

Machine Learning for
Scientiﬁc Applications
http://davidlary.info
David Lary
Need: Accounting for complex multi-variate context
which is often not fully described by theory
Monday, August 11, 14

Long Term Data Sets:
Uncertainty, Cross-Calibration,
Data Fusion & Machine Learning
Motivated by Data Assimilation
With examples from Land,Atmosphere & Ocean

Bias Detection
“Who may discern his errors, ....” Psalm 19:12
7

Why is it an issue?
• With fusion of multiple datasets bias is
often an issue (very relevant for climate
variables).
• Data assimilation is a least squares or a
Best Linear Unbiased Estimator (BLUE)
8

.... runs deeper still
• Instrument teams have a keen sense of faithfully reporting the
data, as it is, warts and all.They are naturally loath to empirically
correct biases; they would like to theoretically understand the
cause of the bias and data issues from ﬁrst principles.
The Earth System is so complex, with many interacting processes,
and often the instruments are also complex, this is not always
possible.
Residual data issues can, and usually do, remain.
• Modelers know that data bias exist, but are very reticent to make
changes to data products.
.... we therefore have a problem of closure.
9

The problem!
• Biases are ubiquitous, not all of them can be explained
theoretically.Yet, we typically need to fuse multiple datasets to
construct long-term time series and/or improve global coverage.
• If the biases are not corrected before data fusion we introduce
further problems, such as ...
• spurious trends, leading to the possibility of unsuitable
policy decisions.
• when assimilation is involved, the suboptimal use of
observations, non-physical structures in the
analysis, biases in the assimilated ﬁelds, and extrapolation
of biases due to multivariate background constraints.
10

A Further Problem
The instruments whose data we would like
to fuse are often not making coincident
measurements in time or space.
Imperative to inter-compare observations in
their appropriate context.
11

Integrate multiple satellite datasets
for applications
The comparison above shows the total ozone column
observed by EP TOMS and Aura OMI. The high
resolution coverage that Aura OMI provides is clearly
seen. In the particular event shown there is a
tropopause fold event over Texas.
12

An Example
13 representativeness

0.5 1 1.5 2 2.5 3 3.5 4
x 10
6
0
0.02
0.04
0.06
0.08
0.1
0.12
O
3
v.m.r.
RelativeFrequency
All years 01 (1900 K< < 2300 K, 90o
<
el
< 79o
)
Aura MLS O3
(23)
CLAES v9 O
3
(207)
ISAMS v10 O3
(19)
UARS MLS v5 183 GHz O
3
(379)
UARS MLS v5 205 GHz O
3
(490)
SAGE 2 v6.2 O3
(21)
SBUV v8 O3
(33)
15

Geophysical Insights
(a) (b)
(c) (d)
Figure 2: N2O Equivalent PV latitude - potential temperature cross sections
of (a) representativeness uncertainty (v.m.r.), (b) observational uncertainty
16

Bias is Spatially Dependent
−75 −60 −45 −30 −15 0 15 30 45 60 75
250
300
350
400
500
600
700
1000
1200
1500
2000
2500
Equivalent PV Latitude
PotentialTemperature(K)
% Bias (UARS MLS v5 183 GHz O
3
− HALOE v19 O
3
) for January of all years
−30
−20
−10
0
10
20
30
−75 −60 −45 −30 −15 0 15 30 45 60 75
250
300
350
400
500
600
700
1000
1200
1500
2000
2500
Equivalent PV Latitude
PotentialTemperature(K)
% Bias (UARS MLS v5 183 GHz O
3
− HALOE v19 O
3
) for January of all years
−30
−20
−10
0
10
20
30
17

So what can we do
about this?
.... we do not have a theoretical explanation
18

Machine Learning
for when our understanding is incomplete
19
... and that is quite often!

What is Machine Learning?
• Machine learning is a sub-ﬁeld of artiﬁcial
intelligence that is concerned with the design and
development of algorithms that allow computers
to learn the behavior of data sets empirically.
• A major focus of machine-learning research is to
produce (induce) empirical models from data
automatically.
• This approach is usually used because of the
absence of adequate and complete
theoretical models that are more desirable
conceptually.
20

What is Machine Learning?
The use of machine learning can actually help
us to construct a more complete theoretical
model, as it allows us to determine which
factors are statistically capable of providing
the data mappings we seek— e.g. the
multi-variate, non-linear, non-
parametric mapping between satellite
radiances and a suite of ocean products.
21

Machine Learning
Is for:
Regression
➡ Multivariate, non-linear, non-parametric
Classiﬁcation
➡ Supervised and unsupervised
22

Machine Learning
Comes in Several Flavors, for example:
• Neural Networks
• SupportVector Machines
• Gaussian Process Models
• Decision Trees
• Random Forests
23

Machine Learning Regression
x1 x2 x3 x4 x5 xn y
Inputs
Output(s)
Inputs
Inputs
Inputs
Inputs
Inputs
Inputs
y = f (x1,x2,x3,x4 ,x5,…,xn )
Multivariate, non-linear, non-parametric
n can be very large
Training Data

Machine Learning Supervised
Classiﬁcation
x1 x2 x3 x4 x5 xn y
Inputs
Output(s)
Inputs
Inputs
Inputs
Inputs
Inputs
Inputs
n can be very large
Training Data

Machine Learning
Unsupervised Classiﬁcation
n can be very large
x1 x2 x3 x4 x5 xn
Inputs
Inputs
Inputs
Inputs
Inputs
Inputs
Inputs
Training Data

Neural Networks
In a neural network model simple
nodes (neurons), are connected
together to form a network of
nodes. Its practical use comes
with algorithms designed to alter
the strength (weights) of the
connections in the network to
produce a desired signal ﬂow.
27

SupportVector Machines
Support vector machines (SVMs)
are a set of related supervised
learning methods used for
classiﬁcation and regression.
Intuitively, an SVM model is a
representation of the training
examples as points in space,
mapped so that the examples of
the separate categories are
divided by a clear gap that is as
wide as possible.
VladimirVapnik
28

Gaussian Process Models
Gaussian processes (GPs)
(Rasmussen and Williams 2006) ﬁt a
multivariate Gaussian probability
distribution to any set of regressors,
allowing for analytic inference.As a
principled Bayesian technique, GPs
go beyond SVMs by allowing us to
supply a full posterior distribution
for our regressors, giving us both
mean estimates as well as an
indication of the uncertainty in them.
29

Random Forest
Random forests are an ensemble learning method for
classiﬁcation (and regression) that operate by
constructing a multitude of decision trees, hence a
forest.The approach was developed by Leo Breiman
and Adele Cutler.

A key issue is training
dataset size, the bigger
the better!
..... until we run out of memory
31

Variations in Stratospheric Cly
Between 1991 and the present
David Lary, Anne Douglass, Darryn Waugh,
Richard Stolarski, Paul Newman, Hamse Mussa
• Data can be biased,
maybe as a function of
many parameters.
• May be observing a
proxy for what we
really want to know.32

ozone reductions there (SOCOL and E39C), and the model
with the largest cold bias in the Antarctic lower strato-
sphere in spring (LMDZrepro) simulates very low ozone.
CCMs show a large range of ozone trends over the
past 25 years (see left panels in Figure 3-26 of Chapter 3)
and large differences from observations. Some of these
differences may in part be related to differences in the sim-
recovery due to declining ODSs, we place importance on
the models’ ability to correctly simulate stratospheric Cly
as well as the representation of transport characteristics
and polar temperatures. Therefore, more credence is given
to those models that realistically simulate these processes.
Figure 6-7 shows a subset of the diagnostics used to eval-
uate these processes and CCMs shown with solid curves
21st
CENTURY OZONE LAYER
Figure 6-8. October zonal mean values of total inorganic chlorine (Cly in ppb) at 50 hPa and 80°S from CCMs.
Panel (a) shows Cly and panel (b) difference in Cly from that in 1980. The symbols in (a) show estimates of Cly
in the Antarctic lower stratosphere in spring from measurements from the UARS satellite in 1992 and the Aura
satellite in 2005, yielding values around 3 ppb (Douglass et al., 1995; Santee et al., 1996) and around 3.3 ppb
(see Figure 4-8), respectively.
50 hPa 80°S October 50 hPa 80°S October
Cly
–Cly
(1980)(ppbv)
Cly
(ppbv)
Year Year
33

21st
CENTURY OZONE LAYER
Cly
–Cly
(1980)(ppbv)
Cly
(ppbv)
Year Year
A large range of Cly in
the model simulations
Constrained by a limited number of
Cly observations
33

• We need to know the distribution of
inorganic chlorine (Cly) in the
stratosphere to:
• Attribute changes in stratospheric
ozone to changes in halogens.
• Assess the realism of chemistry-
climate models.
34

Cly=HCl+ClONO2+ClO+HOCl
+2Cl2O2+2Cl2
Long time-series
Sporadic
Long time-series
Since 2004
Estimating Cly is hampered by lack of observations
Estimating Cly is hampered by inter-instrument biases
35

Using PDFs for Bias Detection
0.8 1 1.2 1.4 1.6 1.8 2
x 10
9
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
HCl v.m.r.
RelativeFrequency
2005/01 (460 K< < 590 K, 49
o
< el
< 61
o
)
ACE v2.2 HCl (75)
Aura MLS HCl (1544)
HALOE v19 HCl (101)
http://www.pdfcentral.info/
HALOE -Aura
HCl
If we now repeat
this globally for all
periods of overlap
36

0
1 2 3 4
0
1
2
3
4
HALOE HCl (ppbv)
ATMOSHCl(ppbv)
Slope = 1.05
Intercept = 0.23 ppbv
Data
1:1
Weighted Fit
HCl Inter-comparisons
37

0 1 2 3 4
0
1
2
3
4
HALOE HCl (ppbv)
ACEHClv2.2(ppbv)
Slope = 1.18
Intercept = −0.050 ppbv
Data
1:1
Weighted Fit
37

0 1 2 30
1
2
3
HALOE HCl (ppbv)
MLSHCl(ppbv)
Slope = 1.09
Data
1:1
Weighted Fit
Fit
37

0 1 2 30
1
2
3
HALOE HCl (ppbv)
MLSHCl(ppbv)
Slope = 1.09
Data
1:1
Weighted Fit
Fit
0 1 2 3
0
1
2
3
HALOE HCl (ppbv) NN adjusted
MLSHCl(ppbv)
Slope = 0.995
Iintercept = 0.0093 ppbv
Data
1:1
Weighted Fit
37

Neurological algorithms
InputsOutputs
Process
38

An example neural network
Inputs
Outputs
Process
39

Inputs
Outputs
Process
39
Objective design of neural networks
using genetic algorithms

40

Re-calibration
using a Neural Network
0.5 1 1.5 2 2.5 3 3.5
x 10
9
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Targets T
OutputsA,LinearFit:A=(0.97)T+(5e11)
HCl Training Outputs vs. Targets, R=0.98739
Training Data Points
Best Linear Fit
A = T
0.5 1 1.5 2 2.5 3 3.5
x 10
9
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Targets T
OutputsA,LinearFit:A=(0.98)T+(2.9e11)
HCl Validation Outputs vs. Targets, R=0.99232
Validation Data Points
Best Linear Fit
A = T
41

Re-calibration
using a Neural Network
0.5 1 1.5 2 2.5 3 3.5
x 10
9
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Targets T
OutputsA,LinearFit:A=(0.97)T+(5e11)
HCl Training Outputs vs. Targets, R=0.98739
Training Data Points
Best Linear Fit
A = T
0.5 1 1.5 2 2.5 3 3.5
x 10
9
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Targets T
OutputsA,LinearFit:A=(0.98)T+(2.9e11)
HCl Validation Outputs vs. Targets, R=0.99232
Validation Data Points
Best Linear Fit
A = T
Totally independent
validation
41

Long-term continuity
42

Long-term continuity
Applied Neural Network
Re-calibration to HALOE
42

1995 2000 2005
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
9
Year
Cly
Monthly average 2
o
800 K
525 K
6 Year Age
5 Year Age
4 Year Age
3 Year Age
2 Year Age
1995 2000 2005
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
9
Year
Cl
y
Monthly average 61
o
800 K
525 K
6 Year Age
5 Year Age
4 Year Age
3 Year Age
2 Year Age
October
Use neural networks to infer Cly from HCl, CH4, ϕpv, and θ.
Long-term continuity for Cly
43

1995 2000 2005
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
9
Year
Cly
Monthly average 2
o
800 K
525 K
6 Year Age
5 Year Age
4 Year Age
3 Year Age
2 Year Age
1995 2000 2005
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
9
Year
Cl
y
Monthly average 61
o
800 K
525 K
6 Year Age
5 Year Age
4 Year Age
3 Year Age
2 Year Age
October
21st
CENTURY OZONE LAYER
Cly
–Cly
(1980)(ppbv)
Cly
(ppbv)
Year Year
43

1995 2000 2005
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
9
Year
Cly
Monthly average 2
o
800 K
525 K
6 Year Age
5 Year Age
4 Year Age
3 Year Age
2 Year Age
October
ulated Cly, e.g., E39C and SOCOL show a trend smaller
than observed, whereas AMTRAC and UMETRAC show
a trend larger than observed in extrapolar area weighted
mean column ozone. However, other factors also con-
tribute, e.g., biases in tropospheric ozone (Austin and
Wilson, 2006).
The CCM evaluation discussed above and in Eyring
et al. (2006) has guided the level of confidence we place
on each model simulation. The CCMs vary in their skill
in representing different processes and characteristics of
the atmosphere. Because the focus here is on ozone
in Figures 6-7, 6-8, 6-10 and 6-12 to 6-14 are those that
are in good agreement with the observations in Figure
6-7. However, these line styles should not be over-
interpreted as both the ability of the CCMs to represent
these processes as well as the relative importance of Cly,
temperature, and transport vary between different regions
and altitudes. Also, analyses of model dynamics in the
Arctic, and differences in the chlorine budget/partitioning
in these models, when available, might change this evalu-
ation for some regions and altitudes.
21st
CENTURY OZONE LAYER
6.26
Cly
–Cly
(1980)(ppbv)
Cly
(ppbv)
Year Year
43

Other uses of machine
learning
• Cross calibration of vegetation indices from AVHRR, MODIS,
SPOT and SeaWIFS
• Inferring CO2 ﬂuxes from vegetation indices and surface
temperature
• Inferring ocean pigment concentrations and other parameters
• Inferring drought stress and endophyte infection in cacao (coffee)
• Learning the chaotically tumbling orbit of the Hubble space
telescope
• Detecting online ebay fraud
• Acceleration of expensive code elements
46

Another application
dissolved organic carbon
47

Method used to estimate DOC R
SeaWiFS bands GP NL 0.99977
MODIS bands GP NL 0.9997
All bands GP NL 0.99901
UV & SeaWiFS bands GP NL 0.99899
All bands NN 0.95859
UV & SeaWiFS bands NN 0.94609
MODIS bands NN 0.92585
SeaWiFS bands NN 0.91653
49

5
10
15
0
5
0
10
0
15
1
0.99
0.95
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.20.10
Standarddeviation
Co r r e
l a
t
i o
n
C
o
e
f
f
ic
ient
RM
S
D
A
B
C
D
E
F
G
H
Gaussian Process Models50

Relative Importance of the Inputs
Wavelength Relative Importance
Rrs490 0.00087123
Rrs555 0.011976
Rrs670 1.5876
Rrs510 9.8423
Rrs443 13.0898
Rrs412 20.2553
The GPM hyper-parameters give an indication of the relative
importance of the inputs. For the DOC SeaWiFS bands the best inputs
are those with the smallest values, here they are sorted in order of
importance
Most
Important
Least
Important
51

−0.5 0 0.5 1 1.5 2
0
5
10
15
20
25
30
35
40
a412−a443
Salinity
Salinity
Data
Polynomial (r2
=0.928)
NN (r
2
=0.933)
SVM (r2
=0.933)
52

Visibility
Variable R
Td
q
T
U
RH
SLP
-0.29
-0.26
-0.19
-0.18
0.13
0.05
53

High Resolution Identiﬁcation of
Dust Sources Using Machine
Learning and Remote Sensing Data
Annette Walker and David J. Lary
A42A-08

NRL High-resolution Dust Source Database
20030820 NRL DEP20030820 NRL DEP
Iran
Pakistan
Iran
Pakistan
• 10 years of DEP (2 yr MSG/RGB) imagery
• COAMPS 10 m wind overlays
• Surface weather plots
• ENVI (Gis-like software)
• NGDC topographical 10ºX10º tiles
• Overlay 0.25º grid or use Google Earth (GE)
• Dust source area entered into database
(cursor location tool = 1km precision)
• Cross-correlate land and water features
using maps, atlases, Landsat images
(detailed topographic, geographic,
and geomorphic information, GE)
• Technical and governmental reports
Approach and Methodology
20110630 NRL MSG/RGB
Saudi
Arabia
20030820 MODIS True Color

20030820 NRL DEP20030820 NRL DEP
Iran
Pakistan
Iran
Pakistan
• 10 years of DEP (2 yr MSG/RGB) imagery
• COAMPS 10 m wind overlays
• Surface weather plots
• ENVI (Gis-like software)
• NGDC topographical 10ºX10º tiles
• Overlay 0.25º grid or use Google Earth (GE)
• Dust source area entered into database
(cursor location tool = 1km precision)
• Cross-correlate land and water features
using maps, atlases, Landsat images
(detailed topographic, geographic,
and geomorphic information, GE)
• Technical and governmental reports
Approach and Methodology
20110630 NRL MSG/RGB
Saudi
Arabia
20030820 MODIS True Color20030820 NRL DEP
Iran
Pakistan

Solid red and purple shapes identify dust source
areas located using DEP and MSG.
SW Asia DSD East Asia DSD
Mongolia
Saudi
Arabia

Self-Organizing Map
Self-organizing maps (SOMs) are a
data visualization and unsupervised
classiﬁcation technique invented by
Professor Teuvo Kohonen (Kohonen
1982; 1990) that reduce the
dimensions of data through the use
of self-organizing neural networks.
They help us address the issue that
humans simply cannot visualize high
dimensional data.

Self-Organizing Map
SOMs reduce dimensionality by
producing a map that objectively plots
the similarities of the data by grouping
similar data items together.
SOMs learn to classify input vectors
according to how they are grouped in
the input space.
SOMs learn both the distribution and
topology of the input vectors they are
trained on.This approach allows SOMs
to accomplish two things, reduce
dimensions and display similarities.

Detecting Dust Sources

Self Organizing Map Classiﬁcation
7 Bands
MODIS MCD43C3
bihemispherical reﬂectance

All 1000-Classes mapped for North Africa

Libyan Dust Event: May 9, 2010 (8Z – 12Z)
Jabal al Akhdar (‫األخضر‬ ‫الجبل‬‎ Al Ǧabal al 'Aḫḍar, English: Green Mountains)
A coastal mountain range with height 1.0-1.5 km.

Plumes originate on leeward side of
Al Jabal al Akhdar where drainage occurs
along slopes.
Corresponding SOM-Classes: 49, 93, 94
Libyan Dust Event: May 9, 2010 (6Z – 8Z)
Jabal al Akhdar (‫األخضر‬ ‫الجبل‬‎ Al Ǧabal al 'Aḫḍar, English: Green Mountains)
A coastal mountain range with height 1.0-1.5 km.

Chad: Bodélé Depression
Dust Event: March 16, 2010 (7Z -12Z)
Located at the southern edge of the Sahara Desert in north central Africa, is the lowest point in Chad. Dust storms from the Bodélé Depression occur on
average about 100 days per year. The Bodélé depression is a single spot in the Sahara that provides most of the mineral dust to the Amazon forest.

Selected SOM Classes
NRL MSG-RGB 20110109
Source area is not
designated in first pass of
MODIS reflectance and land
surface classification.
1000 SOM Classes

Selected Classes with Class 137
NRL MSG-RGB 20110109
Class 137 maps diatom
sediment in depression.
1000 SOM Classes

Selected Classes Without Class 137
NRL MSG-RGB 201101091000 SOM Classes
Class 137 maps diatom
sediment in depression.

Solid black circles/ovals show plume source
Corresponding SOM Classes within open circles/ovals
Northern Sahara: 36, 40, 63, 100
Sahel: 147, 229, 230, 405
West Africa: Feb 2, 2011 13Z

Selected Classes for North Africa
(This involves 40 distinct classes)

Jan 1, 2006 True Color
Jan 1, 2006 NRL DEP
Sources along New Mexico/Texas border
The North American sources have a different
spectral signature than those we saw in SW Asia
Agricultural on high planes
Blue dessert areas

Sources in Arizona and Colorado
Apr 17, 2006 NRL DEP
Apr 17, 2006 True color

Selected Classes for North America (n=64)

All 1000-Classes mapped for South America

All 1000-Classes mapped for South America
Blue colored SOM-Classes are concentrated in
Atacama and Salar de Uyuni deserts
White areas are salt flats

South America: Bolivia and Chile
July 18, 2010 MODIS Terra True Color

South America: Bolivia and Chile
July 18, 2010 MODIS Terra True ColorSelected SOM-Classes in 200s, 300s, and 400s

• SOMs provide an effective mechanism for
automating the identiﬁcation of dust sources.
• Using the SOMs let us globally map dust sources
at high resolution 1-10 km.
• Saved time in ﬁnding dust sources while
comparing to satellite imagery.
• This can be done in real time to have dynamically
changing dust sources.

Model&
Exis+ng&
New&
Exis+ng&
New&
78

Model&
Exis+ng&
New&
Model&
Exis+ng&
New&
• Personalized Health
Care
• Proactive Health
Care System
• Business Analytics
• Smart Logistics
• Disaster Response
• Fraud Detection
http://holistics3.com

Visualiza1on(
Decision(
Support(
Machine(
Learning(
Insight(&(
Discovery(
Exis%ng(
• Social(Media(
• Socioeconomic,(Census(
• News(feeds(
• Environmental(
• Weather(
• Satellite(
• Sensors(
• Health(
• Economic(
New(
• Business(Analy%cs(2.0(
• UAVs(
• HyperHspectral(Imaging(
• Smart(Dust(
• Wearable(Sensors(
• Autonomous(Cars(
Simula%on(
• Global(Weather(Models(
• Economic(Models(
• Earthquake(Models(
GigaPop(Pipe(
TACC

Machine Learning for Scientific Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning for Scientific Applications

Similar to Machine Learning for Scientific Applications (20)

More from David Lary

More from David Lary (6)

Recently uploaded

Recently uploaded (20)

Machine Learning for Scientific Applications