CLIM: Transition Workshop - Discussion of Statistics in Oceanography - Michael Stein, May 15, 2018

Discussion for “Statistics in Oceanography”
Michael L. Stein
University of Chicago
SAMSI 2018

Kuusela, trends in ocean heat content
Heat uptake of ocean has a large effect on transient global warming, so
measuring it better is of great importance.
My work has shown that if observations are sufficiently dense, mean functions
have negligible impact on interpolation.
Despite appearances, Argo data not all that dense in space-time.
So mean function may matter, especially where gaps in data larger.
Still, I don’t like polynomial means, especially in time.
Extrapolates poorly, especially if coefficients of quadratic in time vary with
space.
Difficult to interpret scientifically.
I prefer to
use measured/meaningful covariates if possible (results from an ocean
model?)
use a random field model that fills in large gaps flexibly
Local model may struggle with this?

How to model covariance structure?
Local modeling simplifies treatment of nonstationarity and reduces
computation.
Computing loglikelihood for global model probably not feasible.
Use bigger regions and nonstationary models?
Or use global model and do computations locally (independent block
model)?
Need to recognize that locations on opposite sides of Isthmus of Panama
aren’t close.
Use SPDE model?
Integrating first and then interpolating in horizontal is clearly good idea, but it
would be very interesting to model variation in all three spatial dimensions.
Above mixing layer very different than below.
How to capture horizontal/vertical interactions (presumably not
separable).
Easier but still interesting problem? Separate ocean into several vertical layers
and see if time trends differ by layer.

Of possible theoretical interest?
If use exponential covariance function for process at a single depth, should you
use something diﬀerent for integrated quantity?
If f (ω, ν) is spectral density on R2
× R, then spectral density for process at
ﬁxed level of third dimension is
R
f (ω, ν) dν
and spectral density for process averaged over third dimension from a to b is
R
f (ω, ν)
4 sin2 1
2
(b − a)ν
(b − a)2ν2
dν.
Second gives less weight to high frequency vertical variations. Practical
relevance?

Giglio, mapping ocean O2
What does the estimated regression surface look like?
Are there substantial nonlinearities in temperature or salinity?
Other than between latitude and longitude, or there any substantial
interactions? If so, are these interpretable?
What to make of chlorophyll not being informative. Can this be universally
true?
General issue for black box regressions is how to visualize regression surface.
Possibilities here:
Contour map of O2 as function of salinity and temperature for ﬁxed time
and location with a map and calendar in corner that one could click on to
change location and time.
Contour map of O2 as function of spatial location for ﬁxed salinity,
temperature and time with three sliders for these covariates.

Spatial statistics
Spatial models (e.g., Gaussian processes) deal naturally with irregularly sited
spatial data.
Is there something being lost here by not using such models?
To answer this, I would look for spatial (or spatial-temporal) dependence in the
residuals from the regression.
If dependence scales short relative to larger gaps in float locations, taking
account of dependence may not help much where interpolation is hardest.
Still worth investigating.
Use temperature and salinity as covariates in mean function and model
remainder as space-time random field.
Model temperature, salinity and O2 as multivariate process.
Latter helpful if want O2 where don’t have temperature and salinity, although
can first interpolate and use interpolated surfaces in mean function.
Uncertainties? Can random forest give defensible uncertainties that vary in
time and locations? Spatial statistics can in principle.

Data structure
“Streaks” in locations of O2 observations.
Multiple observations from same float.
Floats generally take observations every 10 days.
If validation set is a random subset of all observations, then will likely have very
nearby observations in training set in space and time.
Danger of overfitting?
Probably better to leave out floats for validation data.
But given highly clustered locations of O2 measurements, still might
overfit.
Maybe should leave out all observations in fairly large regions for validation
sets?

Some important problems we didn’t get to
Using position and T/S information simultaneously in one analysis.
Models/analysis for processes in three spatial dimensions and times.
Statistical models that take account of eddies.
Statistical models for mixed layer depth.

CLIM: Transition Workshop - Discussion of Statistics in Oceanography - Michael Stein, May 15, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to CLIM: Transition Workshop - Discussion of Statistics in Oceanography - Michael Stein, May 15, 2018

Similar to CLIM: Transition Workshop - Discussion of Statistics in Oceanography - Michael Stein, May 15, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM: Transition Workshop - Discussion of Statistics in Oceanography - Michael Stein, May 15, 2018