Improving Physical Parametrizations in Climate Models using Machine Learning
Parametrizations in Climate
Models using Machine Learning
October 3, 2018
George Mason University
Nathan Kutz (Applied Math)
Chris Bretherton (Applied Math and
What is a weather model?
If it is true, as any scientist believes, that
subsequent states of the atmosphere develop from
preceding ones according to physical laws, one will
agree that the necessary and sufficient conditions
for a rational solution of the problem of
meteorological prediction are the following:
1. One has to know with sufficient accuracy the
state of the atmosphere at a given time.
2. One has to know with sufficient accuracy the
laws according to which one state of the
atmosphere develops from another.
- Bjerknes (1904) Vilhelm Bjerknes
High resolution models produce realistic clouds
Giga-LES (Khairoutdinov et al. 2009)
100x100 km with 100 m grid
Can only solve at coarse resolution
Image from NOAA
How we usually build parametrizations
High resolution simulation or observations
Climate models have biases in mean state
Hwang and Frierson (2013)
…and in variability (e.g. MJO)
Jiang, X. et al. (2015)
Parametrization is a function approximation
Q2Q, T, U, V, …
Machine learning builds black boxes
• Many 1000s of parameters
• Need a lot of data
• Designed to be trained not interpreted
• Examples: Decision trees, neural networks, support vector machines
Easy to tune/train Easy to interpret
Many parameters Few parameters
Training Approach 1
1. Use finite differences to compute residual tendencies
2. Train neural network:
q, s, SHF, LHF, TOA Q1, Q2
The diagnostic performance is good!
What about prognostic performance?
Single column dynamics
Uh oh…temperature = 1035 K after 1 day
Is this why most past studies only show diagnostic results?
Is fitting the tendencies the right approach?
• Assumes that model dynamics are continuous in
• But they are not (Donahue and Caldwell, 2018)
• Assumes moist physics tendencies are available
• Not true for DYAMOND outputs
• Not true for observations
• Does not ensure good predictions over many
Fitting the approximate Q1 and Q2 is equivalent to
minimizing one-step error
…but that does not ensure longer term performance
The scheme is now stable
Simulated time series at x=1000 km, y=5198 km
Matches NG-Aqua better than CAM
Model Version 5 (CAM5)
Single Column Mode
(default physics, no
Humidity Anomaly (from true zonal mean ) (g/kg)
Implementation in a GCM
Weather forecasting tests
Coarse Resolution Atmospheric Model
Coarse resolution model (cSAM)
• System for Atmospheric Modeling
• 160 km resolution
• ”Dry” anelastic dynamics
• Advection of water vapor
• Virtual temperature effect on
• Damping + diffusion
• Meridionally varying Coriolis force
• Double precision important for
• 3 layers of 256 neurons each
• Trained with full global NGAqua
Estimating large scale forcing for training
• SAM has advection and diffusion
• To compute known forcing using SAM:
1. Initialize SAM with data at time t: x(t)
2. Evolve forward for 10 minutes
3. Sam Forcing = (x(t+10 min) – x(t))/10 min
• Could also account for radiation and other model physics
Another possible solution: Data Assimilation
• Filter the unresolvable scales in the training data using Digital Filter
Initialization (Lynch 1997)
• Defines model error (Kaas, et. al. 1998; Rodwell and Palmer, 2007)
From: Rodwell and Palmer (2007)
Conclusions and Future Directions
• Neural network parametrization for unresolved physics
• Numerically stable in single column and spatially extended mode
• Only trained from coarse resolution snapshots of a GCRM
• To do list
• Predict momentum source (Q3)
• More realistic training data (DYAMOND will be great resource)
• Stochastic parametrization
• Data Assimilation
• Brenowitz, N. D. & Bretherton, C. S. Prognostic Validation of a Neural
Network Unified Physics Parameterization. Geophys. Res. Lett. 17,
• S. Rasp, M. S. Pritchard, P. Gentine, Deep learning to represent
subgrid processes in climate models. Proc. Natl. Acad. Sci. U. S. A.
115, 9684–9689 (2018).
Noah Brenowitz (firstname.lastname@example.org)
Neural networks are a popular machine
graphic: Krueger and Bogenschutz
Mass conserving initial conditions
• 4km SAM and 160km cSAM are on a staggered grid
• Zonal velocity and meridional velocity are averages along the
interfaces (40 grid points)
• All other variables: averaged over full 160 km box (40^2 grid points)
Parametrized source on the equator
Super-parametrized CAM (SPCAM)
Inputs: u, v, w, q, T, SHF, LHF Outputs: Q1, Q2, TOA Radiative flux, Precip
Used neural network. The diagnosed tendencies and precipitation match
SPCAM, but they show no prognostic tests.
…they do show nice prognostic results in a very recent manuscript.
Is using more layers another way to break the
This really nice work, but there are some
• Still uses true tendencies as inputs
• Not clear it will work with GCRM outputs
• Takes 8 hours to train (vs 5-20 minutes for our approach)
Example: Decision Trees
Not much convection
Not much convection
…but can be hard to interpret
Very similar to a lookup table…non-continuous outputs
Videos available here: https://cimss.ssec.wisc.edu/data/
I’d first like to acknowledge my co-mentor Nathan Kutz, and the following organizations for funding my work
But what is a weather/climate model?
The father of meteorology Vilhem Bjerknes said this quote...which I will read
In this talk I will focus on number 2 <click>
Traditionally, data has had it’s strongest impact on point 1, but in this work we focus on using data in part 2.
Now, on certain scales, we know the physical laws describing atmospheric motions well enough to create extremely realistic depictions of clouds and other small-scale atmospheric phenenomena.
This is a visualization from a large-eddy-simulation performed using the SAM model on a 100m grid.
Now, this simulations uses a similar number of degrees of freedom for one 100 km squared box as a climate model does for the whole planet.
So it’s just not practical to perform simulations at that level of accuracy.
Instead of solving the fine resolution equations we end up solving the coarse-resolution equivalent. Here, I am showing the coarse resolution budgets of dry static energy, water vapor mixing ratio, and momentum. These budgets are forced by some terms which are known on the coarse-grid scale, such as advection, and the coriolis force.
Bar denotes the average over a GCM grid box. The grad bar is a coarse grained approximation of the gradient.
The right hand side denotes the residual terms, which are known as the apparent heating (Q1), which includes raditiona nad latent heating, and the apparent moistening (Q2), which describes phase changes of water. Likewise there is a residual term in the momentum budget (Q3) related to turbulent mixing.
The taks of parametriztion is to close these budgets, and write Q1-Q3 as a functions of the coarse-resolution variables alone.
A scientist will interpret a complex set of simulations or observatiosn and reduce them to a simplified flow chart. This flow chart will then be turned into a Fortran code, and if it’s succesful be intergrated into a climate model.
While this approach has yielded great benefits, it is evident that it can improved. For example, the pace of development in parametrization is very slow. In particular, deep convective parameterizations have not changed much in the past two decades even though many new high resolution models and observations have become available. I will argue that the main bottleneck process is this person…the scientist.
Perhaps as a result, many climate models have large biases in the mean state.
Here I am showing a plot of the famous double-ITCZ bias, which is a precipitation bias in the tropics.
On the left, are observations, and on the rightn is the multi model mean of many climate models.
There are also problems in the variability. In particular, models struggle to simulate the main intraseasonal oscillation in the tropics, the Madden Julian Oscillation. Each panel of this plot depicts lag regression diagrams of the rainfall averaged between 10S and 10N.
One of the MJO’s most basic features is its eastward propagation. Unfortunately, many models can not replicate this.
So how do we move past this situation?
Essentially, param is a function approximation problem. It takes in the profiles of humiidity, temperature, etc and produces the apparent source terms Q1, Q2, and Q3.
By using simplified physical models, we essentially limit the class of functions that we can use to just those which are interpretable by humans at the cost of accuracy. Here, I will argue that treating parametrizations as a black box machine learning problem can help us move past this bottleneck. While many may not like black boxes, the fact is many or most physically based parametrizations already are black boxes to most climate modelers. This is even more true when considering the fully coupled interactions of many different parametrizations.
The defining attribute of machine learning is the complexity of the models it uses.
To illustrate this point, I would like to use this complexity spectrum. On the right are models, like most current parametrizations, that are easy to interpret because they have few parameters. On the other side, are models that have many thousands of parameters, which are easier to train. This might seam paradoxical, but in a higher dimensional parameter space there might be many more ”right answers” than in a low-dimensional space.
Some examples include…<read slides>
Here is toy example of how a excessively low dimensional parameter space can orbit a optimal solution.
Let this space ship be a typical phsyics parametrization, and let our task be to land at the optimum. If the space ship only controls in the azimuthal angle theta, then it will only ever orbit the optimum. To make it land, it will need at least one more parameter (the radius).
I’ve talked about the models a little a bit, so let’s talk about the training datasets. And this section will serve as an introduction to some past work on machine learning parametrization.
So how do we train a machine learning model with many 1000s of parameters? The key is to base everything off of a suitable training dataset.
The simplest possible training dataset, is the output of a known determinsitic function. This is known as emulation.
Indeed, Krasnopalksy trained a simple neural network model to reprodcue the output of the community atmosphere models radiation code.
Another more interesting datasets, are simulations which explcitily resolve atmospheric convection, by having grid sizes less than 4 km or so. Because of their computational expense, these are usually run in limited area domains.
One issue with this data, is that the tendencies we aim to estimate (e.g. Q1 and Q2 and Q3) are not outputs of the model, and have to be estimated.
Kransopalsky trained a neural network to reproduce the mean effect of clouds in a CRM simulation, and shoed some interesting diagnostic results. In other words, given only the humidity and temperature profiles, they could predict the instantaneous amount of precipitation and other fields like cloud fraction.
In my view, the most exciting datasets that we can use for machine learning are from global cloud resolving models.
One important thing to note, is that there are no plans to output tendency information.
Unfortunately, unlike their emulation work, they and no others until very recently tested their methods prognostically.
In this section, I will discuss some work we recently published which prognostically tests a NN parametrization in a single column setting.
We published this work recently in GRL.
Our training dataset is a simplified GCRM simulation performed using the System for Atmospheric modeling (by Marat Khairtoudinov). We call it the near-global aquaplanet simulation or NGAqua
Mention the aspects of the simulation
Diurnal cycle mid latitude cyclones Tropics (w/ eastward disturbance)
Now, this dataset was not designed for machine learning, so they did not store moisture or temperature budget information. And the sampling interval is relatively coarse in time.
For this study, we focused on the tropics, where moist atmospheric convection plays an especially important role.
We coarse grain the 4 km NGAqua data onto a 160km grid, representative of a typical climate model resolution. This resolution was partially chosen so that the precipitation de-correlates over a time scale of about 3-6 hours.
Within each grid box we have measurements at different heights of humidity and dry static energy.
There are 34 vertical grid points for each variable, which we concatenate into a single 68 dimensional vector.
The data are normalized by subtracting the mean and dividing by a mass-weighted variance.
So as I said before, we did not have the moist physics tendencies available so the first approach we took was to estimate them using finite differences in time. And then train a neural network which depends on the humidity, static energy, surface fluxes, and TOA flux to preidct these approximate tendencies.
We find that when we train… we got good diagnostic performance.
In this top panel I show the approximate apparent heating rates for a given location in the equator. The bottom shows the heating rates predicted by the neural network. The timing and magnitude of the strong heating events (which are associated with heavy rainfall and atmospheric convectoin) are quite similar.
So what about prognostic performance. We test this in the single column model framework. So first a bit of notiation….x are the prognostic variables which will be evolved forward in time. Y are the auxiliary inputs which we prescribe. Likewise, gLS(t) are the coarse-grid advection tendencies which we compute offline using centered differences.
Then, the single column dynamics, treat g_LS and y as prescribed forcings which vary in time, but do not respond to the local fluctuations of humidity and temperature. This provides a simple prognostic testing framework, that can be simulate cheaply.
However, when we run the model forward for many time steps we get dramatically unrealistic results.
First problem, we know model dynamics are not continuous. Changing the order of physical paramtrizations in a GCM can dramatically alter results.
As it turns out, approximating Q1 and Q2 using finite differences over three hours is equivalent to minimizing the one-step error made by the SCM.
…but that does not ensure longer term performance. To fix this, we instead minimize the accumulated error, over multiple time steps.
And we train the model in this way, we obtain a stable scheme.
Moreover, this scheme matches the NG-Aqua data much better than the single column version of CAM does.
Here are moisture anomalies…
And temperature ones…
The next logical step is to couple this neural network to large-scale dynamics. In other words, put the neural network in a coarse resolution model.
To do this, we run the CRM we used to generate the NGAqua dataset, at the coarse-graining resolution of 160km. We chose to use this model, rather than a more established GCM like CAM, because the NGAqua data was not generated by a primitive equation model, and did not have spherical geometry.
Perhaps, the most challenging aspect of this work I am about to present has been developing coarse resolution capabilities for the SAM mode. This includes adding hyperdiffusion, reading initial conditions from a netCDF. For instance, we found that coarse res SAM does not conserve mass when compiled in single precision.
Another difficulty is that the neural network code is in python, but the model is in Fortran. So I have had to develop a interface layer for calling python from within fortran.
On the other hand, the neural network training is pretty similar. We use a slightly deeper neural network with 3 layers. It was not much difficult to expand the traiing region outside of the tropics.
We also changed how we estimated the resolved/known “large-scale” forcing. Before, I just computed the advective derivative with centered differences, but now I actually use SAM to compute this. This procedure could also account for radiation and other model physics.
After training the model with these large-scale forcings, we then ran a simulation in coarse resolution SAM with the neural network on.
Here are 6 snapshots of the precipitable water field through the course of this simulation.
The hope is that these snapshots replicate the original NGAqua simulation very closely.
And when compute quantitative notions of forecast error, this seems to be the case. Here I have plotted the domain RMS forecast errors for 4 atmospheric variables. The blue curve is the simulation with the NN. The gray curve is a dry-dynamics only simulation, and the orange curve is a persistence forecast.
At certain heights, the NN significantly improves the accruacy for the humidity and temperature.
It struggles with the horizontal winds, and does an especially poor job with the vertical winds.
One big issue, that was visible a couple slides ago is that the ITCZ narrows dramatically in the NN+SAM simulation.
Here, I plot the zonal average of net precipitaion in the final forecast day of the NN simulation compared with the all-time mean from NGAqua. The preicpitation is much more peaked near the “equator” of the model.
This can also be seen by looking at zonal average vertical motion in the tropics.
Since we do not predict the source of momentum, and SAM does not have a cumulus momentum transport or boundary layer scheme, we think that these problems might not be caused by problems with the NNs prediction of temperature and humidity forcing.
In particular, it appears that there is not enough momentum mixing in the tropical lower atmosphere. For instance the easterlies in above the equator are too weak near the surface, but too strong aloft. Conceivably, this could alter the lower level convergence in the tropical regions.
An obvious solution is to parametrize CMT either using an off the shelf scheme or a neural network.
Another problem is a loss of stochasticity compared to the training data.
This is clear by comparing the net precipitation after 1 day between NGAqua an the NN+SAM simulation. The NG simulation is much more speckled and has much higher peak rain rates.
One obvious solution is to make the output of the parametrization stochastic.
We are starting up some collaborations to do this.
Another possible solution is to make the training data itself less stochatic, by using data assimilation to filter the ”unresolvable” scales.
Indeed, some studies already discuss using data assimilation to define model error
This hints at a potential algorithm….
This also provides a loose coupling between the neural network training process and the large-scale dynamics.
This provides a path towards using observations!
Now, I have always found that description a little confusing, so here are the equations. The values at each layer are just a linear transform of the previous layer followed by applying a nonlinear function. This nonlinear function is known as an activation function.
So the parameters of a neural net are the weight matrices A_1,… and the constant vectors b_1…
Also the bias is much smaller with the neural network.
For example, the model needs to be initialized with winds that obey mass conservation.
We trained a neural network with these new large-scale forcings and ran an
More recently, Pierre Gentine, Mike Pritchard, and co authors.
Increasing nhid tends to capture the short term errors.
Increasing nhid has much more effect on the 64-step errors.
Increasing T helps is necessary for longer time accuracy. But does not hurt the short-term performance very much.
Our results are actually pretty sensitive to the type of time-stepping we use to advance the ODE forward in time.
We find the best results when using split-in-time time stepping.