Fine particulate matter (PM2.5) is a mixture of air pollutants that, at a high concentration level, has adverse effects on human health. An interesting statistics problem is to estimate these pollutant exposures for the entire US, such estimates can be used to inform policy and decision making. During the workshop, we will work on two major source of air quality data that are used by the EPA to estimate pollutant exposures, including monitoring data and the Community Multiscale Air Quality (CMAQ) model. The monitoring stations provide fairly accurate measurements of the pollutants; however, they are sparse in space and take
measurements at a coarse time resolution, typically 1-in-3 or 1-in-6 days. On the other hand, the CMAQ model provides daily concentration levels of each component with complete spatial coverage on a grid; these model outputs, however, need to be evaluated and calibrated to the monitoring data. We will explore these air quality data for the summer of 2011 and brainstorm on statistical models to estimate air pollutant exposures.
Group members: Meixi Chen, Vincent Gonzales, Alan Ji, Chandni Malhotra, Hongyu Mao, Sharon Sung
2. What is PM2.5?
May 2018, SAMSI Workshop
Bypass nose/throat penetrate deep into lungs, circulatory system.
➢ Particulate Matter
➢ Diameter < 2.5 micrometers
➢ 3% the diameter of human
hair
3. May 2018, SAMSI Workshop
PM2.5 Monitoring Systems in the
US
➢Monitoring stations are sparse
➢Need predictions for locations
without a monitoring station
4. What is CMAQ?
May 2018, SAMSI Workshop
CMAQ - Community Multi-scale Air
Quality is a numerical air quality
model
To predict the concentration of air
pollutants
5. CMAQ Inaccuracies
● High topographical regions
contained greatest degrees of
error
● Areas with more monitoring stations
had best predictions
6. The Big Question/Goal
What is the best statistical
model that predicts PM 2.5
concentration level for the
entire U.S. using numerical
model outputs and other
available covariates?
May 2018, SAMSI Workshop
13. Variable Selection & Transformation
31 Plots: Covariate v.s. PM 2.5 (Response Variable)
Residual Plot: Residuals of PM2.5 v.s. each covariate
Adjusted R-squared
Used to decide which covariate to exclude when two are highly correlated.
14. Variable Selection & Transformation
Residual Plot
➢ Do regression PM2.5 ~ CMAQ
➢ Plot the residuals against the other covariates
Finally, 15 covariates are selected
Boundary layer height
residuals
15. May 2018, SAMSI Workshop
Random Forest
No. of trees:
500
No. of variables tried at each split:
5
Mean of squared residuals (log
scale): 0.1075135
% Variance explained:
72.25
Some fun math behind the models…
16. May 2018, SAMSI Workshop
Spatial Model
Covariance
Matrix
Conditional
Normality
Some fun math behind the models…
17. The Kriging Concept
“The basic idea of kriging is to predict the value of a function at a given
point by computing a weighted average of the known values of the
function in the neighborhood of the point.”
———Wikipedia
May 2018, SAMSI Workshop
22. 5 Fold Cross-Validation
➢ Divide the whole dataset into 5 folds
➢ Train the model using 4 of them and leave out the fifth one
➢ Make predictions on the fifth fold and obtain the MSE and MAD
23. Model MSE MAD
CMAQ 51.734 4.681
Simple LR 23.220 3.103
Random forest 13.254 2.177
Spatial analysis 9.734 1.718
May 2018, SAMSI Workshop
Model Comparison based on
cross-validation
24. May 2018, SAMSI Workshop
Prediction Maps
for Jan 1st , 2011
MSE of CMAQ = 51.734, MSE of LR = 23.220, MSE of RF = 13.254, MSE of Spatial Analysis = 9.734
25. May 2018, SAMSI Workshop
Prediction Maps
for Aug 1st , 2011
MSE of CMAQ = 51.734, MSE of LR = 23.220, MSE of RF = 13.254, MSE of Spatial Analysis = 9.734
26. Summary
➢ Spatial analysis makes the BEST predictions
➢ Potential Improvements:
○ Look at the interactions between covariates
○ Other machine learning methods like neural network
○ Seasonal analysis
○ Mid-west?
27. May 2018, SAMSI Workshop
Special thanks to Yawen, Amanda, Suman, and Doug