1. PROJECT REPORT
Regression Analysis
MTH 416A
Indian Institute of Technology, Kanpur
Department of Mathematics and Statistics
2016-17
SPECULATING DAILY MAXIMUM
CARBON MONOXIDE (CO) LEVEL
Supervised By Authored by
Dr. Sharmishtha Mitra Bhanu Yadav – 13198
Nakul Surana - 13418
2. 1
PROJECTREPORTRegressionAnalysis|[Pickthedate]
SPECULATING DAILY MAXIMUM CARBON MONOXIDE (CO) LEVEL
Bhanu Yadav & Nakul Surana Department of Mathematics and Statistics, Indian Institute of Technology,
Kanpur, India
Email: bhanuydv@iitk.ac.in nakuls@iitk.ac.in
__________________________________________________________________________________
Objective
Considering the increasing pollution levels in the city and its harmful effects on kid’s health, in this study
we wish to predict Carbon monoxide levels given the various sensor values. If CO levels are within 2ppm to
9ppm then it is considered to be tolerable.
Forecasting Description
To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th
April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.
Data Description
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical
sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from 10th March
2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, NonMetallic
Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by
a co-located reference certified analyzer.
Source: UCI machine learning repository- Air Quality data set
(http://archive.ics.uci.edu/ml/datasets/Air+Quality#)
Attribute Information
0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference
analyzer)
5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
3. 2
PROJECTREPORTRegressionAnalysis|[Pickthedate]
9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in °C
13 Relative Humidity (%)
14 AH Absolute Humidity Group
Key Characteristics
Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was
also changing as per the days of the week, which could be because of the varying number of automobiles
(emitting air pollutants) on weekdays and weekends.
1. Variables in the Data Set Y variable – > CO(GT)
2. Possible X Variables –> PTO8.S1(CO), NMHC(GT), C6H6(GT), PTO8.S2(NMHC), NOx(GT),
PTO8.S3(NOx), NO2(GT), PTO8.S4(NO2), PTO8.S5(O3), T, RH and AH
3. X Variable NMHC had more than 90% missing values (Excluded from the possible X variables set)
4. All other variables had less than 10% missing values
5. Replaced the missing values by the previous hour values and for consecutive missing values with
last week-hour values
Plot of CO vs Time
X-axis -> Days of the year (Ex. 1st
day is 5th
April’04 and vice-versa)
Y-axis – Concentration of CO in PPM
4. 3
PROJECTREPORTRegressionAnalysis|[Pickthedate]
This suggests a seasonality of CO w.r.t. days of the year to compensate that we will introduce dummy
variables
X4 = 1 if days of the year are between 200 to 300
= 0 otherwise
Plot of CO vs Week time
1:7 Monday: Sunday (X-axis)
Y-axis – Concentration of CO in PPM
Different Colors represents different months of the year
This suggests a seasonality of CO w.r.t. days of the week. to compensate that we will introduce dummy
Variable
X5 = 1 if Monday, Tuesday, Saturday and Sunday
= 0 otherwise
5. 4
PROJECTREPORTRegressionAnalysis|[Pickthedate]
Input Variables:
Linear correlation coefficients computed among analyzed species using on field recorded data
rNMHC-C6H6
0.98
rCO-NOx
0.78
rCO-NO2
0.67
rC6H6-NOx
0.72
rC6H6-NO2
0.60
rNOx-NO2
0.76
rCO-C6H6
0.90
As regard as benzene-NMHC coefficient, it should be noted that it has been computed using only the first 8
days of measurements, after which the NMHC targeted analyzer went out of service.
After checking different available variables we decided that the following variables can affect the CO
levels:
Regressors:
• Daily maximum C6H6 (lag 7)
• Daily maximum T (lag 7)
• Daily maximum AH (lag 7)
• Monthly dummy variables
• Weekly dummy variable
FAQ:
Q1: Why to use lag 7?
Ans: To forecast the CO concentration a weak earlier.
Q2: Why T and AH as a Regressors?
Ans: T – Temperature, AH - Absolute Humidity are one of the key factors of CO concentration in
atmosphere. (Literature review) and correlation coefficients.
6. 5
PROJECTREPORTRegressionAnalysis|[Pickthedate]
Multiple Regression Analysis
Y = Xβ + ε (Model)
Full analysis:
1. Coefficient table
Estimate SE t-Stat p-Value
(Intercept)' 2.208278182 0.2283410655 9.670963816 8.19E-20
x1' 0.1454091155 0.006748314924 21.54747032 1.11E-66
x2' -0.05443110318 0.01089796337 -4.994612418 9.22E-07
x3' -0.01977522209 0.2179249985 -0.09074324759 0.9277472139
x4' 0.3087663162 0.1681926407 1.835789693 0.0672156882
x5' 0.1594614238 0.1342518955 1.18777782 0.2357061988
2 ANOVA
SumSq DF MeanSq F pValue
Total 1416.818082 364
3.892357
369 NaN NaN
Model 936.0920362 5
187.2184
072 139.8122876 5.41E-82
Residual 480.726046 359
1.339069
766 NaN NaN
Lack of fit 458.2127127 352
1.301740
661 0.4047461339 0.9826200603
Pure error 22.51333333 7
3.216190
476 NaN NaN
SSres = 480.726045999331 || SSreg = 936.092036192449 || SSTotal = 1416.81808219178
MSres = 1.34 || MSreg = 187.22
R2
= 0.660700232413988 ||| R2
_adjusted = 0.655974608910004
7. 6
PROJECTREPORTRegressionAnalysis|[Pickthedate]
Residue Analysis:
Normal probability plot of the residual: This is a graph designed so that the cumulative normal distribution
will plot as a straight line. Let t[1] < t[2] < . . . < t[n] be the externally studentized residuals ranked in
increasing order. If we plot t[i] against the cumulative probability Pi = − ( ) i n 1 2 / , i = 1, 2, . . . , n , on the
normal probability plot
Plot of Residuals against the Fitted Values yˆI : plot of the (preferrably the externally studentized residuals,
t i ) versus the corresponding fi tted values yˆi is useful for detecting several common types of model
inadequacies
8. 7
PROJECTREPORTRegressionAnalysis|[Pickthedate]
Conclusions:
Y = 2.2 + 0.15 (Max C6H6) – 0.05 (Max T) – 0.02 (Max AH) + 0.31 (Monthly dummy) + 0.16
(Weekly dummy)
R2_adjusted = 0.656 => Our model can explain 65% of the variability in the data
Normal probability plot of the residual behaves properly
Plot of Residuals against the Fitted Values yˆI behaves properly too
References:
On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario
S. De Vito a,∗, E. Massera a, M. Piga b, L. Martinotto b, G. Di Francia a
https://archive.ics.uci.edu/ml/datasets/Air+Quality