This document describes a study that uses the Mahalanobis Taguchi System (MTS) to create a heat vulnerability index for New York City neighborhoods. The MTS is a statistical method that combines Mahalanobis distances and Taguchi orthogonal arrays to identify important variables. The study collects geographical, socioeconomic, and tree cover data for NYC census tracts. MTS is applied to calculate Mahalanobis distances between "normal" and "outside" tract groups. Taguchi arrays are used to test variables. The results are inconclusive due to limited data. More medical data is needed to better differentiate tract groups and identify significant variables for a heat vulnerability index.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Heat Vulnerability Index Using Mahalanobis Taguchi System
1. Heat Vulnerability Indexes for Urban Environments Using the
Mahalanobis Taguchi System
Danton Zhao
Advisors: Professor Lindsey Van Wagenen & Professor Michel Lobenberg
INTRODUCTION
The Mahalanobis Taguchi System (MTS) is a multivariate statistical method which combines the
Mahalanobis distance with the testing of Taguchi orthogonal arrays. The Mahalanobis distance
is a metric of how far a sample deviates from the normal/training group. Should the training
group fall within a Gaussian distribution, the mean of the training group’s Mahalanobis
distances should be approximately equal to 1. The Taguchi orthogonal arrays are two factor
matrices which specify which variables to keep or turn off while running multiple variations of
an experiment. These arrays aim to increase testing efficiency by reducing the number of trials
needed for categorizing variables as beneficial or harmful to experimental data. By
incorporating the two into MTS, essential variables for the Mahalanobis distance can be
determined from the Signal-to-Noise Ratios yielded by the Taguchi testing.
Previous research into climate change has shown that the frequency and risk of heat-related
illnesses will rise alongside temperatures. However, these cases are not geographically
distributed, in other words, some areas will be more at-risk than other areas[2]. Many studies
have attempted to create indexes for this risk by analyzing historical health and climate
data.[3]. In 2011, an extensive case study of vulnerabilities for communities in New York City
was concluded, which utilized Z-scores to build a composite vulnerability index[1]. These Z-
scores were drawn from variables related to not only climate and health data, but also
geographical and socioeconomic data. The Mahalanobis distance is similar to the Z-score, in
the sense that it assigns some metric to the deviation of data, but it is different in that it uses a
multivariate approach. This potentially enables us to more efficiently create a heat
vulnerability index, utilizing data from a broader selection of sources.
REFERENCES
[1] Jaime Madrigano, Kazuhiko Ito, Sarah Johnson, Patrick L. Kinney, and Thomas Matte A Case-Only Study of
Vulnerability to Heat Wave Related Mortality in New York City (20002011) Environ Health Perspect 123;
doi:10.1289/ehp.1408178.
[2] G Brooke Anderson and Michelle L. Bell Heat Waves in the United States: Mortality Risk during Heat Waves and
Effect Modification by Heat Wave Characteristics in 43 U.S. Communities Environ Health Perspect 119:210218;
doi: 10.1289/ehp.1002313.
[3] California Environmental Public Tracking Network, 20140624, Heat-related inpatient hospitalizations and
emergency room visits among California residents, May-September, 2000-2012.
[4] Quantum GIS Development Team (2016). Quantum GIS Geographic Information System. Open Source Geospatial
Foundation Project. http://qgis.osgeo.org
[5] R Core Team (2016) R: A Language and Environment for Statistical
[6] MATLAB and Statistics Toolbox Release 2012b, The MathWorks, Inc., Natick, Massachusetts, United States.
METHODOLOGY
Collecting Data
Member tracts of the normal (low vulnerability) and outside groups were identified via their corresponding
colors, which were extracted from the raster image provided by the New York City case study[1]. By utilizing
112 points, the raster image was mapped to and aligned with the 2010 New York City TIGER shapefile. The
primary color for each respective census tract was found by using QGIS’s Zonal Statistics plugin to identify
the most frequently occurring pixel value within the Red, Green, and Blue color bands. Tracts, which had
been assigned with blue for their low vulnerability, were classified as the normal group, while tracts, in red
or orange, were classified as the outside group. Although many previous studies on heat vulnerability made
use of health data from state repositories, the resource and time constraints of this project restricted us
from utilizing that data. Instead, publicly available geographical and socioeconomic data was collected and
processed.
RESULTS
CONCLUSION AND FUTURE GOALS
• No separation in Mahalanobis distances could be identified with the selected variables.
• The negative SNRs from the Taguchi arrays support the previous statement
• Findings are inconclusive, we will need more data, most likely medical data
• Hopefully, more complete datasets can be gathered in the future to increase the sample
size of eligible census tracts
• Crime data may be an interesting topic to explore with relation to heat vulnerability
Greenery coverage was found to be an important variable which helped to differentiate at-risk and not-at-
risk communities[1]. Trees classified with “Good” or “Excellent” health were imported into RStudio from
the 2005 Street Tree Census comma separated value file, hosted by NYC Open Data. Approximately 86% of
all trees linked to Staten Island were not listed with a borough census tract code. In an effort to promote
the integrity of the dataset, Staten Island trees were removed from the table alongside the other 17573
unlisted trees. Tree count, mean tree diameter at breast
height (DBH), and the standard deviation of tree DBH for each listed tract code were aggregated into a data
frame.
Looking at the colored vulnerability index map, the at-risk communities appear to be primarily located in
very urban neighborhoods. Gentrification has somewhat shifted the demographics of the people living
within these neighborhoods, but we still wanted to observe if this geographical data played any significant
role in the vulnerability. Tax lot data was extracted from the November 2011 file (11v2) within the NYC
PLUTO archive. To simplify the process of merging datasets, an R script was written which would output the
corresponding 7 digit borough tract code when a tract county and tract code were inputted. From this data,
I extracted several variables regarding the area committed to certain land use categories, from each census
tract, into a separate data frame. Number of buildings, area of land allocated to hospitals, and area of land
allocated to schools were a few of the variables tested.
Socioeconomic data was extracted from the 2011 American Community Survey. Referring
back to the colored map, the at-risk communities also appear to be located in areas of low income.
Variables such as the per capita income, unemployment count, and time required for travelling to work
were among the collection of variables aggregated to test socioeconomic impact.
METHODOLOGY (CONT.)
Writing the Program for Analysis
The Mahalanobis distance is canonically expressed, for a matrix with i rows and j columns, as:
where k is the number of columns/variables being tested, Z is the standardized version of the
matrix, and C is the covariance matrix for Z.
In R, mathematical operations performed between a matrix and a one-dimensional vector are
done across the Nth row of a matrix and the Nth element of the vector. In order to standardize an
incoming numerical data frame or matrix , the following operations would need to be
implemented.
The Taguchi portion of the MTS program, consisted of looping the Mahalanobis distance
calculations on the outside group. Each loop would use a different row of the orthogonal Taguchi
Array, which was provided by the DoE.base R package, as a reference for which variables would
be tested. The SNRs for each loop were aggregated into a one-dimensional vector, which would
later be used to find the average SNR for trials where a variable was on, and trials where a
variable was off.Fig 3: Transformed Case-study Raster
Fig 2: Points georeferenced to TIGER shapefile
Fig 4: Summary of Tree Counties
Fig 5: Mahalanobis Distances of Variables with Highest SNR Difference
Fig 6: Average SNR from Taguchi Trials
Figure 1: A Multivariate Gaussian distribution with Mahalanobis Distances Depicted on a Color Scale
Mathematics Department
NYU Tandon School of Engineering
6 MetroTech Center, Brooklyn, NY 11201
Email: dantonz@smu.edu