1. Introduction
Clustering techniques can play an important role in analyzing
ecological data. Our goal was to apply these clustering techniques to a
large dataset consisting of plant species information from the US and
Canada and group the states, territories, and provinces into
geographic regions according to the distribution of plant species. We
used the Kulczynski dissimilarity coefficient to measure the similarity
and dissimilarity among the regions. These measures, along with the
average linkage clustering method, were then used to group the
regions. In order to visualize the clusters, we created a dendrogram
and identified 13 distinct clusters that best indicate plant species
trends across our target geographic area. Finally, we displayed the
clustered regions on a geographic map and compared them with
typical forest maps.
Preparing the Data
● We began with a data set as pictured below (Figure 1) where the
rows represented the genus and species of each of the plants, and
the columns represented each of the states that the plants were
present in.
● We wrote a program to convert this raw data into a binary
presence-absence matrix which could be handled much more
naturally by R (Figure 2).
● Many of the species were only present in a couple of locations,
which would provide minimal information for clustering. We
decided to elimate any plants that appear in three or less locations,
which meant that the reminaing plants showed stronger ties
between geographic regions.
● These actions reduced the total number of plants from 34,000 to
12,000 rows.
Figure 1
Figure 2
A Cluster Analysis of North American Locations Based on Plant Species
Sally Dufek and Parker Kain
Faculty Mentors: Dhanuja Kasturiratna and Aimee Krug
Kulczynski
To establish a quantifiable distance between each of the states, we utilized a
package called “prabclus” containing a method that calculates the Kulczynski
distance for each of the pairs of states. We chose Kulczynski distance because
it works better for presence and absence data, specifically where there are
significantly more zeroes than ones. Kulczynski distance uses the following
formula, where A1 and A2, where A1 represents all of the plant species in a
state, and A1 ∩ A2 is the shared states between two states:
After creating a table of Kulczynski distances between states, we could cluster
the states using hierarchical clustering methods. We chose average linkage to
minimize the variance in distances between our clusters, as it focuses on a
central measure of location rather than merely the closest or furthest points.
We chose 13 clusters with each one being unique and tightly packed and
limited the number of one and two state clusters (Figure 3). As pictured
above, some of the clusters are significantly larger than others, while the
smallest one contains one state, Hawaii. From here, we utilized the “maps”
package in R to create a colored map of our areas of interest to best visualize
the clusters, and compare them to a topographical map of US forests (Figure
5). The maps are very similar with only minor border regions bearing any real
differences.
Results
Our 13 clusters (Figure 4) made sense when overlayed with various
topographical forest maps of the US and Canada (Figure 5). Our
geographic map was able to recreate typical forest maps rather
accurately with only small variations in borders, and in some places
allowed for smaller, more tightly packed clusters than the forest
maps.
Acknowledgements
We would like to thank the UR-STEM program at NKU for
funding our research project.
Figure 3
Figure 4
Figure 5
References
"Facts and Information about the Continent of North America." Natural History on the Net.
N.p., 07 July 2016. Web. 27 July 2016
Henning, Chrstian, and Bernard Hausdorf. Design of Dissimilarity Measures: A New
Dissimilarity between Species Distribution Areas. UCL - London's Global University. UCL,
n.d. Web. 15 July 2016.
Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Boston:
Pearson Addison Wesley, 2005. Print.