Transcript of "Data Mining on a bird habitat dataset.doc"
1.
Data Mining on a bird habitat dataset
Pusheng Zhang
Department of Computer Science & Engineering,
University of Minnesota, MN 55455
pusheng@cs.umn.edu
1. Introduction
Widespread use of spatial databases is leading to an increasing interest in mining interesting and useful but
implicit spatial patterns. In addition to the non-spatial attributes, the spatial dataset is more complex and
include extended objects such as coordinates, points, polygons and cells in the graph. The values of
attributes, which are referenced by spatial locations, tend to vary gradually over space. While classical data
mining techniques, either explicitly or implicitly assume that the data is independently generated. The
spatial distributions of attributes sometimes have distinct local trends, which contradict the global trends.
This is most vivid in Figure 1(b), where the spatial distribution of Vegetation Durability is jagged in the
western section of the wetland as compared to the overall impression of uniformity across the wetland.
Thus the spatial data is not only not independent it is also not identically distributed.
Figure 1: (a) Training dataset: The geometry of the marshland and the locations of nests, (b) The spatial
distribution of vegetation durability over the marshland, (c) The spatial distribution of water depth, and (d)
The spatial distribution of distance to open water
We are given data about two wetlands on the shores of Lake Erie in Ohio, USA in order to predict the
spatial distribution of marsh-breeding bird: red-winged blackbird. The names of the wetlands are Darr and
Stubble and the data was collected from April to June in two successive years, 1995 and 1996. For the
purpose of data collection, a local 5000 cells are superimposed. The cells of the grid had square geometries
of size 5 meters by 5 meters. In each cell the values of several structural and environmental factors were
2.
recorded, included vegetation durability, stem density, stem height, distance to open water, distance to
edge, and water depth. At each cell was also recorded the fact whether a bird nest (red-winged blackbird)
was present or not. They are shown in Table 1.
Attribute Type Description
Vegetation Durability (VD) Numeric Scale from 10 to 100
Stem Density (SD) Numeric In number of stems/ m 2
Stem Height (SH) Numeric In centimeters above water
Distance to Open Water (DOP) Numeric In meters
Distance to Edge (DOE) Numeric In meters
Water Depth (WD) Numeric In centimeters
Nest Binary Record the presence/absence of bird nest in the cell
Table 1: The non-spatial attributes in the dataset
In the bird habitat dataset, our collaborators have put forward the solutions [1,2] for some problems.
Here we propose some techniques to improve the performance of data mining on this dataset, and expect
these techniques can spread across many domains including ecology and environment management, public
safety, transportation, business logistics, and tourism.
2. Challenges in the dataset
There are some challenges in this dataset, I will categorize these challenges into spatial issues and non-
spatial issue.
For the spatial nature in this dataset, the challenges are the following:
1. Neighbor regions tend to have same or similar properties
2. Nests are not distributed everywhere, i.e., they are not identically distributed.
3. Nests tend to be close, however, they are not too crowded together. Thus even some locations near
nests are nest worthy, they are mislabeled as “Non-Nest”
4. Measures of prediction accuracy
First, Classical data mining deals with numbers and categories. In contrast, spatial dataset is more
complex and include extended objects such as coordinates, points and cell in the graph. Second, classical
data mining algorithms often make assumptions (e.g. independent, identical distributions), and treat each
input to be independent of other inputs, which violate the first law of Geography: everything is related to
everything else but nearby things are more related than distant things [3]. In other words, the values of
attributes of nearby spatial objects tend to systematically affect each other. In spatial statistics, an area
within statistics devoted to the analysis of spatial data, this is called spatial autocorrelation [4], and the
spatial pattern often exhibit continuity and high autocorrelation among nearby features.
However, there are some instances mislabeled in this dataset. For example, some locations near nests
maybe are nest worthy, while they are sampled as “Non-Nest” since there were no real nests presented
during the data collection. Maybe it’s just because the birds don’t like to be overcrowded for the nest
construction although these locations are very suitable for the nests also. We can observe in Figure 2 that
some nest worthy locations around the actual nests are mislabeled as non-nest.
3.
Figure 2 Mislabeling some nest worthy locations as non-nest class in the dataset
Moreover, the measure of spatial accuracy is maybe substantially different from classical measures.
For the binary classes problem, the standard way to measure classification accuracy is to calculate the
percentage of correctly classified objects. This measure may not be the most suitable for the spatial data,
e.g., we use the classical accuracy measure in Figure 3, and the accuracies for the two models are same.
However, domain experts prefer (b) over (c), since predicted nest locations are closer on average to some
actual nest locations. The classical accuracy measure can distinguish between Figure 2(b) and 2(c), and we
need a measure of spatial accuracy to capture this preference.
Figure 3: (a) The pixels with actual nests, (b) Location predicted by a model, (c) Location predicted by
another model. Prediction (c) is spatially more accurate than (b)
In addition to the challenges due to the spatial nature in the dataset, we also have some general challenges
for the non-spatial issues.
5. Different class size
6. Outliers
7. Relative Thresholds for the model cross
8. Temporal nature
In this dataset, nest instances are just a little proportion in the dataset, and the majority of the dataset
are non-nest locations. Since we are just interested in the nest pattern, it’s very difficult to capture the
features for the nests. Furthermore, since we use C4.5 classification, the performance on different class
sizes is poor. Thus we have to develop some techniques to balance the numbers for the two respective
classes.
Furthermore, there are outliers and noises in datasets. For the noise, here it means the contradictory
records those with the same (or very similar) values for the training attributes but belonging to the different
classes. Actually it is similar to the scenario of mislabeling associated with the spatial nature. For the
outliers, it means the values of attributes are far from the nearby points, e.g., we can see there are some
outliers on the upper margin in the Figure 4 below.
4.
Figure 4: Outlier in the attribute “Distance to Edge” distribution for the Dataset Darr 96
There are different geometries and environmental conditions for the different regions, and birds
always choose the local maximum of them for the nests. For example, vegetation durability is one of key
attribute in the dataset, in the following two regions with the different distribution, if we build learning
model on the Figure5(a) region, we can derive the rule like “vegetation durability >=90” for the nest, the
model will fail to predict the nest location in the Figure 5(b) region even if there are same patterns for the
vegetation durability formula in the rules for the nests in the two datasets. We have to capture the different
thresholds cross the different regions and datasets.
Figure 5: (a) Local maximum in one region, (b) Local maximum in another region
For this dataset there are also some temporal issues, there are different attribute distributions and nest
distributions for the same wetland in the two successive years. In Figure 6 we can observe that the
distributions for the vegetation durability in the two successive years are different, and even the nest
location distributions are also different in Figure 7. It’s very difficult to capture the temporal pattern.
5.
(a) (b)
Figure 6: (a) Vegetation durability distribution for Darr wetland in 1995, (b) Vegetation durability
distribution for Darr wetland in 1996
Figure 7: Different nest locations distributions for the Darr wetland in 1995 and 1996
3. Related Work
Uygar[5,6] has applied classical data mining techniques like logistic regression and neural network to
build spatial habitat models. Logistic regression was used because the dependent variable is binary
(nest/non-nest) and logistic function “squash” the real line into the unit-interval. The values in the unit-
interval can then be interpreted as probabilities. They concluded that using logistic regression the nests
could be classified as a 24% rate better than random. The use of neural networks actually decrease the
classification accuracy but led to a better understanding of the interactions between the explanatory and the
dependent variables.
There are two important reasons why, despite extensive domain knowledge, the results of classical data
mining are not satisfactory. First, classical techniques, e.g. logistic regression, make assumption about
independent distributions for the properties of each pixel, ignoring spatial autocorrelation. Second, a more
subtle but equally important reason is the objective function of classification measure accuracy. Uygar still
used the classical accuracy measure, and it may not suit for this spatial dataset.
6.
In spatial autoregressive model, the spatial dependencies of the dependent variable are directly
modeled in the regression equations [10]. Assume that the dependent values y i are related to each other,
i.e., y i = f ( y j ), i ≠ j , the regression formula can be defined as
y = ρWy + βX + ε
Here W is the neighborhood relationship contiguity matrix and ρ is the parameter that reflects the strength
of spatial dependencies between the elements of the dependent variable. Spatial autocorrelation measures
are crucially dependent on the choice and design of the contiguity matrix W. The design of the matrix itself
is predicted on determining “what constitutes a neighborhood of influence?” Two common choices are the
four and eight neighborhood. Thus given a lattice structure and a point S in the lattice, a four-neighbor
assumes that S influences all cells which share an edge with S. In eight-neighborhood, S influences all cells
which share an edge or vertex with S, and a contiguity matrix is shown in Figure 8. The contiguity matrix
of the uneven lattice (left) is shown on the right hand side. The contiguity matrix plays a crucial role in the
spatial extension of the regression model.
Figure 8: A spatial neighborhood and its contiguity matrix
We will refer to the regression formula above as spatial autoregressive model (SAM). Notice when
ρ =0, this equation collapses to the classical regression model. The benefits of modeling spatial
autocorrelation are many: (1) the residual error will have much lower spatial autocorrelation. With the
proper choice of W, the residual error should at least theoretically have no systematic variation. (2) If the
spatial autocorrelation coefficient is statistically significant then it will quantify the presence of spatial
autocorrelation. It will indicate the extent to which variance in the dependent variable (y) are explained by
the average of neighboring observation values. (3) Finally, the model will have a better fit.
SAM can deal with spatial challenge 1 and 2, and capture the spatial information into regression. In
[1,2], PLUMS, a framework for spatial data mining on this dataset was proposed. PLUMS also use spatial
autoregressive model with a proper contiguity matrix W. Moreover, it develops a new measure for the
spatial accuracy measure, ADNP (Average Distance to Nearest Prediction), which is defined as
Addition to challenge 1 and 2 PLUMS can deal with challenge 4 for the spatial accuracy measure.
However, we still have some challenges, which haven’t solved. So we proposed some techniques in this
dataset:
• Smoothing using Eight/Four Neighbors, Guassian Distribution and iterative relabeling
algorithms to solve challenge 1,2,3 (Mislabeling).
• Balancing the number of two classes to solves challenge 5 (Different class sizes)
• Relabeling using iterative rule based/clustering/regression algorithm solves challenge 3
• For relative thresholds (challenge 7) cross the different regions, here we can use peak
selection in the given window, and choose the different thresholds for the different
7.
regions in the raster. We can add one more attribute to the dataset to capture the local
ranking for each attribute, and include it in the classification or regression
• We will eliminate the outlier data detected for challenge 6
• Temporal data mining for challenge 8
In the following sections, I will elaborate these techniques and show the experimental results.
4. Techniques used
4.1 Balance the numbers of two classes
In this dataset, we are more interested in the nest patterns, while there are just a little proportion for nest
records, e.g., there are 5372 records in total for the dataset Darr95, while just 85 nests are among them. For
the unbalanced numbers of nest records and non-nest records, we can add more weights on the nest records
to facilitate to capture the nest patterns. One possible solution is that we randomly get the same size records
from non-nest and use the sample plus nest records to do data mining. The problem is that we are losing
some information by choosing a fraction of records from non-nest locations. Another one is that we simply
replicate the records of nest according to the ratio between the number of non-nest records and nest records
to balance the numbers of nest records and non-nest records. Thus we can get roughly equal numbers for
the two classes. Here I use the second method to solve the problem.
Here I use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85 records
for nest locations out of 5372 records in total, and I use the data from same year but different wetland
(Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total.
We can’t capture any nest feature and predict any nest location using the original dataset Darr95 to do the
C4.5 classification, and the model always predicts non-nest everywhere. For the decision tree generated in
the C4.5 as shown in Figure 8 is just one node classified as non-nest. Even though the accuracy of this
model is not too bad, 5287/5372 = 0.984, the recall for this model is 0. Hence the model is useless since it
can’ predict any location for nest.
Non-Nest
Figure 9: Decision tree for the classification on the original dataset Darr95
A P Nest Non-Nest
Nest 0 85
Non-Nest 0 5287
Table 2: Classification matrix on the original dataset Darr 95
Here I apply the balancing method on the Darr95, ratio=62, and we can get the balanced dataset for the
Darr95 as shown in table 3.
Nest Number Non-Nest Number
Before Balancing 85 5287
After Balancing 5270 5287
Table 3: Datasets before and after balancing
8.
Then we build the learning model on the balanced dataset for Darr95. The training result is very good as
following
A P Nest Non-Nest Precision = 0.91
Nest* 5270 0 Recall = 1
Non-Nest 508 4779
Table 3: Training results for the balanced dataset of Darr95
* Note: Here the nest number is the number after balancing
Since we are just interested in the nest records, we prefer to keep higher recall, for the training data we get
a perfect recall. As for the testing results, we can see from Table 4 that the model can predict 6 out of 30
nest locations.
A P Nest Non-Nest Precision = 0.05
Nest 6 24 Recall = 0.2
Non-Nest 119 1669
Table 4: Testing results for dataset Stubble95
Compared to the poor classification performance on the original dataset, the technique of balancing dataset
can achieve much gain for the classification, especially for using the decision tree based classification
methods. It’s a very useful technique, and I will combine it with other techniques to solve the challenges.
4.2 Relabeling using neighborhood information
4.2.1 Algorithm descriptions
Since we have know some nest worthy locations nearby the nests are mislabeled as non-nest class, we want
to relabel them into nest according to the neighborhood information. For the neighborhood, we can adapt
immediate four- neighbor or eight-neighbor as the neighborhood information. For each nest cell, we assign
a weight 1 and we assume that it can affect the immediate the four/eight neighbors; so we can assign the
neighbors a proper weight according to the distance as shown in Figure 10
Figure 10: Weight distributions for nest cell (a) Four-neighbor, (b) Eight-neighbor
We can calculate the cumulative weight for each non-nest cell, and choose a threshold to discretize the
results to binary classes. Thus we can claim more nest locations. We can see from Figure 11 that the
locations relabeled from non-nest to nest are always nearby several nests.
9.
Figure 11: After relabeling based on eight-neighbor scheme, we relabel 22 non-nest locations nearby nests
as nest locations (Threshold for discretization is 0.5) in dataset Darr 95 (Red are nest locations and blue
are non-nest locations)
4.2.2 Experiment Results
Based on the relabeled dataset, we do the classification using C4.5, and we get the following results for the
training and testing respectively
A P Nest Non-Nest Precision = 0.88
Nest* 15 102 Recall = 0.128
Non-Nest 2 5253
Table 5: Training results for the balanced dataset of Darr95
* Note: Here the nest number is the number after balancing
A P Nest Non-Nest Precision = 0.03
Nest 5 25 Recall = 0.17
Non-Nest 154 1634
Table 6: Testing results for dataset Stubble95
However, since the weight assignment is somehow arbitrary and the performance is not good enough. So
we want to develop more precise weight assignment function, in the next section I will introduce the
Guassian distribution scheme for smoothing.
4.3 Relabeling using Guassian distribution
4.3.1 Algorithm descriptions
10.
Similar to four/eight neighborhood scheme, we use Guassian distribution as the nest influence function for
the weight assignment to the nearby location. We distribute the weight for each nest location in a 7*7
window as shown in Figure 12, and calculate the cumulative weight for each non-nest location.
Figure 12: Window size is 7*7 for the nest influence using Guassian Distribution
After calculating the cumulative weights for every non-nest location, we can observe the distribution for
the weight as following Figure 13
Figure 13: Weight Distribution after calculating cumulative weight for every non-nest location
Then we can discretize the weight value to the class label based on certain threshold, the following Figure
14 is the nest distribution including relabeled locations after discretization
11.
Figure 14: Relabeling based on threshold
4.3.2 Experiment Results
Based on the relabeled dataset, we first balance the relabeled dataset using the technique discussed in 4.1,
then we do the classification using C4.5 on balanced dataset, and we get the following results for the
training and testing respectively
A P Nest Non-Nest Precision = 0.818
Nest* 4130 490 Recall = 0.894
Non-Nest 921 3791
Table 5: Training results for the balanced dataset of Darr95
* Note: Here the nest number is the number after relabeling and balancing
A P Nest Non-Nest Precision = 0.02
Nest 10 20 Recall = 0.33
Non-Nest 497 1291
Table 6: Testing results for dataset Stubble95
We can see that Gussian smoothing can achieve some gain compared to the performance just using
balancing technique.
4.4 Relabeling based on attribute similarity
Since the mislabeled records nearby nests hold the same or similar attribute values to those of nearby actual
nest locations. So we can relabel the non-nest locations based on the attribute similarities. The basic
process is shown in Figure 15
For each non-nest record
For each nest record
Compare corresponding attribute value of nest record with the non-nest record;
If all attribute values of the non-nest record are similar to a nest record
Then we relabel this non-nest location as nest; break;
End
End
Figure 15: Algorithm for the relabeling based on attribute similarity
12.
Here we can set up a threshold to measure the similarity of attribute values between two records, e.g., 2%
or 5%. If we can know the relative significance of each attribute, we can assign the different threshold for
each attribute based on them. Since the spatial data exhibits high autocorrelation and contiguity, we can
expect that the relabeled records locate nearby the actual nests.
4.5 Iterative Relabeling using C4.5
4.5.1 Algorithm Description
Since we are just interested in the nest records, we put the focus on the desirable class. So before we start
relabeling process, we firstly balance the original dataset to facilitate to capture the nest pattern. Then we
can construct the learning model using classification and get the rule set for the nest class, and apply the
rules with the high confidence to the original dataset to relabel the non-nest records. In fact, we keep the
support threshold during the construction of decision tree, so the rules derived also have satisfied the
support threshold. So we apply the rules into the original dataset, and relabel the non-nest records. Then we
can iterate the relabeling as following
Balance the dataset with equal
numbers for two classes
Original Dataset Balanced Dataset
Apply the rules with high confidence for Generate the rule set
the desirable class to the dataset using C4.5
Rule sets
re-label the records of the non-
desirable class Balance the dataset with equal
numbers for two classes
Dataset-1 Balanced Dataset
Generate the rule set
using C4.5
Apply the rules with high confidence for
the desirable class to the dataset
Rule sets
Dataset-2
Iteratively re-label the
records of the non-
……
desirable class
Finally, it will converge to a stable
dataset, i.e., we can’t re-label more
according to the stopping criteria.
Final Dataset
Figure 16: Process graph of iterative relabeling algorithm
13.
The algorithm can describe as
Steps: (Initially i=1)
'
n 1. Balance the dataset d i to d i
'
n 2. Construct the learning model based on the balanced dataset d i using C4.5 rule
classification and get the rule set.
n3. Choose the rules for desirable class with high confidence and support
n4. Apply the rules to the dataset d i , and re-label the records of non-desirable class to
get a new dataset d i +1 .
n5. Monitor the performance of classification based on the new dataset
n6. Repeat the above 5 steps until we can’t re-label more records or the recall on the
dataset d i +1 begins to decrease
Figure 17: Algorithm for iterative relabeling using classification technique
4.5.2 Experiment Results
Here I also use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85
records for nest locations out of 5372 records in total, and I use the data from same year but different
wetland (Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total.
For the first iteration, we first balance dataset using the technique discussed in 4.1, then we do the
classification using C4.5 on balanced dataset, and we get the following results for the training and testing
respectively.
A P Nest Non-Nest Precision = 0.731
Nest* 4836 434 Recall = 0.918
Non-Nest 1778 3509
Table 7: Training results for the balanced dataset of Darr95 in first iteration
* Note: Here the nest number is the number after balancing
A P Nest Non-Nest Precision = 0.02
Nest 8 22 Recall = 0.267
Non-Nest 387 1401
Table 8: Testing results for dataset Stubble95 based on the model above
We choose the rules derived from the training set with high confidence (Threshold = 0.9) to apply to the
original dataset to relabel the non-nest records. Then we claim 228 more nests from non-nest class, and the
number of nest location comes out to 313 (85+228). Then we balance the relabeled dataset into another
dataset, and build the learning model on that dataset and derive the rule sets again, then we get the results
for the training and testing in second iteration
A P Nest Non-Nest Precision = 0.93
Nest* 4624 384 Recall = 0.92
Non-Nest 347 4712
Table 9: Training results for the balanced dataset of Darr95 in second iteration
* Note: Here the nest number is the number after relabeling and balancing
14.
A P Nest Non-Nest Precision = 0.03
Nest 10 20 Recall = 0.33
Non-Nest 293 1495
Table 10: Testing results for dataset Stubble95 based on the model above
We can observe that the classification precision and recall on the testing dataset are both better than the
first iteration, i.e. the learning model has been improved a little. So we iterate the process again, balance the
dataset to a new balanced dataset, and derive the rule set for the nest locations. We choose the rule with
high confidence to apply into the dataset after first re-labeling to do the second relableing. After relabeling,
we balanced the dataset and do the classification and monitor the performance of classification on the
testing dataset.
A P Nest Non-Nest Precision = 0.93
Nest* 4720 384 Recall = 0.93
Non-Nest 341 4712
Table 11: Training results for the balanced dataset of Darr95 in third iteration
* Note: Here the nest number is the number after twice relabeling and balancing
A P Nest Non-Nest Precision = 0.03
Nest 7 23 Recall = 0.23
Non-Nest 261 1527
Table 12: Testing results for dataset Stubble95 based on the model above
We can observe the performance of third iteration on the testing dataset is not good as performance in
iteration; the possible reason is that we are going to a wrong branch in the current iteration for the
relabeling trend prediction. So we can stop at this point, in this experiment, and keep the previous iteration
result as the final dataset. Since we are more interested in the nest patterns, I just use recall as the key factor
for the stopping criteria. Actually we can develop more subtle and precise stopping criteria combining the
recall and other factors.
4.5.3 Discussion on the classification measure
For the iterative algorithms, the stopping criteria are crucial issue. For the measure, we can’t just consider
the traditional precision and recall. In the traditional pecision and recall scheme as Table 13, we just
consider the two classes separately.
A P Nest Non-Nest
Nest a B
Non-Nest c D
Table 13: Traditional precision and recall scheme
a
Pr ecision =
a+c
a
Re call =
a+b
15.
2 * precision * recall
F=
precision + recall
Here since we relabel some non-nest records into nest records. So we have to develop the measures to
capture the feature. Here we introduce some new measures
A P Nest Non-Nest
Nest a B
Relabeled c D
Non-Nest e F
Table 13: adjusted performance matrix for the relabeling case
Here we consider the recall and precision under the circumstance that we re-label some records.
One measure like recall is
a+c
R=
a+b+c+d
Considering the fraction of re-labeled instances added, we define the coefficient
a+b
α=
a+b+c+d
This coefficient is measuring the fraction of relabeling proportion in the dataset after relabeling. We want
to relabel the potential nest worthy nests in to nest class, but we just track the coefficient and want to get
the better performance with relabeling the minimum number of non-nest.
One measure like precision is
a+c
P=
a+b+e
Or we should combine the measures above to integrate a comprehensive measure:
2αRP
X measure =
P+R
4.6 Iterative Relabeling using clustering
4.6.1 Algorithm desciptions
In the relabeling using classification techniques, we apply the rules for the nest class into the original
dataset. We can also use clustering techniques to cluster the nest records and find the attribute boundaries
and apply these boundary into the original dataset to relabel the dataset. If we want to apply the clustering
techniques, we will make the following assumptions:
1. Binary classes question (There are just 2 classes for this question)
2. The records in the desirable class are precise, and only the records in the other class is
mislabeled
3. The distribution of mislabeled instances is at random
4. There are no overlaps in the attribute space for the instances of two classes, i.e., it's linearly
separable, thus we can find the boundary for each class
5. There is no hole in the sample distribution.
6. There are some mislabeled instances in the dataset
16.
7. The percentage of instances mislabeled should not exceed 50%
For the process of relabeling using clustering techniques, I describe the algorithm shown in
1. Choose the instances of desired class
2. Cluster the instances chosen in the step 1 into clusters using some clustering algorithm
(e.g. K-Means, it's difficult to deal with the shape and size problem for the dataset, so it's
not good enough. However, we can use it if the mislabeling percentage is about 20-40%.
I am thinking of the better algorithm, such as Chameleon)
3. Shrink the boundary for the cluster
4. Iterate the step 3 until we get a stable cluster boundary
5. Apply the boundary into the dataset to re-label the instances of the non-desired class.
For the clustering algorithm, we can use the classical algorithms such as K-Means, especially when the
mislabeling rate is not too high. However, K-Means is difficult to deal with the shape and size problem for
the dataset, so it's not good enough. We can try Chameleon combined with shrink boundary technique like
DBSCAN to get the more precise boundary for the nest records and apply to the original dataset.
I just try this dataset on a synthetic dataset, and it works well. Later I will try this technique in the real
dataset such as UCI dataset satisfying the assumptions.
4.7 Relabeling using Regression
In this dataset, all attributes are numerical, so regression is a natural method to fit the model using the
regression formula. This is a binary response problem; we can use Logit/Probit to build the regression
model.
Here we first don’t consider the spatial autocorrelation, we can build the regression using probit regression
formula, and we can get the regression value distribution as shown in Figure 18 and Figure 19.
17.
Figure 18 Probit regression for Darr 95
Figure 19 Probit regression for Stubble 95
Moreover, our collaborators [2] have added spatial regression by adding the neighborhood information as
the contiguity matrix in the regression formula as SAM. I will use SAM for the relabeling, we will relabel
the non-nest records based on the regression value, which is the cumulative probability for the nest
distribution in spatial autoregressive regression.
18.
5. Summary
There are lots of interesting research issues in this dataset, and we have applied several techniques to deal
with the challenges in this dataset. These techniques can solve some challenges, and we are working for
new techniques to deal with other challenges also.
Challenge for Not Independent Not Identical Mislabel Nest- Measure of
spatial issues Distribution Distribution Worthy Locations Accuracy
SAM x x
PLUMS x x X
Our techniques 4.2, 4.3 4.2, 4.3 4.2, 4.3, 4.4, 4.5,
4.6, 4.7
Table14: Comparison for solving the challenges for spatial nature
Challenge for non- Different class size Outliers Relative Thresholds Temporal
spatial issues
SAM
PLUMS
Our techniques 4.1 Detect and Proposed in future
Eliminate outliers work
Table15: Comparison for solving the challenges for non-spatial nature
We will apply these techniques in other datasets, and expect to solve the similar problems cross the
domains.
6. Future Work
I will try the iterative relabeling algorithms using clustering techniques, and find the better clustering
algorithm to find the more precise boundaries for the nest patterns.
I will try relabeling algorithm using the spatial autoregressive model with the proper continuity matrix W,
which capture the neighborhood information.
For the relative thresholds cross the different regions and dataset, I will add additional attributes, e.g., local
ranking of each attribute for each cell within certain window size to capture the local information. And we
build the learning model using the extended dataset including the attribute rankings, and I am doing the
experiment now.
For the prediction accuracy measure, ADNP in PLUMS actually is not good enough, e.g., if we predict nest
everywhere, ADNP is perfect, however, the quality of prediction is not good. So we should also keep an
eye on the number of predictions, we can use average distance to nearest prediction and average distance to
nearest actual location together. We will work for other good measures for classification accuracy and
relabeling properties.
19.
References
[1] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Prediction Location Using Map Similarity(PLUMS): A
Framework for Spatial Data Mining.
[2] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Extending Data Mining for Spatial Applications: A Case
Study in Predicting Nest Locations.
[3] P.Could The Geography at Work. Routledge and Kegan Paul, London, 1985
[4] N.A. Cressie. Statistics for Spatial Data (Revised Edition). Wiley, New York, 1993
[5] S. Ozesmi and U. Ozesmi. An Artificial Neural Network Approach to Spatial Habitat Modeling with
Interspecific Interaction. Ecological Modeling, Elsevier Science B. V., (116): 15-31, 1999
[6] U. Ozesmi and W. Mitsch. A Spatial Habitat Model for the Marsh-breeding Red-winged Black-bird
(Agelaius Phoeniceus L.) in Coastal Lake Erie Wetland. Ecological Modeling, Elsevier Science B. V.,
(101): 139-152, 1997
[7] J. I. Maletic and A. Marcus. Data cleansing: Beyond Integrity Analysis. Proceedings of The Conference
on Information Quality (IQ2000), Massachusetts Institute of Technology, October 20-22, 2000
[8] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, August 2000
[9] K. Koperski, J. Han and N. Stefanovic. An Efficient Two-Step Method for Classification of Spatial
Data
[10] L Anselin. Spatial Econometrics :Methods and Models, Kluwer, Dordrecht, Netherlands, 1998
[11] B. Flurry. A Firs Course in Multivariate Statistics, Springer, 1997
Be the first to comment