Data Mining on a bird habitat dataset
                                        Pusheng Zhang
                        Depart...
recorded, included vegetation durability, stem density, stem height, distance to open water, distance to
edge, and water d...
Figure 2 Mislabeling some nest worthy locations as non-nest class in the dataset

      Moreover, the measure of spatial a...
Figure 4: Outlier in the attribute “Distance to Edge” distribution for the Dataset Darr 96


      There are different geo...
(a)                                             (b)
Figure 6: (a) Vegetation durability distribution for Darr wetland in 1...
In spatial autoregressive model, the spatial dependencies of the dependent variable are directly
modeled in the regression...
regions in the raster. We can add one more attribute to the dataset to capture the local
                  ranking for eac...
Then we build the learning model on the balanced dataset for Darr95. The training result is very good as
following


     ...
Figure 11: After relabeling based on eight-neighbor scheme, we relabel 22 non-nest locations nearby nests
as nest location...
Similar to four/eight neighborhood scheme, we use Guassian distribution as the nest influence function for
the weight assi...
Figure 14: Relabeling based on threshold
4.3.2     Experiment Results

 Based on the relabeled dataset, we first balance t...
Here we can set up a threshold to measure the similarity of attribute values between two records, e.g., 2%
           or 5...
The algorithm can describe as


                  Steps: (Initially i=1)
                                                 ...
A       P           Nest          Non-Nest              Precision = 0.03
            Nest              10             20  ...
2 * precision * recall
                                      F=
                                              precision + ...
7. The percentage of instances mislabeled should not exceed 50%


For the process of relabeling using clustering technique...
Figure 18 Probit regression for Darr 95




                            Figure 19 Probit regression for Stubble 95
Moreove...
5. Summary

There are lots of interesting research issues in this dataset, and we have applied several techniques to deal
...
References
 [1] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Prediction Location Using Map Similarity(PLUMS): A
     Framework...
Upcoming SlideShare
Loading in...5
×

Data Mining on a bird habitat dataset.doc

469

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
469
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Data Mining on a bird habitat dataset.doc"

  1. 1. Data Mining on a bird habitat dataset Pusheng Zhang Department of Computer Science & Engineering, University of Minnesota, MN 55455 pusheng@cs.umn.edu 1. Introduction Widespread use of spatial databases is leading to an increasing interest in mining interesting and useful but implicit spatial patterns. In addition to the non-spatial attributes, the spatial dataset is more complex and include extended objects such as coordinates, points, polygons and cells in the graph. The values of attributes, which are referenced by spatial locations, tend to vary gradually over space. While classical data mining techniques, either explicitly or implicitly assume that the data is independently generated. The spatial distributions of attributes sometimes have distinct local trends, which contradict the global trends. This is most vivid in Figure 1(b), where the spatial distribution of Vegetation Durability is jagged in the western section of the wetland as compared to the overall impression of uniformity across the wetland. Thus the spatial data is not only not independent it is also not identically distributed. Figure 1: (a) Training dataset: The geometry of the marshland and the locations of nests, (b) The spatial distribution of vegetation durability over the marshland, (c) The spatial distribution of water depth, and (d) The spatial distribution of distance to open water We are given data about two wetlands on the shores of Lake Erie in Ohio, USA in order to predict the spatial distribution of marsh-breeding bird: red-winged blackbird. The names of the wetlands are Darr and Stubble and the data was collected from April to June in two successive years, 1995 and 1996. For the purpose of data collection, a local 5000 cells are superimposed. The cells of the grid had square geometries of size 5 meters by 5 meters. In each cell the values of several structural and environmental factors were
  2. 2. recorded, included vegetation durability, stem density, stem height, distance to open water, distance to edge, and water depth. At each cell was also recorded the fact whether a bird nest (red-winged blackbird) was present or not. They are shown in Table 1. Attribute Type Description Vegetation Durability (VD) Numeric Scale from 10 to 100 Stem Density (SD) Numeric In number of stems/ m 2 Stem Height (SH) Numeric In centimeters above water Distance to Open Water (DOP) Numeric In meters Distance to Edge (DOE) Numeric In meters Water Depth (WD) Numeric In centimeters Nest Binary Record the presence/absence of bird nest in the cell Table 1: The non-spatial attributes in the dataset In the bird habitat dataset, our collaborators have put forward the solutions [1,2] for some problems. Here we propose some techniques to improve the performance of data mining on this dataset, and expect these techniques can spread across many domains including ecology and environment management, public safety, transportation, business logistics, and tourism. 2. Challenges in the dataset There are some challenges in this dataset, I will categorize these challenges into spatial issues and non- spatial issue. For the spatial nature in this dataset, the challenges are the following: 1. Neighbor regions tend to have same or similar properties 2. Nests are not distributed everywhere, i.e., they are not identically distributed. 3. Nests tend to be close, however, they are not too crowded together. Thus even some locations near nests are nest worthy, they are mislabeled as “Non-Nest” 4. Measures of prediction accuracy First, Classical data mining deals with numbers and categories. In contrast, spatial dataset is more complex and include extended objects such as coordinates, points and cell in the graph. Second, classical data mining algorithms often make assumptions (e.g. independent, identical distributions), and treat each input to be independent of other inputs, which violate the first law of Geography: everything is related to everything else but nearby things are more related than distant things [3]. In other words, the values of attributes of nearby spatial objects tend to systematically affect each other. In spatial statistics, an area within statistics devoted to the analysis of spatial data, this is called spatial autocorrelation [4], and the spatial pattern often exhibit continuity and high autocorrelation among nearby features. However, there are some instances mislabeled in this dataset. For example, some locations near nests maybe are nest worthy, while they are sampled as “Non-Nest” since there were no real nests presented during the data collection. Maybe it’s just because the birds don’t like to be overcrowded for the nest construction although these locations are very suitable for the nests also. We can observe in Figure 2 that some nest worthy locations around the actual nests are mislabeled as non-nest.
  3. 3. Figure 2 Mislabeling some nest worthy locations as non-nest class in the dataset Moreover, the measure of spatial accuracy is maybe substantially different from classical measures. For the binary classes problem, the standard way to measure classification accuracy is to calculate the percentage of correctly classified objects. This measure may not be the most suitable for the spatial data, e.g., we use the classical accuracy measure in Figure 3, and the accuracies for the two models are same. However, domain experts prefer (b) over (c), since predicted nest locations are closer on average to some actual nest locations. The classical accuracy measure can distinguish between Figure 2(b) and 2(c), and we need a measure of spatial accuracy to capture this preference. Figure 3: (a) The pixels with actual nests, (b) Location predicted by a model, (c) Location predicted by another model. Prediction (c) is spatially more accurate than (b) In addition to the challenges due to the spatial nature in the dataset, we also have some general challenges for the non-spatial issues. 5. Different class size 6. Outliers 7. Relative Thresholds for the model cross 8. Temporal nature In this dataset, nest instances are just a little proportion in the dataset, and the majority of the dataset are non-nest locations. Since we are just interested in the nest pattern, it’s very difficult to capture the features for the nests. Furthermore, since we use C4.5 classification, the performance on different class sizes is poor. Thus we have to develop some techniques to balance the numbers for the two respective classes. Furthermore, there are outliers and noises in datasets. For the noise, here it means the contradictory records those with the same (or very similar) values for the training attributes but belonging to the different classes. Actually it is similar to the scenario of mislabeling associated with the spatial nature. For the outliers, it means the values of attributes are far from the nearby points, e.g., we can see there are some outliers on the upper margin in the Figure 4 below.
  4. 4. Figure 4: Outlier in the attribute “Distance to Edge” distribution for the Dataset Darr 96 There are different geometries and environmental conditions for the different regions, and birds always choose the local maximum of them for the nests. For example, vegetation durability is one of key attribute in the dataset, in the following two regions with the different distribution, if we build learning model on the Figure5(a) region, we can derive the rule like “vegetation durability >=90” for the nest, the model will fail to predict the nest location in the Figure 5(b) region even if there are same patterns for the vegetation durability formula in the rules for the nests in the two datasets. We have to capture the different thresholds cross the different regions and datasets. Figure 5: (a) Local maximum in one region, (b) Local maximum in another region For this dataset there are also some temporal issues, there are different attribute distributions and nest distributions for the same wetland in the two successive years. In Figure 6 we can observe that the distributions for the vegetation durability in the two successive years are different, and even the nest location distributions are also different in Figure 7. It’s very difficult to capture the temporal pattern.
  5. 5. (a) (b) Figure 6: (a) Vegetation durability distribution for Darr wetland in 1995, (b) Vegetation durability distribution for Darr wetland in 1996 Figure 7: Different nest locations distributions for the Darr wetland in 1995 and 1996 3. Related Work Uygar[5,6] has applied classical data mining techniques like logistic regression and neural network to build spatial habitat models. Logistic regression was used because the dependent variable is binary (nest/non-nest) and logistic function “squash” the real line into the unit-interval. The values in the unit- interval can then be interpreted as probabilities. They concluded that using logistic regression the nests could be classified as a 24% rate better than random. The use of neural networks actually decrease the classification accuracy but led to a better understanding of the interactions between the explanatory and the dependent variables. There are two important reasons why, despite extensive domain knowledge, the results of classical data mining are not satisfactory. First, classical techniques, e.g. logistic regression, make assumption about independent distributions for the properties of each pixel, ignoring spatial autocorrelation. Second, a more subtle but equally important reason is the objective function of classification measure accuracy. Uygar still used the classical accuracy measure, and it may not suit for this spatial dataset.
  6. 6. In spatial autoregressive model, the spatial dependencies of the dependent variable are directly modeled in the regression equations [10]. Assume that the dependent values y i are related to each other, i.e., y i = f ( y j ), i ≠ j , the regression formula can be defined as y = ρWy + βX + ε Here W is the neighborhood relationship contiguity matrix and ρ is the parameter that reflects the strength of spatial dependencies between the elements of the dependent variable. Spatial autocorrelation measures are crucially dependent on the choice and design of the contiguity matrix W. The design of the matrix itself is predicted on determining “what constitutes a neighborhood of influence?” Two common choices are the four and eight neighborhood. Thus given a lattice structure and a point S in the lattice, a four-neighbor assumes that S influences all cells which share an edge with S. In eight-neighborhood, S influences all cells which share an edge or vertex with S, and a contiguity matrix is shown in Figure 8. The contiguity matrix of the uneven lattice (left) is shown on the right hand side. The contiguity matrix plays a crucial role in the spatial extension of the regression model. Figure 8: A spatial neighborhood and its contiguity matrix We will refer to the regression formula above as spatial autoregressive model (SAM). Notice when ρ =0, this equation collapses to the classical regression model. The benefits of modeling spatial autocorrelation are many: (1) the residual error will have much lower spatial autocorrelation. With the proper choice of W, the residual error should at least theoretically have no systematic variation. (2) If the spatial autocorrelation coefficient is statistically significant then it will quantify the presence of spatial autocorrelation. It will indicate the extent to which variance in the dependent variable (y) are explained by the average of neighboring observation values. (3) Finally, the model will have a better fit. SAM can deal with spatial challenge 1 and 2, and capture the spatial information into regression. In [1,2], PLUMS, a framework for spatial data mining on this dataset was proposed. PLUMS also use spatial autoregressive model with a proper contiguity matrix W. Moreover, it develops a new measure for the spatial accuracy measure, ADNP (Average Distance to Nearest Prediction), which is defined as Addition to challenge 1 and 2 PLUMS can deal with challenge 4 for the spatial accuracy measure. However, we still have some challenges, which haven’t solved. So we proposed some techniques in this dataset: • Smoothing using Eight/Four Neighbors, Guassian Distribution and iterative relabeling algorithms to solve challenge 1,2,3 (Mislabeling). • Balancing the number of two classes to solves challenge 5 (Different class sizes) • Relabeling using iterative rule based/clustering/regression algorithm solves challenge 3 • For relative thresholds (challenge 7) cross the different regions, here we can use peak selection in the given window, and choose the different thresholds for the different
  7. 7. regions in the raster. We can add one more attribute to the dataset to capture the local ranking for each attribute, and include it in the classification or regression • We will eliminate the outlier data detected for challenge 6 • Temporal data mining for challenge 8 In the following sections, I will elaborate these techniques and show the experimental results. 4. Techniques used 4.1 Balance the numbers of two classes In this dataset, we are more interested in the nest patterns, while there are just a little proportion for nest records, e.g., there are 5372 records in total for the dataset Darr95, while just 85 nests are among them. For the unbalanced numbers of nest records and non-nest records, we can add more weights on the nest records to facilitate to capture the nest patterns. One possible solution is that we randomly get the same size records from non-nest and use the sample plus nest records to do data mining. The problem is that we are losing some information by choosing a fraction of records from non-nest locations. Another one is that we simply replicate the records of nest according to the ratio between the number of non-nest records and nest records to balance the numbers of nest records and non-nest records. Thus we can get roughly equal numbers for the two classes. Here I use the second method to solve the problem. Here I use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85 records for nest locations out of 5372 records in total, and I use the data from same year but different wetland (Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total. We can’t capture any nest feature and predict any nest location using the original dataset Darr95 to do the C4.5 classification, and the model always predicts non-nest everywhere. For the decision tree generated in the C4.5 as shown in Figure 8 is just one node classified as non-nest. Even though the accuracy of this model is not too bad, 5287/5372 = 0.984, the recall for this model is 0. Hence the model is useless since it can’ predict any location for nest. Non-Nest Figure 9: Decision tree for the classification on the original dataset Darr95 A P Nest Non-Nest Nest 0 85 Non-Nest 0 5287 Table 2: Classification matrix on the original dataset Darr 95 Here I apply the balancing method on the Darr95, ratio=62, and we can get the balanced dataset for the Darr95 as shown in table 3. Nest Number Non-Nest Number Before Balancing 85 5287 After Balancing 5270 5287 Table 3: Datasets before and after balancing
  8. 8. Then we build the learning model on the balanced dataset for Darr95. The training result is very good as following A P Nest Non-Nest Precision = 0.91 Nest* 5270 0 Recall = 1 Non-Nest 508 4779 Table 3: Training results for the balanced dataset of Darr95 * Note: Here the nest number is the number after balancing Since we are just interested in the nest records, we prefer to keep higher recall, for the training data we get a perfect recall. As for the testing results, we can see from Table 4 that the model can predict 6 out of 30 nest locations. A P Nest Non-Nest Precision = 0.05 Nest 6 24 Recall = 0.2 Non-Nest 119 1669 Table 4: Testing results for dataset Stubble95 Compared to the poor classification performance on the original dataset, the technique of balancing dataset can achieve much gain for the classification, especially for using the decision tree based classification methods. It’s a very useful technique, and I will combine it with other techniques to solve the challenges. 4.2 Relabeling using neighborhood information 4.2.1 Algorithm descriptions Since we have know some nest worthy locations nearby the nests are mislabeled as non-nest class, we want to relabel them into nest according to the neighborhood information. For the neighborhood, we can adapt immediate four- neighbor or eight-neighbor as the neighborhood information. For each nest cell, we assign a weight 1 and we assume that it can affect the immediate the four/eight neighbors; so we can assign the neighbors a proper weight according to the distance as shown in Figure 10 Figure 10: Weight distributions for nest cell (a) Four-neighbor, (b) Eight-neighbor We can calculate the cumulative weight for each non-nest cell, and choose a threshold to discretize the results to binary classes. Thus we can claim more nest locations. We can see from Figure 11 that the locations relabeled from non-nest to nest are always nearby several nests.
  9. 9. Figure 11: After relabeling based on eight-neighbor scheme, we relabel 22 non-nest locations nearby nests as nest locations (Threshold for discretization is 0.5) in dataset Darr 95 (Red are nest locations and blue are non-nest locations) 4.2.2 Experiment Results Based on the relabeled dataset, we do the classification using C4.5, and we get the following results for the training and testing respectively A P Nest Non-Nest Precision = 0.88 Nest* 15 102 Recall = 0.128 Non-Nest 2 5253 Table 5: Training results for the balanced dataset of Darr95 * Note: Here the nest number is the number after balancing A P Nest Non-Nest Precision = 0.03 Nest 5 25 Recall = 0.17 Non-Nest 154 1634 Table 6: Testing results for dataset Stubble95 However, since the weight assignment is somehow arbitrary and the performance is not good enough. So we want to develop more precise weight assignment function, in the next section I will introduce the Guassian distribution scheme for smoothing. 4.3 Relabeling using Guassian distribution 4.3.1 Algorithm descriptions
  10. 10. Similar to four/eight neighborhood scheme, we use Guassian distribution as the nest influence function for the weight assignment to the nearby location. We distribute the weight for each nest location in a 7*7 window as shown in Figure 12, and calculate the cumulative weight for each non-nest location. Figure 12: Window size is 7*7 for the nest influence using Guassian Distribution After calculating the cumulative weights for every non-nest location, we can observe the distribution for the weight as following Figure 13 Figure 13: Weight Distribution after calculating cumulative weight for every non-nest location Then we can discretize the weight value to the class label based on certain threshold, the following Figure 14 is the nest distribution including relabeled locations after discretization
  11. 11. Figure 14: Relabeling based on threshold 4.3.2 Experiment Results Based on the relabeled dataset, we first balance the relabeled dataset using the technique discussed in 4.1, then we do the classification using C4.5 on balanced dataset, and we get the following results for the training and testing respectively A P Nest Non-Nest Precision = 0.818 Nest* 4130 490 Recall = 0.894 Non-Nest 921 3791 Table 5: Training results for the balanced dataset of Darr95 * Note: Here the nest number is the number after relabeling and balancing A P Nest Non-Nest Precision = 0.02 Nest 10 20 Recall = 0.33 Non-Nest 497 1291 Table 6: Testing results for dataset Stubble95 We can see that Gussian smoothing can achieve some gain compared to the performance just using balancing technique. 4.4 Relabeling based on attribute similarity Since the mislabeled records nearby nests hold the same or similar attribute values to those of nearby actual nest locations. So we can relabel the non-nest locations based on the attribute similarities. The basic process is shown in Figure 15 For each non-nest record For each nest record Compare corresponding attribute value of nest record with the non-nest record; If all attribute values of the non-nest record are similar to a nest record Then we relabel this non-nest location as nest; break; End End Figure 15: Algorithm for the relabeling based on attribute similarity
  12. 12. Here we can set up a threshold to measure the similarity of attribute values between two records, e.g., 2% or 5%. If we can know the relative significance of each attribute, we can assign the different threshold for each attribute based on them. Since the spatial data exhibits high autocorrelation and contiguity, we can expect that the relabeled records locate nearby the actual nests. 4.5 Iterative Relabeling using C4.5 4.5.1 Algorithm Description Since we are just interested in the nest records, we put the focus on the desirable class. So before we start relabeling process, we firstly balance the original dataset to facilitate to capture the nest pattern. Then we can construct the learning model using classification and get the rule set for the nest class, and apply the rules with the high confidence to the original dataset to relabel the non-nest records. In fact, we keep the support threshold during the construction of decision tree, so the rules derived also have satisfied the support threshold. So we apply the rules into the original dataset, and relabel the non-nest records. Then we can iterate the relabeling as following Balance the dataset with equal numbers for two classes Original Dataset Balanced Dataset Apply the rules with high confidence for Generate the rule set the desirable class to the dataset using C4.5 Rule sets re-label the records of the non- desirable class Balance the dataset with equal numbers for two classes Dataset-1 Balanced Dataset Generate the rule set using C4.5 Apply the rules with high confidence for the desirable class to the dataset Rule sets Dataset-2 Iteratively re-label the records of the non- …… desirable class Finally, it will converge to a stable dataset, i.e., we can’t re-label more according to the stopping criteria. Final Dataset Figure 16: Process graph of iterative relabeling algorithm
  13. 13. The algorithm can describe as Steps: (Initially i=1) ' n 1. Balance the dataset d i to d i ' n 2. Construct the learning model based on the balanced dataset d i using C4.5 rule classification and get the rule set. n3. Choose the rules for desirable class with high confidence and support n4. Apply the rules to the dataset d i , and re-label the records of non-desirable class to get a new dataset d i +1 . n5. Monitor the performance of classification based on the new dataset n6. Repeat the above 5 steps until we can’t re-label more records or the recall on the dataset d i +1 begins to decrease Figure 17: Algorithm for iterative relabeling using classification technique 4.5.2 Experiment Results Here I also use the data from Darr wetland in 1995 (Darr95) as the training dataset, which includes 85 records for nest locations out of 5372 records in total, and I use the data from same year but different wetland (Stubble95) as testing dataset, which includes 30 nest locations out of 1818 in total. For the first iteration, we first balance dataset using the technique discussed in 4.1, then we do the classification using C4.5 on balanced dataset, and we get the following results for the training and testing respectively. A P Nest Non-Nest Precision = 0.731 Nest* 4836 434 Recall = 0.918 Non-Nest 1778 3509 Table 7: Training results for the balanced dataset of Darr95 in first iteration * Note: Here the nest number is the number after balancing A P Nest Non-Nest Precision = 0.02 Nest 8 22 Recall = 0.267 Non-Nest 387 1401 Table 8: Testing results for dataset Stubble95 based on the model above We choose the rules derived from the training set with high confidence (Threshold = 0.9) to apply to the original dataset to relabel the non-nest records. Then we claim 228 more nests from non-nest class, and the number of nest location comes out to 313 (85+228). Then we balance the relabeled dataset into another dataset, and build the learning model on that dataset and derive the rule sets again, then we get the results for the training and testing in second iteration A P Nest Non-Nest Precision = 0.93 Nest* 4624 384 Recall = 0.92 Non-Nest 347 4712 Table 9: Training results for the balanced dataset of Darr95 in second iteration * Note: Here the nest number is the number after relabeling and balancing
  14. 14. A P Nest Non-Nest Precision = 0.03 Nest 10 20 Recall = 0.33 Non-Nest 293 1495 Table 10: Testing results for dataset Stubble95 based on the model above We can observe that the classification precision and recall on the testing dataset are both better than the first iteration, i.e. the learning model has been improved a little. So we iterate the process again, balance the dataset to a new balanced dataset, and derive the rule set for the nest locations. We choose the rule with high confidence to apply into the dataset after first re-labeling to do the second relableing. After relabeling, we balanced the dataset and do the classification and monitor the performance of classification on the testing dataset. A P Nest Non-Nest Precision = 0.93 Nest* 4720 384 Recall = 0.93 Non-Nest 341 4712 Table 11: Training results for the balanced dataset of Darr95 in third iteration * Note: Here the nest number is the number after twice relabeling and balancing A P Nest Non-Nest Precision = 0.03 Nest 7 23 Recall = 0.23 Non-Nest 261 1527 Table 12: Testing results for dataset Stubble95 based on the model above We can observe the performance of third iteration on the testing dataset is not good as performance in iteration; the possible reason is that we are going to a wrong branch in the current iteration for the relabeling trend prediction. So we can stop at this point, in this experiment, and keep the previous iteration result as the final dataset. Since we are more interested in the nest patterns, I just use recall as the key factor for the stopping criteria. Actually we can develop more subtle and precise stopping criteria combining the recall and other factors. 4.5.3 Discussion on the classification measure For the iterative algorithms, the stopping criteria are crucial issue. For the measure, we can’t just consider the traditional precision and recall. In the traditional pecision and recall scheme as Table 13, we just consider the two classes separately. A P Nest Non-Nest Nest a B Non-Nest c D Table 13: Traditional precision and recall scheme a Pr ecision = a+c a Re call = a+b
  15. 15. 2 * precision * recall F= precision + recall Here since we relabel some non-nest records into nest records. So we have to develop the measures to capture the feature. Here we introduce some new measures A P Nest Non-Nest Nest a B Relabeled c D Non-Nest e F Table 13: adjusted performance matrix for the relabeling case Here we consider the recall and precision under the circumstance that we re-label some records. One measure like recall is a+c R= a+b+c+d Considering the fraction of re-labeled instances added, we define the coefficient a+b α= a+b+c+d This coefficient is measuring the fraction of relabeling proportion in the dataset after relabeling. We want to relabel the potential nest worthy nests in to nest class, but we just track the coefficient and want to get the better performance with relabeling the minimum number of non-nest. One measure like precision is a+c P= a+b+e Or we should combine the measures above to integrate a comprehensive measure: 2αRP X measure = P+R 4.6 Iterative Relabeling using clustering 4.6.1 Algorithm desciptions In the relabeling using classification techniques, we apply the rules for the nest class into the original dataset. We can also use clustering techniques to cluster the nest records and find the attribute boundaries and apply these boundary into the original dataset to relabel the dataset. If we want to apply the clustering techniques, we will make the following assumptions: 1. Binary classes question (There are just 2 classes for this question) 2. The records in the desirable class are precise, and only the records in the other class is mislabeled 3. The distribution of mislabeled instances is at random 4. There are no overlaps in the attribute space for the instances of two classes, i.e., it's linearly separable, thus we can find the boundary for each class 5. There is no hole in the sample distribution. 6. There are some mislabeled instances in the dataset
  16. 16. 7. The percentage of instances mislabeled should not exceed 50% For the process of relabeling using clustering techniques, I describe the algorithm shown in 1. Choose the instances of desired class 2. Cluster the instances chosen in the step 1 into clusters using some clustering algorithm (e.g. K-Means, it's difficult to deal with the shape and size problem for the dataset, so it's not good enough. However, we can use it if the mislabeling percentage is about 20-40%. I am thinking of the better algorithm, such as Chameleon) 3. Shrink the boundary for the cluster 4. Iterate the step 3 until we get a stable cluster boundary 5. Apply the boundary into the dataset to re-label the instances of the non-desired class. For the clustering algorithm, we can use the classical algorithms such as K-Means, especially when the mislabeling rate is not too high. However, K-Means is difficult to deal with the shape and size problem for the dataset, so it's not good enough. We can try Chameleon combined with shrink boundary technique like DBSCAN to get the more precise boundary for the nest records and apply to the original dataset. I just try this dataset on a synthetic dataset, and it works well. Later I will try this technique in the real dataset such as UCI dataset satisfying the assumptions. 4.7 Relabeling using Regression In this dataset, all attributes are numerical, so regression is a natural method to fit the model using the regression formula. This is a binary response problem; we can use Logit/Probit to build the regression model. Here we first don’t consider the spatial autocorrelation, we can build the regression using probit regression formula, and we can get the regression value distribution as shown in Figure 18 and Figure 19.
  17. 17. Figure 18 Probit regression for Darr 95 Figure 19 Probit regression for Stubble 95 Moreover, our collaborators [2] have added spatial regression by adding the neighborhood information as the contiguity matrix in the regression formula as SAM. I will use SAM for the relabeling, we will relabel the non-nest records based on the regression value, which is the cumulative probability for the nest distribution in spatial autoregressive regression.
  18. 18. 5. Summary There are lots of interesting research issues in this dataset, and we have applied several techniques to deal with the challenges in this dataset. These techniques can solve some challenges, and we are working for new techniques to deal with other challenges also. Challenge for Not Independent Not Identical Mislabel Nest- Measure of spatial issues Distribution Distribution Worthy Locations Accuracy SAM x x PLUMS x x X Our techniques 4.2, 4.3 4.2, 4.3 4.2, 4.3, 4.4, 4.5, 4.6, 4.7 Table14: Comparison for solving the challenges for spatial nature Challenge for non- Different class size Outliers Relative Thresholds Temporal spatial issues SAM PLUMS Our techniques 4.1 Detect and Proposed in future Eliminate outliers work Table15: Comparison for solving the challenges for non-spatial nature We will apply these techniques in other datasets, and expect to solve the similar problems cross the domains. 6. Future Work I will try the iterative relabeling algorithms using clustering techniques, and find the better clustering algorithm to find the more precise boundaries for the nest patterns. I will try relabeling algorithm using the spatial autoregressive model with the proper continuity matrix W, which capture the neighborhood information. For the relative thresholds cross the different regions and dataset, I will add additional attributes, e.g., local ranking of each attribute for each cell within certain window size to capture the local information. And we build the learning model using the extended dataset including the attribute rankings, and I am doing the experiment now. For the prediction accuracy measure, ADNP in PLUMS actually is not good enough, e.g., if we predict nest everywhere, ADNP is perfect, however, the quality of prediction is not good. So we should also keep an eye on the number of predictions, we can use average distance to nearest prediction and average distance to nearest actual location together. We will work for other good measures for classification accuracy and relabeling properties.
  19. 19. References [1] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Prediction Location Using Map Similarity(PLUMS): A Framework for Spatial Data Mining. [2] S. Chawla, S. Shekhar, W. Wu, U. Ozesmi. Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations. [3] P.Could The Geography at Work. Routledge and Kegan Paul, London, 1985 [4] N.A. Cressie. Statistics for Spatial Data (Revised Edition). Wiley, New York, 1993 [5] S. Ozesmi and U. Ozesmi. An Artificial Neural Network Approach to Spatial Habitat Modeling with Interspecific Interaction. Ecological Modeling, Elsevier Science B. V., (116): 15-31, 1999 [6] U. Ozesmi and W. Mitsch. A Spatial Habitat Model for the Marsh-breeding Red-winged Black-bird (Agelaius Phoeniceus L.) in Coastal Lake Erie Wetland. Ecological Modeling, Elsevier Science B. V., (101): 139-152, 1997 [7] J. I. Maletic and A. Marcus. Data cleansing: Beyond Integrity Analysis. Proceedings of The Conference on Information Quality (IQ2000), Massachusetts Institute of Technology, October 20-22, 2000 [8] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000 [9] K. Koperski, J. Han and N. Stefanovic. An Efficient Two-Step Method for Classification of Spatial Data [10] L Anselin. Spatial Econometrics :Methods and Models, Kluwer, Dordrecht, Netherlands, 1998 [11] B. Flurry. A Firs Course in Multivariate Statistics, Springer, 1997

×