Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
annInstance28Nov6pm
1. An Analysis of Instance Selection for
Neural Networks to Improve Training Speed
Xunhu Sun
Department of Computer Sciences
Florida Institute of Technology
Melbourne, FL 32901, USA
sunx2013@my.fit.edu
Philip K. Chan
Department of Computer Sciences
Florida Institute of Technology
Melbourne, FL 32901, USA
pkc@cs.fit.edu
Speaker: Xunhu Sun
2. Reducing Training Instances
Instance selection
Select a subset
ignore the rest
Tradeoff
Amount of data vs. predictive accuracy
Overall research question
How to select a subset without sacrificing much accuracy?
3. How effective can existing instance
selection algorithms for kNN be
applied to ANN?
Research Question 1
7. Remove Far Instances (RFI)
For each instance x
Calculate “enemyDistance” (distance from boundary)
Distance from the closest instance that is of another class
Calculate average and standard deviation of
enemyDistance
Remove instances that are farther than:
average + standard deviation
8. RFI: Accuracy and Retention Rate
Algorithm
Accuracy (%)
Bupa
Haber
Man
Heart
Iono
sphere
Iris WDBC Wine Average
FDS 71.5 74.5 83.2 89.7 96.3 97.3 98.0 87.2
Random 68.1 74.2 82.5 85.4 94.1 97.1 97.1 85.5
RFI 71.8* 74.7 81.6 86.1 96.1* 97.2 98.0 86.5
Retention Rate (% of Full data set)
RFI 86.5 87.3 81.7 82.5 76.8 87.3 83.9 83.7
* indicates RFI has significant higher accuracy than Random based on a t-test with
95% confidence.
10. Remove Dense Instances (RDI)
k-distance
distance of an instance from its k-th nearest neighbor of the
same class
For each class c
For each instance x in class c, calculate k-distance
Calculate the average k-distance, which is the dense
threshold for class c
While the lowest k-distance is less than dense threshold
Remove the instance (y) with the lowest k-distance
Update the k-distance of instances that had y as one of its k
nearest neighbors
11. RDI: Accuracy and Retention Rate
Algorithm
Accuracy (%)
Bupa
Haber
Man
Heart
Iono
sphere
Iris WDBC Wine Average
FDS 71.5 74.5 83.2 89.7 96.3 97.3 98.0 87.2
Random 68.1 74.2 82.5 85.4 94.1 97.1 97.1 85.5
RDI 68.6 74.1 83.4 87.6* 97.8* 96.5 98.0 86.6
Retention Rate (% of Full data set)
RDI 55.7 64.2 65.6 65.4 66.7 58.8 67.8 63.4
* indicates RDI has significant higher accuracy than Random based on a t-test with
95% confidence.
12. What is the tradeoff between
accuracy and speed?
Research Question 3
16. Concluding Remarks
Applying existing instance selection algorithms to
ANN
Generally lower accuracy when fewer instances are retained
Randomly selecting 50% is competitive
Proposed Algorithms
RFI – removing instances far from the decision boundary
RDI– removing instances from dense regions
Tradeoff between accuracy and training time
AIR Ratio: RDI is more effective in 5 out of 7 data sets
18. Training ANN
Iterative update until convergence
Relatively slow compared to other ML algorithms
Could take minutes/hours…
Reducing training time
More effective update to speed up convergence
Fewer training instances/data
19. Artificial Neural Networks (ANN)
A machine learning (ML) algorithm
Used in many applications
E.g. credit card fraud detection, autonomous driving
https://www.youtube.com/watch?v=DWNtsS2kZWs
20. Related Work on Instance Selection
k-Nearest Neighbor (kNN) – studied extensively:
Garcia et al, 2012; Olvera-Lopez et al, 2010; Wilson and
Martinez, 2000
Condensation methods
Retaining boundary points (to help identify boundary)
Relatively larger data reduction
Edition methods
Removing boundary and noisy points to reduce overfitting
Relatively smaller data reduction
Hybrid methods
Artificial Neural Network
el Hindi and AL-Akhras, 2011
Smoothing decision boundary to reduce overfitting
21. Data Sets (UCI ML Repository)
Bupa
Haber
Man
Heart
Iono
sphere
Iris WDBC Wine
#Attributes 6 3 13 34 4 30 13
#Classes 2 2 2 2 3 2 3
#Instances 345 306 270 351 150 569 177
22. ANN
Training set: 2/3
Test set: 1/3
Parameters
Number of output units
1 (two-class problems); n (n-class problems)
Number of hidden units
3, 5, and 10
Number of iterations
Determined by 3-fold cross validation on the training set
Learning rate and momentum
0.1
23. Evaluation Criteria
Accuracy of ANN on the test set
Retention rate (% of instances retained)
Average of 30 runs (10 runs * 3 hidden-layer
settings)
24. Research Questions
1. How effective can existing instance selection
algorithms for kNN be applied to ANN?
2. Can we design more effective instance selection
algorithms for ANN?
3. What is the tradeoff between accuracy and speed?
25. Existing Instance Selection Algorithms
Criteria to be included in our experiments
Relatively fast
Relatively accurate in earlier comparative studies
FCNN, HMNEI (Gracia et al., 2012)
SPOCNN, RPOCNN (Olvera-Lopez et al., 2010)
DROP3 (Wilson and Martinez, 2000)
ENN (el Hindi and AL-Akhras, 2011)
26. Distance vs Accuracy
Data set
Accuracy (%)
All
regions
Border Middle Far
Border
+
Middle
Checker board 93.0 93.6 85.1 81.7 92.0
Nested square 83.0 80.5 80.7 72.9 83.0
Speed up nn learning is reduce instance, how does reduce effect accuracy
Mention x-axis, Mention fds and random, bar retention rate descending order, red line accuracy, as retention rate decreases accuracy generally decreases, random doing pretty well 50% retention but similar accuracy with enn, we will use random as reference later on
Instance near boundary is more important to help define the shape of boundary
Remove rfi+enn, Mention column and rows, focus on rfi and random rows, what’s blue, what’s red, 6 out of 7 better than random
Remove rdi+enn, rdi and rfi are more accurate than random but keeping more instance than random they might be by luck because usually keping more instance will have higher accuracy so is that because the alg is good or more instances
For each percent of reduction how does the accuracy change (normalize the reduction), usually negative because reduce instance will reduce accuracy,
Table of air, column, rows, sorted by average air, red number means best alg for each data set, we observed rdi is the best in avg, rdi is the best in 4 out of 7 data set. Do spend time on positve