annInstance28Nov6pm

An Analysis of Instance Selection for
Neural Networks to Improve Training Speed
Xunhu Sun
Department of Computer Sciences
Florida Institute of Technology
Melbourne, FL 32901, USA
sunx2013@my.fit.edu
Philip K. Chan
Department of Computer Sciences
Florida Institute of Technology
Melbourne, FL 32901, USA
pkc@cs.fit.edu
Speaker: Xunhu Sun

Reducing Training Instances
 Instance selection
 Select a subset
 ignore the rest
 Tradeoff
 Amount of data vs. predictive accuracy
 Overall research question
 How to select a subset without sacrificing much accuracy?

How effective can existing instance
selection algorithms for kNN be
applied to ANN?
Research Question 1

Can we design more effective
instance selection algorithms for
ANN?
Research Question 2

A. Distance from Decision Boundary

Remove Far Instances (RFI)
 For each instance x
 Calculate “enemyDistance” (distance from boundary)
 Distance from the closest instance that is of another class
 Calculate average and standard deviation of
enemyDistance
 Remove instances that are farther than:
 average + standard deviation

RFI: Accuracy and Retention Rate
Algorithm
Accuracy (%)
Bupa
Haber
Man
Heart
Iono
sphere
Iris WDBC Wine Average
FDS 71.5 74.5 83.2 89.7 96.3 97.3 98.0 87.2
Random 68.1 74.2 82.5 85.4 94.1 97.1 97.1 85.5
RFI 71.8* 74.7 81.6 86.1 96.1* 97.2 98.0 86.5
Retention Rate (% of Full data set)
RFI 86.5 87.3 81.7 82.5 76.8 87.3 83.9 83.7
* indicates RFI has significant higher accuracy than Random based on a t-test with
95% confidence.

Remove Dense Instances (RDI)
 k-distance
 distance of an instance from its k-th nearest neighbor of the
same class
 For each class c
 For each instance x in class c, calculate k-distance
 Calculate the average k-distance, which is the dense
threshold for class c
 While the lowest k-distance is less than dense threshold
 Remove the instance (y) with the lowest k-distance
 Update the k-distance of instances that had y as one of its k
nearest neighbors

RDI: Accuracy and Retention Rate
Algorithm
Accuracy (%)
Bupa
Haber
Man
Heart
Iono
sphere
FDS 71.5 74.5 83.2 89.7 96.3 97.3 98.0 87.2
Random 68.1 74.2 82.5 85.4 94.1 97.1 97.1 85.5
RDI 68.6 74.1 83.4 87.6* 97.8* 96.5 98.0 86.6
Retention Rate (% of Full data set)
RDI 55.7 64.2 65.6 65.4 66.7 58.8 67.8 63.4
* indicates RDI has significant higher accuracy than Random based on a t-test with
95% confidence.

What is the tradeoff between
accuracy and speed?
Research Question 3

Retention Rate vs Training Time

Tradeoff between Accuracy and
Retention Rate (training time)
 Accuracy Instance Reduction (AIR) Ratio =
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝐹𝐷𝑆)
1 − 𝑟𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚)

AIR Ratio
Algorithm
AIR ratio * 100
Bupa
Haber
Man
Heart
Iono
sphere
RDI -6.55 -1.12 0.58 -6.07 4.50 -1.94 0.00 -1.51
RanENN -5.80 1.20 -1.60 -12.00 0.80 -1.00 -3.40 -3.11
Random -6.80 -0.60 -1.40 -8.60 -4.40 -0.40 -1.80 -3.43
RFI 2.22 1.57 -8.74 -20.57 -0.86 -0.79 0.00 -3.88
SPOCNN -2.43 -2.32 -3.85 -8.74 -5.64 -1.42 -3.01 -3.91
FCNN -2.51 1.25 -2.50 -19.57 -2.90 -0.92 -4.44 -4.51
RPOCNN -7.20 -2.51 -3.51 -8.25 -3.91 -3.75 -4.13 -4.75
HMNEI -15.50 -0.81 -2.90 -10.22 -5.52 -1.75 -3.32 -5.72
DROP3 -9.70 -2.22 -6.36 -18.34 -2.70 -2.41 -7.98 -7.10
ENN -11.23 -1.97 -4.15 -37.59 -10.00 0.00 -11.90 -10.98

Concluding Remarks
 Applying existing instance selection algorithms to
ANN
 Generally lower accuracy when fewer instances are retained
 Randomly selecting 50% is competitive
 Proposed Algorithms
 RFI – removing instances far from the decision boundary
 RDI– removing instances from dense regions
 Tradeoff between accuracy and training time
 AIR Ratio: RDI is more effective in 5 out of 7 data sets

Thank You
 Questions?
 https://github.com/TigerSun86/MachineLearning

Training ANN
 Iterative update until convergence
 Relatively slow compared to other ML algorithms
 Could take minutes/hours…
 Reducing training time
 More effective update to speed up convergence
 Fewer training instances/data

Artificial Neural Networks (ANN)
 A machine learning (ML) algorithm
 Used in many applications
 E.g. credit card fraud detection, autonomous driving
 https://www.youtube.com/watch?v=DWNtsS2kZWs

Related Work on Instance Selection
 k-Nearest Neighbor (kNN) – studied extensively:
 Garcia et al, 2012; Olvera-Lopez et al, 2010; Wilson and
Martinez, 2000
 Condensation methods
 Retaining boundary points (to help identify boundary)
 Relatively larger data reduction
 Edition methods
 Removing boundary and noisy points to reduce overfitting
 Relatively smaller data reduction
 Hybrid methods
 Artificial Neural Network
 el Hindi and AL-Akhras, 2011
 Smoothing decision boundary to reduce overfitting

Data Sets (UCI ML Repository)
Bupa
Haber
Man
Heart
Iono
sphere
Iris WDBC Wine
#Attributes 6 3 13 34 4 30 13
#Classes 2 2 2 2 3 2 3
#Instances 345 306 270 351 150 569 177

ANN
 Training set: 2/3
 Test set: 1/3
 Parameters
 Number of output units
 1 (two-class problems); n (n-class problems)
 Number of hidden units
 3, 5, and 10
 Number of iterations
 Determined by 3-fold cross validation on the training set
 Learning rate and momentum
 0.1

Evaluation Criteria
 Accuracy of ANN on the test set
 Retention rate (% of instances retained)
 Average of 30 runs (10 runs * 3 hidden-layer
settings)

Research Questions
1. How effective can existing instance selection
algorithms for kNN be applied to ANN?
2. Can we design more effective instance selection
algorithms for ANN?
3. What is the tradeoff between accuracy and speed?

Existing Instance Selection Algorithms
 Criteria to be included in our experiments
 Relatively fast
 Relatively accurate in earlier comparative studies
 FCNN, HMNEI (Gracia et al., 2012)
 SPOCNN, RPOCNN (Olvera-Lopez et al., 2010)
 DROP3 (Wilson and Martinez, 2000)
 ENN (el Hindi and AL-Akhras, 2011)

Distance vs Accuracy
Data set
Accuracy (%)
All
regions
Border Middle Far
Border
+
Middle
Checker board 93.0 93.6 85.1 81.7 92.0
Nested square 83.0 80.5 80.7 72.9 83.0

Accuracy and Retention
Result FDS +dense -dense
Accuracy
(%)
95.6 93.6 97.8
# of
instances
240 160 160

annInstance28Nov6pm

Recommended

Recommended

More Related Content

Similar to annInstance28Nov6pm

Similar to annInstance28Nov6pm (20)

annInstance28Nov6pm

Editor's Notes