The document describes a study exploring the noise resilience of classification algorithms, specifically comparing the Naive Bayes classifier, k-Nearest Neighbors (kNN) classifier, and a Combined Sturges classifier. It introduces a noise model to add artificial noise to datasets at the attribute level. Several artificial and real-world datasets are tested with different noise levels added. The performance of the three classifiers is evaluated on the noisy datasets using various metrics like accuracy, precision, recall, and F-measure. Preliminary results on non-noisy datasets show kNN generally performs best, while Combined Sturges and Naive Bayes results vary across datasets. The document aims to analyze how classifier performance is affected by increasing noise levels.
3. Motivation
A study on Noise?
Real-world datasets are noisy
Recordings under normal environmental conditions
Equipment Measurement Error
Most algorithms ignore Noise.
Not a lot of research done on Noise.
Aim : Explore the robustness of algorithms to Noise.
Which algorithm is least affected by noisy Datasets?
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 3 / 39
5. Classification
Classification : Assigning a new observation to a set of known
categories
Companies store large amounts of data.
Effective Classifier can assist in making good predictions and informed
business decisions.
E.g. Whether to recommend Prime products to the non-prime
customers, based on behavior
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 5 / 39
6. Classification Algorithms
Two broad kinds of Classifiers are -
Frequency based classifiers: use the frequency of datapoints in the
dataset to determine the class membership of a given test point,
Geometry based classifiers leverage the geometrical aspects of a
dataset such as the distance.
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 6 / 39
7. Naive Bayes
The Naive Bayes Classifier
Frequency based classifier
Computes the probability of a test data point to be in each class
class probability extracted from training data.
Pros
Intuitive to understand and build.
Easily trained, even with a small dataset
It’s fast
Cons
Assumes conditional independence of the data
ignores the underlying geometry of the data.
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 7 / 39
8. k Nearest Neighbors
The k Nearest Neighbors Classifier
Geometry based classifier
Assigns the class to test data point by determining the majority class
of k nearest points
Pros
Easy to implement and understand
Classes don’t have to be linearly separable
Cons
Tends to ignore the importance of an attribute; uses all
only indirectly takes into account the frequency of the data
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 8 / 39
10. Combined Sturges
The Combined Sturges(CS)Classifier
Explicitly uses geometry + frequency
Data represented as Frequency distribution on class.
Classification Score is computed for each class.
Test point assigned to class with highest Score.
Continuous data values are binned.
No. of bins = 1 + log2n
Sturges, 1926 - Choice of a Class Interval
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 10 / 39
18. Combined Sturges
1 Combined Criterion
Test Point : T1
3 4
d =
(T1 − A1).f (A1)
Expected Distance
ED = EDc
A1.EDc
A2
min Expected
Distance, ED
Table: Aggregate Expected Distance, ED
A1 f (A1) d.f A2 f (A2) d.f
1 0.25 0.50 1 0.50 1.50
3 0.25 0 2 0.25 0.50
4 0.50 0.50 3 0.25 0.25
ED0
A1 1.00 ED0
A2 2.25
A1 f (A1) d.f A2 f (A2) d.f
1 0.25 0.50 2 0.75 1.50
2 0.25 0.25 3 0.25 0.25
3 0.50 0
ED1
A1 0.75 ED1
A2 1.75
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 17 / 39
19. Combined Sturges
Classification Penalty
S(0)
ED = 1.00 × 2.25 = 2.25
S(0) = ED × (1 − P(Class0)) = 1.125
S(1)
ED = 0.75 × 1.75 = 1.31
S(1) = ED × (1 − P(Class1)) = 0.655
S(0) > S(1)
Class 1
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 18 / 39
20. The Noise Model
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 19 / 39
21. The Noise Model
Dealing with Noise
Brodley & Fried, 1999 - detect and reduce noise
Kubica & Moore, 2003 - identify Noise using a probabilistic model
and remove it.
Elias Kalapanidas, 2003 - Developed a Noise Model based on data
properties.
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 20 / 39
22. The Noise Model
Additive Noise, x = x + δx
δx = σxj × zi,j
σxj , standard deviation of attribute j,
zi,j = CDF(pi,j )
xi,j =
xi,j if pi,j ≥ n
xi,j if pi,j < n
(1)
Based on Noise level n ∈ {0, 0.15, 0.30, 0.50, 0.80}
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 21 / 39
25. Datasets
Artificial datasets
Multivariate Normal
x1 = random Normal vector, t = random Normal vector
x2 = 0.8x1 + 0.6t
x3 = 0.6x1 + 0.8t
x4 = t
Linear Function with Non-normal inputs
x2 = (x1)2
+ 0.5t
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 24 / 39
26. Datasets
2 Artificial datasets
Different Imbalanced-Ratio
3 Real Datasets
Table: Comparison of physical properties of Datasets.
Dataset
No. of
Samples
No. of
Classes
No. of
Attributes
Attribute
Value
Imbalance
Ratio
Haberman 306 2 3 Integer 2.78
A1 200 3 4 Real 6.66
A2 200 3 4 Real 39
Iris 150 3 4 Real 2
Pima
Diabetes
768 2 8
Integer,
Real
1.87
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 25 / 39
27. Process Flow
1 Create Artificial Datasets
2 Implement the Noise model on all Datasets
3 Apply the three algorithms
4 Compare the results
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 26 / 39
38. Conclusion
No algorithm is best.
In general knn has better accuracy but CS is more robust to noise.
Naive Bayes does much worse for noise, than others.
Also:
CS performs well for Imbalanced Datasets.
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 37 / 39
39. Future Work
Test with more datasets.
Test for performance on imbalanced datasets.
Only additive Noise model was used, try with other variations.
Compare with more algorithms.
Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 38 / 39