1. Modifications to a Minimizing Expected Risk-based Multi-class Image
Classification Algorithm on Value-of-Information (VoI)
Zhuo Li —— zhuol1@andrew.cmu.edu
Abstract—— Real-world image classification always meets with the problem that there are so
many images that could be easily obtained through different kinds of technologies, while little of
them are correctly labeled manually, for a huge amount of human-labeling requires too much
work. However, by applying proper active learning algorithms, computers can complete the
labeling process with a small number of human-labeled images as start and interactively
querying the oracle or human to get the true labels for some informative images useful in the
labeling. In my project, I made some modifications to the existing active learning algorithm [1]
on
VoI (value-of-information) to perform the task of multi-class image classification.
Keywords: Machine Learning; Active Learning; Uncertainty Sampling
Introduction to the algorithm I used
My algorithm is an adaption to the existing active learning algorithm raised by Joshi, Ajay J.,
Fatih Porikli, and Nikolaos P. [1]
whose Query Selection Strategy is Minimizing Expected Risk,
from which I modified an Uncertainty Sampling algorithm to implement this multi-class bio-
images’ classification project.
My adapted algorithm cares about the misclassification risk for every image in the active pool
(unlabeled pool) in the query selection phase and uses a support vector machine (SVM) as the
base learner.
Specifically, I randomly choose 300 samples, which is around 1/10 of the number of query
limits, as the “seed” for both the active and random learners, and use a batch mode for Query
Selection at the size of 50.
Modifications:
In this project, I chose to make modifications to the existing algorithm [1]
on the framework of
VoI which takes care of two things in the query selection strategy, the misclassification risk and
the cost of user annotation. I chose only to consider the metric of misclassification risk
rather than the cost of user annotation because in this project, the cost is the same for the
algorithm to query for any image in the training set, and there only exists the limit for the
number of queries, but the cost for every different query. Hence, misclassification risk is the
metric for selecting images to query in active learning.
The second modification I made is introduced in the first part of “Why this algorithm is
suitable” in the format of “Note”.
2. Why this algorithm is suitable
The algorithm I used is suitable for this multi-class bio-images’ classification for the
following two reasons:
1. The risk misclassification strategy used in query selection:
In the query selection phase, the original algorithm will compute the overall risks [1]
for
the whole system after learning either one image from the unlabeled pool, and compare
each risk to the overall risk of the whole system before learning either one of the image
from the unlabeled pool. Then the algorithm chooses to query the image that causes
the largest risk reduction, in other words, reduces the overall risk at the most.
There is a risk matrix M in the computation of the overall risk, which denotes the weight
of the risk of misclassifying every label. The weight can be given for misclassifying one
label for another label, according to the risk it causes in real world. For example, if this
algorithm is to be used to recognize the gene that causes different diseases, the weight
of misclassifying a gene that could cause tumor to causing color blindness can be very
high, since it would be expensive if the classification is wrong, but the weight could be
low vice versa.
NOTE:
However, because of the great time complexity caused by computing the posterior risk
for every image under every newly learned model (a Minimizing Expected Risk
algorithm ), which requires to train thousands of new models in just one iteration, I
made the second modification to the query selection phase by computing the risk of
misclassifying every image in the unlabeled pool instead, moving the 50 ones with the
largest risks (batch mode) to the labeled pool and training the active learning model
again with the new labeled pool, which has changed this algorithm from time-
consuming Minimizing Expected Risk-based to time-complexity-friendly.
The risk for misclassifying one image 𝑥 which belongs to the unlabeled pool is as follows,
ℛℒ
{%}
= 𝑀*+
,
+-.
,
*-.
∙ 𝑝%
*
ℒ 𝑝%
+
ℒ
where ℒ is the labeled pool in each iteration of query selection, 𝑘 is the size of multi-
labels, 𝑀 is the risk matrix mentioned above, and 𝑝%
*
ℒ is the posterior probability of
classifying image 𝑥 to label 𝒾 under the condition of labeled pool, which does not need
to train thousands of new models to make a decision.
2. The Support Vector Machine used as base learner for Multi-class Classification
Since the training data has altogether 8 labels, it would not be eligible to use just one
binary classifier. SVM implements multi-class in mainly two ways, one-versus-rest and
one-versus-one. I called the “svc” from Python’s API “sklearn.svm” to implement the
function of multi-class classification through the one-versus-one [2]
method.
With svc, it is possible to train the model in multi-class classification and return the
probabilities for every label.
3. Performance of this algorithm
Besides the test error as a function of the amount of labeled data declared by the requirements, I
used another metric, success rate, to evaluate the performance of the active learner against a
random learner, which is actually a projection of test error, documenting the success rate of a
model predicting the test set, but more explicitly depicts the successful rates of predictions. Two
kinds of images will be provided here for evaluation.
In addition, I ran 10 times for EASY and MODERATE dataset to get the average success rate
and test error for evaluation, to avoid the randomness of the “seed set” so as to make a more
comprehensive evaluation.(Seed set is picked randomly from all training data before the active
learning process begins)
Note that there is a parameter C in the figures, which is the penalty parameter for SVM and will
be explained later in the Findings part.
EASY DATASET:
Figure 1: One-time Test Errors and Success Rate
versus Amount of Labeled Points
for EASY dataset
6. Findings & Explanations of Figures
For Parameter C
I set C as 1.0 for the EASY and DIFFICULT dataset and 0.9 for the MODERATE dataset,
reasons are as follows:
C is the penalty parameter for the base learner SVM controlling the influence of the
misclassification on the objective function, [3]
in other words, parameter C determines the
model’s “faith” to training data, that is to say, if C is too large, SVM will “trust” the training data
too much, which might cause overfitting. On the contrary, if C is too small, SVM will not “trust”
the training data that much, which might cause underfitting. How to choose a better C matters.
In SVM, C is default as 1.0, which is a trade-off between bias and variance. For the low-noise
EASY and DIFFICULT dataset, I choose C as default, which is 1.0. As for the MODERATE set,
since there is some noise in the training data, of which I’ve got to minimize the influence, I
choose parameter C as 0.9, which will make the model not “trust” the training data that much so
as to avoid overfitting.
Four graphs of “success rate” with different Cs are provided in Fig.6 below, from which we can
see the difference of the prediction accuracy between active learner and random learner,
accounting for the suitability of choosing C as 0.9, rather than other values, for MODERATE set.
7. Figure 6: Average Test Errors for MODERATE dataset
with different Cs
For Easy Dataset
For the EASY dataset, we can observe that the active learner outperforms the random learner,
both in the prediction accuracy and the speed to reach its best performance.
The active learner has around 77 errors out of 1000 predictions, while the random learner has
around 92 errors out of 1000. These are all the final performances of both learners because from
Fig.2, the average performance of 10 times of running, we can see the lines for two learners tend
to be smooth in the end.
From the average performance, we could see that in the beginning, the active learner might
perform a little weaker than the random learner, which is because the active learner chooses the
most informative images to query, in other words, the ones that are most risky and most likely to
be around the boundaries, and the random learner randomly picks images to query, which could
temporarily cause the active learner to underperform. However, with the amount of labeled
points increasing, we could see crystal clear that the active learner outperforms the random
one.
8. What’s more, the success rate of active learner reaches its peak at 92.4% way before when
random learner reaches its peak at 91.1%, which shows that the learning speed of the active
learner is faster than the random one.
For Moderate Dataset
For the MODERATE dataset, I set penalty parameter C as 0.9 to avoid the influences of noise.
We can observe in Fig.4 that the active learner outperforms the random learner, both in the
prediction accuracy and the learning speed.
The active learner has around 150 errors out of 1000 predictions, while the random learner has
around 166 errors out of 1000.
From the average performance, we could see that in the beginning, the active learner might
perform a little weaker than the random learner, as in the EASY set. However, we could see that
overall, the active learner outperforms the random one. The performance in MODERATE set
is not as good as in EASY set is because there exists a certain amount of noise in the training set
for MODERATE set.
In addition, the success rate of the active learner reaches its peak at 85%, and the random learner
reaches its peak accuracy is 83.6%. The closer the two learners reach their peaks, the smoother
the line of the active learner, which shows that the learning speed of the active learner is faster
than the random one.
For Difficult Dataset
For the DIFFICULT dataset, I did feature selection to both before and in each iteration of the
active learning.
A Tree-based Feature Selection method from Python’s API “sklearn.feature_selection” [4]
is
applied for feature selection here, and each feature selection is done before training the active
learner and random learner to exclude the negative influences of unrelated features.
From Fig.7 below, we can see related features in the training data for DIFFUCLT set is around
from 23 to 26, which means nearly half the features are unrelated, which are successfully
excluded by the process of my feature selection processes.
9. Figure 7: Numbers of selected features
for the active and random learners at the end
In addition, by Fig.5, we can see that the active learner has around 130 errors out of 1000
predictions, while the random learner has around 153 errors out of 1000, and nearly in the whole
time, the active learner outperforms the random learner with its peak accuracy at 87%, while the
random learner’s peak accuracy is 84.7%.
10. References
1. Joshi, Ajay J., Fatih Porikli, and Nikolaos P. Papanikolopoulos. "Scalable active learning for
multiclass image classification." IEEE transactions on pattern analysis and machine intelligence
34.11 (2012): 2259-2273.
2. http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn-svm-svc
3. http://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-
linear-kernel
4. http://scikit-learn.org/stable/modules/feature_selection.html