Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition

SEMI-AUTOMATIC GROUND TRUTH GENERATION
USING UNSUPERVISED CLUSTERING AND LIMITED
MANUAL LABELING:
APPLICATION TO HANDWRITTEN CHARACTER
RECOGNITION
SzilárdVajda,Yves Rangoni, Hubert Cecotti
Pattern Recognition Letters, 2015
1

Ground-truth generation
UnlabeledLabeled
• Usually, real world data is not labeled.
• Large data collections need accurate labels.
2

Labeling strategy
Unsupervised clustering
Labeling by
human expert
the closest real data point for
each centroid is labeled
3
Image dataset
Pixels
Profiles
LBP
Randon
Encoder
Feature representations

Image dataset
Pixels
Profiles
LBP
Randon
Encoder
Label a
Label c
Label b
Labeling strategy
each point inherits a label
of its cluster
4

Input image
Pixels
Profiles
LBP
Randon
Encoder
5
8
5
5
5
5
Final label
Consensus voting /
Majority voting
Labeling strategy
5

• Raw pixels
– Pixel intensity in raw images
• Profiles (upper/lower/left/right)
– only considers outer shape of the character
– i.e. consider the distance between the upper
horizontal line and the closest pixel to the
upper boundary of the image
• Local Binary Patterns (LBP)
– local texture and rotation invariant
representation
6
L. Heutte, T. Paquet, J.V. Moreau, Y. Lecourtier, C. Olivier, A
structural/statistical featurebased vector for handwritten
character recognition, Pattern Recognit. Lett. 19 (7) (1998)
629–641.

• Randon transform
– takes multiple and parallel-beam projections of the image from
different angles
• Encoder network
– a special kind of deep learning architectures
– data-driven
7

• Definitions
• Voting scheme: consensus, majority voting
Classifiers
the number of patterns that should be assigned to the i-th class
the number of patterns that are assigned to the
class after classification
𝑁𝑝 = 𝑁𝑑𝑒𝑐 + 𝑁𝑟𝑒𝑗
𝑁𝑑𝑒𝑐: patterns that have a class assigned, 𝑁𝑟𝑒𝑗: patterns that have no assigned patterns
𝑁+ / 𝑁−: patterns that have been correctly / incorrectly classiﬁed 8

Classifiers
• Unsupervised clustering
– K-means clustering (Lloyd algorithm)
– Self Organizing Map (SOM) : a special type of neural network trained in
an unsupervised fashion, to produce a two-dimensional mapping of the
input data
– The Growing Neural Gas (GNG) : no constraints on the topology
contrary to the SOM
• Supervised classification
– The k-nearest neighbor (k-nn) classiﬁer 9

Classifiers
• Evaluation
• measures combine inter-class and intra-class variances
• measures the reliability of the labeling strategy
X: total numbers of vectors to be clustered
1
0

Dataset
• MNIST
– Arabic digits
– 10 classes (0,1,…,9)
– 60,000 training / 10,000 test images
• Lampung
– multi-writer handwritten collection produced by 82 high school students from
Bandar Lampung, Indonesia
– 20 character classes
– 23, 447 characters for training
– 7,853 characters for the test
1
1

Results
• Performance of features
1
2

Results
• Compactness of clustering techniques
1
3

Results
• Clustering performance
1
4

Results
• Labeling performance
– Majority / consensus voting: at least 3 methods / 5 methods provide the same label
1
5

Results
• Labeling performance
1
6
Competitive performance is shown with
few human-labeled samples

Results
• Classification performance
– against several Monte Carlo simulations (100 times) which pick random samples
from complete training set.
1
7

Results
• Classification performance (different voting)
– A fully connected multi layer perceptron classification
1
8
96.69 96.74 96.77
The network is more sensitive to the samples with wrong labels

Conclusion
• Semi-automatic labeling scheme with minimal human
involvement.
• The newly discovered labels with this labeling scheme are
compared in a k-nn scheme, with randomly selected samples
and the complete data (all labeled).
1
9

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition

Recommended

Recommended

More Related Content

Similar to Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition

Similar to Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition (20)

More from SOYEON KIM

More from SOYEON KIM (20)

Recently uploaded

Recently uploaded (20)

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition