1. Machine Learning – Spring 2009 – Project 1
Due Date: March 6, 2009
Consider the images of handwritten characters in Figure 1.
Figure 1. Some handwritten character images.
Without looking at the figure on the next page, how well can you identify the characters? (Your answer
should be “Not very well”). However, if you look at Figure 4, the identities of the characters should be
obvious. This observation leads to a somewhat paradoxical situation if we want to use machine learning
to write a program that reads handwritten words, i.e. we can’t read the characters without reading the
words and we can’t read the words without reading the characters! One response to this is have the
program segment an image of a handwritten word into many small segments, each hopefully consisting of
a single character or something less than a character. Then given a string representation of a handwritten
word, use dynamic programming to find the best way to put the segments together to match the string and
assign the match a score. The match score is built from scores assigned to individual segments as
depicted in Figure 2.
Input
Image
Image Segmentation
1 2 3 4 5 6 7 8 9 10
Primitives
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25
Best
Match
to
"Richmond"
R=53 i=27 c=52 h=61 m=70 o=43 n=61 d=88
Best
Match
to
"Edmund"
E=12 d=79 m-85 u=25 n=61 d=88
Figure 2. Dynamic Programming Approach to Handwritten Word Recognition
This changes the character level problem from asking the question “Which character does the input image
represent?” to “How much does the input image look like it represent a specific character (such as u or
v)?”. This approach actually works quite well and is the basis of several (at least two that I know of
2. anyway) operational systems. However, the problem with this approach is that now the regression has to
assign low scores to many images that are not actually characters. Also, it has to assign reasonable scores
to images that look quite a bit like several different characters, as shown in Figure 3.
(a) "Ha" or "tta" (b) "J" or "U"
"a" "v""e" "n" "u" "e"
(c) "Cowlesville" (d) Incorrect interpretation as "Avenue"
Figure 3. Some other problems that can occur in handwritten word recognition
Figure 4. Word image from which the character images in Figure 1 were segmented.
This project will use this problem to investigate the ability of Bayesian linear regression techniques to
develop robust mappings from features calculated on input images to membership values in different
classes. You will be given access to a set of handwritten character images with their corresponding class
labels from a subset of all the possible classes and with associated feature vectors consisting of Edge
Histograms. This is your training set X. C is the set of corresponding class labels. You should do the
following:
STEP 1. DEFINE TARGET OUTPUTS USING K-NEAREST NEIGHBORS
For each character image I with corresponding feature vector xi
Find the K-nearest neighbors of xi in the set X (xi can be it’s own neighbor)
Let Li = (ci1, ci2, …., cik) denote the list of class labels of the K-nearest neighbors
For each class c ∈ C, let tic = (number of times c occurs in Li)/K
OPTIONAL for the “true” class ci of xi, let tici = 1
End For
STEP 2. BUILD BAYESIAN REGRESSION MODELS FOR CALCULATING MEMBERSHIP
For each class c ∈ C in the training set
Build Bayesian models to map the inputs xi to outputs to tic calculated above
*Note that I’m not using enough subscripts but this is pseudo-code*
You should build a MAP model and a predictive distribution model
If you are able, investigate the possibility of estimating the hyperparameters
End For
You can also try to do something above and beyond this if you like
STEP3. EVALUATE YOUR MODELS
3. You will be supplied with a test set consisting of more character samples and images that don’t represent
characters at all. The evaluation will be both qualitative and quantitative. For each character test sample,
you can qualitatively evaluate the memberships by displaying the image and the set of memberships
obtained by evaluating your models on the test feature vectors. You can do this qualitative analysis for
the non-character test samples as well. In addition, for each model you can calculate the histogram of
output values for all the characters and the histogram of output values for all the non-characters and
compare them by displaying them and by calculating the separation indices (difference in means square
divided by the sum of the variances)
STEP4. WRITE A REPORT
Write a report that is no less than 5 pages long and no more than 10 pages of test.
The report should be organized as follows:
• Abstract
• Overview of what was done (don’t retype equations from the book, just refer to them)
• Overview of results
• Experimental Design
• Design and implement experiments that test some of the various claims and qualities of Bayesian
regression mentioned in the textbookd
• Clear yet detailed description of what you implemented and what experiments you conducted
Not code
You can go beyond what I’ve described if you get a creative idea.
• Discuss implementation choices you had to make
• Experimental Results
• Discuss outcomes precisely
• Put your histograms and qualitative displays here.
• Observations
o Discuss in detail your observations concerning the various claims and qualities of
Bayesian`regression mentioned in the textbook.
You will have to think about this when you design your experiments
Don’t make idle statements; observations should be backed up by experimental
results