1. Machine Learning and Understanding the Web
By Mark Chavira and Ulises Robles
February 14, 2000
Introduction
The World Wide Web contains information on many subjects, and on many subjects, the World Wide Web contains
a lot of information. Undoubtedly, as it has grown, this store of data has developed into a mass of digitized
knowledge that is unprecedented in both breadth and depth. Although researchers, hobbyists, and others have
already discovered sundry uses for the resource, the sheer size of the WWW limits its use in many ways. To help
manage the complexities of size, users have enlisted the aid of computers in ways that go beyond the simplistic
ability to access pages by typing in URL's or by following the hyper-link structure. For example, Internet search
engines allow users to find information in a way that is more convenient than and not always explicit in the hyper-
links. Although computers already help us manage the Web, we would like them to do more. We would like to be
able to ask a computer general questions, questions to which answers exist on the Web, questions like, "Who is the
chair of the Computer Science Department at University X?" However, for computers to give such assistance, they
must be able to understand a large portion of the semantic content of the Web. Computers do not currently
understand this content. Of course, there is good reason. The Web was not designed for computerized
understanding. Instead, it was designed for human understanding. As a result, one could argue that the idea of
getting a computer to understand the content of the Web is akin to the idea of getting computers to understand
natural language, a goal that remains elusive. Among the opinions regarding solutions to this problem are two. This
paper gives an overview of both, and then it gives three examples of ways in which researchers are exploring the
second possible solution.
The Structuring Solution
The first solution, sometimes championed by database experts and those accustomed to working with other types of
highly structured data, is to change the Web. An extreme notion of this view claims that the data in the WWW is
completely unstructured, and so long as it remains so, computers will have no hope of providing a type of assistance
that is analogous to answering SQL queries. Proponents of this view would like to impose structure on the
information in the Web in a way that is analogous to the way that information is structured within a relational
database. At the very least, they would like a new form of markup language to replace HTML, one that would
essentially give the computer clues that would help it to infer the semantics of the text. In the extreme, proponents
of this "structuring solution" would like pages on the Web to be like forms that authors fill, forms that computers
know about and can interpret. To be universally applicable to the WWW, both the moderate and the extreme
approach require that existing data be put into a different form and that new data published to the Web conform to
the new structure.
The Learning Solution
The second solution, sometimes championed by Artificial Intelligence experts and those accustomed to working
with uncertainty, is to make computers smarter. Proponents of this view argue against the structuring solution,
saying that, because humans would need to do the structuring, the task is simply not feasible. Dr. Christopher
Manning, a professor in Linguistics and Computer Science at Stanford University, believes that there is simply too
much information to convert; that the vast majority of people who publish pages to the Web would never agree to
conform; and that the structure imposed would, to some extent, limit flexibility and expressiveness. Dr. Manning
believes that the content on the Web is not unstructured. Rather, it possesses a "difficult structure," a structure that
is possibly closer to the structure of natural language than to that of the "easy structure" found within a database
[2000.] Those who champion the "learning solution" believe that computerized understanding--at least some form
of simple comprehension--is an achievable goal. Moreover, some believe that the problem of understanding the
Web is a simpler one than the problem of understanding natural language, because the Web does impose more
structure than, say, a telephone conversation. There are, after all, links, tags, URL's and standard styles of Web page
Page 1 of 7
2. design that can give the computer hints about semantic intent. Even looking at the text in isolation, without making
use of additional information within the page, has the potential to give good results. As a result, attempts at learning
from text have direct applicability to learning from the Web. The remainder of this paper explores a small sampling
of the work of researchers who champion the learning solution. The researchers in these examples work primarily
on learning from text, without special consideration for other information that is embedded within Web pages.
Obviously, the task is big. Initial work in the area has focused on solving similar but much smaller problems, with
the hope that solutions to these smaller problems will lead to something more.
Naive Bayes
Naive Bayes is the default learning algorithm used to classify textual documents. In a paper entitled "Learning to
Extract Symbolic Knowledge from the World Wide Web" [Craven et al., 1998] researchers at Carnegie Mellon
University describe some of their experience applying Naive Bayes to the Web. The researchers describe their goal
as follows:
[The research effort has] the long term goal of automatically creating and maintaining a computer-
understandable knowledge base whose content mirrors that of the World Wide Web...Such a “World Wide
Knowledge Base” would consist of computer understandable assertions in symbolic, probabilistic form,
and it would have many uses. At a minimum, it would allow much more effective information retrieval by
supporting more sophisticated queries than current keyword-based search engines. Going a step further, it
would enable new uses of the Web to support knowledge-based inference and problem solving.
The essential idea is to design a system which, when given (1) an "ontology specifying the classes and relations of
interest" and (2) "training examples that represent instances of the ontology classes and relations," would then learn
"general procedures for extracting new instances of these classes and relations from the Web." In their preliminary
work, the researches simplified the problem by focusing on a small subset of the Web and by attempting to train the
computer to recognize a very limited set of concepts and relationships. More specifically, the team acquired
approximately ten thousand pages from the sites of various Computer Science Departments. From these pages, they
attempted to train the computer to recognize the kinds of objects depicted in Figure 1.
E n t i ty
O th er A c ti v i ty P e rso n D e p a rtm e n t
R e se a rc h P ro je c t C o u rse F a c u lty S ta ff S tu d ent
Figure 1 CMU Entity Hierarchy
In addition, they attempted to have the system learn the following relationships among the objects:
• (InstructorOfCourse A B)
• (MembersOfProject A B)
• (DepartmentOfPerson A B)
The above relationships can be read, "A is the instructor of course B;" "A is a member of project B;" and "A is the
department of person B." The two main goals of the research were (1) to train the computer to accurately assign one
of the eight leaf classes to a given Web page from a CS department and (2) to train the computer to recognize the
above relationships among the pages that represent the entities. The team used a combination of methods. We will
consider goal (1), since it is to this goal that the group applied Naive Bayes.
Page 2 of 7
3. Underlying the work were a set of assumptions that further simplified the task. For example, during their work, the
team assumed that each instance of an entity corresponds to exactly one page in their sample. As a consequence of
this assumption, a student, for example, corresponds to the student's home page. If the student has a collection of
pages at his or her site, then the main page would be matched with the student and the other pages would be
categorized as "other." These simplifying assumptions are certainly a problem. However, as the project progresses
into the future, the team intends to remove many of the simplifying assumptions to make their work more general.
During the first phase of the experiment, researchers hand-classified the pages. Afterward, the researchers applied a
Naive Bayes learner to a large subset of the ten thousand pages in order to generate a system that could identify the
class corresponding to a page it had not yet seen and for which it did not already have the answers. The team then
applied the system to the remaining pages to test the coverage and accuracy of the results. They used four-fold cross
validation to check their results.
The page classification sub-problem demonstrates the kinds of results the group achieved. To classify a page, the
researchers used a classifier that assigned a class c' to a document d according to the following equation:
logPr(c) T Pr(w i | c)
c' = argmax + ∑ Pr(w i | d)log
Pr(w | d)
c
n i =1 i
Equation 1
The paper describes the terms in the equation as follows: "where n is the number of words in d, T is the size of the
vocabulary, and wi is the i-th word in the vocabulary. Pr(w i | c) thus represents the probability of drawing wi given a
document from class c, and Pr(wi | d) represents the frequency of occurrences of wi in document d." The approach is
familiar. Define a discriminant function for each class. For a given document d, run the document through each of
the discriminant functions and choose the class that corresponds to the largest result. Each discriminant function
makes use of Bayes law and, in this experiment, assumes feature independence (hence the “naive” part of name.)
Because of this assumption, the method used at CMU does not suffer terribly from the curse of dimensionality,
which would otherwise become severe, as each word in the vocabulary adds an additional dimension. With
coverage of approximately 20%, the average accuracy for each class was approximately 60%. With higher coverage
the accuracy goes down, so that at 60% coverage, accuracy is roughly 40%. These numbers don't seem all that
great. However, the simplifying assumptions contributed to low performance. For example, for a collection of
pages that comprise a student's Web site, the classifier might choose many of them to correspond to an instance of
student. Recall that, because of an artificial, simplifying assumption, only one page corresponds to the student,
while the others get grouped into “other” class. As the researchers remove assumptions and adjust their learning
algorithms accordingly, performance should improve. As a side note, when the researchers introduced some
additional heuristics, such as heuristics that examine patterns in the URL, accuracy improved to 90% for 20%
coverage and 70% for 60% coverage. We shall see how some other techniques perform better than Naive Bayes
used in isolation.
Maximum Entropy
Over the years, other techniques for supervised learning in text classification have emerged. Nigam, Lafferty, and
McCallum [1999] describe one of them: Maximum Entropy, which has been applied to a variety of natural language
tasks. Maximum Entropy estimates class probability distributions from a given set of labeled training data. The
methodology the authors present is an iterative, scaling algorithm that maximizes the entropy distribution consistent
with the features of the classes. The algorithm defines a model of the class distributions. The model begins as a
uniform distribution since nothing is yet known about the distributions of the classes. The algorithm then changes
the model in an iterative fashion. With each iteration, the algorithm uses the labeled training data to constrain the
model so that it matches the data more closely. After the algorithm concludes, the model gives a good estimate of
the distributions of the class labels given a document. The authors’ experiments show that Maximum Entropy is
better than Naive Bayes but that Maximum Entropy sometimes suffers from over-fitting the training data due to poor
feature selection. If priors are used together with Maximum Entropy, performance is better.
Page 3 of 7
4. The researchers first selected a set of features. The features used in this experiment were word counts. For each
(class, word) pair, the algorithm finds a count that is the expected number of times the word appears in document of
the class, over the total number of words in the document. If a word occurs frequently in a given class, a
corresponding weight for that class-word pair is set to be higher than for other class-word pairs having a different
class and the same word. The authors point out that this method is typical in natural language classification.
Maximum Entropy starts by restricting the model distribution so that each class has the same expectation for a given
feature. In other words, the researches initialized the expectations for each feature by taking the average of that
feature over all the documents, as expressed by the following equation:
1 1
∑ f i (d, c(d)) = | D | d∑∑ P(c | d)f i (d, c)
| D | d∈D ∈D c
Equation 2
where each fi(d, c) is a feature of document d for class c. From Equation 2, the paper concludes that the parametric
form of the conditional distribution of each class has the exponential form:
1
P(c | d) = exp(∑ λ i f i (d, c))
Z(d) i
Equation 3
where λi the parameter to be learned for feature i and where Z(d) a normalizing constant.
The authors then present the IIS (Improved Iterative Scaling) procedure, which is a hill climbing algorithm used to
compute the parameters of the classifier, given the labeled data. The algorithm works in log likelihood space.
Given the training data, we can compute the log-likelihood of the model as follows:
I(Λ | D) = log∏ PΛ (c(d)/d) = ∑ ∑ λ i f i (d, c(d)) − ∑ log∑ exp∑ λ i f i (d, c)
d∈D d∈D i d∈D c i
Equation 4
where Λ denotes a vector of conditional distributions of the form in Equation 3, one for each class and D is the set of
documents. A general outline of the Improved Scaling algorithm follows: Given the set of labeled documents D and
a set of features f, perform the following steps:
1. For all features fi, compute the expected value over the training data according to Equation 2.
2. Initialize all the feature parameters to be estimated (λi ) to 0.
3. Iterate until convergence occurs; i.e., until we reach the global maximum
a. Calculate the expected class labels for each document with the current parameters P Λ(C/D), i.e.,
solve Equation 3
b. For each (λi)
i. Using standard Hill Climbing, find the step size δi that has a high log-likelihood.
ii. λi =λi + δi
4. Output: The text classifier.
The analysis presented here shows that at each step, we can find changes to each λi. reaching convergence to the
single global maximum in the likelihood "surface". The paper states that there are no local maxima.
Page 4 of 7
5. Compared to Naive Bayes techniques, Maximum Entropy does not require that the features be independent. For
instance, in the phrase "Palo Alto” there are two words, which almost always occur together and rarely occur by
themselves. Naive Bayes will consider the two words independently and count the phrase twice. Maximum
Entropy, however, will reduce the weight of these features by half, since the constraints are based on the expectation
of the number of counts.
The authors use three data sets for evaluating the performance of the algorithm:
• The WebKB data set [Craven et al., 1998] contains Web documents from university computer science
departments. In the present research, they use those pages in this data set that correspond to student,
faculty, course, and project pages (4199 pages).
• The Industry Sector Hierarchy data set [McCallum and Nigam, 1998] contains company Web pages
classified into a hierarchy of industry sectors. The 6440 pages divide into 71 classes and the hierarchy is 2
levels.
• The Newsgroup data set [Joachims, 1997] contains about 20,000 rticles divided into 20 UseNet discussion
groups. The project removed words that occur only once.
The authors also considered Naive Bayes as a textual classification algorithm. They present a comparison using two
variants of Naive Bayes: scaled (the word count is scaled so that each document has a constant number of word
occurrences) and un-scaled. They mentioned that the in most cases the scaled version is better than the regular
Naive Bayes.
The experimenters used cross validation to test results. For the Newsgroup and the Industry Sector data sets, the
algorithm sometimes produced over-fitting results. To prevent this occurrence, the researchers stopped the iterations
early. In all the test cases, Maximum Entropy performed better than regular Naive Bayes, especially on the WebKB
dataset, where the algorithm achieved 40% reduction in error over Naive Bayes. However, compared to the scaled
Naive Bayes the Maximum Entropy results were sometimes better, sometimes slightly worse. The authors attribute
the worse performances to over-fitting.
To help deal with over-fitting, the authors experimented with using Gaussian priors with the mean equal to zero and
a diagonal covariance matrix. The fact that the matrix is diagonal implies that all features use the same variance.
Equation 5 describes the Gaussian distribution of the priors.
2
1 - λi
P(Λ) = ∏ exp( 2
)
i 2πσ i
2
2σ i
Equation 5
where λi is the parameter for the i-th feature and σ2 is its variance.
The paper also shows that over-fitting is reduced when using a Gaussian prior with Maximum Entropy. The
classification error is better also. As a consequence of the above, the performance is better than that obtained when
using scaled naive Bayes. When no over-fitting is encountered (without using priors), the performance is almost
unchanged.
The authors point out a few shortcomings of their approach. One shortcoming is that the researchers used the same
features for every class. This need not be the case, and the learner would be more flexible if the researchers did not
use features in this way. Another relevant limitation is that the researchers used the same Gaussian prior variance in
all the experiments. This approach is not correct, particularly when the training data is sparse. A possible
improvement is to adjust the prior based on the amount of training data. The authors hypothesize that another
improvement would result from using feature functions of the form log(counter) or some other sub-linear
representation instead of the counts themselves. They have observed that using un-scaled counts gives decreased
accuracy.
Page 5 of 7
6. Classifying with Unlabeled Training Data (Bootstrapping and EM)
Many machine learning tasks begin with (1) a set of classifications c1..cn and (2) a set of instances that need to be
classified according to (1). The first step is to hand-label a number of training instances for input into the learning
algorithm and test instances for input into the resulting classifier. In many cases, to obtain good results, the number
of hand-labeled instances must be large. Herein lies a problem. Hand-labeling is usually difficult and time-
consuming. It would be desirable to skip this step. This problem also exists when turning to the Web and to
classifying text in general. Hence the motivation for the research described in “Text Classification by Bootstrapping
with Keywords, EM and Shrinkage” [McCallum and Nigam, 1999.] In this paper, researchers describe their
approach to classifying text that does not require hand-labeled training instances. The researches begin with the
goal of classifying Computer Science papers into 70 different topics (e.g. NLP, Interface Design, Multimedia,)
which are sub-disciplines of Computer Science. The researchers proceed as follows:
1. For each class ci, define a short list of key words.
2. Classify the documents according the list of key words. Some documents will remain unclassified.
3. Construct a classifier using the instances that have been classified.
4. Use the classifier to classify all instances.
5. Iterate over steps (3) and (4) until convergence occurs.
Step (1) is to choose keywords for each class. A person chooses these keywords in a way he or she believes will
help identify instances of the class. This selection process is not easy; it requires much thought, trial, and error.
However, it requires much less work than hand-labeling data. Some of the key words the researchers chose follow:
Topic Keywords
NLP language, natural, processing, information, text
Interface Design interface, design, user, sketch, interfaces
Multimedia real, time, data, media
Step (2) classifies documents according to the keywords. To perform this step, the computer essentially searches for
the first keyword in the given document and assigns the document to the corresponding class. Doing so leaves many
documents unclassified and some documents misclassified. In step (3), the researchers use a Naive Bayes learner to
construct an initial classifier using the documents classified in step (2,) the labels obtained from step (2,) and
discriminant equations derived from Equation 6.
|d i |
P(c j | d i ) ∝ P(c j )P(d i | c j ) ∝ P(c j )∏ P(w d i ,k | c j )
k =1
Equation 6
where cj is the class, di is the document being considered, and wdi,k is the k-th word in document di. Finally, the
researches use the results obtained thus far to “bootstrap” the EM algorithm. That is, they use the results as a
starting point for the EM learner, which successively improves the quality of the classifier.
The paper describes EM as follows:
EM is a class of interactive algorithms for maximum likelihood or maximum a posteriori parameter
estimation in problems with incomplete data [Dempster et al., 1977.] Given a model of data generation and
data with some missing values, EM iteratively uses the current model to estimate the missing values, and
then uses the missing value estimates to improve the model. Using all the available data, EM will locally
maximize the likelihood of the parameters and give estimates for the missing values. In our scenario, the
class labels of the unlabeled data are the missing values.
Page 6 of 7
7. EM essentially consists of two phases. The “E” phase calculates probabilistically-weighted class labels, P(cj| di), for
every document using the current classifier and Equation 6. The “M” phase constructs a new classifier according to
Naive Bayes from all of the classified instances. EM then iterates over the E and M phases until convergence
occurs. In addition to EM, the authors also applied a technique known as shrinkage to assist with the sparseness of
the data.
The researchers found that keywords alone provided 45% accuracy. The classifier that they constructed using
bootstrapping, EM, and shrinkage, obtained 66% accuracy. The paper makes a note that this level of accuracy
approaches estimated human agreement levels, which is 72%. Exactly what human agreement means, the paper left
unclear. The paper does not discuss coverage.
Conclusion
Computers should be able to help us obtain information from the Web in ways that are more sophisticated than those
used by current search engines. Different approaches have emerged to achieving this goal. The structuring
approach seeks to change the Web, so that the Web is easier for computers to understand. The learning approach
seeks to make computers smarter, so that they can understand the Web as it is. Because learning from the Web is
similar to learning from text, textual approaches serve as one of the foundations for work within the learning
approach. The default learning algorithm for producing text document classifiers is Naive Bayes. When applied to
Web page classification, Naive Bayes demonstrates results that are similar to those achieved when applied to text
document classification. To improve on Naive Bayes, researchers have explored other learners, including Maximum
Entropy and EM, which replace and/or augment Naive Bayes. In some cases, these other learners outperform Naive
Bayes. Work in training computers to understand Web content and to be able to use that understanding to provide
solutions is still in preliminary stages. The research has given some promising results.
References
• [Manning, 2000] C. Manning. January 2000. Lecture before Digital Library Group at Stanford University.
January 2000.
• [Craven et. all, 1998] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S.
Slattery. 1999. Learning to Extract Symbolic Knowledge from the World Wide Web.
• [McCallum and Nigam, 1999] A. McCallum, K. Nigam. 1999. Text classification by Bootstrapping with
Keywords, EM and shrinkage.
• [Nigam et. all, 1999] K. Nigam, J. Lafferty, A. McCallum. 1999. Using Maximum Entropy for Text
Classification.
• [McCallum and Nigam, 1998] Andrew McCallum and Kamal Nigam. A comparison of event models for
naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization. Tech. rep.
WS-98-05, AAAI Press. http://www.cs.cmu.edu/~mccallum.
• [Joachims, 1997] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for
text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML
’97), pages 143-151, 1997.
• [Demster et. all, 1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from
incomplete data via the EM algorithmn. Journal of the Royal Statistical Socity, Series B, 39(1):1-38.
Page 7 of 7