SlideShare a Scribd company logo
1 of 7
Machine Learning and Understanding the Web
                                         By Mark Chavira and Ulises Robles
                                                February 14, 2000


Introduction
The World Wide Web contains information on many subjects, and on many subjects, the World Wide Web contains
a lot of information. Undoubtedly, as it has grown, this store of data has developed into a mass of digitized
knowledge that is unprecedented in both breadth and depth. Although researchers, hobbyists, and others have
already discovered sundry uses for the resource, the sheer size of the WWW limits its use in many ways. To help
manage the complexities of size, users have enlisted the aid of computers in ways that go beyond the simplistic
ability to access pages by typing in URL's or by following the hyper-link structure. For example, Internet search
engines allow users to find information in a way that is more convenient than and not always explicit in the hyper-
links. Although computers already help us manage the Web, we would like them to do more. We would like to be
able to ask a computer general questions, questions to which answers exist on the Web, questions like, "Who is the
chair of the Computer Science Department at University X?" However, for computers to give such assistance, they
must be able to understand a large portion of the semantic content of the Web. Computers do not currently
understand this content. Of course, there is good reason. The Web was not designed for computerized
understanding. Instead, it was designed for human understanding. As a result, one could argue that the idea of
getting a computer to understand the content of the Web is akin to the idea of getting computers to understand
natural language, a goal that remains elusive. Among the opinions regarding solutions to this problem are two. This
paper gives an overview of both, and then it gives three examples of ways in which researchers are exploring the
second possible solution.

The Structuring Solution
The first solution, sometimes championed by database experts and those accustomed to working with other types of
highly structured data, is to change the Web. An extreme notion of this view claims that the data in the WWW is
completely unstructured, and so long as it remains so, computers will have no hope of providing a type of assistance
that is analogous to answering SQL queries. Proponents of this view would like to impose structure on the
information in the Web in a way that is analogous to the way that information is structured within a relational
database. At the very least, they would like a new form of markup language to replace HTML, one that would
essentially give the computer clues that would help it to infer the semantics of the text. In the extreme, proponents
of this "structuring solution" would like pages on the Web to be like forms that authors fill, forms that computers
know about and can interpret. To be universally applicable to the WWW, both the moderate and the extreme
approach require that existing data be put into a different form and that new data published to the Web conform to
the new structure.

The Learning Solution
The second solution, sometimes championed by Artificial Intelligence experts and those accustomed to working
with uncertainty, is to make computers smarter. Proponents of this view argue against the structuring solution,
saying that, because humans would need to do the structuring, the task is simply not feasible. Dr. Christopher
Manning, a professor in Linguistics and Computer Science at Stanford University, believes that there is simply too
much information to convert; that the vast majority of people who publish pages to the Web would never agree to
conform; and that the structure imposed would, to some extent, limit flexibility and expressiveness. Dr. Manning
believes that the content on the Web is not unstructured. Rather, it possesses a "difficult structure," a structure that
is possibly closer to the structure of natural language than to that of the "easy structure" found within a database
[2000.] Those who champion the "learning solution" believe that computerized understanding--at least some form
of simple comprehension--is an achievable goal. Moreover, some believe that the problem of understanding the
Web is a simpler one than the problem of understanding natural language, because the Web does impose more
structure than, say, a telephone conversation. There are, after all, links, tags, URL's and standard styles of Web page



                                                      Page 1 of 7
design that can give the computer hints about semantic intent. Even looking at the text in isolation, without making
use of additional information within the page, has the potential to give good results. As a result, attempts at learning
from text have direct applicability to learning from the Web. The remainder of this paper explores a small sampling
of the work of researchers who champion the learning solution. The researchers in these examples work primarily
on learning from text, without special consideration for other information that is embedded within Web pages.
Obviously, the task is big. Initial work in the area has focused on solving similar but much smaller problems, with
the hope that solutions to these smaller problems will lead to something more.

Naive Bayes
Naive Bayes is the default learning algorithm used to classify textual documents. In a paper entitled "Learning to
Extract Symbolic Knowledge from the World Wide Web" [Craven et al., 1998] researchers at Carnegie Mellon
University describe some of their experience applying Naive Bayes to the Web. The researchers describe their goal
as follows:

         [The research effort has] the long term goal of automatically creating and maintaining a computer-
         understandable knowledge base whose content mirrors that of the World Wide Web...Such a “World Wide
         Knowledge Base” would consist of computer understandable assertions in symbolic, probabilistic form,
         and it would have many uses. At a minimum, it would allow much more effective information retrieval by
         supporting more sophisticated queries than current keyword-based search engines. Going a step further, it
         would enable new uses of the Web to support knowledge-based inference and problem solving.

The essential idea is to design a system which, when given (1) an "ontology specifying the classes and relations of
interest" and (2) "training examples that represent instances of the ontology classes and relations," would then learn
"general procedures for extracting new instances of these classes and relations from the Web." In their preliminary
work, the researches simplified the problem by focusing on a small subset of the Web and by attempting to train the
computer to recognize a very limited set of concepts and relationships. More specifically, the team acquired
approximately ten thousand pages from the sites of various Computer Science Departments. From these pages, they
attempted to train the computer to recognize the kinds of objects depicted in Figure 1.

                                                                         E n t i ty




         O th er                           A c ti v i ty                                        P e rso n   D e p a rtm e n t




                   R e se a rc h P ro je c t               C o u rse              F a c u lty    S ta ff       S tu d ent


                                                             Figure 1 CMU Entity Hierarchy

In addition, they attempted to have the system learn the following relationships among the objects:

    •    (InstructorOfCourse A B)
    •    (MembersOfProject A B)
    •    (DepartmentOfPerson A B)

The above relationships can be read, "A is the instructor of course B;" "A is a member of project B;" and "A is the
department of person B." The two main goals of the research were (1) to train the computer to accurately assign one
of the eight leaf classes to a given Web page from a CS department and (2) to train the computer to recognize the
above relationships among the pages that represent the entities. The team used a combination of methods. We will
consider goal (1), since it is to this goal that the group applied Naive Bayes.



                                                                       Page 2 of 7
Underlying the work were a set of assumptions that further simplified the task. For example, during their work, the
team assumed that each instance of an entity corresponds to exactly one page in their sample. As a consequence of
this assumption, a student, for example, corresponds to the student's home page. If the student has a collection of
pages at his or her site, then the main page would be matched with the student and the other pages would be
categorized as "other." These simplifying assumptions are certainly a problem. However, as the project progresses
into the future, the team intends to remove many of the simplifying assumptions to make their work more general.

During the first phase of the experiment, researchers hand-classified the pages. Afterward, the researchers applied a
Naive Bayes learner to a large subset of the ten thousand pages in order to generate a system that could identify the
class corresponding to a page it had not yet seen and for which it did not already have the answers. The team then
applied the system to the remaining pages to test the coverage and accuracy of the results. They used four-fold cross
validation to check their results.

The page classification sub-problem demonstrates the kinds of results the group achieved. To classify a page, the
researchers used a classifier that assigned a class c' to a document d according to the following equation:


                                    logPr(c) T                 Pr(w i | c)  
                       c' = argmax          + ∑ Pr(w i | d)log
                                                                Pr(w | d)  
                                                                             
                               c
                                       n      i =1                  i      
                                                      Equation 1

The paper describes the terms in the equation as follows: "where n is the number of words in d, T is the size of the
vocabulary, and wi is the i-th word in the vocabulary. Pr(w i | c) thus represents the probability of drawing wi given a
document from class c, and Pr(wi | d) represents the frequency of occurrences of wi in document d." The approach is
familiar. Define a discriminant function for each class. For a given document d, run the document through each of
the discriminant functions and choose the class that corresponds to the largest result. Each discriminant function
makes use of Bayes law and, in this experiment, assumes feature independence (hence the “naive” part of name.)
Because of this assumption, the method used at CMU does not suffer terribly from the curse of dimensionality,
which would otherwise become severe, as each word in the vocabulary adds an additional dimension. With
coverage of approximately 20%, the average accuracy for each class was approximately 60%. With higher coverage
the accuracy goes down, so that at 60% coverage, accuracy is roughly 40%. These numbers don't seem all that
great. However, the simplifying assumptions contributed to low performance. For example, for a collection of
pages that comprise a student's Web site, the classifier might choose many of them to correspond to an instance of
student. Recall that, because of an artificial, simplifying assumption, only one page corresponds to the student,
while the others get grouped into “other” class. As the researchers remove assumptions and adjust their learning
algorithms accordingly, performance should improve. As a side note, when the researchers introduced some
additional heuristics, such as heuristics that examine patterns in the URL, accuracy improved to 90% for 20%
coverage and 70% for 60% coverage. We shall see how some other techniques perform better than Naive Bayes
used in isolation.

Maximum Entropy
Over the years, other techniques for supervised learning in text classification have emerged. Nigam, Lafferty, and
McCallum [1999] describe one of them: Maximum Entropy, which has been applied to a variety of natural language
tasks. Maximum Entropy estimates class probability distributions from a given set of labeled training data. The
methodology the authors present is an iterative, scaling algorithm that maximizes the entropy distribution consistent
with the features of the classes. The algorithm defines a model of the class distributions. The model begins as a
uniform distribution since nothing is yet known about the distributions of the classes. The algorithm then changes
the model in an iterative fashion. With each iteration, the algorithm uses the labeled training data to constrain the
model so that it matches the data more closely. After the algorithm concludes, the model gives a good estimate of
the distributions of the class labels given a document. The authors’ experiments show that Maximum Entropy is
better than Naive Bayes but that Maximum Entropy sometimes suffers from over-fitting the training data due to poor
feature selection. If priors are used together with Maximum Entropy, performance is better.



                                                      Page 3 of 7
The researchers first selected a set of features. The features used in this experiment were word counts. For each
(class, word) pair, the algorithm finds a count that is the expected number of times the word appears in document of
the class, over the total number of words in the document. If a word occurs frequently in a given class, a
corresponding weight for that class-word pair is set to be higher than for other class-word pairs having a different
class and the same word. The authors point out that this method is typical in natural language classification.

Maximum Entropy starts by restricting the model distribution so that each class has the same expectation for a given
feature. In other words, the researches initialized the expectations for each feature by taking the average of that
feature over all the documents, as expressed by the following equation:

                                    1                       1
                                        ∑ f i (d, c(d)) = | D | d∑∑ P(c | d)f i (d, c)
                                  | D | d∈D                      ∈D c


                                                       Equation 2

where each fi(d, c) is a feature of document d for class c. From Equation 2, the paper concludes that the parametric
form of the conditional distribution of each class has the exponential form:

                                                       1
                                         P(c | d) =        exp(∑ λ i f i (d, c))
                                                      Z(d)     i


                                                       Equation 3

where λi the parameter to be learned for feature i and where Z(d) a normalizing constant.

The authors then present the IIS (Improved Iterative Scaling) procedure, which is a hill climbing algorithm used to
compute the parameters of the classifier, given the labeled data. The algorithm works in log likelihood space.
Given the training data, we can compute the log-likelihood of the model as follows:


            I(Λ | D) = log∏ PΛ (c(d)/d) = ∑ ∑ λ i f i (d, c(d)) − ∑ log∑ exp∑ λ i f i (d, c)
                            d∈D                  d∈D i                    d∈D      c      i


                                                       Equation 4

where Λ denotes a vector of conditional distributions of the form in Equation 3, one for each class and D is the set of
documents. A general outline of the Improved Scaling algorithm follows: Given the set of labeled documents D and
a set of features f, perform the following steps:

    1.   For all features fi, compute the expected value over the training data according to Equation 2.
    2.   Initialize all the feature parameters to be estimated (λi ) to 0.
    3.   Iterate until convergence occurs; i.e., until we reach the global maximum
              a. Calculate the expected class labels for each document with the current parameters P Λ(C/D), i.e.,
                   solve Equation 3
              b. For each (λi)
                          i.    Using standard Hill Climbing, find the step size δi that has a high log-likelihood.
                         ii.    λi =λi + δi
    4.   Output: The text classifier.

The analysis presented here shows that at each step, we can find changes to each λi. reaching convergence to the
single global maximum in the likelihood "surface". The paper states that there are no local maxima.




                                                       Page 4 of 7
Compared to Naive Bayes techniques, Maximum Entropy does not require that the features be independent. For
instance, in the phrase "Palo Alto” there are two words, which almost always occur together and rarely occur by
themselves. Naive Bayes will consider the two words independently and count the phrase twice. Maximum
Entropy, however, will reduce the weight of these features by half, since the constraints are based on the expectation
of the number of counts.

The authors use three data sets for evaluating the performance of the algorithm:

    •    The WebKB data set [Craven et al., 1998] contains Web documents from university computer science
         departments. In the present research, they use those pages in this data set that correspond to student,
         faculty, course, and project pages (4199 pages).
    •    The Industry Sector Hierarchy data set [McCallum and Nigam, 1998] contains company Web pages
         classified into a hierarchy of industry sectors. The 6440 pages divide into 71 classes and the hierarchy is 2
         levels.
    •    The Newsgroup data set [Joachims, 1997] contains about 20,000 rticles divided into 20 UseNet discussion
         groups. The project removed words that occur only once.

The authors also considered Naive Bayes as a textual classification algorithm. They present a comparison using two
variants of Naive Bayes: scaled (the word count is scaled so that each document has a constant number of word
occurrences) and un-scaled. They mentioned that the in most cases the scaled version is better than the regular
Naive Bayes.

The experimenters used cross validation to test results. For the Newsgroup and the Industry Sector data sets, the
algorithm sometimes produced over-fitting results. To prevent this occurrence, the researchers stopped the iterations
early. In all the test cases, Maximum Entropy performed better than regular Naive Bayes, especially on the WebKB
dataset, where the algorithm achieved 40% reduction in error over Naive Bayes. However, compared to the scaled
Naive Bayes the Maximum Entropy results were sometimes better, sometimes slightly worse. The authors attribute
the worse performances to over-fitting.

To help deal with over-fitting, the authors experimented with using Gaussian priors with the mean equal to zero and
a diagonal covariance matrix. The fact that the matrix is diagonal implies that all features use the same variance.
Equation 5 describes the Gaussian distribution of the priors.

                                                                                      2
                                                              1                - λi
                                         P(Λ) = ∏                       exp(          2
                                                                                          )
                                                     i      2πσ i
                                                                    2
                                                                               2σ i

                                                         Equation 5

where λi is the parameter for the i-th feature and σ2 is its variance.

The paper also shows that over-fitting is reduced when using a Gaussian prior with Maximum Entropy. The
classification error is better also. As a consequence of the above, the performance is better than that obtained when
using scaled naive Bayes. When no over-fitting is encountered (without using priors), the performance is almost
unchanged.

The authors point out a few shortcomings of their approach. One shortcoming is that the researchers used the same
features for every class. This need not be the case, and the learner would be more flexible if the researchers did not
use features in this way. Another relevant limitation is that the researchers used the same Gaussian prior variance in
all the experiments. This approach is not correct, particularly when the training data is sparse. A possible
improvement is to adjust the prior based on the amount of training data. The authors hypothesize that another
improvement would result from using feature functions of the form log(counter) or some other sub-linear
representation instead of the counts themselves. They have observed that using un-scaled counts gives decreased
accuracy.




                                                         Page 5 of 7
Classifying with Unlabeled Training Data (Bootstrapping and EM)
Many machine learning tasks begin with (1) a set of classifications c1..cn and (2) a set of instances that need to be
classified according to (1). The first step is to hand-label a number of training instances for input into the learning
algorithm and test instances for input into the resulting classifier. In many cases, to obtain good results, the number
of hand-labeled instances must be large. Herein lies a problem. Hand-labeling is usually difficult and time-
consuming. It would be desirable to skip this step. This problem also exists when turning to the Web and to
classifying text in general. Hence the motivation for the research described in “Text Classification by Bootstrapping
with Keywords, EM and Shrinkage” [McCallum and Nigam, 1999.] In this paper, researchers describe their
approach to classifying text that does not require hand-labeled training instances. The researches begin with the
goal of classifying Computer Science papers into 70 different topics (e.g. NLP, Interface Design, Multimedia,)
which are sub-disciplines of Computer Science. The researchers proceed as follows:

    1. For each class ci, define a short list of key words.
    2.   Classify the documents according the list of key words. Some documents will remain unclassified.
    3.   Construct a classifier using the instances that have been classified.
    4.   Use the classifier to classify all instances.
    5.   Iterate over steps (3) and (4) until convergence occurs.

Step (1) is to choose keywords for each class. A person chooses these keywords in a way he or she believes will
help identify instances of the class. This selection process is not easy; it requires much thought, trial, and error.
However, it requires much less work than hand-labeling data. Some of the key words the researchers chose follow:

         Topic                       Keywords

         NLP                         language, natural, processing, information, text
         Interface Design            interface, design, user, sketch, interfaces
         Multimedia                  real, time, data, media

Step (2) classifies documents according to the keywords. To perform this step, the computer essentially searches for
the first keyword in the given document and assigns the document to the corresponding class. Doing so leaves many
documents unclassified and some documents misclassified. In step (3), the researchers use a Naive Bayes learner to
construct an initial classifier using the documents classified in step (2,) the labels obtained from step (2,) and
discriminant equations derived from Equation 6.

                                                                           |d i |
                             P(c j | d i ) ∝ P(c j )P(d i | c j ) ∝ P(c j )∏ P(w d i ,k | c j )
                                                                           k =1


                                                       Equation 6

where cj is the class, di is the document being considered, and wdi,k is the k-th word in document di. Finally, the
researches use the results obtained thus far to “bootstrap” the EM algorithm. That is, they use the results as a
starting point for the EM learner, which successively improves the quality of the classifier.

The paper describes EM as follows:

         EM is a class of interactive algorithms for maximum likelihood or maximum a posteriori parameter
         estimation in problems with incomplete data [Dempster et al., 1977.] Given a model of data generation and
         data with some missing values, EM iteratively uses the current model to estimate the missing values, and
         then uses the missing value estimates to improve the model. Using all the available data, EM will locally
         maximize the likelihood of the parameters and give estimates for the missing values. In our scenario, the
         class labels of the unlabeled data are the missing values.




                                                       Page 6 of 7
EM essentially consists of two phases. The “E” phase calculates probabilistically-weighted class labels, P(cj| di), for
every document using the current classifier and Equation 6. The “M” phase constructs a new classifier according to
Naive Bayes from all of the classified instances. EM then iterates over the E and M phases until convergence
occurs. In addition to EM, the authors also applied a technique known as shrinkage to assist with the sparseness of
the data.

The researchers found that keywords alone provided 45% accuracy. The classifier that they constructed using
bootstrapping, EM, and shrinkage, obtained 66% accuracy. The paper makes a note that this level of accuracy
approaches estimated human agreement levels, which is 72%. Exactly what human agreement means, the paper left
unclear. The paper does not discuss coverage.

Conclusion
Computers should be able to help us obtain information from the Web in ways that are more sophisticated than those
used by current search engines. Different approaches have emerged to achieving this goal. The structuring
approach seeks to change the Web, so that the Web is easier for computers to understand. The learning approach
seeks to make computers smarter, so that they can understand the Web as it is. Because learning from the Web is
similar to learning from text, textual approaches serve as one of the foundations for work within the learning
approach. The default learning algorithm for producing text document classifiers is Naive Bayes. When applied to
Web page classification, Naive Bayes demonstrates results that are similar to those achieved when applied to text
document classification. To improve on Naive Bayes, researchers have explored other learners, including Maximum
Entropy and EM, which replace and/or augment Naive Bayes. In some cases, these other learners outperform Naive
Bayes. Work in training computers to understand Web content and to be able to use that understanding to provide
solutions is still in preliminary stages. The research has given some promising results.

References

    •    [Manning, 2000] C. Manning. January 2000. Lecture before Digital Library Group at Stanford University.
         January 2000.
    •     [Craven et. all, 1998] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S.
         Slattery. 1999. Learning to Extract Symbolic Knowledge from the World Wide Web.
    •    [McCallum and Nigam, 1999] A. McCallum, K. Nigam. 1999. Text classification by Bootstrapping with
         Keywords, EM and shrinkage.
    •    [Nigam et. all, 1999] K. Nigam, J. Lafferty, A. McCallum. 1999. Using Maximum Entropy for Text
         Classification.
    •     [McCallum and Nigam, 1998] Andrew McCallum and Kamal Nigam. A comparison of event models for
         naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization. Tech. rep.
         WS-98-05, AAAI Press. http://www.cs.cmu.edu/~mccallum.
    •    [Joachims, 1997] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for
         text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML
         ’97), pages 143-151, 1997.
    •    [Demster et. all, 1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from
         incomplete data via the EM algorithmn. Journal of the Royal Statistical Socity, Series B, 39(1):1-38.




                                                     Page 7 of 7

More Related Content

What's hot

Image retrieval from the world wide web issues, techniques, and systems
Image retrieval from the world wide web issues, techniques, and systemsImage retrieval from the world wide web issues, techniques, and systems
Image retrieval from the world wide web issues, techniques, and systemsunyil96
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Defense
DefenseDefense
Defenseac2182
 
The semanticweb may2001_timbernerslee
The semanticweb may2001_timbernersleeThe semanticweb may2001_timbernerslee
The semanticweb may2001_timbernersleegrknsfk
 
The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)Frank van Harmelen
 
The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)Frank van Harmelen
 
Medlink revision course in a box
Medlink revision course in a boxMedlink revision course in a box
Medlink revision course in a boxJames Craven
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 

What's hot (11)

Broad Data
Broad DataBroad Data
Broad Data
 
Image retrieval from the world wide web issues, techniques, and systems
Image retrieval from the world wide web issues, techniques, and systemsImage retrieval from the world wide web issues, techniques, and systems
Image retrieval from the world wide web issues, techniques, and systems
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Defense
DefenseDefense
Defense
 
The semanticweb may2001_timbernerslee
The semanticweb may2001_timbernersleeThe semanticweb may2001_timbernerslee
The semanticweb may2001_timbernerslee
 
The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)
 
The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)
 
Medlink revision course in a box
Medlink revision course in a boxMedlink revision course in a box
Medlink revision course in a box
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 

Viewers also liked

Gitpractice01
Gitpractice01Gitpractice01
Gitpractice01mmm110
 
Google gmail services support ppt
Google gmail services support pptGoogle gmail services support ppt
Google gmail services support pptVictoria Martin
 
Meet Ukraine - full version
Meet Ukraine - full versionMeet Ukraine - full version
Meet Ukraine - full versionPatrick Tahiri
 
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...Artificial Neural Network Based Fault Classifier for Transmission Line Protec...
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...IJERD Editor
 
Alexander Krizhanovsky Krizhanovsky Hpds
Alexander Krizhanovsky Krizhanovsky HpdsAlexander Krizhanovsky Krizhanovsky Hpds
Alexander Krizhanovsky Krizhanovsky Hpdsguest092df8
 

Viewers also liked (9)

Lc 532 2010
Lc 532 2010Lc 532 2010
Lc 532 2010
 
Gitpractice01
Gitpractice01Gitpractice01
Gitpractice01
 
Rojo
RojoRojo
Rojo
 
Google gmail services support ppt
Google gmail services support pptGoogle gmail services support ppt
Google gmail services support ppt
 
Meet Ukraine - full version
Meet Ukraine - full versionMeet Ukraine - full version
Meet Ukraine - full version
 
TwoTwoFive
TwoTwoFiveTwoTwoFive
TwoTwoFive
 
IT Offshoring seminar
IT Offshoring seminarIT Offshoring seminar
IT Offshoring seminar
 
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...Artificial Neural Network Based Fault Classifier for Transmission Line Protec...
Artificial Neural Network Based Fault Classifier for Transmission Line Protec...
 
Alexander Krizhanovsky Krizhanovsky Hpds
Alexander Krizhanovsky Krizhanovsky HpdsAlexander Krizhanovsky Krizhanovsky Hpds
Alexander Krizhanovsky Krizhanovsky Hpds
 

Similar to Equation 2.doc

A Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationA Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationWhitney Anderson
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIbutest
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIbutest
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388IJMER
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud ComputingCarmen Sanborn
 
Semantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsSemantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsDragan Gasevic
 
Usability Assessment 2004 02
Usability Assessment 2004 02Usability Assessment 2004 02
Usability Assessment 2004 02jessicaward1
 
Web Mining Based Framework for Ontology Learning
Web Mining Based Framework for Ontology LearningWeb Mining Based Framework for Ontology Learning
Web Mining Based Framework for Ontology Learningcsandit
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEcsandit
 
Redesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsRedesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsJenny Emanuel
 
InternshipPoster
InternshipPosterInternshipPoster
InternshipPosterRu Zhao
 
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...IJwest
 
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...dannyijwest
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 

Similar to Equation 2.doc (20)

A Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationA Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document Classification
 
J017145559
J017145559J017145559
J017145559
 
Challenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document ClusteringChallenging Issues and Similarity Measures for Web Document Clustering
Challenging Issues and Similarity Measures for Web Document Clustering
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AI
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AI
 
Research Statement
Research StatementResearch Statement
Research Statement
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud Computing
 
Semantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsSemantic Technologies in Learning Environments
Semantic Technologies in Learning Environments
 
Usability Assessment 2004 02
Usability Assessment 2004 02Usability Assessment 2004 02
Usability Assessment 2004 02
 
Web Mining Based Framework for Ontology Learning
Web Mining Based Framework for Ontology LearningWeb Mining Based Framework for Ontology Learning
Web Mining Based Framework for Ontology Learning
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
 
Redesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture PrincipalsRedesigning a Website Using Information Architecture Principals
Redesigning a Website Using Information Architecture Principals
 
InternshipPoster
InternshipPosterInternshipPoster
InternshipPoster
 
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
 
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
A BOOTSTRAPPING METHOD FOR AUTOMATIC CONSTRUCTING OF THE WEB ONTOLOGY INSTANC...
 
LuisValeroInterests
LuisValeroInterestsLuisValeroInterests
LuisValeroInterests
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Web Mining
Web MiningWeb Mining
Web Mining
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Equation 2.doc

  • 1. Machine Learning and Understanding the Web By Mark Chavira and Ulises Robles February 14, 2000 Introduction The World Wide Web contains information on many subjects, and on many subjects, the World Wide Web contains a lot of information. Undoubtedly, as it has grown, this store of data has developed into a mass of digitized knowledge that is unprecedented in both breadth and depth. Although researchers, hobbyists, and others have already discovered sundry uses for the resource, the sheer size of the WWW limits its use in many ways. To help manage the complexities of size, users have enlisted the aid of computers in ways that go beyond the simplistic ability to access pages by typing in URL's or by following the hyper-link structure. For example, Internet search engines allow users to find information in a way that is more convenient than and not always explicit in the hyper- links. Although computers already help us manage the Web, we would like them to do more. We would like to be able to ask a computer general questions, questions to which answers exist on the Web, questions like, "Who is the chair of the Computer Science Department at University X?" However, for computers to give such assistance, they must be able to understand a large portion of the semantic content of the Web. Computers do not currently understand this content. Of course, there is good reason. The Web was not designed for computerized understanding. Instead, it was designed for human understanding. As a result, one could argue that the idea of getting a computer to understand the content of the Web is akin to the idea of getting computers to understand natural language, a goal that remains elusive. Among the opinions regarding solutions to this problem are two. This paper gives an overview of both, and then it gives three examples of ways in which researchers are exploring the second possible solution. The Structuring Solution The first solution, sometimes championed by database experts and those accustomed to working with other types of highly structured data, is to change the Web. An extreme notion of this view claims that the data in the WWW is completely unstructured, and so long as it remains so, computers will have no hope of providing a type of assistance that is analogous to answering SQL queries. Proponents of this view would like to impose structure on the information in the Web in a way that is analogous to the way that information is structured within a relational database. At the very least, they would like a new form of markup language to replace HTML, one that would essentially give the computer clues that would help it to infer the semantics of the text. In the extreme, proponents of this "structuring solution" would like pages on the Web to be like forms that authors fill, forms that computers know about and can interpret. To be universally applicable to the WWW, both the moderate and the extreme approach require that existing data be put into a different form and that new data published to the Web conform to the new structure. The Learning Solution The second solution, sometimes championed by Artificial Intelligence experts and those accustomed to working with uncertainty, is to make computers smarter. Proponents of this view argue against the structuring solution, saying that, because humans would need to do the structuring, the task is simply not feasible. Dr. Christopher Manning, a professor in Linguistics and Computer Science at Stanford University, believes that there is simply too much information to convert; that the vast majority of people who publish pages to the Web would never agree to conform; and that the structure imposed would, to some extent, limit flexibility and expressiveness. Dr. Manning believes that the content on the Web is not unstructured. Rather, it possesses a "difficult structure," a structure that is possibly closer to the structure of natural language than to that of the "easy structure" found within a database [2000.] Those who champion the "learning solution" believe that computerized understanding--at least some form of simple comprehension--is an achievable goal. Moreover, some believe that the problem of understanding the Web is a simpler one than the problem of understanding natural language, because the Web does impose more structure than, say, a telephone conversation. There are, after all, links, tags, URL's and standard styles of Web page Page 1 of 7
  • 2. design that can give the computer hints about semantic intent. Even looking at the text in isolation, without making use of additional information within the page, has the potential to give good results. As a result, attempts at learning from text have direct applicability to learning from the Web. The remainder of this paper explores a small sampling of the work of researchers who champion the learning solution. The researchers in these examples work primarily on learning from text, without special consideration for other information that is embedded within Web pages. Obviously, the task is big. Initial work in the area has focused on solving similar but much smaller problems, with the hope that solutions to these smaller problems will lead to something more. Naive Bayes Naive Bayes is the default learning algorithm used to classify textual documents. In a paper entitled "Learning to Extract Symbolic Knowledge from the World Wide Web" [Craven et al., 1998] researchers at Carnegie Mellon University describe some of their experience applying Naive Bayes to the Web. The researchers describe their goal as follows: [The research effort has] the long term goal of automatically creating and maintaining a computer- understandable knowledge base whose content mirrors that of the World Wide Web...Such a “World Wide Knowledge Base” would consist of computer understandable assertions in symbolic, probabilistic form, and it would have many uses. At a minimum, it would allow much more effective information retrieval by supporting more sophisticated queries than current keyword-based search engines. Going a step further, it would enable new uses of the Web to support knowledge-based inference and problem solving. The essential idea is to design a system which, when given (1) an "ontology specifying the classes and relations of interest" and (2) "training examples that represent instances of the ontology classes and relations," would then learn "general procedures for extracting new instances of these classes and relations from the Web." In their preliminary work, the researches simplified the problem by focusing on a small subset of the Web and by attempting to train the computer to recognize a very limited set of concepts and relationships. More specifically, the team acquired approximately ten thousand pages from the sites of various Computer Science Departments. From these pages, they attempted to train the computer to recognize the kinds of objects depicted in Figure 1. E n t i ty O th er A c ti v i ty P e rso n D e p a rtm e n t R e se a rc h P ro je c t C o u rse F a c u lty S ta ff S tu d ent Figure 1 CMU Entity Hierarchy In addition, they attempted to have the system learn the following relationships among the objects: • (InstructorOfCourse A B) • (MembersOfProject A B) • (DepartmentOfPerson A B) The above relationships can be read, "A is the instructor of course B;" "A is a member of project B;" and "A is the department of person B." The two main goals of the research were (1) to train the computer to accurately assign one of the eight leaf classes to a given Web page from a CS department and (2) to train the computer to recognize the above relationships among the pages that represent the entities. The team used a combination of methods. We will consider goal (1), since it is to this goal that the group applied Naive Bayes. Page 2 of 7
  • 3. Underlying the work were a set of assumptions that further simplified the task. For example, during their work, the team assumed that each instance of an entity corresponds to exactly one page in their sample. As a consequence of this assumption, a student, for example, corresponds to the student's home page. If the student has a collection of pages at his or her site, then the main page would be matched with the student and the other pages would be categorized as "other." These simplifying assumptions are certainly a problem. However, as the project progresses into the future, the team intends to remove many of the simplifying assumptions to make their work more general. During the first phase of the experiment, researchers hand-classified the pages. Afterward, the researchers applied a Naive Bayes learner to a large subset of the ten thousand pages in order to generate a system that could identify the class corresponding to a page it had not yet seen and for which it did not already have the answers. The team then applied the system to the remaining pages to test the coverage and accuracy of the results. They used four-fold cross validation to check their results. The page classification sub-problem demonstrates the kinds of results the group achieved. To classify a page, the researchers used a classifier that assigned a class c' to a document d according to the following equation:  logPr(c) T  Pr(w i | c)   c' = argmax  + ∑ Pr(w i | d)log  Pr(w | d)    c  n i =1  i  Equation 1 The paper describes the terms in the equation as follows: "where n is the number of words in d, T is the size of the vocabulary, and wi is the i-th word in the vocabulary. Pr(w i | c) thus represents the probability of drawing wi given a document from class c, and Pr(wi | d) represents the frequency of occurrences of wi in document d." The approach is familiar. Define a discriminant function for each class. For a given document d, run the document through each of the discriminant functions and choose the class that corresponds to the largest result. Each discriminant function makes use of Bayes law and, in this experiment, assumes feature independence (hence the “naive” part of name.) Because of this assumption, the method used at CMU does not suffer terribly from the curse of dimensionality, which would otherwise become severe, as each word in the vocabulary adds an additional dimension. With coverage of approximately 20%, the average accuracy for each class was approximately 60%. With higher coverage the accuracy goes down, so that at 60% coverage, accuracy is roughly 40%. These numbers don't seem all that great. However, the simplifying assumptions contributed to low performance. For example, for a collection of pages that comprise a student's Web site, the classifier might choose many of them to correspond to an instance of student. Recall that, because of an artificial, simplifying assumption, only one page corresponds to the student, while the others get grouped into “other” class. As the researchers remove assumptions and adjust their learning algorithms accordingly, performance should improve. As a side note, when the researchers introduced some additional heuristics, such as heuristics that examine patterns in the URL, accuracy improved to 90% for 20% coverage and 70% for 60% coverage. We shall see how some other techniques perform better than Naive Bayes used in isolation. Maximum Entropy Over the years, other techniques for supervised learning in text classification have emerged. Nigam, Lafferty, and McCallum [1999] describe one of them: Maximum Entropy, which has been applied to a variety of natural language tasks. Maximum Entropy estimates class probability distributions from a given set of labeled training data. The methodology the authors present is an iterative, scaling algorithm that maximizes the entropy distribution consistent with the features of the classes. The algorithm defines a model of the class distributions. The model begins as a uniform distribution since nothing is yet known about the distributions of the classes. The algorithm then changes the model in an iterative fashion. With each iteration, the algorithm uses the labeled training data to constrain the model so that it matches the data more closely. After the algorithm concludes, the model gives a good estimate of the distributions of the class labels given a document. The authors’ experiments show that Maximum Entropy is better than Naive Bayes but that Maximum Entropy sometimes suffers from over-fitting the training data due to poor feature selection. If priors are used together with Maximum Entropy, performance is better. Page 3 of 7
  • 4. The researchers first selected a set of features. The features used in this experiment were word counts. For each (class, word) pair, the algorithm finds a count that is the expected number of times the word appears in document of the class, over the total number of words in the document. If a word occurs frequently in a given class, a corresponding weight for that class-word pair is set to be higher than for other class-word pairs having a different class and the same word. The authors point out that this method is typical in natural language classification. Maximum Entropy starts by restricting the model distribution so that each class has the same expectation for a given feature. In other words, the researches initialized the expectations for each feature by taking the average of that feature over all the documents, as expressed by the following equation: 1 1 ∑ f i (d, c(d)) = | D | d∑∑ P(c | d)f i (d, c) | D | d∈D ∈D c Equation 2 where each fi(d, c) is a feature of document d for class c. From Equation 2, the paper concludes that the parametric form of the conditional distribution of each class has the exponential form: 1 P(c | d) = exp(∑ λ i f i (d, c)) Z(d) i Equation 3 where λi the parameter to be learned for feature i and where Z(d) a normalizing constant. The authors then present the IIS (Improved Iterative Scaling) procedure, which is a hill climbing algorithm used to compute the parameters of the classifier, given the labeled data. The algorithm works in log likelihood space. Given the training data, we can compute the log-likelihood of the model as follows: I(Λ | D) = log∏ PΛ (c(d)/d) = ∑ ∑ λ i f i (d, c(d)) − ∑ log∑ exp∑ λ i f i (d, c) d∈D d∈D i d∈D c i Equation 4 where Λ denotes a vector of conditional distributions of the form in Equation 3, one for each class and D is the set of documents. A general outline of the Improved Scaling algorithm follows: Given the set of labeled documents D and a set of features f, perform the following steps: 1. For all features fi, compute the expected value over the training data according to Equation 2. 2. Initialize all the feature parameters to be estimated (λi ) to 0. 3. Iterate until convergence occurs; i.e., until we reach the global maximum a. Calculate the expected class labels for each document with the current parameters P Λ(C/D), i.e., solve Equation 3 b. For each (λi) i. Using standard Hill Climbing, find the step size δi that has a high log-likelihood. ii. λi =λi + δi 4. Output: The text classifier. The analysis presented here shows that at each step, we can find changes to each λi. reaching convergence to the single global maximum in the likelihood "surface". The paper states that there are no local maxima. Page 4 of 7
  • 5. Compared to Naive Bayes techniques, Maximum Entropy does not require that the features be independent. For instance, in the phrase "Palo Alto” there are two words, which almost always occur together and rarely occur by themselves. Naive Bayes will consider the two words independently and count the phrase twice. Maximum Entropy, however, will reduce the weight of these features by half, since the constraints are based on the expectation of the number of counts. The authors use three data sets for evaluating the performance of the algorithm: • The WebKB data set [Craven et al., 1998] contains Web documents from university computer science departments. In the present research, they use those pages in this data set that correspond to student, faculty, course, and project pages (4199 pages). • The Industry Sector Hierarchy data set [McCallum and Nigam, 1998] contains company Web pages classified into a hierarchy of industry sectors. The 6440 pages divide into 71 classes and the hierarchy is 2 levels. • The Newsgroup data set [Joachims, 1997] contains about 20,000 rticles divided into 20 UseNet discussion groups. The project removed words that occur only once. The authors also considered Naive Bayes as a textual classification algorithm. They present a comparison using two variants of Naive Bayes: scaled (the word count is scaled so that each document has a constant number of word occurrences) and un-scaled. They mentioned that the in most cases the scaled version is better than the regular Naive Bayes. The experimenters used cross validation to test results. For the Newsgroup and the Industry Sector data sets, the algorithm sometimes produced over-fitting results. To prevent this occurrence, the researchers stopped the iterations early. In all the test cases, Maximum Entropy performed better than regular Naive Bayes, especially on the WebKB dataset, where the algorithm achieved 40% reduction in error over Naive Bayes. However, compared to the scaled Naive Bayes the Maximum Entropy results were sometimes better, sometimes slightly worse. The authors attribute the worse performances to over-fitting. To help deal with over-fitting, the authors experimented with using Gaussian priors with the mean equal to zero and a diagonal covariance matrix. The fact that the matrix is diagonal implies that all features use the same variance. Equation 5 describes the Gaussian distribution of the priors. 2 1 - λi P(Λ) = ∏ exp( 2 ) i 2πσ i 2 2σ i Equation 5 where λi is the parameter for the i-th feature and σ2 is its variance. The paper also shows that over-fitting is reduced when using a Gaussian prior with Maximum Entropy. The classification error is better also. As a consequence of the above, the performance is better than that obtained when using scaled naive Bayes. When no over-fitting is encountered (without using priors), the performance is almost unchanged. The authors point out a few shortcomings of their approach. One shortcoming is that the researchers used the same features for every class. This need not be the case, and the learner would be more flexible if the researchers did not use features in this way. Another relevant limitation is that the researchers used the same Gaussian prior variance in all the experiments. This approach is not correct, particularly when the training data is sparse. A possible improvement is to adjust the prior based on the amount of training data. The authors hypothesize that another improvement would result from using feature functions of the form log(counter) or some other sub-linear representation instead of the counts themselves. They have observed that using un-scaled counts gives decreased accuracy. Page 5 of 7
  • 6. Classifying with Unlabeled Training Data (Bootstrapping and EM) Many machine learning tasks begin with (1) a set of classifications c1..cn and (2) a set of instances that need to be classified according to (1). The first step is to hand-label a number of training instances for input into the learning algorithm and test instances for input into the resulting classifier. In many cases, to obtain good results, the number of hand-labeled instances must be large. Herein lies a problem. Hand-labeling is usually difficult and time- consuming. It would be desirable to skip this step. This problem also exists when turning to the Web and to classifying text in general. Hence the motivation for the research described in “Text Classification by Bootstrapping with Keywords, EM and Shrinkage” [McCallum and Nigam, 1999.] In this paper, researchers describe their approach to classifying text that does not require hand-labeled training instances. The researches begin with the goal of classifying Computer Science papers into 70 different topics (e.g. NLP, Interface Design, Multimedia,) which are sub-disciplines of Computer Science. The researchers proceed as follows: 1. For each class ci, define a short list of key words. 2. Classify the documents according the list of key words. Some documents will remain unclassified. 3. Construct a classifier using the instances that have been classified. 4. Use the classifier to classify all instances. 5. Iterate over steps (3) and (4) until convergence occurs. Step (1) is to choose keywords for each class. A person chooses these keywords in a way he or she believes will help identify instances of the class. This selection process is not easy; it requires much thought, trial, and error. However, it requires much less work than hand-labeling data. Some of the key words the researchers chose follow: Topic Keywords NLP language, natural, processing, information, text Interface Design interface, design, user, sketch, interfaces Multimedia real, time, data, media Step (2) classifies documents according to the keywords. To perform this step, the computer essentially searches for the first keyword in the given document and assigns the document to the corresponding class. Doing so leaves many documents unclassified and some documents misclassified. In step (3), the researchers use a Naive Bayes learner to construct an initial classifier using the documents classified in step (2,) the labels obtained from step (2,) and discriminant equations derived from Equation 6. |d i | P(c j | d i ) ∝ P(c j )P(d i | c j ) ∝ P(c j )∏ P(w d i ,k | c j ) k =1 Equation 6 where cj is the class, di is the document being considered, and wdi,k is the k-th word in document di. Finally, the researches use the results obtained thus far to “bootstrap” the EM algorithm. That is, they use the results as a starting point for the EM learner, which successively improves the quality of the classifier. The paper describes EM as follows: EM is a class of interactive algorithms for maximum likelihood or maximum a posteriori parameter estimation in problems with incomplete data [Dempster et al., 1977.] Given a model of data generation and data with some missing values, EM iteratively uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the parameters and give estimates for the missing values. In our scenario, the class labels of the unlabeled data are the missing values. Page 6 of 7
  • 7. EM essentially consists of two phases. The “E” phase calculates probabilistically-weighted class labels, P(cj| di), for every document using the current classifier and Equation 6. The “M” phase constructs a new classifier according to Naive Bayes from all of the classified instances. EM then iterates over the E and M phases until convergence occurs. In addition to EM, the authors also applied a technique known as shrinkage to assist with the sparseness of the data. The researchers found that keywords alone provided 45% accuracy. The classifier that they constructed using bootstrapping, EM, and shrinkage, obtained 66% accuracy. The paper makes a note that this level of accuracy approaches estimated human agreement levels, which is 72%. Exactly what human agreement means, the paper left unclear. The paper does not discuss coverage. Conclusion Computers should be able to help us obtain information from the Web in ways that are more sophisticated than those used by current search engines. Different approaches have emerged to achieving this goal. The structuring approach seeks to change the Web, so that the Web is easier for computers to understand. The learning approach seeks to make computers smarter, so that they can understand the Web as it is. Because learning from the Web is similar to learning from text, textual approaches serve as one of the foundations for work within the learning approach. The default learning algorithm for producing text document classifiers is Naive Bayes. When applied to Web page classification, Naive Bayes demonstrates results that are similar to those achieved when applied to text document classification. To improve on Naive Bayes, researchers have explored other learners, including Maximum Entropy and EM, which replace and/or augment Naive Bayes. In some cases, these other learners outperform Naive Bayes. Work in training computers to understand Web content and to be able to use that understanding to provide solutions is still in preliminary stages. The research has given some promising results. References • [Manning, 2000] C. Manning. January 2000. Lecture before Digital Library Group at Stanford University. January 2000. • [Craven et. all, 1998] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. 1999. Learning to Extract Symbolic Knowledge from the World Wide Web. • [McCallum and Nigam, 1999] A. McCallum, K. Nigam. 1999. Text classification by Bootstrapping with Keywords, EM and shrinkage. • [Nigam et. all, 1999] K. Nigam, J. Lafferty, A. McCallum. 1999. Using Maximum Entropy for Text Classification. • [McCallum and Nigam, 1998] Andrew McCallum and Kamal Nigam. A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05, AAAI Press. http://www.cs.cmu.edu/~mccallum. • [Joachims, 1997] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML ’97), pages 143-151, 1997. • [Demster et. all, 1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithmn. Journal of the Royal Statistical Socity, Series B, 39(1):1-38. Page 7 of 7