1. Term Project of CS570 Artificial Intelligence
Webpage Categorization An application of Text Categorization
Team Golf
20043561 Ha-Yong, Jung
20040000 Hoang, G. Vu.
20043350 Jung-Ki, Yoo
2004 / 12/ 19
2. 1. Introduction
Text Categorization (TC) is the task of automatically classifying documents based on
their contents (i.e., based on text only, not on images, metadata, etc.) to a number of pre-
defined categories. Each document can be classified in multiple, one, or no category at
all.
TC has a wide range of applications including document organization, document
filtering, web page categorization, and word sense disambiguation.
In this project, we develop a web page categorization application capable of
automatically classifying Korean scientific web pages into a number of ‘favorite’
categories. We choose Support Vector Machines (SVM) as our classification technique
because SVM is powerful, theoretically rigorous, and is able to handle very high-
dimensional feature spaces. Furthermore, some recent TC systems have showed that
SVM outperforms most of other machine learning techniques such as neural networks,
RBF networks, and decision trees. The text classifier using SVM uses document vector as
a feature vector to represent text in training step and also in classifying step. And our
target document webpage has much more unnecessary texts than contents texts we really
need. So we have to change representation of text through many steps to use webpage as
document vector of classifier.
We implemented our system as client-server structure. It has several advantages.
First, training set and its category can be changed independently with application. And
also we can change classifier using other Machine Learning algorithm or optimize it. All
of these doesn’t affect to application at all. So performance and coverage of our system
can be improved without changing clients. That is, any users using clients doesn’t need to
change their clients program. Whole system flow is as follows:
Figure1: System Flows
This paper is organized as follows. In section 2, we talk about text representation.
Section 3 introduces SVM and the use of SVM in our application. In section 4, the details
of our application are described.
3. 2. Text Representation
In this section, we will talk about Text Representation which is needed for Text
Categorization. Text Representation module consists of several steps and partial result of
each step can be viewed on the web. So our system can be accessed not only client
application but also general web browser.
Figure2: Web Interface of Text Representation Module
2.1. Contents Extraction from Webpage
The target documents of our application are WebPages. So we need to extract only
contents from a webpage because there are much more text which is not related to the
contents, for example, HTML tag, java script, and so on. As you know, webpage is made
of HTML and HTML is not well organized language. So, it’s not so easy work to extract
contents because there are various exceptions and various expressions to represent same
thing in HTML.
There are two kinds of texts which should be eliminated in texts formatted by HTML.
One is meta text including HTML tag, and the other is contents text which is not related
to real contents. First, we eliminate HTML tag marked by “<” and “>” characters. Then,
we eliminate java script and style sheet, but there are some ambiguity to decide entry
point and end point of them because their entry point is expressed variously, furthermore
they has various special characters include “<” and “>”. And we eliminate not only these
meta texts but also contents texts which is not related to real contents. For example, the
texts between anchor tags represents link to other webpage. The other webpage which is
4. linked is meaningful of course, but the texts represent link itself is not meaningful at all
in many cases. So we exclude the text represent link.
The things mentioned above are just a little part of whole problem. In this manner,
there are too many various exceptions and it take long time to fix it because many
exceptions are discovered in Parsing step not in this step. The “contents” part marked by
blue box in Figure2 represents contents texts after this process.
2.2. Parsing Contents
The contents be made by previous step can be used for making document vector to
represent document, but it is not sufficient to represent feature of each documents
because it contains all terms in contents. For instance, almost important feature of
document is represented by noun terms and little verb terms but it includes not only noun
terms and verb terms but also adjective, adverb, preposition, conjunction, and so on.
Moreover, one term has same meaning is represented by various forms because it is
natural language used by human. So we need to parse the contents and extract only noun
terms. The reasons to extract only noun terms are verbs has only little meanings of
documents relative to nouns and many meanings extracted from verbs can be covered by
nouns. To do parsing, we used a Korean parser tool “HanNaNum.” The “Parsed Data”
part marked by light blue box in Figure2 represents parsed results of contents texts after
this process.
2.3. Document Vector
d = w , w ,...,w
Each document is represented by a sparse vector of weights j 1j 2j rj
where w is
the number of terms that occur at least once in the collection of documents and r is an
index of terms and j is an index of documents. We call this sparse vector as Document
Vector. This document vector is very sparse vector because index r should be counted by
whole training document set and one document has only few terms of them of course. An
index r should be global unique in whole training document set because each of them will
be used by one feature dimension in training step of Machine Learning. So we made an
index table which is implemented by hash table to assign an index r to each term. And, to
get an index r, we used an index table which is made in previous step. An index table for
category of each document is also made by same way.
In training step of text classification, we should get a document vector matrix which
each row is document vector of parsed training documents. So we made the training
document vector matrix includes each document vectors and its category. And we should
get a document vector for an input document in application. The “Document Vector” part
marked by dark brown box in Figure2 represents document vector of parsed texts after
this process. But it shows just terms itself used in one document. An index which is
assigned to each term is showed by next step.
2.4. Document Vector for SVM
We already made document vector in previous step, but SVM needs little difference
format of document vector so it was needed to convert document vector to the other
format. First constraint of document vector used by SVM is feature terms of document
vector and its category should be an integer index described above. And it must be
ordered. So we converted document vector format to fit for SVM. And we did one more
5. thing in this step to get a more high precision. It is to use TFIDF as weight of each term.
The TFIDF is computed as follows:
C0
wrj = tfidf (t r , d j ) = # (t r , d j ). log
# (d j )
If we use this weighting function, we can get more high precision in general because it
can assign more high weight to more discriminative terms. If the term occurs in a few
documents, it has more discriminative power. So TFIDF can assign more high weight to
more discriminative terms by using this idea. The “Document Vector for SVM” part
marked by purple box in Figure2 represents document vector for SVM after this process.
3. Text Categorization with Support Vector Machines
In this section, we introduce SVM and TC with SVM. The specific use of SVM in our
application is described afterward.
3.1. Support Vector Machines
SVM is a new machine learning technique based on Statistical Learning theory. SVM has
been proved to be very effective in dealing with high-dimensional feature spaces - the
most challenging problem of other machine learning techniques due to the so-called curse
of dimensionality.
For the sake of simplicity, let's examine the basic idea of SVM in the linear-separable
case (i.e., the training sample is separable by a hyperplane). Based on Structural Risk
Minimization principle, SVM algorithm try to find a hyperplane such that the margin
(i.e., the minimal distance of any training point to the hyperplane, see Figure3) is optimal.
In order to find an optimal margin, quadratic optimization is used in which the basic
computation is the dot product of two points in the input space.
Figure3: Linear classifier and margins.
For nonlinear-separable case, SVM use kernel-based methods to map the input space
to a so-called feature space. The basic idea of kernel methods is finding a map Φ from the
input space which is nonlinear separable to a linear-separable feature space (see Figure
4). However, the problem with the feature space is that it is usually of very large or even
infinite dimensions and thus computing dot product in this space is intractable.
Fortunately, kernel methods overcomes this problem by finding maps such that
computing dot products in feature spaces becomes computing kernel functions in input
spaces
k(x,y) = <Φ(x).Φ(y)>,
6. where k(x,y) is a kernel function in the input space and <Φ(x).Φ(y)> is a dot product in
the feature space. Therefore, the dot product in feature spaces can be computed even if
the map Φ is unknown. Some most widely used kernel functions are Gaussian RBF,
Polynomial, Sigmoidal, and B-Splines.
Figure 4: Example of mapping from input space to feature space.
3.2. Why SVM Work Well for Text Categorization?
SVM has been known to be very efficient in TC due to a number of following reasons:
• High dimensional input spaces: Normally, in TC the dimensions of input spaces are
very large, and it is very challenging to other machine learning techniques. However,
SVM does not depend on the number of features, and thus it has the potential to
handle large feature spaces.
• Few irrelevant features: In TC, there are very few irrelevant features. Therefore, the
uses of feature selection in other machine learning approaches to reduce the number
of irrelevant features will also decrease the accuracy of the classifiers. In contrast,
SVM can handle very large feature space and thus feature selection is only optional.
• Document vectors are sparse: In TC, each document vector has only few entries
which are not zero (i.e., sparse). Moreover, there are many theoretical and empirical
evidence that SVM is well suited for problems with dense concepts and sparse
instances like TC.
• Most TC problems are linearly separable: In practice, many TC problems are known
to be linearly separable. Therefore, SVM is suitable for these tasks.
3.3. Using SVM in our application
The SVM classifier of our application is implemented based on the LIBSVM library. The
library realizes C-support vector classification (C-SVC), ν-support vector classification
(ν-SVC), ν-support vector regression (ν-SVR), and incorporates many efficient features
such as caching, chunking, sequential minimal optimization and performs well in
moderate-sized problems (about tens of thousands of training data points). In our
2
application, we only use C-SVC with RBF kernel function K ( x, y ) = e −γ x − y .
To train our SVM classifier we use a collection of 7566 Korean scientific web documents
that are already categorized in 16 broad categories by domain experts. After parsing and
pre-processing, the total number of terms are 189815, meaning that each document is
represented by a 189815-dimensional sparse vector – a fairly large dimension that is
considered to be extremely difficult in other machine learning techniques such as neural
networks or decision trees. Aside from the training data set, another testing data set
7. consisting of 490 web documents that are also already classified in the same 16
categories is used to evaluate the classifier.
To decide the best parameters (i.e., C and γ) for the classifier, a grid search (i.e., search
over various pairs (C, γ)) is needed to be performed. However, since a full grid search is
very time-consuming we fix the γ parameter as 1/k (k is the number of training data) and
only try various values of C. A recommended range of values of C is 2 0, 21, 22, … , 210
which is known good enough in practice. The classification accuracies obtained over
training and testing data with various values of C are shown in Table 1 below.
Table 1: Classification accuracy over training and testing data for various values of C parameter.
C Accuracy over training data Accuracy over testing data
1 59.91% 44.90%
2 69.53% 53.06%
4 78.09% 57.35%
8 85.47% 61.22%
16 90.92% 64.69%
32 94.14% 68.57%
64 96.52% 71.22%
128 97.82% 71.02%
256 98.23% 71.02%
512 98.49% 72.45%
1024 98.61% 72.45%
As we can see in Table 1, with C=1024 the classifier performs most accurately.
Therefore, we choose C=1024 for our classifier.
4. Application Implementation
This section is about how we implemented our application and its functions. We used
simple web-browser type of program to show the function of our module.
4.1 Basic Structure
Basically, our program is composed with client side (web-browser) and server
side(processing module).
Figure 5: Basic concept of the program
8. Client part is implemented as a simple web-browser. During web-surfing , if the
user want to add the current page to his/her favorite list, he can easily add it just by
pressing ‘Add Favorite’ button on the top of our program . After showing some HTML
codes of current page, the program will send that code to the web server. And the web-
server will process that information to appropriate form to SVM (by parsing and
converting), and send it to client program. After receiving that data (preprocessed array
of numbers appropriate to SVM module) client side will categorize it to one of 16 pre-
defined categories using SVM module. And finally, client program will make .URL
file of that page( which is able to be used directly for favorite list of Explorer. It is a
kind of common format ) to categorized directory.
We especially focused on Science section because we cannot cover all the subjects
of news on Internet. That will be categorized to 16 categories such as below list.
"Sub","Spa","Phy","Mea","Mat","Geo","Eng","Ene","Ecl","Ear","Che","Bio","Ast","Agr","Aer","Aco"
Sub : Time travel Spa : Space Phy : Physics Mea: Measure Mat : Mathematics
Geo : Geology Eng : Engineering Ene :Energy Ecl : Ecology Ear: Earth
Che : chemistry Bio: Biology Ast :Astronology Agr : agriculture
Aer : Aero Aco : Acoustic
4.2 How to use this program?
Following procedures are showing how to use this program.
Figure 6: Main window of Client Program
Client program has similar interface with Internet Explorer. If you want to add this
page as a favorite list, and want to be categorized automatically, press ‘Add favorite’
button. (Figure 6)
9. Figure 7: Source Code View
After pressing the button, you can see this dialog box (Figure 7) showing the HTML
codes of current web-page. Because there are so many unnecessary parts like tag, we
have to send this code to preprocessing server to preprocess it.
Moreover, you can input some specific name of this page to have at ‘바로가기’ box.
Unless you write some specific words here, the page will be added to its category as
‘temporary.url’.
Figure 8: Result Dialog
After tens of seconds, you can see a dialog box(Figure 8) indicating the result of
processing (the Category: You can see this page is categorized to Bio part from above
directory) and the name of your favorite file(혈액형.url).
10. You can use this result( the directory and categorized sub directories) just copying and
pasting this to your favorite list. If you do not want to such efforts, press ‘Setup’ button
and modify the base directory of favorite list (Figure 9) of this program like below.
Figure 9: Setup Dialog
Figure 10: After changing
If you modified base directory of our program to favorite directory of Explorer, you
can directly use the result like above. (Figure 10)
11. 5. Conclusion and Further Work
We tested the probability of using Learning algorithm to Web-browser categorization
using simple modules. Especially, SVM is used to classify HTML codes because SVM is
known as good for character or document classification; we chose this classifier to divide
temporary-obtained-HTML codes to add this to favorite lists of user.
The client program in our whole structure, have too many work to process such as
waiting the result of preprocessing from server and running SVM module. These
functions are originally planned to implement on the Web-sever. However, because of
some problem (maybe because of some protocol usage difference or dealing data
(floating point precision)) we couldn`t include all the processing part on our server. If
this problem is solved, we think that we can make our client module have no time delay
processing such parts. Figure11
If server can process all the heavy calculating and just send the result to our client part,
there`ll be no time delay dealing with such processes. Moreover, this is good for
managing , fixing and updating module just by replacing SVM model.
Internet
Preprocess SVM
Client
Figure 11: Ideal Structure of server-client relationship
Moreover, this structure can be easily expanded to other applications like spam-mail
filtering, E-mail categorization, and Patent Classification.