SlideShare a Scribd company logo
1 of 11
Term Project of CS570 Artificial Intelligence



Webpage Categorization   An application of Text Categorization




                                                                       Team Golf
                                                                 20043561 Ha-Yong, Jung
                                                                  20040000 Hoang, G. Vu.
                                                                   20043350 Jung-Ki, Yoo




                                     2004 / 12/ 19
1. Introduction
Text Categorization (TC) is the task of automatically classifying documents based on
their contents (i.e., based on text only, not on images, metadata, etc.) to a number of pre-
defined categories. Each document can be classified in multiple, one, or no category at
all.
TC has a wide range of applications including document organization, document
filtering, web page categorization, and word sense disambiguation.

    In this project, we develop a web page categorization application capable of
automatically classifying Korean scientific web pages into a number of ‘favorite’
categories. We choose Support Vector Machines (SVM) as our classification technique
because SVM is powerful, theoretically rigorous, and is able to handle very high-
dimensional feature spaces. Furthermore, some recent TC systems have showed that
SVM outperforms most of other machine learning techniques such as neural networks,
RBF networks, and decision trees. The text classifier using SVM uses document vector as
a feature vector to represent text in training step and also in classifying step. And our
target document webpage has much more unnecessary texts than contents texts we really
need. So we have to change representation of text through many steps to use webpage as
document vector of classifier.

    We implemented our system as client-server structure. It has several advantages.
First, training set and its category can be changed independently with application. And
also we can change classifier using other Machine Learning algorithm or optimize it. All
of these doesn’t affect to application at all. So performance and coverage of our system
can be improved without changing clients. That is, any users using clients doesn’t need to
change their clients program. Whole system flow is as follows:




                                   Figure1: System Flows

    This paper is organized as follows. In section 2, we talk about text representation.
Section 3 introduces SVM and the use of SVM in our application. In section 4, the details
of our application are described.
2. Text Representation
In this section, we will talk about Text Representation which is needed for Text
Categorization. Text Representation module consists of several steps and partial result of
each step can be viewed on the web. So our system can be accessed not only client
application but also general web browser.




                    Figure2: Web Interface of Text Representation Module

2.1. Contents Extraction from Webpage
The target documents of our application are WebPages. So we need to extract only
contents from a webpage because there are much more text which is not related to the
contents, for example, HTML tag, java script, and so on. As you know, webpage is made
of HTML and HTML is not well organized language. So, it’s not so easy work to extract
contents because there are various exceptions and various expressions to represent same
thing in HTML.

    There are two kinds of texts which should be eliminated in texts formatted by HTML.
One is meta text including HTML tag, and the other is contents text which is not related
to real contents. First, we eliminate HTML tag marked by “<” and “>” characters. Then,
we eliminate java script and style sheet, but there are some ambiguity to decide entry
point and end point of them because their entry point is expressed variously, furthermore
they has various special characters include “<” and “>”. And we eliminate not only these
meta texts but also contents texts which is not related to real contents. For example, the
texts between anchor tags represents link to other webpage. The other webpage which is
linked is meaningful of course, but the texts represent link itself is not meaningful at all
in many cases. So we exclude the text represent link.
    The things mentioned above are just a little part of whole problem. In this manner,
there are too many various exceptions and it take long time to fix it because many
exceptions are discovered in Parsing step not in this step. The “contents” part marked by
blue box in Figure2 represents contents texts after this process.

2.2. Parsing Contents
The contents be made by previous step can be used for making document vector to
represent document, but it is not sufficient to represent feature of each documents
because it contains all terms in contents. For instance, almost important feature of
document is represented by noun terms and little verb terms but it includes not only noun
terms and verb terms but also adjective, adverb, preposition, conjunction, and so on.
Moreover, one term has same meaning is represented by various forms because it is
natural language used by human. So we need to parse the contents and extract only noun
terms. The reasons to extract only noun terms are verbs has only little meanings of
documents relative to nouns and many meanings extracted from verbs can be covered by
nouns. To do parsing, we used a Korean parser tool “HanNaNum.” The “Parsed Data”
part marked by light blue box in Figure2 represents parsed results of contents texts after
this process.

2.3. Document Vector
                                                               d = w , w ,...,w
Each document is represented by a sparse vector of weights j      1j  2j   rj
                                                                               where w is
the number of terms that occur at least once in the collection of documents and r is an
index of terms and j is an index of documents. We call this sparse vector as Document
Vector. This document vector is very sparse vector because index r should be counted by
whole training document set and one document has only few terms of them of course. An
index r should be global unique in whole training document set because each of them will
be used by one feature dimension in training step of Machine Learning. So we made an
index table which is implemented by hash table to assign an index r to each term. And, to
get an index r, we used an index table which is made in previous step. An index table for
category of each document is also made by same way.

    In training step of text classification, we should get a document vector matrix which
each row is document vector of parsed training documents. So we made the training
document vector matrix includes each document vectors and its category. And we should
get a document vector for an input document in application. The “Document Vector” part
marked by dark brown box in Figure2 represents document vector of parsed texts after
this process. But it shows just terms itself used in one document. An index which is
assigned to each term is showed by next step.

2.4. Document Vector for SVM
We already made document vector in previous step, but SVM needs little difference
format of document vector so it was needed to convert document vector to the other
format. First constraint of document vector used by SVM is feature terms of document
vector and its category should be an integer index described above. And it must be
ordered. So we converted document vector format to fit for SVM. And we did one more
thing in this step to get a more high precision. It is to use TFIDF as weight of each term.
The TFIDF is computed as follows:
                                                                         C0
                      wrj = tfidf (t r , d j ) = # (t r , d j ). log
                                                                       # (d j )
If we use this weighting function, we can get more high precision in general because it
can assign more high weight to more discriminative terms. If the term occurs in a few
documents, it has more discriminative power. So TFIDF can assign more high weight to
more discriminative terms by using this idea. The “Document Vector for SVM” part
marked by purple box in Figure2 represents document vector for SVM after this process.

3. Text Categorization with Support Vector Machines
In this section, we introduce SVM and TC with SVM. The specific use of SVM in our
application is described afterward.

3.1. Support Vector Machines
SVM is a new machine learning technique based on Statistical Learning theory. SVM has
been proved to be very effective in dealing with high-dimensional feature spaces - the
most challenging problem of other machine learning techniques due to the so-called curse
of dimensionality.

     For the sake of simplicity, let's examine the basic idea of SVM in the linear-separable
case (i.e., the training sample is separable by a hyperplane). Based on Structural Risk
Minimization principle, SVM algorithm try to find a hyperplane such that the margin
(i.e., the minimal distance of any training point to the hyperplane, see Figure3) is optimal.
In order to find an optimal margin, quadratic optimization is used in which the basic
computation is the dot product of two points in the input space.




                            Figure3: Linear classifier and margins.

    For nonlinear-separable case, SVM use kernel-based methods to map the input space
to a so-called feature space. The basic idea of kernel methods is finding a map Φ from the
input space which is nonlinear separable to a linear-separable feature space (see Figure
4). However, the problem with the feature space is that it is usually of very large or even
infinite dimensions and thus computing dot product in this space is intractable.
Fortunately, kernel methods overcomes this problem by finding maps such that
computing dot products in feature spaces becomes computing kernel functions in input
spaces
                                   k(x,y) = <Φ(x).Φ(y)>,
where k(x,y) is a kernel function in the input space and <Φ(x).Φ(y)> is a dot product in
the feature space. Therefore, the dot product in feature spaces can be computed even if
the map Φ is unknown. Some most widely used kernel functions are Gaussian RBF,
Polynomial, Sigmoidal, and B-Splines.




                Figure 4: Example of mapping from input space to feature space.

3.2. Why SVM Work Well for Text Categorization?
SVM has been known to be very efficient in TC due to a number of following reasons:
• High dimensional input spaces: Normally, in TC the dimensions of input spaces are
    very large, and it is very challenging to other machine learning techniques. However,
    SVM does not depend on the number of features, and thus it has the potential to
    handle large feature spaces.
• Few irrelevant features: In TC, there are very few irrelevant features. Therefore, the
    uses of feature selection in other machine learning approaches to reduce the number
    of irrelevant features will also decrease the accuracy of the classifiers. In contrast,
    SVM can handle very large feature space and thus feature selection is only optional.
• Document vectors are sparse: In TC, each document vector has only few entries
    which are not zero (i.e., sparse). Moreover, there are many theoretical and empirical
    evidence that SVM is well suited for problems with dense concepts and sparse
    instances like TC.
• Most TC problems are linearly separable: In practice, many TC problems are known
    to be linearly separable. Therefore, SVM is suitable for these tasks.

3.3. Using SVM in our application
The SVM classifier of our application is implemented based on the LIBSVM library. The
library realizes C-support vector classification (C-SVC), ν-support vector classification
(ν-SVC), ν-support vector regression (ν-SVR), and incorporates many efficient features
such as caching, chunking, sequential minimal optimization and performs well in
moderate-sized problems (about tens of thousands of training data points). In our
                                                                               2
application, we only use C-SVC with RBF kernel function K ( x, y ) = e −γ x − y .

To train our SVM classifier we use a collection of 7566 Korean scientific web documents
that are already categorized in 16 broad categories by domain experts. After parsing and
pre-processing, the total number of terms are 189815, meaning that each document is
represented by a 189815-dimensional sparse vector – a fairly large dimension that is
considered to be extremely difficult in other machine learning techniques such as neural
networks or decision trees. Aside from the training data set, another testing data set
consisting of 490 web documents that are also already classified in the same 16
categories is used to evaluate the classifier.

To decide the best parameters (i.e., C and γ) for the classifier, a grid search (i.e., search
over various pairs (C, γ)) is needed to be performed. However, since a full grid search is
very time-consuming we fix the γ parameter as 1/k (k is the number of training data) and
only try various values of C. A recommended range of values of C is 2 0, 21, 22, … , 210
which is known good enough in practice. The classification accuracies obtained over
training and testing data with various values of C are shown in Table 1 below.

Table 1: Classification accuracy over training and testing data for various values of C parameter.
               C                   Accuracy over training data        Accuracy over testing data
                1                           59.91%                            44.90%
                2                           69.53%                            53.06%
                4                           78.09%                            57.35%
                8                           85.47%                            61.22%
               16                           90.92%                            64.69%
               32                           94.14%                            68.57%
               64                           96.52%                            71.22%
              128                           97.82%                            71.02%
              256                           98.23%                            71.02%
              512                           98.49%                            72.45%
             1024                           98.61%                            72.45%

As we can see in Table 1, with C=1024 the classifier performs most accurately.
Therefore, we choose C=1024 for our classifier.

4. Application Implementation
 This section is about how we implemented our application and its functions. We used
simple web-browser type of program to show the function of our module.

4.1 Basic Structure
      Basically, our program is composed with client side (web-browser) and server
side(processing module).




                           Figure 5: Basic concept of the program
Client part is implemented as a simple web-browser. During web-surfing , if the
  user want to add the current page to his/her favorite list, he can easily add it just by
  pressing ‘Add Favorite’ button on the top of our program . After showing some HTML
  codes of current page, the program will send that code to the web server. And the web-
  server will process that information to appropriate form to SVM (by parsing and
  converting), and send it to client program. After receiving that data (preprocessed array
  of numbers appropriate to SVM module) client side will categorize it to one of 16 pre-
  defined categories using SVM module. And finally, client program will make .URL
  file of that page( which is able to be used directly for favorite list of Explorer. It is a
  kind of common format ) to categorized directory.
     We especially focused on Science section because we cannot cover all the subjects
  of news on Internet. That will be categorized to 16 categories such as below list.

  "Sub","Spa","Phy","Mea","Mat","Geo","Eng","Ene","Ecl","Ear","Che","Bio","Ast","Agr","Aer","Aco"
   Sub : Time travel     Spa : Space Phy : Physics Mea: Measure Mat : Mathematics
   Geo : Geology         Eng : Engineering Ene :Energy Ecl : Ecology      Ear: Earth
   Che : chemistry       Bio: Biology      Ast :Astronology Agr : agriculture
   Aer : Aero            Aco : Acoustic


4.2 How to use this program?
     Following procedures are showing how to use this program.




                          Figure 6: Main window of Client Program

  Client program has similar interface with Internet Explorer. If you want to add this
page as a favorite list, and want to be categorized automatically, press ‘Add favorite’
button. (Figure 6)
Figure 7: Source Code View

   After pressing the button, you can see this dialog box (Figure 7) showing the HTML
codes of current web-page. Because there are so many unnecessary parts like tag, we
have to send this code to preprocessing server to preprocess it.
   Moreover, you can input some specific name of this page to have at ‘바로가기’ box.
Unless you write some specific words here, the page will be added to its category as
‘temporary.url’.




                                    Figure 8: Result Dialog

   After tens of seconds, you can see a dialog box(Figure 8) indicating the result of
processing (the Category: You can see this page is categorized to Bio part from above
directory) and the name of your favorite file(혈액형.url).
You can use this result( the directory and categorized sub directories) just copying and
pasting this to your favorite list. If you do not want to such efforts, press ‘Setup’ button
and modify the base directory of favorite list (Figure 9) of this program like below.




                                   Figure 9: Setup Dialog




                              Figure 10: After changing

  If you modified base directory of our program to favorite directory of Explorer, you
can directly use the result like above. (Figure 10)
5. Conclusion and Further Work
   We tested the probability of using Learning algorithm to Web-browser categorization
using simple modules. Especially, SVM is used to classify HTML codes because SVM is
known as good for character or document classification; we chose this classifier to divide
temporary-obtained-HTML codes to add this to favorite lists of user.
   The client program in our whole structure, have too many work to process such as
waiting the result of preprocessing from server and running SVM module. These
functions are originally planned to implement on the Web-sever. However, because of
some problem (maybe because of some protocol usage difference or dealing data
(floating point precision)) we couldn`t include all the processing part on our server. If
this problem is solved, we think that we can make our client module have no time delay
processing such parts. Figure11
   If server can process all the heavy calculating and just send the result to our client part,
there`ll be no time delay dealing with such processes. Moreover, this is good for
managing , fixing and updating module just by replacing SVM model.


                                     Internet



                              Preprocess            SVM




                                        Client
                  Figure 11: Ideal Structure of server-client relationship

   Moreover, this structure can be easily expanded to other applications like spam-mail
filtering, E-mail categorization, and Patent Classification.

More Related Content

What's hot

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435IJRAT
 
Ap Power Point Chpt4
Ap Power Point Chpt4Ap Power Point Chpt4
Ap Power Point Chpt4dplunkett
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONadeij1
 
Object Oriented Software Development revision slide
Object Oriented Software Development revision slide Object Oriented Software Development revision slide
Object Oriented Software Development revision slide fauza jali
 
Ch.1 oop introduction, classes and objects
Ch.1 oop introduction, classes and objectsCh.1 oop introduction, classes and objects
Ch.1 oop introduction, classes and objectsITNet
 
Writing Usable APIs in Practice by Giovanni Asproni
Writing Usable APIs in Practice by Giovanni AsproniWriting Usable APIs in Practice by Giovanni Asproni
Writing Usable APIs in Practice by Giovanni AsproniSyncConf
 
Ap Power Point Chpt7
Ap Power Point Chpt7Ap Power Point Chpt7
Ap Power Point Chpt7dplunkett
 
C0 review core java1
C0 review core java1C0 review core java1
C0 review core java1tam53pm1
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET Journal
 
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLHIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLijfcstjournal
 
Basics of java 2
Basics of java 2Basics of java 2
Basics of java 2Raghu nath
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241Urjit Patel
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep LearningIRJET Journal
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET Journal
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 

What's hot (19)

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
Ap Power Point Chpt4
Ap Power Point Chpt4Ap Power Point Chpt4
Ap Power Point Chpt4
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
 
Object Oriented Software Development revision slide
Object Oriented Software Development revision slide Object Oriented Software Development revision slide
Object Oriented Software Development revision slide
 
Ch.1 oop introduction, classes and objects
Ch.1 oop introduction, classes and objectsCh.1 oop introduction, classes and objects
Ch.1 oop introduction, classes and objects
 
Writing Usable APIs in Practice by Giovanni Asproni
Writing Usable APIs in Practice by Giovanni AsproniWriting Usable APIs in Practice by Giovanni Asproni
Writing Usable APIs in Practice by Giovanni Asproni
 
Ap Power Point Chpt7
Ap Power Point Chpt7Ap Power Point Chpt7
Ap Power Point Chpt7
 
C0 review core java1
C0 review core java1C0 review core java1
C0 review core java1
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLHIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
 
Unit3
Unit3Unit3
Unit3
 
Basics of java 2
Basics of java 2Basics of java 2
Basics of java 2
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in Twitter
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 

Viewers also liked

Learning the Structure of Related Tasks
Learning the Structure of Related Tasks Learning the Structure of Related Tasks
Learning the Structure of Related Tasks butest
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
Презентация УИС для учебных заведений
Презентация УИС для учебных заведенийПрезентация УИС для учебных заведений
Презентация УИС для учебных заведенийsoftmotions
 
Presentación kandinsky
Presentación kandinskyPresentación kandinsky
Presentación kandinskyhome
 
MikroBasic
MikroBasicMikroBasic
MikroBasicbutest
 
經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案R.O.C.Executive Yuan
 
Historia En Galicia
Historia En GaliciaHistoria En Galicia
Historia En Galiciaguestff35dbb
 
Draft Horse Plowing 2010
Draft  Horse Plowing 2010Draft  Horse Plowing 2010
Draft Horse Plowing 2010Gigi Embrechts
 
Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...butest
 
L'hiver en Russie
L'hiver en RussieL'hiver en Russie
L'hiver en RussieLily Lake
 
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...butest
 
[CLPE] Novidades do Asp.net 4
[CLPE] Novidades do Asp.net 4[CLPE] Novidades do Asp.net 4
[CLPE] Novidades do Asp.net 4Felipe Pimentel
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
แบบสำรวจตัวเอง
แบบสำรวจตัวเองแบบสำรวจตัวเอง
แบบสำรวจตัวเองPiangtawan Tianloek
 
EDU 508 Course Title: 21 Things Every 21st Century Educator Should ...
EDU 508 Course Title: 21 Things Every  21st Century Educator Should ...EDU 508 Course Title: 21 Things Every  21st Century Educator Should ...
EDU 508 Course Title: 21 Things Every 21st Century Educator Should ...butest
 
Slides
SlidesSlides
Slidesbutest
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Workbutest
 
Download Materials
Download MaterialsDownload Materials
Download Materialsbutest
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 

Viewers also liked (20)

Learning the Structure of Related Tasks
Learning the Structure of Related Tasks Learning the Structure of Related Tasks
Learning the Structure of Related Tasks
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Презентация УИС для учебных заведений
Презентация УИС для учебных заведенийПрезентация УИС для учебных заведений
Презентация УИС для учебных заведений
 
Presentación kandinsky
Presentación kandinskyPresentación kandinsky
Presentación kandinsky
 
MikroBasic
MikroBasicMikroBasic
MikroBasic
 
經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案
 
Hivext 04.2009
Hivext 04.2009Hivext 04.2009
Hivext 04.2009
 
Historia En Galicia
Historia En GaliciaHistoria En Galicia
Historia En Galicia
 
Draft Horse Plowing 2010
Draft  Horse Plowing 2010Draft  Horse Plowing 2010
Draft Horse Plowing 2010
 
Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...Unifying an Introduction to Artificial Intelligence Course ...
Unifying an Introduction to Artificial Intelligence Course ...
 
L'hiver en Russie
L'hiver en RussieL'hiver en Russie
L'hiver en Russie
 
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...
Metodiskie norādījumi (1,27 MB, docx fomātā) - RĪGAS TEHNISKĀ ...
 
[CLPE] Novidades do Asp.net 4
[CLPE] Novidades do Asp.net 4[CLPE] Novidades do Asp.net 4
[CLPE] Novidades do Asp.net 4
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
แบบสำรวจตัวเอง
แบบสำรวจตัวเองแบบสำรวจตัวเอง
แบบสำรวจตัวเอง
 
EDU 508 Course Title: 21 Things Every 21st Century Educator Should ...
EDU 508 Course Title: 21 Things Every  21st Century Educator Should ...EDU 508 Course Title: 21 Things Every  21st Century Educator Should ...
EDU 508 Course Title: 21 Things Every 21st Century Educator Should ...
 
Slides
SlidesSlides
Slides
 
Representative Previous Work
Representative Previous WorkRepresentative Previous Work
Representative Previous Work
 
Download Materials
Download MaterialsDownload Materials
Download Materials
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Similar to Team G

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesNamed Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesIRJET Journal
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
NEr using N-Gram techniqueppt
NEr using N-Gram techniquepptNEr using N-Gram techniqueppt
NEr using N-Gram techniquepptGyandeep Kansal
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioIRJET Journal
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnIOSR Journals
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...Geetika Gautam
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 
Example Of Import Java
Example Of Import JavaExample Of Import Java
Example Of Import JavaMelody Rios
 
ALICE – APPLYING BERT TO ITALIAN EMAILS
ALICE – APPLYING BERT TO ITALIAN EMAILSALICE – APPLYING BERT TO ITALIAN EMAILS
ALICE – APPLYING BERT TO ITALIAN EMAILSIJCI JOURNAL
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification SystemIRJET Journal
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ireSoham Saha
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep LearningIRJET Journal
 
E learning excel vba programming lesson 3
E learning excel vba programming  lesson 3E learning excel vba programming  lesson 3
E learning excel vba programming lesson 3Vijay Perepa
 

Similar to Team G (20)

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesNamed Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
NEr using N-Gram techniqueppt
NEr using N-Gram techniquepptNEr using N-Gram techniqueppt
NEr using N-Gram techniqueppt
 
Final ppt
Final pptFinal ppt
Final ppt
 
Advance oops concepts
Advance oops conceptsAdvance oops concepts
Advance oops concepts
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 
Example Of Import Java
Example Of Import JavaExample Of Import Java
Example Of Import Java
 
ALICE – APPLYING BERT TO ITALIAN EMAILS
ALICE – APPLYING BERT TO ITALIAN EMAILSALICE – APPLYING BERT TO ITALIAN EMAILS
ALICE – APPLYING BERT TO ITALIAN EMAILS
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Document Analyser Using Deep Learning
Document Analyser Using Deep LearningDocument Analyser Using Deep Learning
Document Analyser Using Deep Learning
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
E learning excel vba programming lesson 3
E learning excel vba programming  lesson 3E learning excel vba programming  lesson 3
E learning excel vba programming lesson 3
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Team G

  • 1. Term Project of CS570 Artificial Intelligence Webpage Categorization An application of Text Categorization Team Golf 20043561 Ha-Yong, Jung 20040000 Hoang, G. Vu. 20043350 Jung-Ki, Yoo 2004 / 12/ 19
  • 2. 1. Introduction Text Categorization (TC) is the task of automatically classifying documents based on their contents (i.e., based on text only, not on images, metadata, etc.) to a number of pre- defined categories. Each document can be classified in multiple, one, or no category at all. TC has a wide range of applications including document organization, document filtering, web page categorization, and word sense disambiguation. In this project, we develop a web page categorization application capable of automatically classifying Korean scientific web pages into a number of ‘favorite’ categories. We choose Support Vector Machines (SVM) as our classification technique because SVM is powerful, theoretically rigorous, and is able to handle very high- dimensional feature spaces. Furthermore, some recent TC systems have showed that SVM outperforms most of other machine learning techniques such as neural networks, RBF networks, and decision trees. The text classifier using SVM uses document vector as a feature vector to represent text in training step and also in classifying step. And our target document webpage has much more unnecessary texts than contents texts we really need. So we have to change representation of text through many steps to use webpage as document vector of classifier. We implemented our system as client-server structure. It has several advantages. First, training set and its category can be changed independently with application. And also we can change classifier using other Machine Learning algorithm or optimize it. All of these doesn’t affect to application at all. So performance and coverage of our system can be improved without changing clients. That is, any users using clients doesn’t need to change their clients program. Whole system flow is as follows: Figure1: System Flows This paper is organized as follows. In section 2, we talk about text representation. Section 3 introduces SVM and the use of SVM in our application. In section 4, the details of our application are described.
  • 3. 2. Text Representation In this section, we will talk about Text Representation which is needed for Text Categorization. Text Representation module consists of several steps and partial result of each step can be viewed on the web. So our system can be accessed not only client application but also general web browser. Figure2: Web Interface of Text Representation Module 2.1. Contents Extraction from Webpage The target documents of our application are WebPages. So we need to extract only contents from a webpage because there are much more text which is not related to the contents, for example, HTML tag, java script, and so on. As you know, webpage is made of HTML and HTML is not well organized language. So, it’s not so easy work to extract contents because there are various exceptions and various expressions to represent same thing in HTML. There are two kinds of texts which should be eliminated in texts formatted by HTML. One is meta text including HTML tag, and the other is contents text which is not related to real contents. First, we eliminate HTML tag marked by “<” and “>” characters. Then, we eliminate java script and style sheet, but there are some ambiguity to decide entry point and end point of them because their entry point is expressed variously, furthermore they has various special characters include “<” and “>”. And we eliminate not only these meta texts but also contents texts which is not related to real contents. For example, the texts between anchor tags represents link to other webpage. The other webpage which is
  • 4. linked is meaningful of course, but the texts represent link itself is not meaningful at all in many cases. So we exclude the text represent link. The things mentioned above are just a little part of whole problem. In this manner, there are too many various exceptions and it take long time to fix it because many exceptions are discovered in Parsing step not in this step. The “contents” part marked by blue box in Figure2 represents contents texts after this process. 2.2. Parsing Contents The contents be made by previous step can be used for making document vector to represent document, but it is not sufficient to represent feature of each documents because it contains all terms in contents. For instance, almost important feature of document is represented by noun terms and little verb terms but it includes not only noun terms and verb terms but also adjective, adverb, preposition, conjunction, and so on. Moreover, one term has same meaning is represented by various forms because it is natural language used by human. So we need to parse the contents and extract only noun terms. The reasons to extract only noun terms are verbs has only little meanings of documents relative to nouns and many meanings extracted from verbs can be covered by nouns. To do parsing, we used a Korean parser tool “HanNaNum.” The “Parsed Data” part marked by light blue box in Figure2 represents parsed results of contents texts after this process. 2.3. Document Vector d = w , w ,...,w Each document is represented by a sparse vector of weights j 1j 2j rj where w is the number of terms that occur at least once in the collection of documents and r is an index of terms and j is an index of documents. We call this sparse vector as Document Vector. This document vector is very sparse vector because index r should be counted by whole training document set and one document has only few terms of them of course. An index r should be global unique in whole training document set because each of them will be used by one feature dimension in training step of Machine Learning. So we made an index table which is implemented by hash table to assign an index r to each term. And, to get an index r, we used an index table which is made in previous step. An index table for category of each document is also made by same way. In training step of text classification, we should get a document vector matrix which each row is document vector of parsed training documents. So we made the training document vector matrix includes each document vectors and its category. And we should get a document vector for an input document in application. The “Document Vector” part marked by dark brown box in Figure2 represents document vector of parsed texts after this process. But it shows just terms itself used in one document. An index which is assigned to each term is showed by next step. 2.4. Document Vector for SVM We already made document vector in previous step, but SVM needs little difference format of document vector so it was needed to convert document vector to the other format. First constraint of document vector used by SVM is feature terms of document vector and its category should be an integer index described above. And it must be ordered. So we converted document vector format to fit for SVM. And we did one more
  • 5. thing in this step to get a more high precision. It is to use TFIDF as weight of each term. The TFIDF is computed as follows: C0 wrj = tfidf (t r , d j ) = # (t r , d j ). log # (d j ) If we use this weighting function, we can get more high precision in general because it can assign more high weight to more discriminative terms. If the term occurs in a few documents, it has more discriminative power. So TFIDF can assign more high weight to more discriminative terms by using this idea. The “Document Vector for SVM” part marked by purple box in Figure2 represents document vector for SVM after this process. 3. Text Categorization with Support Vector Machines In this section, we introduce SVM and TC with SVM. The specific use of SVM in our application is described afterward. 3.1. Support Vector Machines SVM is a new machine learning technique based on Statistical Learning theory. SVM has been proved to be very effective in dealing with high-dimensional feature spaces - the most challenging problem of other machine learning techniques due to the so-called curse of dimensionality. For the sake of simplicity, let's examine the basic idea of SVM in the linear-separable case (i.e., the training sample is separable by a hyperplane). Based on Structural Risk Minimization principle, SVM algorithm try to find a hyperplane such that the margin (i.e., the minimal distance of any training point to the hyperplane, see Figure3) is optimal. In order to find an optimal margin, quadratic optimization is used in which the basic computation is the dot product of two points in the input space. Figure3: Linear classifier and margins. For nonlinear-separable case, SVM use kernel-based methods to map the input space to a so-called feature space. The basic idea of kernel methods is finding a map Φ from the input space which is nonlinear separable to a linear-separable feature space (see Figure 4). However, the problem with the feature space is that it is usually of very large or even infinite dimensions and thus computing dot product in this space is intractable. Fortunately, kernel methods overcomes this problem by finding maps such that computing dot products in feature spaces becomes computing kernel functions in input spaces k(x,y) = <Φ(x).Φ(y)>,
  • 6. where k(x,y) is a kernel function in the input space and <Φ(x).Φ(y)> is a dot product in the feature space. Therefore, the dot product in feature spaces can be computed even if the map Φ is unknown. Some most widely used kernel functions are Gaussian RBF, Polynomial, Sigmoidal, and B-Splines. Figure 4: Example of mapping from input space to feature space. 3.2. Why SVM Work Well for Text Categorization? SVM has been known to be very efficient in TC due to a number of following reasons: • High dimensional input spaces: Normally, in TC the dimensions of input spaces are very large, and it is very challenging to other machine learning techniques. However, SVM does not depend on the number of features, and thus it has the potential to handle large feature spaces. • Few irrelevant features: In TC, there are very few irrelevant features. Therefore, the uses of feature selection in other machine learning approaches to reduce the number of irrelevant features will also decrease the accuracy of the classifiers. In contrast, SVM can handle very large feature space and thus feature selection is only optional. • Document vectors are sparse: In TC, each document vector has only few entries which are not zero (i.e., sparse). Moreover, there are many theoretical and empirical evidence that SVM is well suited for problems with dense concepts and sparse instances like TC. • Most TC problems are linearly separable: In practice, many TC problems are known to be linearly separable. Therefore, SVM is suitable for these tasks. 3.3. Using SVM in our application The SVM classifier of our application is implemented based on the LIBSVM library. The library realizes C-support vector classification (C-SVC), ν-support vector classification (ν-SVC), ν-support vector regression (ν-SVR), and incorporates many efficient features such as caching, chunking, sequential minimal optimization and performs well in moderate-sized problems (about tens of thousands of training data points). In our 2 application, we only use C-SVC with RBF kernel function K ( x, y ) = e −γ x − y . To train our SVM classifier we use a collection of 7566 Korean scientific web documents that are already categorized in 16 broad categories by domain experts. After parsing and pre-processing, the total number of terms are 189815, meaning that each document is represented by a 189815-dimensional sparse vector – a fairly large dimension that is considered to be extremely difficult in other machine learning techniques such as neural networks or decision trees. Aside from the training data set, another testing data set
  • 7. consisting of 490 web documents that are also already classified in the same 16 categories is used to evaluate the classifier. To decide the best parameters (i.e., C and γ) for the classifier, a grid search (i.e., search over various pairs (C, γ)) is needed to be performed. However, since a full grid search is very time-consuming we fix the γ parameter as 1/k (k is the number of training data) and only try various values of C. A recommended range of values of C is 2 0, 21, 22, … , 210 which is known good enough in practice. The classification accuracies obtained over training and testing data with various values of C are shown in Table 1 below. Table 1: Classification accuracy over training and testing data for various values of C parameter. C Accuracy over training data Accuracy over testing data 1 59.91% 44.90% 2 69.53% 53.06% 4 78.09% 57.35% 8 85.47% 61.22% 16 90.92% 64.69% 32 94.14% 68.57% 64 96.52% 71.22% 128 97.82% 71.02% 256 98.23% 71.02% 512 98.49% 72.45% 1024 98.61% 72.45% As we can see in Table 1, with C=1024 the classifier performs most accurately. Therefore, we choose C=1024 for our classifier. 4. Application Implementation This section is about how we implemented our application and its functions. We used simple web-browser type of program to show the function of our module. 4.1 Basic Structure Basically, our program is composed with client side (web-browser) and server side(processing module). Figure 5: Basic concept of the program
  • 8. Client part is implemented as a simple web-browser. During web-surfing , if the user want to add the current page to his/her favorite list, he can easily add it just by pressing ‘Add Favorite’ button on the top of our program . After showing some HTML codes of current page, the program will send that code to the web server. And the web- server will process that information to appropriate form to SVM (by parsing and converting), and send it to client program. After receiving that data (preprocessed array of numbers appropriate to SVM module) client side will categorize it to one of 16 pre- defined categories using SVM module. And finally, client program will make .URL file of that page( which is able to be used directly for favorite list of Explorer. It is a kind of common format ) to categorized directory. We especially focused on Science section because we cannot cover all the subjects of news on Internet. That will be categorized to 16 categories such as below list. "Sub","Spa","Phy","Mea","Mat","Geo","Eng","Ene","Ecl","Ear","Che","Bio","Ast","Agr","Aer","Aco" Sub : Time travel Spa : Space Phy : Physics Mea: Measure Mat : Mathematics Geo : Geology Eng : Engineering Ene :Energy Ecl : Ecology Ear: Earth Che : chemistry Bio: Biology Ast :Astronology Agr : agriculture Aer : Aero Aco : Acoustic 4.2 How to use this program? Following procedures are showing how to use this program. Figure 6: Main window of Client Program Client program has similar interface with Internet Explorer. If you want to add this page as a favorite list, and want to be categorized automatically, press ‘Add favorite’ button. (Figure 6)
  • 9. Figure 7: Source Code View After pressing the button, you can see this dialog box (Figure 7) showing the HTML codes of current web-page. Because there are so many unnecessary parts like tag, we have to send this code to preprocessing server to preprocess it. Moreover, you can input some specific name of this page to have at ‘바로가기’ box. Unless you write some specific words here, the page will be added to its category as ‘temporary.url’. Figure 8: Result Dialog After tens of seconds, you can see a dialog box(Figure 8) indicating the result of processing (the Category: You can see this page is categorized to Bio part from above directory) and the name of your favorite file(혈액형.url).
  • 10. You can use this result( the directory and categorized sub directories) just copying and pasting this to your favorite list. If you do not want to such efforts, press ‘Setup’ button and modify the base directory of favorite list (Figure 9) of this program like below. Figure 9: Setup Dialog Figure 10: After changing If you modified base directory of our program to favorite directory of Explorer, you can directly use the result like above. (Figure 10)
  • 11. 5. Conclusion and Further Work We tested the probability of using Learning algorithm to Web-browser categorization using simple modules. Especially, SVM is used to classify HTML codes because SVM is known as good for character or document classification; we chose this classifier to divide temporary-obtained-HTML codes to add this to favorite lists of user. The client program in our whole structure, have too many work to process such as waiting the result of preprocessing from server and running SVM module. These functions are originally planned to implement on the Web-sever. However, because of some problem (maybe because of some protocol usage difference or dealing data (floating point precision)) we couldn`t include all the processing part on our server. If this problem is solved, we think that we can make our client module have no time delay processing such parts. Figure11 If server can process all the heavy calculating and just send the result to our client part, there`ll be no time delay dealing with such processes. Moreover, this is good for managing , fixing and updating module just by replacing SVM model. Internet Preprocess SVM Client Figure 11: Ideal Structure of server-client relationship Moreover, this structure can be easily expanded to other applications like spam-mail filtering, E-mail categorization, and Patent Classification.