CSCI 6505 Course Project Report

             Student Name: Yuan An
                 Email: yuana@cs.dal.ca

          Stu...
1.DESCRIPTION OF PROBLEM:

Many organization and people have their own websites to post information for public.
With the g...
2.3 Classification:
In our implementation, there is a keyword vector containing 2,901 words related to
computer science. W...
3.2 Interactive diagram:




    HTML classifier                               Crawler:
    module:                       ...
4. IMPLEMENTATION:

In this section , we list all java classes used in this project:

The following classes are used for t...
5. SAMPLE RESULTS:

Our implementation of this project is a combination of keyword search engine and topic-
specific searc...
Figure 2
When user want to search some relevant documents by topic, he just types the topic in the
textfield and click the...
decide the classification of new instance. It is obvious that if there are many training
    data, then it is not efficien...
Upcoming SlideShare
Loading in …5
×

CSCI6505 Project:Construct search engine using ML approach

880 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
880
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CSCI6505 Project:Construct search engine using ML approach

  1. 1. CSCI 6505 Course Project Report Student Name: Yuan An Email: yuana@cs.dal.ca Student Name: Suihong Liang Email: abacus@cs.dal.ca Date:4 Dec 2000 Topic: Construct a topic-based search engine using machine learning approach (instance-based learning method) for a given website.
  2. 2. 1.DESCRIPTION OF PROBLEM: Many organization and people have their own websites to post information for public. With the growth of the number of the files in website, it is convenient to index all the files for search purpose. There are many commercial search engines such Yahoo!,Google providing the ability for searching related web pages over all websites around the world. It is also useful for single website providing the ability for searching related web pages over this website itself. The simplest technique for indexing HTML files is to count the number of hit times of given keywords in a HTML file or to look the HEAD part of HTML file to find related information. We are going to use machine learning approach to classify HTML file into certain topic, specially, we are going to use Instance-Based learning algorithm, i.e., k-nearest neighbors, to do this work. From our original intentions, we want to classify a HTML file related to some computer science topics into most related topics. And , then index those HTML files in a given website to provide the ability of topic-based search for this website. Since the reason of limited time for this project, we didn't use this classifier to index any specific website. For the experiment, we just downloaded some test HTML files related those topics: 'artificial intelligent', 'programming language', 'operating system', 'database' ,' graphics', 'software engineering' and take them as training data and test data. We extracted the vocabulary containing 2,901 words related to computer science from an online technology dictionary: ( http://www.oasismanagement.com/TECHNOLOGY/GLOSSARY/index.html ). Then we built a classifier using instance-based learning algorithm. The detail will be discussed in the following sections. 2. CORE IDEA: 2.1 Instance-based learning approach: Instance-based learning methods such as k-nearest neighbor are straightforward approaches to approximating real-valued or discrete-valued target functions. Learning in this algorithm consists of simply storing the presented training data. When a new query instance is encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance. Since the Weka package has implements almost all machine learning algorithms using Java including k-nearest neighbor. We are going to use this package to implement our project. 2.2 Representation of HTML files: Since our project is used for studying machine learning purpose, we don't focus on the representation of document. We use the simplest method to represent instance, i.e., HTML file. We define a vector of words as the attributes of instances. The vector is extracted from an online technology dictionary containing 2,901 words related to computer science domain. We given several categories of HTML file such as ‘artificial intelligent’, ‘ programming language’, ‘operating system’, ‘database’, ‘graphics’,’ software engineering’. First of all, we collect a set of training data to train classifier. Then, we use the trained classifier to index all HTML files for a given website.
  3. 3. 2.3 Classification: In our implementation, there is a keyword vector containing 2,901 words related to computer science. When a new HTML file comes in, our system first transfer this HTML file into text file by discarding all HTML tags and comments as well as HEAD part of this file. Then the system transfer this HTML file into an instance by calling makeInstance() method. Finally, the trained classifier is called to classify the new instance using k-nearest neighbors algorithm. 2.4 Indexing: There is a crawler in our system used to crawl the directory tree for a given website’s URL. When the crawler crawls along the path, if it encounters a HTML file, then it calls trained classifier to classify this HTML file into corresponding classification. The crawler writes the pair of classification label and url of this file into a TreeMap. Here, we used TreeMap for later rank information. After crawling, the crawler writes the TreeMap into a text file for user’s search. 3.MAIN COMPONENTS and Interactive diagram: The total project consists of the following components: (1). A command-line utility for indexing all HTML files into various topics for a given home directory of website--it will crawl all subdirectories of given home directory automatically. (2). Server-side CGI or Java Servlets for replying user's query. (3). User's interface displaying in brower. 3.1 Description of modules: 1. HTML file classifier: This module is used to train a classifier from scratch, update the classifier by more training data, and classify new document. The function for transferring a HTML file into text file is also in this module. The weka package is imported in this part and its implementation of k-nearest neighbors algorithm and other helping utilities were used. 2. Crawler or Indexer: This module is used to crawl the directory tree for a given website to index all HTML files reside in the website. The crawler is command utility used by webmaster after updating of its website. This crawler takes the home URL as start point and loads the trained classifier, then it crawls all subdirectories using Breadth-First Search strategy. Whenever it encounters a new HTML file, it classifies this file into corresponding category and store the pair of label and address into a map. After crawling, it writes the map into a text file for user’s search. 3. Server-side Searcher: This module is used for reply the searching results to users who submit the query. Since the all HTML files have been indexed and the index information has been written in a text file, the server-side searcher just searches the index information file and finds the matched record , replies to user. There are many server side techniques such that CGI and servlets.
  4. 4. 3.2 Interactive diagram: HTML classifier Crawler: module: 1. load classifier 1. build classifier. from disk. 2. update classifier. 2. crawling along 3. classify new Classifier directory tree. document. stored in 3. classify 4. Transfer files. disk. encountered files. Searcher: 1. accepts user’s Indexed query. information 2. Searches on file in disk. indexed file. 3. Replies results. User’s browse
  5. 5. 4. IMPLEMENTATION: In this section , we list all java classes used in this project: The following classes are used for transfer HTML file into text file: 1. public interface HTMLContent. 2. public class HTMLContentList: extends ArrayList. 3.public class HTMLTag: stores a name and optional attribute list. 4. public class HTMLText: stores text of HTML file. 5.public class HTMLToken: stores tokens of HTML file. 6.public class HTMLTokenizer: parse the HTML file into tokens. 7. public class HTMLTokenList: extends ArrayList. 8. public class Parser: take a HTMLTokenlist as input, convert it into HTMLContentList. 9.public class HTMLAttribute: stores attribute of HTML tag. 10.public class HTMLAttributeList: extends ArrayList to store all attributes of a tag. The following classes are used for indexing, classifying and searching: 11.public class HTMLIndex: extends HashMap, implementing two methods: (1) addString(), takes class label and title/filename arguments and creates a mapping between each label and the respective file, (2) writeFile(), streams the index content to a file. 12.public class HTMLIndexer: a command line utility that traverses the directories from a given root path. 13.public class HTMLClassifier: k-nearest neighbors classifier , implementing those methods (1) HTMLClassifier(), constructor to build classifier from scratch or load from a file, (2) updateModel(), train classifier using training data, (3) classifyMessage(), to classify new instance, (4) makeInstance(), to make a new instance, (5) htmlToText(), transfer HTML file into text file 14.public interface Searcher: a search engine that returns the matched records in indexed file. 15.public class HTMLSearch: implements interface Searcher. 16.public class SearchServlet: wraps the Searcher with an appropriate interface to handle a POST request, with a string argument named 'search'. The result is returned on the output stream.
  6. 6. 5. SAMPLE RESULTS: Our implementation of this project is a combination of keyword search engine and topic- specific search engine. This prototype is just used on our own website to be tested. The user's interface is a textfield input and two submit buttons (see Figure 1): one button with the label 'keyword' and another button with the label 'topic'. Figure 1 When user want to search some relevant documents by keyword, he just types the keywords in the textfield and click the button bearing the 'keyword' label (see Figure 1). If there are any documents have the keywords matched , then, the matched documents' name will be replied with hyperlinks as well as the hit times of the keywords in the corresponding documents (see Figure 2). If there is no any document matched, the result is 'no pages found'.
  7. 7. Figure 2 When user want to search some relevant documents by topic, he just types the topic in the textfield and click the button bearing the 'topic' label (see Figure 1). In this implementation, it only accept the following topics search: 'artificail intelligent', 'programming language', 'operating system', 'database', 'graphics', 'software engineering'. If there are any documents have the label matched , then, the matched documents' name will be replied with hyperlinks as well as the hit times of the keywords in the corresponding documents (see Figure 3). If there is no any document matched, the result is 'no pages found'. Figure 3 6. DISCUSSION: In this project , we implemented a topic-based search engine for a given website. The key point of such search engine is to train a classifier for classify HTML files into corresponding categories. We used k-nearest neighbors algorithm which implemented in WEKA package dedicated for machine learning algorithms to train classifier for classifying files related to computer science. Since this project is a course project and its focus is on machine learning, so it is pretty simple in representation of documents and collection of training data. There are many open problems can be solved further and better: (1)The k-nearest neighbors algorithm needs to store the training data in somewhere. When new instance comes in, the k-nearest neighbors are retrieved and compared to
  8. 8. decide the classification of new instance. It is obvious that if there are many training data, then it is not efficient. So, we may develop a more efficient classifier in future using some efficient machine learning algorithm. (2)We represent the documents using a vector extracted from a online technology dictionary. This is a fairly simple representation. It can be improved more. (3) In our implementation, it has no the ability to rank the relevant pages for the users by topic search. In keyword search, we just calculate the hit times of keywords in the relevant page, but in topic search, we didn't come up with any ranking strategy. However, such ranking strategy in search engine is desirable.

×