THE UNIVERSITY OF THE GAMBIA
SENIOR PROJECT
WEB CRAWLER
DOCUMENTATION
Written by:
Seedy Ahmed Jallow 2121210
Salieu Sallah 2112465
Landing Jatta 2121750
Table of Contents
INTRODUCTION........................................................................................................3
DESCRIPTION...........................................................................................................3
THEORITICAL BACKGROUND....................................................................................4
DOM PARSER............................................................................................................8
Using A DOM Parser...........................................................................................11
SOFTWARE ANALYSIS.............................................................................................12
Problem Definition..............................................................................................12
Functional Requirement.....................................................................................12
Non Functional Requirements............................................................................12
Target User.........................................................................................................13
Requirement Specification.................................................................................13
Acceptance Criteria............................................................................................14
System Assumption............................................................................................15
Relationship Description.....................................................................................15
Structure of the website.....................................................................................15
SOFTWARE DESIGN................................................................................................16
System Development Environment....................................................................16
System Development Languages..................................................................16
Classes...............................................................................................................19
Main Class......................................................................................................19
Web Crawler Class..........................................................................................20
SOFTWARE TESTING...............................................................................................22
BIBLIOGRAPHY AND REFERENCES..........................................................................25
INTRODUCTION
This is an implementation of a web crawler using the Java programming language. This
project is implemented fully from scratch using a DOM parser to parse our XML files.
This is project is about taking a fully built XML website and visit recursively all the
pages that are present in the website searching for links and saving them in a hash table
and later printing the links recursively. In other words the Web crawler fetches data from
the already built XML site. Starting with an initial URL, which is not only limited to the
index page of the website, it crawls through all the pages of the website recursively.
However, the articles show a powerful technique to traverse the hierarchy and generate
DOM events, instead of outputting an XML document directly. Now I can plug-in
different content handlers that do different things or generate different versions of the
XML.
Internet has become a basic necessity and without it, life is going to be very difficult.
With the help of Internet, a person can get a huge amount of information related to any
topic. A person uses a search engine to get information about the topic of interest. The
user just enters a keyword and sometimes a string in the text-field of a search engine to
get the related information. The links for different web-pages appear in the form of list
and this is a ranked list generated by the necessary processing in the system. This is
basically due to the indexing done inside the system in order to show the relevant results
containing exact information to the user. The user clicks on the relevant link of web page
from the ranked list of web-pages and navigates through the respective web pages.
Similarly, sometimes there is a need to get the text of a web page using a parser and for
this purpose many html parsers are available to get the data in the form of text. When the
tags are removed from a web page then in order to do the indexing of words, some
processing is needed to be done in the text and get some relevant results to know about
the words and the set of data present in that web page respectively.
DESCRIPTION
A Web crawler is an Internet bot which systematically browses the World Wide Web,
typically for the purpose of Web indexing. A Web crawler may also be called a Web
spider, an ant, an automatic indexer, or (in software context) a Web strutter.
Web search engines and some other sites use Web crawling or spidering softwares to
update their web content or indexes of others sites' web content. Web crawlers can copy
all the pages they visit for later processing by a search engine which indexes the
downloaded pages so the users can search much more efficiently.
Crawlers can validate hyper links and HTML /XML code. They can also be used for
web scraping.
Web crawlers are a key component of web search engines, where they are used to collect
the pages that are to be indexed. Crawlers have many applications beyond general
search, for example in web data mining (e.g. Attributor, a service that mines the web for
copyright violations, or ShopWiki, a price comparison service).
THEORITICAL BACKGROUND
Web crawlers are almost as old as the web itself. In the spring of 1993, just months after
the release of NCSA Mosaic, Matthew Gray wrote the first web crawler,
the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics
about the growth of the web. A year later, David Eichmann wrote the
first research paper containing a short description of a web crawler, the RBSE spider.
Burner provided the first detailed description of the architecture of a web
crawler, namely the original Internet Archive crawler Brin and Page’s seminal paper on
the (early) architecture of the Google search engine contained a brief description of the
Google crawler, which used a central database for coordinating the crawling.
Conceptually, the algorithm executed by a web crawler is extremely simple: select a
URL from a set of candidates, download the associated web pages, extract the URLs
(hyperlinks) contained therein, and add those URLs that have not been encountered
before to the candidate set. Indeed, it is quite possible to implement a simple functioning
web crawler in a few lines of a high-level scripting language such as Perl. However,
building a web-scale web crawler imposes major engineering challenges, all of which
are ultimately related to scale. In order to maintain a search engine corpus of say, ten
billion web pages, in a reasonable state of freshness, say with pages being refreshed
every 4 weeks on average, the crawler must download over 4,000 pages/second. In order
to achieve this, the crawler must be distributed over multiple computers, and each
crawling machine must pursue multiple downloads in parallel. But if a distributed and
highly parallel web crawler were to issue many concurrent requests to a single web
server, it would in all likelihood overload and crash that web server. Therefore,web
crawlers need to implement politeness policies that rate-limit the amount of traffic
directed to any particular web server (possibly informed by that server’s observed
responsiveness). There are many possible politeness policies; one that is particularly
easy to implement is to disallow concurrent requests to the same web server; a slightly
more sophisticated policy would be to wait for time proportional to the last download
time before contacting a given web server again. In some web crawler designs (e.g. the
original Google crawler and PolyBot the page downloading processes are distributed,
while the major data structures – the set of discovered URLs and the set of URLs that
have to be downloaded – are maintained by a single machine. This design is
conceptually simple, but it does not scale indefinitely; eventually the central data
structures become a bottleneck. The alternative is to partition the major data structures
over the crawling machines.
This program starts by creating a hash table of Strings to store the attributes and the
hyper links..
static Hashtable<String, String> openList = new Hashtable<String, String>();
static Hashtable<String, String> extList = new Hashtable<String, String>();
static Hashtable<String, String> closeList = new Hashtable<String, String>();
A HASHTABLE is a data structure used to implement an associative array, a structure
that can map keys to values. A hash table uses a hash function to compute an index into
an array of buckets or slots, from which the correct value can be found. In the context of
this web ,crawler it use to map our key (a) and our value (href).
After importing all the necessary files we then parse the XML files to the DOM. The
Document Object Model (DOM) is a programming interface for HTML, XML and SVG
documents. It provides a structured representation of the document (a tree) and it defines
a way that the structure can be accessed from programs so that they can change the
document structure, style and content. The DOM provides a representation of the
document as a structured group of nodes and objects that have properties and methods.
Nodes can also have event handlers attached to them, and once that event is triggered the
event handlers get executed. Essentially, it connects web pages to scripts or
programming languages.
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Hashtable;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
public static void parsePage(URL url) {
String xmlPath = url.getFile();
File xmlFile = new File(xmlPath);
String page = null;
The Document Object Model (DOM) is a set of language-independent interfaces for
programmatic access to the logical XML document. We will use the latest Java DOM
Interfaces. These correspond to the latest version of the language-
DOM Level 1 interface as specified by the W3C, which is always accessible through
this link. IBM’s XML4J parser all latest version, pretty much as soon as it is available. .
As we have learned, the structure of a well formed XML document can be expressed
logically as a tree with a single Interface that encapsulates the structural connections
between the XML constructs is called the Node. The Node con that express structural
connections such as Node# getChildNodes(), Node# getNextSibling(), Node#
getParentNode(),
The DOM Interfaces also contain separate interfaces for XML’s high-level constructs
such as Element . Each of these interfaces extends Node . For example, there are
interfaces for Element, Attribute, Comment, Text, and so on. Each of these specific and
setter functions for their own specific data. For example, the Attribute interface has
Attribute# getName(), and Attribute member functions. The Element interface has the
means to get and set attributes via functions like Element#
getAttributeNode(java.lang.String), and Element# setAttributeNode(Attribute).
Always remember that various high-level interfaces such as Element, Attribute, Text,
Comment, and so on, all extend means the structural member functions of Node (such as
getNodeName() ) are available to Element, Attribute , and all these t illustration of this is
that any node such as Element or Text knows what it is, by re-implementing
getNodeType() . This allow programmer to query the type using Node# getNodeType()
instead of Java’s more expensive run-time type instance of f
So, in Java you can write a simple recursive function to traverse a DOM tree:
The root of the DOM tree is the Document interface. We have waited until now to
introduce it because it serves multiple purposes. It represents the whole document and
contains the methods by which you can get to the global document information and the
root Element .
Second, it serves as a general constructor or factory for all XML types, providing
methods to create the various cons an XML document. If an XML parser gives you a
DOM Document reference, you may still invoke the create methods with to build more
DOM nodes and use append Child and other functions to add them to the document
node or other nodes if the client programmer changes, adds, or removes nodes from the
DOM tree, there is no DOM requirement to check validity. This burden is left to the
programmer (with possible help from the specific DOM or parser implementation).
Obviously the third step is the complicated one. Once you know the contents of the
XML document, you might want to, for example, generate a Web page, create a
purchase order, or build a pie chart. Considering the infinite range of data that could be
contained in an XML document, the task of writing an application that correctly
processes any potential input is intimidating. Fortunately, the common XML parsing
tools discussed here can make the task much, much simpler.
DOM PARSER
The XML Parser for Java provides a way for your applications to work
with XML data on the Web. The XML Parser provides classes for
parsing, generating, manipulating, and validating XML documents. You
can include the XML Parser in Business-to-Business (B2B) and other
applications that manage XML documents, work with metacontent,
interface with databases, and exchange messages and data. The XML
Parser is written entirely in Java, and conforms to the XML 1.0
Recommendation and associated standards, such as Document Object
Model (DOM) 1.0, Simple API for XML (SAX) 1.0, and the XML
Namespaces Recommendation.
DOM implementations
The Document Object Model is an application programmer’s interface
to XML data. XML parsers produce a DOM representation of the parsed
XML. Your application uses the methods defined by the DOM to access
and manipulate the parsed XML. The IBM XML Parser provides two
DOM implementations:
– Standard DOM: provides the standard DOM Level 1 API, and is highly
tuned for performance
– TX Compatibility DOM: provides a large number of features not
provided by the standard DOM API, and is not tuned for performance.
You choose the DOM implementation you need for your application
when you write your code. You cannot, however, use both DOM’s in
the XML Parser at the same time. In the XML Parser, the DOM API is
implemented using the SAX API.
Modular design
The XML Parser has a modular architecture. This means that you can
customize
the XML Parser in a variety of different ways, including the following:
Construct different types of parsers using the classes provided,
including:
– Validating and non-validating SAX parser
– Validating and non-validating DOM parser
– Validating and non-validating TXDOM parser
To see all the classes for the XML Parser, look in the W3C for Java IDE
for
the W3C XML Parser for Java project and the org.w3c.xml.parsers
package.
Specify two catalog file formats: the SGML Open catalog, and the X-
Catalog
format. Replace the DTD-based validator with a validator based on
some other method,
such as the Document Content Description (DCD), Schema for Object-
Oriented XML (SOX), or Document Definition Markup Language (DDML)
proposals under consideration by the World Wide Web Consortium
(W3C).
Constructing a parser with only the features your application needs
reduces the
number of class files or the size of the JAR file you need. For more
information
about constructing the XML Parser, refer to the related tasks at the
bottom of this
page.
Constructing a parser
You construct a parser by instantiating one of the classes in the
com.ibm.xml.parsers package. You can instantiate the classes in one
of the following ways:
– Using a parser factory
– Explicitly instantiating a parser class
– Extending a parser class
For more information about constructing a parser, refer to the related
tasks at the
bottom of this page.
Samples
We provide the following sample programs in the IBM XML Parser for
Java Examples project. The sample programs demonstrate the
features of the XML Parser using the SAX and DOM APIs:
– SAXWriter and DOMWriter: parse a file, and print out the file in XML
format.
– SAXCount and DOMCount: parse your input file, and output the total
parse time along with counts of elements, attributes, text characters,
and white space characters you can ignore. SAXCount and DOMCount
also display any errors or warnings that occurred during the parse.
– DOMFilter: searches for specific elements in your XML document.
– TreeViewer: displays the input XML file in a graphical tree-style
interface. It also
highlights lines that have validation errors or are not well-formed.
Creating a DOM parser
You can construct a parser in your application in one of the following ways:
– Using a parser factory
– Explicitly instantiating a parser class
– Extending a parser class
To create a DOM parser, use one of the methods listed above, and specify
com.ibm.xml.parsers.DOMParser to get a validating parser, or
com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser. To access
the DOM tree, your application can call the getDocument() method on the parser.
For more information about constructing a parser, refer to the related tasks below.
Using A DOM Parser
import com.ibm.xml.parsers.DOMParser;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
//Constructing parser by instantiating parser object
//In this case from DOMParser
public class example2 {
static public void main( String[] argv ) {
String xmlFile = “file:///xml_document_to_parse”;
DOMParser parser = new DOMParser();
try {
parser.parse(xmlFile);
} catch (SAXException se) {
se.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
}
// The next lines are only for DOM Parsers
Document doc = ((DOMParser) parser).getDocument();
if ( doc != null ) {
try {
(new dom.DOMWriter( false ) ).print( doc ); // use print
method from dom.DOMWriter
} catch ( UnsupportedEncodingException ex ) {
ex.printStackTrace();
}
}
}
SOFTWARE ANALYSIS
Problem Definition
For our senior project, we were asked to write a search engine program that will list all
the pages that are present in a particular off-line website, and as well list all the external
links that are reachable from one of the internal pages.
Search engines consist of many features like web crawling, words extracting, indexing,
ranking, searching, search querying etc. In this project I am just concentrating on
crawling through the website and indexing the pages and outputting them as well as the
external links that are reachable through one of the internal pages.
Functional Requirement
Functional requirements means the physical module which are going to be produced by
the proposed system. The only functional module for this system is web crawling. The
crawler takes the index page of the website as input. Then it scans through all the
elements on the page, extracting the hyper link references to other pages and storing
them in a list to be scanned through later. The crawler will scan through the pages
recursively storing all scanned pages in a hash table to make sure the crawler takes care
of circular references.
Non Functional Requirements
This program isn't meant to be an end user program, so very little emphasis is made on
the user interface. As a result no user interface was developed. Input and output will be
through the terminal. It is worth noting also that this is not a professional program either,
so the issue of product security and the like are not considered.
Target User
The target users of this program are aside from the project instructor and supervisor
(obviously), are the general programming community who want to see the very basic
implementation of a search engine. They are allowed to use, reuse, share my code as
long as I am credited for it.
Requirement Specification
The logical model above is a data flow diagram overview showing the processes
required for the proposed system. Details of the processes is explained in the physical
design below.
Process Description
Input The index page of the website that is to be crawled is
inputed by the user.
Create URL Create a URL from the path of a file.
Parse Page Creates document builders to break down the structure of
the page into a tree of nodes. And traverses through the
nodes to collect hyper link references to other pages
Save Links Stores the hyper link references in a list and provide links
to the crawler.
Internal Links Gets all the URLs the whose references are internal pages
of the website or in other words, has the “file://” protocol.
External Links Gets all the URLs that are referencing to pages external to
the website or in other words has the “http://” protocol.
Save in table Stores all the links in their respective hash tables.
Html Page Checks whether the URL is referencing to a valid html
page and not an image, port etc.
Print Outputs the URLs
Acceptance Criteria
On the day of completion of the project, all the features explained above will be
provided, mainly web crawling.
System Assumption
It is url search only. All processes of the web crawler are made to processes url info only.
It doesn't care about other searches including image searching. The results of other
languages are unexpected. Thou the use of anything other than urls will not lead to
system errors. It also assumes that the user is well versed with command line input,
output and other command line attributes that are necessary in running the program.
Relationship Description
Each page has many links, so the relationship between the pages and links are one to
many.
Structure of the website
Federn, is the name of the website to be crawled. The website contains information on
feathers. There are hundreds of feathers whose description and identification is given in
this website. The website is available in three languages, German, English and French.
Each page of the website has a link tab on the top of the page. That tab contains links to
the home page, feathers, identification, news, bibliography, services and help.
The home page of Federn contains the description of the idea behind the website, the
authors, the concept behind this project and the acknowledgement of the contribution of
others in the development of the website. As you can see it contains a lot of links
referring to other pages. All the links though are internal links.
The feathers page contains all the feathers that were identified and described in this
website. The list of feathers are arranged in two formats. Firstly, they are arranged
according to their Genus and Family names on one side of the page, and arranged
according to alphabetical order on the other side. Each feather name is a link to the page
containing the description of the feathers and scanned images.
The identification page contains an image of the feather with a picture of a bird which
had that type of feather. It also contains detailed descriptions of the different types,
colors and shapes of that feather and the main function of the feather in flight and
temperature regulation.
There is a news tab that contain any new information found on the feathers or any
discovery made on feathers.
The bibliography contains links and resources where information on this website is
gathered from. It also contains service and help pages.
As you can see this website is a huge one. Each page aside from the main index page,
has 3 copies of itself in 3 different languages.
SOFTWARE DESIGN
System Development Environment
System Development Languages
The only language that is used in the development of this program is Java. Java is a
highly dynamic language. It contains most of the functionalities that was needed in the
development of the program.
The Java IO package allowed me to utilize the file class which I used to create file
objects which I feed to the parser to parse the pages of the website. I created the files by
using the absolute file paths which was extracted from the urls.
The Java
NET
package contains classes which were used to create urls from file paths. The urls can be
created as in my case by passing the absolute file path of the parent class and the page
name of the url being processed. Using of urls in my program is very crucial, bearing in
mind that I needed to check the referenced locations of the urls that are being processed
to make sure that the urls are referring to pages that are local to the website being
crawled or in order words are stored in the file system. You can check the protocols of
the urls by using the getProtocol() method. If it returns “file://” then that page being
referenced by the url is local to the file system. If it returns “http://” then that page being
referenced by the url is referring to a page outside the website being crawled.
Getting the baseURI:
Creating a URL:
Checking the protocol of a url
The Java UTIL class enables us to use a structure called Hash table. Hash tables are used
to store data objects in this case urls. I created two instances of the hash table class to
store urls on the website being crawled that refer to pages that are internal to the website
and store urls on the website that refer to other pages that are outside. Thou there are
other storage systems that can be used like MySQL, array lists, arrays etc because it is
unlike MySQL it is very simple to implement and use and unlike array lists, it is faster at
storing, searching and retrieving data which is very important considering that thousands
of urls can be stored and searched through over and over again.
Creating hash tables to store internal and external links
The Java
library also has a very important package which is by far the most important tool used in
my program, the xml parser package. This package contains the document builder
factory which is used to create document builders which contains the parsers we are
going to use to break down the pages. The parser parses the content of the file which it
is feed as an XML document and returns a new DOM document object. This package
also contains methods which validate the xml documents and verify whether the
documents are well formed as well.
Getting the document builder factory, document builder and parsing a xml document
into a DOM document object sample;
It is very important that the parser doesn't not validate the xml pages because it will
require Internet access and as you already know, the program is crawling off-line web
pages. If it does, it will lead to system errors.
Setting off the validating features of the parser;
The external package DOM is used to create documents which will store the parsed
pages content into a document. The document contains elements in a tree like structure
with each element corresponding to a node on the tree. Traversing through the tree with
any appropriate traversal method, all the nodes containing a-element tags are collected
and stored in a list of nodes. Looping through that list of nodes, one is able to extract all
a-element tags containing hyper link references.
Classes
In my implementation of the program, I used only classes. The first class contained the
main method while the second class contained the main implementation of the web
crawler.
Main Class
The main class contains the main method. The main method contains the prompt for the
user to enter the absolute file path of the index page of the website to be crawled. When
the user complies, the path is converted to a url object and is stored in the hash table
containing internal links. The main method also contains the first call of the recursive
method processPage(URL url).
Structure of the main method;
Web Crawler Class
This class contains 80% of the implementation. It has only one method definition, that of
the recursive method, processPage(). At the beginning of the class, the hash tables are
declared followed by the definition of the processPage() method.
The processPage() method contains only one parameter, the url object that is passed.
Inside the method, the absolute path of the url is extracted and a file object is created
thereof. The document builder object is created from the declaration and initialization of
the document builder factory and document builders in the preceding lines of code. The
method also contains the code snippet making sure that the parser doesn't validate the
xml pages. The parser is then called to parse the xml document and then the root
element of DOM document is then normalized. Thereafter, the root element of the
document is extracted and the traversal of the nodes of the document begin. All a-
element tags are selected and stored in a list of nodes. The nodes then are looped
through and all the a-element tags containing the “href” attribute, the values of the
attributes are extracted and a url is created therein of the pages that href is referencing
to. As explained before, the url is created by extracting the base url of the parent file of
page being referenced to and the page name of that file.
The protocol of the file is then checked, and if it is “file://”, the program proceeds to
check whether that url isn't referring to an image, a port etc, that it is referring to an
actual page. Then it proceeds to make sure that that url is not already stored in the hash
table containing links to internal pages of the website. If it is stored in the hash table
already, the link is discarded by the system and the next link on the node list is
processed. If it ain't stored in the hash table, the url is stored and that page is processed
for more urls.
If the protocol is tested and it returns “http://”, the program proceeds to check whether
that url isn't already stored in the hash table containing links to external pages. It it is,
the url is discarded and the next link on the list is processed. If not, the url is stored in
the table and then it is printed to the screen
SOFTWARE TESTING
During the testing of the program, many problems were encountered. One of the first
problems we had during the initial testings was with the validation of the parser. It is
standard that all xml documents are checked to see whether they are well formed
documents and are valid.
The website we are crawling as you already know, is off-line and if the parser tries to
validate it, errors like the one shown below will occur because it needs to connect to the
Internet to perform the checks.
To solve it, we set all features of the document builder factory that could start the
validation of the xml documents to false, as we have shown you somewhere before.
Another problem we encountered during the implementation of the program was how to
get the absolute paths of the relative path to the pages we found on each page we already
crawled. All that the crawler returned was the names of the files that were found to be
referenced from the page that we were crawling. What we later did was to get the base
uri of the file that was being crawled; it returns the absolute file path of that file being
crawled and attached the names of the pages that were found on that file. That way we
were able to create a url for all the links and processed them.
Aside from the problems mentioned above, the program was able to pass through the
final tests without any major bugs therefore bringing us successfully to the end of the
implementation of the program. Although it was not an easy ride, it was worth every bit
of effort we invested in it. Below are Terminal images showing the compilation and
running of the program and the results i.e. the links on the website being crawled. No
graphical interface is developed therefore the default GUI; the terminal is used.
Command to compile the Program:
Running the Program:
Prompt and input of Index page:
The program ran smoothly and proceeded to print out all the links that were found on the
website and label them internal or external depending on were they are referencing to
and the protocol they contain.
BIBLIOGRAPHY AND REFERENCES
[HREF1] What is a “Web Crawler" ? (
http://research.compaq.com/SRC/mercator/faq.html )
[HERF2] inverted index ( http://burks.brighton.ac.uk/burks/foldoc/86/59.htm )
[MARC] Marckini, Fredrick. Secrets to making your Internet Web Pages Achieve Top
Rankings (ResponseDirect.com, Inc., c1999 )
http://en.wikipedia.org/wiki/Web_crawler
http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf
http://research.microsoft.com/pubs/102936/eds-webcrawlerarchitecture.pdf