SlideShare a Scribd company logo
1 of 8
Download to read offline
Jan Zizka (Eds) : CCSIT, SIPP, AISC, PDCTA - 2013
pp. 509–516, 2013. © CS & IT-CSCP 2013 DOI : 10.5121/csit.2013.3655
A SEMANTIC BASED APPROACH FOR
INFORMATION RETRIEVAL FROM HTML
DOCUMENTS USING WRAPPER
INDUCTION TECHNIQUE
A.M.Abirami1
, Dr.A.Askarunisa2
, T.M.Aishwarya1
and K.S.Eswari1
1
Department of Information Technology,
Thiagarajar College of Engineering, Madurai, Tamil Nadu, India.
abiramiam@tce.edu, {aish2292,eswainnov}@gmail.com
2
Department of Computer Science and Engineering,
Vickram College of Engineering, Enathi, Tamil Nadu, India.
nishanazer@yahoo.com
ABSTRACT
Most of the internet applications are built using web technologies like HTML. Web pages are
designed in such a way that it displays the data records from the underlying databases or just
displays the text in an unstructured format but using some fixed template. Summarizing these
data which are dispersed in different web pages is hectic and tedious and consumes most of the
time and manual effort. A supervised learning technique called Wrapper Induction technique
can be used across the web pages to learn data extraction rules. By applying these learnt rules
to web pages, enables the information extraction an easier process. This paper focuses on
developing a tool for information extraction from the unstructured data. The use of semantic
web technologies much simplifies the process. This tool enables us to query the data being
scattered over multiple web pages, in distinguished ways. This can be accomplished by the
following steps – extracting the data from multiple web pages, storing them in the form of RDF
triples, integrating multiple RDF files using ontology, generating SPARQL query based on user
query and generating report in the form of tables or charts from the results of SPARQL query.
The relationship between various related web pages are identified using ontology and used to
query in better ways thus enhancing the searching efficacy.
KEYWORDS
Information Retrieval, Ontology, Structured Information Extraction, RDF, SPARQL, Semantic
Web.
1. INTRODUCTION
Today’s web contents are suitable only for humans; it is not machine understandable and
readable. So retrieving information from web pages is a complex task. Time consumption is more
if one needs to refer or access too many HTML pages to collect data and make summary on them.
510 Computer Science & Information Technology (CS & IT)
The semantic web technologies [16] overcome these difficulties to a greater extent. A core data
representation format for semantic web is Resource Description Framework (RDF). RDF is a data
model for web pages. RDF is a framework for representing information about resources in a
graph form. It was primarily intended for representing metadata about WWW resources, such as
the title, author, and modification date of a Web page, but it can be used for storing any other
data. It is based on triples subject-predicate-object that form graph of data [18].
All data in the semantic web use RDF as the primary representation language [9]. RDF Schema
(RDFS) can be used to describe taxonomies of classes and properties and use them to create
lightweight ontologies. Ontologies describe the conceptualization, the structure of the domain,
which includes the domain model with possible restrictions [11]. More detailed ontologies can be
created with Web Ontology Language (OWL). It is syntactically embedded into RDF, so like
RDFS, it provides additional standardized vocabulary. For querying RDF data as well as RDFS
and OWL ontologies with knowledge bases, a Simple Protocol and RDF Query Language
(SPARQL) is available. SPARQL is SQL-like language, but uses RDF triples and resources for
both matching part of the query and for returning results [10].
Ujjal Marjit et.al. [1] presented a linked data approach to discover resume information by
providing a standard data model, data sharing and data reusing. Arup Sarkar et.al. [2] provided an
approach to publish the data from legacy databases as linked data on the web thus enabling the
machine readability. David Camacho et.al. [3] extracted information from HTML documents and
added semantics through XML to generate rules using Web Mantic with the help of user
interaction. Uriv shah et.al [4] provided a framework for retrieval of documents containing both
free text and semantically enriched markup using DAML+OIL semantic web language. Olaf
Hartig et.al. [5] developed an approach to execute SPARQL queries over web of linked data to
discover relevant data for query answering during query execution itself using RDF links between
data sources. Peter Haase et. al. [6] assumed a globally shared domain ontology, which can not
only be used for subsequently classifying the bibliographic metadata, but also for supporting an
improved query refinement process. Abirami et. al [7] proposed a Semantic based approach for
generating reports from HTML pages by converting those HTML pages into pre-processed and
formatted CVS files, generating OWL files for domain, separating contents from CVS files based
on OWL files and rules to generate RDF files and querying RDF files using SPARQL.
This paper proposes a model for storing the content of HTML pages into RDF format using the
rules generated by the wrapper induction technique. For example, the university web site may
contain the details of all the staff members of all the departments mostly in HTML pages. The
data is collected and consolidated mostly manually by every year. Though the content is readily
available in HTML pages, the manual effort is wasted while consolidating the report. The
proposed model generates the report from different HTML documents using RDF and SPARQL.
Only the relevant information is extracted and stored in RDF format which makes further
querying easier. This paper is organized into following sections: section 2 describes the
methodology, section 3 shows the experiments that are taken and section 4 gives the conclusion.
2. METHODOLOGY
The proposed model includes five different processes namely – (i) extracting structured data from
multiple HTML pages (ii) generating RDF files with the structured data extracted (iii) creating
ontology for specific domain relationship (iv) generating corresponding SPARQL queries for the
Computer Science & Information Technology (CS & IT) 511
user query and (v) generating reports according to the user needs. The Figure 1 shows the various
processes in the implementation, which are dealt in the subsequent sections.
Figure. 1. Model for Information Retrieval
2.1 Pre-Processing of HTML pages
It is important to check for the clean HTML pages, because a typical Web page contains a large
amount of noise, which can adversely affect data extraction accuracy. Pre-Processing phase is
done before actual extraction starts, which is depicted in Figure 2.The raw HTML is fed into the
HTML Tidy [13]. This tool checks for the presence of close tag for every open tag. The resultant
well formed html is fed into noise removal phase, where the output will be a clean HTML page.
2.2 Structured Data Extraction & RDF Generation
Once the pre-processing phase gets completed, the clean html pages are fed into structured data
extraction phase. This is accomplished by using the Wrapper Induction Technique [17], which is
the supervised learning approach, and is semi-automatic. In this approach, a set of extraction rules
is learned from a collection of manually labeled pages. The rules are then employed to extract
target data items from other similarly formatted pages. The process is pictorially represented in
Figure 3.
Most HTML tags work in pairs. Each pair consists of an open tag and a close tag indicated by < >
and </> respectively. Within each corresponding tag-pair, there can be other pairs of tags,
resulting in nested structures. Thus, HTML tags can naturally encode nested data. Notable points
are: (a) there are no designated tags for each type as HTML was not designed as a data encoding
language and any HTML tag can be used for any type (b) for a tuple type, values (data items) of
different attributes are usually encoded differently to distinguish them and to highlight important
items. In case of any conflicts, they are identified and corrected by replacing or adding rules.
These two processes go iteratively. Once the resultant structured data goes along with the human
oracle, the extracted structured data is sent for the RDF generation phase.
512 Computer Science & Information Technology (CS & IT)
Figure. 2. Pre-processing of web pages
The extraction is done using a tree structure called the
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
learning and data extraction. Given the
following the tree path P from the root to the node by extracting each node in
The extraction rules are based on the idea of
consecutive tokens and is used to locate the beginning or the end of a target item.
Some of the rules identified by the tool are:
- HTML page with the word “journal” shows that the faculty has published journals.
- Pattern says that the very first list (“<ul> </ul>”) is the Journal
- in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
names.
- content within the bold tag is the Title of the paper published.
- content after the bold tag is the Journal name.
- within the journal name, the four co
year of journal publication.
- Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
bold tag is examined to extract the author names and title precisely.
Those rules were used to extract the structured data from the HTM
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
the resultant structured data goes alon
sent for the RDF generation phase. With the default format of the RDF file and the structured data
extracted, the RDF file is generated for its corresponding HTML file.
Computer Science & Information Technology (CS & IT)
processing of web pages Figure. 3. Structured data extraction
The extraction is done using a tree structure called the EC tree (embedded catalog tree
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
learning and data extraction. Given the EC tree and the rules, any node can be extracted by
from the root to the node by extracting each node in P from its parent.
The extraction rules are based on the idea of landmarks. Each landmark is a sequence of
s used to locate the beginning or the end of a target item.
Some of the rules identified by the tool are:
HTML page with the word “journal” shows that the faculty has published journals.
Pattern says that the very first list (“<ul> </ul>”) is the Journal list.
in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
content within the bold tag is the Title of the paper published.
content after the bold tag is the Journal name.
within the journal name, the four consecutive numbers within the range, 19** to 20** is
year of journal publication.
Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
bold tag is examined to extract the author names and title precisely.
Figure. 4.Embedded catalog tree
Those rules were used to extract the structured data from the HTML pages of same format. The
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
the resultant structured data goes along with the human oracle, the extracted structured data is
sent for the RDF generation phase. With the default format of the RDF file and the structured data
extracted, the RDF file is generated for its corresponding HTML file.
Figure. 3. Structured data extraction
catalog tree) [17],
which model the data embedding in a HTML page. For example EC tree used in this paper is
given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule
e and the rules, any node can be extracted by
from its parent.
. Each landmark is a sequence of
HTML page with the word “journal” shows that the faculty has published journals.
in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors
nsecutive numbers within the range, 19** to 20** is
Some of the author names are within bold tag, which gets included wrongly in journal
name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted
between individual author names. During extraction, the presence of “auth” tag within
L pages of same format. The
output is checked against the human oracle. In case of conflict, the conflicting things were
identified and corrected by replacing or adding rules. These two processes go iteratively. Once
g with the human oracle, the extracted structured data is
sent for the RDF generation phase. With the default format of the RDF file and the structured data
Computer Science & Information Technology (CS & IT) 513
2.3 OWL Creation
Domain specific vocabularies and the relationships are established by ontology creation step
using the Protégé tool [8]. The relationship between those RDF files is established by .owl file
[12]. The ontology developed by this step is very much useful during the report generation by
correctly identifying the RDF file for search. SPARQL query [9] is analyzed and the report is
generated in the required format using JENA APIs [11]. This is further explained in the
subsequent section.
3. EXPERIMENTAL RESULTS
The proposed model is experimented with the college web site (www.tce.edu) to get the report on
the journal publications of various departments’ staff members for a particular time period.
HTML pages containing Journal details are pre-processed. Some faculty members may not have
updated their details in HTML page. Such HTML pages are rejected before pre-processing phase.
Table 1. Experimental data considered
Dept
# Html
Pages
# Preprocessed
Pages
CSE 28 11
ECE 32 26
EEE 30 14
MECH 34 18
CIVIL 23 14
IT 20 11
3.1 HTML to RDF conversion
Each cleaned HTML pages are converted into RDF [9] as shown below in Figure 5.
Figure. 5. RDF for HTML page
3.2 Evaluation Metrics
We’ve considered precision as a measure [17] before and after the rule refinement. Precision is
calculated as the ratio of information retrieved and the total information. The precision value is
measured against the number of correctly extracted information under the right RDF tag for 94
HTML pages. The following Table 2 gives the precision values for various tags.
514 Computer Science & Information Technology (CS & IT)
Table 2. Precision for RDF conversion
Lower precision in extracting author names is due to improper alignment of author names which
cannot be corrected in preprocessing phase. Precision is increased by rule refinement for the
following problems that are dealt: many consecutive commas, misplaced commas and absence of
comma. The following corrective actions are also taken: (i) Within the journal name, the four
consecutive numbers within the range, 19** to 20** is considered as year (ii) Some of the author
names are within bold tag, which is wrongly placed, thus adding <auth> in between the author
names.
3.3 Query Results
For an instance, a general query [15] for retrieving all details is given below.
3.4 Report Generation
One of the ways of presenting the output is through various charts. The tool uses JFreeChart [14]
and enables easy query interface for various factors and they are depicted in the series of Figures
6-9.
Figure.6. Number of journals published by Figure.7. Number of journals published by
each department in the range of years given author in given range of year
Extraction of
Before Rule
Refinement
After Rule
Refinement
Author 51% (48) 89% (84)
Title 75% (71) 94% (89)
Journal 100% (94) 100% (94)
Year 68% (64) 88% (83)
Computer Science & Information Technology (CS & IT) 515
Figure. 8. Number of journals published by Figure. 9. Number of journals published by
each department in given year given department in the range of years.
Time taken by manual effort includes time taken to traverse the links to access required HTML
document, scan the page for required details, identify and feed the details in different location (eg.
spreadsheet) and generate report. These functions takes time in the order of minutes and in worst
case it takes hours, based on the person doing the job. But the toolkit enables to update the
repository on regular basis. Once the details are updated, required information can be obtained
and reports can be generated in seconds. Time taken by tool includes - generation of SPARQL
query, identifying required RDF files from ontology, extracting details from RDF files identified
and generating the report. The time taken to generate the reports is shown in Table 3.
Table 3. Report generation time
Report
No.
Manual
Effort (in min)
Toolkit’s time
(in sec)
R1 1416 4
R2 184 3
R3 670 4
R4 14 2
4. CONCLUSION
We developed a toolkit to retrieve the web of data using RDF and ontology by well refined rules
to extract the data. Quick and useful inferences can be made by easy query building and report
generation process. As a future enhancement, inter-department relationships can be made using
MULTIPLE ontologies to bring out the inter-college relationships.
REFERENCES
[1] Ujjal Marjit,Kumar Sharma and Utpal Biswas,"Discovering Resume Information using Linked data",
Information Journal of web & semantic technology(IJWesT) Vol.3,No.2,April 2012.
[2] Arup Sarkar,Ujjal Marjit and Utpal Biswas,"Linked data generation for the University data from
Legacy database", International Journal of Web & Semantic technology(IJWesT) Vol.2,No.3,July
2011.
516 Computer Science & Information Technology (CS & IT)
[3] David Camacho and Maria D.R-Moreno,"Web data Extraction using Semantic Generators", VSP
International Science Publishers,2006.
[4] Uriv shah,Tim Finin,Anupam Joshi,R.Scott cost,James Mayfield,"Information Retrieval on the
Semantic web".
[5] Olaf Hartig,Christian Bizer and Johann-Christoph Freytag ,"Executing SPARQL Queries over the
Web of Linked Data".
[6] Peter Haase, Nenad Stojanovic, York Sure, and Johanna Volker, “Personalized Information Retrieval
in Bibster,a Semantics-Based Bibliographic Peer-to-Peer System”
[7] A.M.Abirami , Dr.A.Askarunisa, ”A Proposal for the Semantic based Report Generation of Related
HTML Documents”, International Journal of Software Engineering and Technology, Sept 2011.
[8] http://protege.stanford.edu/
[9] http://www.w3schools.com/rdf/rdf_example.asp
[10] http://www.w3.org/TR/rdf-sparql-query/
[11] http://jena.sourceforge.net/tutorial/RDF_API/
[12] http://www.obitko.com/tutorials/ontologies-semanticweb/ontologies.htm
[13] http://tidy.sourceforge.net/
[14] http://www.jfree.org/jfreechart/
[15] http://code.google.com/
[16] http://www.ai.sri.com/~muslea/PS/jaamas-2k.pdf
[17] Bing Liu,"Web data mining - Exploring hyperlinks,contents,and usage data".
[18] Grigoris Antoniou and Frank van Harmelen, "A semantic Web Primer".

More Related Content

What's hot

F0362036045
F0362036045F0362036045
F0362036045theijes
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463IJRAT
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
Jarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDFJarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDFMustafa Jarrar
 
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Robert Meusel
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASE
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASESURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASE
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASEIAEME Publication
 
Annotations chicago
Annotations chicagoAnnotations chicago
Annotations chicagoTimothy Cole
 
A category theoretic model of rdf ontology
A category theoretic model of rdf ontologyA category theoretic model of rdf ontology
A category theoretic model of rdf ontologyIJwest
 
Using linguistic analysis to translate
Using linguistic analysis to translateUsing linguistic analysis to translate
Using linguistic analysis to translateIJwest
 
14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patilwidespreadpromotion
 

What's hot (18)

Infos2014
Infos2014Infos2014
Infos2014
 
F0362036045
F0362036045F0362036045
F0362036045
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Going for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked MetadataGoing for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked Metadata
 
Dspace OAI-PMH
Dspace OAI-PMHDspace OAI-PMH
Dspace OAI-PMH
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Jarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDFJarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDF
 
Web Spa
Web SpaWeb Spa
Web Spa
 
Extract, transform and load architecture for metadata collection
Extract, transform and load architecture for metadata collectionExtract, transform and load architecture for metadata collection
Extract, transform and load architecture for metadata collection
 
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
 
Semantic Web Nature
Semantic Web NatureSemantic Web Nature
Semantic Web Nature
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASE
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASESURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASE
SURVEY ON MONGODB: AN OPEN-SOURCE DOCUMENT DATABASE
 
Annotations chicago
Annotations chicagoAnnotations chicago
Annotations chicago
 
A category theoretic model of rdf ontology
A category theoretic model of rdf ontologyA category theoretic model of rdf ontology
A category theoretic model of rdf ontology
 
Using linguistic analysis to translate
Using linguistic analysis to translateUsing linguistic analysis to translate
Using linguistic analysis to translate
 
14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil14. Files - Data Structures using C++ by Varsha Patil
14. Files - Data Structures using C++ by Varsha Patil
 

Similar to A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING WRAPPER INDUCTION TECHNIQUE

A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...cscpconf
 
A semantic based approach for knowledge discovery and acquistion from multipl...
A semantic based approach for knowledge discovery and acquistion from multipl...A semantic based approach for knowledge discovery and acquistion from multipl...
A semantic based approach for knowledge discovery and acquistion from multipl...csandit
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringIRJET Journal
 
The Data Records Extraction from Web Pages
The Data Records Extraction from Web PagesThe Data Records Extraction from Web Pages
The Data Records Extraction from Web Pagesijtsrd
 
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisSemi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisIRJET Journal
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...IRJET Journal
 
Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web dannyijwest
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
2008 Industry Standards for C2 CDM and Framework
2008 Industry Standards for C2 CDM and Framework2008 Industry Standards for C2 CDM and Framework
2008 Industry Standards for C2 CDM and FrameworkBob Marcus
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyijnlc
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0STIinnsbruck
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine ScrapperIRJET Journal
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 

Similar to A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING WRAPPER INDUCTION TECHNIQUE (20)

A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
 
A semantic based approach for knowledge discovery and acquistion from multipl...
A semantic based approach for knowledge discovery and acquistion from multipl...A semantic based approach for knowledge discovery and acquistion from multipl...
A semantic based approach for knowledge discovery and acquistion from multipl...
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 
The Data Records Extraction from Web Pages
The Data Records Extraction from Web PagesThe Data Records Extraction from Web Pages
The Data Records Extraction from Web Pages
 
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisSemi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...
 
Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web
 
H017554148
H017554148H017554148
H017554148
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
2008 Industry Standards for C2 CDM and Framework
2008 Industry Standards for C2 CDM and Framework2008 Industry Standards for C2 CDM and Framework
2008 Industry Standards for C2 CDM and Framework
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontology
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine Scrapper
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Web Presen
Web PresenWeb Presen
Web Presen
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 

More from cscpconf

ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
 
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
 
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
 
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIESPROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
 
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
 
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICTWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
 
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINDETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
 
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
 
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMIMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
 
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKCLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
 
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
 
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
 
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHCHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
 
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
 

More from cscpconf (20)

ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
 
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
 
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
 
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIESPROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
 
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
 
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICTWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
 
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINDETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
 
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
 
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMIMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEW
 
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKCLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
 
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
 
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
 
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHCHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
 
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
 

Recently uploaded

Introduction to TechSoup’s Digital Marketing Services and Use Cases
Introduction to TechSoup’s Digital Marketing  Services and Use CasesIntroduction to TechSoup’s Digital Marketing  Services and Use Cases
Introduction to TechSoup’s Digital Marketing Services and Use CasesTechSoup
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of PlayPooky Knightsmith
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesSHIVANANDaRV
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningMarc Dusseiller Dusjagr
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSAnaAcapella
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxakanksha16arora
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxCeline George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17Celine George
 

Recently uploaded (20)

Introduction to TechSoup’s Digital Marketing Services and Use Cases
Introduction to TechSoup’s Digital Marketing  Services and Use CasesIntroduction to TechSoup’s Digital Marketing  Services and Use Cases
Introduction to TechSoup’s Digital Marketing Services and Use Cases
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Our Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdfOur Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 

A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING WRAPPER INDUCTION TECHNIQUE

  • 1. Jan Zizka (Eds) : CCSIT, SIPP, AISC, PDCTA - 2013 pp. 509–516, 2013. © CS & IT-CSCP 2013 DOI : 10.5121/csit.2013.3655 A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING WRAPPER INDUCTION TECHNIQUE A.M.Abirami1 , Dr.A.Askarunisa2 , T.M.Aishwarya1 and K.S.Eswari1 1 Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil Nadu, India. abiramiam@tce.edu, {aish2292,eswainnov}@gmail.com 2 Department of Computer Science and Engineering, Vickram College of Engineering, Enathi, Tamil Nadu, India. nishanazer@yahoo.com ABSTRACT Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy. KEYWORDS Information Retrieval, Ontology, Structured Information Extraction, RDF, SPARQL, Semantic Web. 1. INTRODUCTION Today’s web contents are suitable only for humans; it is not machine understandable and readable. So retrieving information from web pages is a complex task. Time consumption is more if one needs to refer or access too many HTML pages to collect data and make summary on them.
  • 2. 510 Computer Science & Information Technology (CS & IT) The semantic web technologies [16] overcome these difficulties to a greater extent. A core data representation format for semantic web is Resource Description Framework (RDF). RDF is a data model for web pages. RDF is a framework for representing information about resources in a graph form. It was primarily intended for representing metadata about WWW resources, such as the title, author, and modification date of a Web page, but it can be used for storing any other data. It is based on triples subject-predicate-object that form graph of data [18]. All data in the semantic web use RDF as the primary representation language [9]. RDF Schema (RDFS) can be used to describe taxonomies of classes and properties and use them to create lightweight ontologies. Ontologies describe the conceptualization, the structure of the domain, which includes the domain model with possible restrictions [11]. More detailed ontologies can be created with Web Ontology Language (OWL). It is syntactically embedded into RDF, so like RDFS, it provides additional standardized vocabulary. For querying RDF data as well as RDFS and OWL ontologies with knowledge bases, a Simple Protocol and RDF Query Language (SPARQL) is available. SPARQL is SQL-like language, but uses RDF triples and resources for both matching part of the query and for returning results [10]. Ujjal Marjit et.al. [1] presented a linked data approach to discover resume information by providing a standard data model, data sharing and data reusing. Arup Sarkar et.al. [2] provided an approach to publish the data from legacy databases as linked data on the web thus enabling the machine readability. David Camacho et.al. [3] extracted information from HTML documents and added semantics through XML to generate rules using Web Mantic with the help of user interaction. Uriv shah et.al [4] provided a framework for retrieval of documents containing both free text and semantically enriched markup using DAML+OIL semantic web language. Olaf Hartig et.al. [5] developed an approach to execute SPARQL queries over web of linked data to discover relevant data for query answering during query execution itself using RDF links between data sources. Peter Haase et. al. [6] assumed a globally shared domain ontology, which can not only be used for subsequently classifying the bibliographic metadata, but also for supporting an improved query refinement process. Abirami et. al [7] proposed a Semantic based approach for generating reports from HTML pages by converting those HTML pages into pre-processed and formatted CVS files, generating OWL files for domain, separating contents from CVS files based on OWL files and rules to generate RDF files and querying RDF files using SPARQL. This paper proposes a model for storing the content of HTML pages into RDF format using the rules generated by the wrapper induction technique. For example, the university web site may contain the details of all the staff members of all the departments mostly in HTML pages. The data is collected and consolidated mostly manually by every year. Though the content is readily available in HTML pages, the manual effort is wasted while consolidating the report. The proposed model generates the report from different HTML documents using RDF and SPARQL. Only the relevant information is extracted and stored in RDF format which makes further querying easier. This paper is organized into following sections: section 2 describes the methodology, section 3 shows the experiments that are taken and section 4 gives the conclusion. 2. METHODOLOGY The proposed model includes five different processes namely – (i) extracting structured data from multiple HTML pages (ii) generating RDF files with the structured data extracted (iii) creating ontology for specific domain relationship (iv) generating corresponding SPARQL queries for the
  • 3. Computer Science & Information Technology (CS & IT) 511 user query and (v) generating reports according to the user needs. The Figure 1 shows the various processes in the implementation, which are dealt in the subsequent sections. Figure. 1. Model for Information Retrieval 2.1 Pre-Processing of HTML pages It is important to check for the clean HTML pages, because a typical Web page contains a large amount of noise, which can adversely affect data extraction accuracy. Pre-Processing phase is done before actual extraction starts, which is depicted in Figure 2.The raw HTML is fed into the HTML Tidy [13]. This tool checks for the presence of close tag for every open tag. The resultant well formed html is fed into noise removal phase, where the output will be a clean HTML page. 2.2 Structured Data Extraction & RDF Generation Once the pre-processing phase gets completed, the clean html pages are fed into structured data extraction phase. This is accomplished by using the Wrapper Induction Technique [17], which is the supervised learning approach, and is semi-automatic. In this approach, a set of extraction rules is learned from a collection of manually labeled pages. The rules are then employed to extract target data items from other similarly formatted pages. The process is pictorially represented in Figure 3. Most HTML tags work in pairs. Each pair consists of an open tag and a close tag indicated by < > and </> respectively. Within each corresponding tag-pair, there can be other pairs of tags, resulting in nested structures. Thus, HTML tags can naturally encode nested data. Notable points are: (a) there are no designated tags for each type as HTML was not designed as a data encoding language and any HTML tag can be used for any type (b) for a tuple type, values (data items) of different attributes are usually encoded differently to distinguish them and to highlight important items. In case of any conflicts, they are identified and corrected by replacing or adding rules. These two processes go iteratively. Once the resultant structured data goes along with the human oracle, the extracted structured data is sent for the RDF generation phase.
  • 4. 512 Computer Science & Information Technology (CS & IT) Figure. 2. Pre-processing of web pages The extraction is done using a tree structure called the which model the data embedding in a HTML page. For example EC tree used in this paper is given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule learning and data extraction. Given the following the tree path P from the root to the node by extracting each node in The extraction rules are based on the idea of consecutive tokens and is used to locate the beginning or the end of a target item. Some of the rules identified by the tool are: - HTML page with the word “journal” shows that the faculty has published journals. - Pattern says that the very first list (“<ul> </ul>”) is the Journal - in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors names. - content within the bold tag is the Title of the paper published. - content after the bold tag is the Journal name. - within the journal name, the four co year of journal publication. - Some of the author names are within bold tag, which gets included wrongly in journal name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted between individual author names. During extraction, the presence of “auth” tag within bold tag is examined to extract the author names and title precisely. Those rules were used to extract the structured data from the HTM output is checked against the human oracle. In case of conflict, the conflicting things were identified and corrected by replacing or adding rules. These two processes go iteratively. Once the resultant structured data goes alon sent for the RDF generation phase. With the default format of the RDF file and the structured data extracted, the RDF file is generated for its corresponding HTML file. Computer Science & Information Technology (CS & IT) processing of web pages Figure. 3. Structured data extraction The extraction is done using a tree structure called the EC tree (embedded catalog tree which model the data embedding in a HTML page. For example EC tree used in this paper is given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule learning and data extraction. Given the EC tree and the rules, any node can be extracted by from the root to the node by extracting each node in P from its parent. The extraction rules are based on the idea of landmarks. Each landmark is a sequence of s used to locate the beginning or the end of a target item. Some of the rules identified by the tool are: HTML page with the word “journal” shows that the faculty has published journals. Pattern says that the very first list (“<ul> </ul>”) is the Journal list. in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors content within the bold tag is the Title of the paper published. content after the bold tag is the Journal name. within the journal name, the four consecutive numbers within the range, 19** to 20** is year of journal publication. Some of the author names are within bold tag, which gets included wrongly in journal name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted between individual author names. During extraction, the presence of “auth” tag within bold tag is examined to extract the author names and title precisely. Figure. 4.Embedded catalog tree Those rules were used to extract the structured data from the HTML pages of same format. The output is checked against the human oracle. In case of conflict, the conflicting things were identified and corrected by replacing or adding rules. These two processes go iteratively. Once the resultant structured data goes along with the human oracle, the extracted structured data is sent for the RDF generation phase. With the default format of the RDF file and the structured data extracted, the RDF file is generated for its corresponding HTML file. Figure. 3. Structured data extraction catalog tree) [17], which model the data embedding in a HTML page. For example EC tree used in this paper is given in Figure 4.The wrapper [18] uses a tree structure based on this to facilitate extraction rule e and the rules, any node can be extracted by from its parent. . Each landmark is a sequence of HTML page with the word “journal” shows that the faculty has published journals. in a list (<li> </li>) within journal list, the contents before the bold tag(<b>) are Authors nsecutive numbers within the range, 19** to 20** is Some of the author names are within bold tag, which gets included wrongly in journal name during extraction. After careful assessment, an “<auth> </auth>” tag is inserted between individual author names. During extraction, the presence of “auth” tag within L pages of same format. The output is checked against the human oracle. In case of conflict, the conflicting things were identified and corrected by replacing or adding rules. These two processes go iteratively. Once g with the human oracle, the extracted structured data is sent for the RDF generation phase. With the default format of the RDF file and the structured data
  • 5. Computer Science & Information Technology (CS & IT) 513 2.3 OWL Creation Domain specific vocabularies and the relationships are established by ontology creation step using the Protégé tool [8]. The relationship between those RDF files is established by .owl file [12]. The ontology developed by this step is very much useful during the report generation by correctly identifying the RDF file for search. SPARQL query [9] is analyzed and the report is generated in the required format using JENA APIs [11]. This is further explained in the subsequent section. 3. EXPERIMENTAL RESULTS The proposed model is experimented with the college web site (www.tce.edu) to get the report on the journal publications of various departments’ staff members for a particular time period. HTML pages containing Journal details are pre-processed. Some faculty members may not have updated their details in HTML page. Such HTML pages are rejected before pre-processing phase. Table 1. Experimental data considered Dept # Html Pages # Preprocessed Pages CSE 28 11 ECE 32 26 EEE 30 14 MECH 34 18 CIVIL 23 14 IT 20 11 3.1 HTML to RDF conversion Each cleaned HTML pages are converted into RDF [9] as shown below in Figure 5. Figure. 5. RDF for HTML page 3.2 Evaluation Metrics We’ve considered precision as a measure [17] before and after the rule refinement. Precision is calculated as the ratio of information retrieved and the total information. The precision value is measured against the number of correctly extracted information under the right RDF tag for 94 HTML pages. The following Table 2 gives the precision values for various tags.
  • 6. 514 Computer Science & Information Technology (CS & IT) Table 2. Precision for RDF conversion Lower precision in extracting author names is due to improper alignment of author names which cannot be corrected in preprocessing phase. Precision is increased by rule refinement for the following problems that are dealt: many consecutive commas, misplaced commas and absence of comma. The following corrective actions are also taken: (i) Within the journal name, the four consecutive numbers within the range, 19** to 20** is considered as year (ii) Some of the author names are within bold tag, which is wrongly placed, thus adding <auth> in between the author names. 3.3 Query Results For an instance, a general query [15] for retrieving all details is given below. 3.4 Report Generation One of the ways of presenting the output is through various charts. The tool uses JFreeChart [14] and enables easy query interface for various factors and they are depicted in the series of Figures 6-9. Figure.6. Number of journals published by Figure.7. Number of journals published by each department in the range of years given author in given range of year Extraction of Before Rule Refinement After Rule Refinement Author 51% (48) 89% (84) Title 75% (71) 94% (89) Journal 100% (94) 100% (94) Year 68% (64) 88% (83)
  • 7. Computer Science & Information Technology (CS & IT) 515 Figure. 8. Number of journals published by Figure. 9. Number of journals published by each department in given year given department in the range of years. Time taken by manual effort includes time taken to traverse the links to access required HTML document, scan the page for required details, identify and feed the details in different location (eg. spreadsheet) and generate report. These functions takes time in the order of minutes and in worst case it takes hours, based on the person doing the job. But the toolkit enables to update the repository on regular basis. Once the details are updated, required information can be obtained and reports can be generated in seconds. Time taken by tool includes - generation of SPARQL query, identifying required RDF files from ontology, extracting details from RDF files identified and generating the report. The time taken to generate the reports is shown in Table 3. Table 3. Report generation time Report No. Manual Effort (in min) Toolkit’s time (in sec) R1 1416 4 R2 184 3 R3 670 4 R4 14 2 4. CONCLUSION We developed a toolkit to retrieve the web of data using RDF and ontology by well refined rules to extract the data. Quick and useful inferences can be made by easy query building and report generation process. As a future enhancement, inter-department relationships can be made using MULTIPLE ontologies to bring out the inter-college relationships. REFERENCES [1] Ujjal Marjit,Kumar Sharma and Utpal Biswas,"Discovering Resume Information using Linked data", Information Journal of web & semantic technology(IJWesT) Vol.3,No.2,April 2012. [2] Arup Sarkar,Ujjal Marjit and Utpal Biswas,"Linked data generation for the University data from Legacy database", International Journal of Web & Semantic technology(IJWesT) Vol.2,No.3,July 2011.
  • 8. 516 Computer Science & Information Technology (CS & IT) [3] David Camacho and Maria D.R-Moreno,"Web data Extraction using Semantic Generators", VSP International Science Publishers,2006. [4] Uriv shah,Tim Finin,Anupam Joshi,R.Scott cost,James Mayfield,"Information Retrieval on the Semantic web". [5] Olaf Hartig,Christian Bizer and Johann-Christoph Freytag ,"Executing SPARQL Queries over the Web of Linked Data". [6] Peter Haase, Nenad Stojanovic, York Sure, and Johanna Volker, “Personalized Information Retrieval in Bibster,a Semantics-Based Bibliographic Peer-to-Peer System” [7] A.M.Abirami , Dr.A.Askarunisa, ”A Proposal for the Semantic based Report Generation of Related HTML Documents”, International Journal of Software Engineering and Technology, Sept 2011. [8] http://protege.stanford.edu/ [9] http://www.w3schools.com/rdf/rdf_example.asp [10] http://www.w3.org/TR/rdf-sparql-query/ [11] http://jena.sourceforge.net/tutorial/RDF_API/ [12] http://www.obitko.com/tutorials/ontologies-semanticweb/ontologies.htm [13] http://tidy.sourceforge.net/ [14] http://www.jfree.org/jfreechart/ [15] http://code.google.com/ [16] http://www.ai.sri.com/~muslea/PS/jaamas-2k.pdf [17] Bing Liu,"Web data mining - Exploring hyperlinks,contents,and usage data". [18] Grigoris Antoniou and Frank van Harmelen, "A semantic Web Primer".