SlideShare a Scribd company logo
1 of 99
Download to read offline
XML
- ( )
XML Retrieval A Slot Filling Approach
1997 2002 ,
,
`
`
` ` ,
,
` ` : `
! `
,
XML 1998 W3C
XML XML
XML
XML
XML
XML
(Slot-Tree Ontology) XML
XML
(Slot-Filling Algorithm) XML XML
XML
XML
(Data Mining)
(Slot-Mining Algorithm) XML
- (Protein Information
Resource) XML XML
,
XML
Abstract
Extensible Markup Language (XML) is widely used in data exchanging and knowledge
representation. A retrieval system that used to manage the content of XML documents is
strongly desired. In order to improve the efficiency of XML retrieval systems, we design a set
of methods based on a ontology called slot-trees, and use the slot-trees to help the XML
retrieval process.
One problem for us to build smart computer is that computer cannot understand natural
language as good as human. This is called the semantic gap between human and computer. For
XML retrieval systems, semantic gap lies on both the query side and document side. The
semantic gap on the query side is due to the difficulty for human to write structured query. The
semantic gap on the document side is due to the difficulty for computer to understand XML
documents. In order to reduce the semantic gap, we design a XML retrieval system based on a
notion of slot-tree ontology.
Slot-tree ontology is an object-based knowledge representation. In this thesis we develop
slot-tree ontology to represent the inner structure of an object. We then introduce a slot-filling
algorithm that maps XML documents into the slot-tree ontology in order to capture the
semantics. After that, we design a XML retrieval system based on the slot-tree ontology and
slot-filling algorithm. The system includes a slot-based query interface, a semantic retrieval
model for XML, and a program that extract summary for browsing.
Since the construction of slot-tree is not an easy job, we also develop a slot-mining
algorithm to construct the slot-tree automatically. Our slot-mining algorithm is a statistical
approach based on the correlation analysis between tags and words. The highly correlated
terms are filled into the slot-tree as values. This algorithm eases the construction process of the
slot-tree.
Two XML collections, one on butterflies and another on proteins, are used as test-bed of
our XML retrieval system. We found that our XML retrieval system is easy to use and performs
well in the retrieval effectiveness and the quality of browsing. Furthermore, the slot-mining
algorithm can fill important words into each slot. However, the mining results should be
modified manually in order to improve the quality of the slot-tree.
Finally, we summary our contributions on XML retrieval, and then compare our methods
to some other methods. A qualitative analysis is given in the last chapter. We also suggest
directions for further research.
XML Retrieval : A Slot-Filling Approach
Ph.D. Dissertation
Chen, Chung Chen
Department of Computer Science and Information Engineering
National Taiwan University
Taipei, Taiwan
E-mail : johnson@turing.csie.ntu.edu.tw
Advisor : Jieh Hsiang
23 July 2002
Content
Part 1 : Tutorial of This Thesis
1 Introduction 1
1.1 Motivation 1
1.2 Research problems 3
1.3 Research approaches 5
1.4 Outline of this thesis 7
2 Background XML and Information Retrieval 8
2.1 XML 8
2.2 Information retrieval 9
2.3 XML querying and retrieval 12
2.4 Using ontology to help the XML retrieval process 16
2.5 Discussion 20
Part 2 : Slot-Tree Based Methods for XML Retrieval
3 Slot-Tree Ontology and Slot-Filling Algorithm 21
3.1 Introduction 21
3.2 Slot-tree ontology 22
3.3 Slot-filling algorithm 26
3.4 Discussion 28
4 An Ontology Based Approach for XML Querying, Retrieval and Browsing 29
4.1 Introduction 29
4.2 XML documents 30
4.3 Indexing structure 32
4.4 Query language and query interface 33
4.5 Ranking strategy 34
4.6 Browsing XML documents 36
4.7 Discussion 37
5 The Construction of Slot-Tree Ontology 38
5.1 Introduction 38
5.2 Background 39
5.3 The process of building a slot-tree 39
5.4 Slot-mining algorithm 41
5.5 Discussion 44
Part 3 : Case Studies
6 Case Study - A Digital Museum of Butterflies 46
6.1 Introduction 46
6.2 The representation of butterflies in XML 47
6.3 Slot-tree ontology for butterflies 48
6.4 Query interface 51
6.5 Slot-filling algorithm 52
6.6 XML retrieval 53
6.7 Slot-mining algorithm 53
6.8 Discussion 56
7 Case Study - Protein Information Resource 57
7.1 Introduction 57
7.2 The representation of proteins in XML 58
7.3 Slot-tree ontology for proteins 58
7.4 Query interface 59
7.5 Slot-filling algorithm 60
7.6 XML retrieval 61
7.7 Slot-mining algorithm 62
7.8 Discussion 64
Part 4 : Conclusions
8 Conclusions and Contributions 65
8.1 Comparison 65
8.2 Contributions 69
8.3 Discussion and future work 70
Reference 71
Appendix 1 : A Museum of Butterflies in Taiwan 77
Appendix 2 : Protein Information Resource 85
Part 1 : Tutorial of This Thesis
1 Introduction
This thesis introduces an information retrieval (IR) method for XML. One big problem for
information retrieval is that computer cannot understand documents as good as people. The
problem is called the semantic gap problem. Our goal is building an information retrieval
system to reduce the semantic gap between human and computer on XML. Our approach is
using ontology to help the searching processes for XML, include querying, retrieval and
browsing. This thesis is opened with our motivation in section 1.1. Our research problems are
proposed in section 1.2. Our research approaches are described in section 1.3. An overview of
this thesis is outlined in section 1.4.
1.1 Motivation
Extensible Markup Language (XML) [XML98] is a standard to encode semi-structured
documents. XML is useful in data representation, data exchanging and data publishing on the
web. Many people believes that XML will be a widely spread standard in the future. For this
reason, XML has gained much attention in both the information community and in the field of
database research.
XML is a markup language with extensible tags. Everyone may define his own markup
language based on XML. In fact, hundreds of specifications based on XML have been
proposed from 1997 to 2002. These specifications are designed to fulfill the need of some
domains or some applications. For example, Protein Information Resource (PIR)
(http://pir.georgetown.edu/) is an XML collections designed to record the data about proteins.
UDDI [UDDI00] is an XML specifications designed to record the profile of business
companies.
XML is designed to be easy understood by human and computer. XML is encoded in text
format for human to read and understand easily. Tags in XML provide semantic background for
computer to understand the content correctly. XML can be used as a bridge between human
writing and computer understanding.
A smart computer program that understands XML documents is useful. However, building
a computer program to understand XML documents is still very difficult. In this thesis, we
propose methods for computer to understand XML documents.
The natural language processing (NLP) community has been focus on the processing and
understanding of natural language documents for a long time [Grosz86]. However,
understanding natural language documents is still very difficult for computer programs. No
effective approach is powerful enough to solve the understanding problem. Building a smart
computer program to understand natural language texts is very difficult because of the
semantic gap . The semantic gap is described as following.
Computer cannot understand natural language as good as human.
The semantic gap causes some difficulties for information retrieval systems. For example,
an information retrieval system cannot understand our natural language queries, and retrieve
many documents that are not semantically related to our queries.
There are two semantic gaps for an information retrieval system, one for queries
understanding and another for documents understanding. These gaps are list as following.
Gap 1 : Computer cannot understand queries as good as human.
Gap 2 : Computer cannot understand documents as good as human.
Figure 1.1 : Semantic gaps of natural language
In order to reduce the semantic gap problem, researchers in NLP community have been
trying hard to resolve the following question.
How to make computers understand natural language?
However, natural language is too difficult for computer programs to understand now.
Although many people have been devoted to solve the problem for more than thirty years,
designing a computer program to understand natural language is still an open research
problem.
Computers do not understand natural language well. Why don t we design a structured
language that is easy for computer to understand and easy for human to write. If we can design
such a language, then we have a common language between human and computer. People may
write documents in this language for computer to understand. Then we may build computer
programs to understand documents in this language.
XML is such a language that is easy for human to write. However, we have no method for
computer to understand XML documents easily. If we can design such a computer program, we
may reduce the semantic gap for XML, so that XML may plays as a bridge between human and
computer.
In this thesis, our goal is to reduce the semantic gap on XML. Our approach is to design
methods for computer to understand XML documents. Our research problem is described in the
next section.
1.2 Research problems
XML is a markup language with extensible tags. People have to understand tags before writing
XML documents. If there are too many tags for an XML writer to remember, he cannot write
XML documents easily. If a writer has to mark each word up in XML documents, he cannot
write it easily, too. On the other hand, if a writer mark documents up roughly, it is difficult for
computer to understand. The tradeoff between human writing and computer understanding is
called the human-computer dilemma of XML .
Our goal is to design an XML retrieval system to resolve the human-computer dilemma
of XML . For an XML retrieval system, there are two semantic gaps between human and
computer, one gap on query side and another gap on document side. Figure 1.2 shows these
two gaps.
Figure 1.2 : Semantic gaps of XML
On the document side, an XML document may be easy for human to write but not so easy for
computer to understand. An XML document with many natural language texts is not so easy
for computer to understand. Example 1.1 shows an XML document that contains natural
language text in the color block and size block. It is not so easy for computers to
understand the XML document.
Example 1.1 : An XML document that is not easy for computer to understand
<butterfly name= kodairai >
<color>with black wing and white spots on it</color>
<size>middle size butterflies, from 50mm to 60mm</size>
</butterfly>
On the contrarily, an XML document may be easy for computer to understand but not so easy
for human to read and write. An XML document that marks each word up is not so easy for
human to read and write. Example 1.2 shows an XML document that is not easy for human to
read and write.
Example 1.2 An XML document that is not easy for human to read and write
<butterfly name= kodairai >
<color><wing>black<wing><texture>white spot</texture></color>
<size>
<classification>middle size</classification>
<from>50mm</from><to>60mm</to>
</size>
</butterfly>
The same things happen on the query side, an XML query may be easy for human to write but
not so easy for computer to understand. An XML query with natural language is not so easy for
computer to understand. Example 1.3 shows an XML query that is not so easy for computer to
understand.
Example 1.3 An XML query that is not easy for computer to understand
<butterfly>in black color with white spots</butterfly>
On the contrarily, an XML query may be easy for computer to understand but not easy for
human to read and write. A structuralized XML query is not so easy for human to read and
write. Example 1.4 shows an XML query that is not so easy for human to read and write.
Example 1.4 An XML query that is not easy for human to read and write
For $b in //butterfly
Where ?b/color = black¡¤and ?b/texture= white spots
Return ?b
Two approaches may be used to reduce semantic gap between human and computer on
XML. The first approach is building computer programs to understand XML documents or
queries. The second approach is building tools for human to write XML documents or queries.
We adopt the first approach on the document side and adopt the second approach on the
query side. It means that we build a computer program to understand roughly tagged XML
documents, and we build a tool for human to write XML queries easily. The following section
shows our approach.
1.3 Research approaches
In this thesis, we build an XML retrieval system to reduce the semantic gap between human
and computer on XML. An ontology called slot-tree is used to help the retrieval process. A user
may use the query interface to write queries easily. The slot-tree ontology also helps the
computer to understand XML documents easily. Figure 1.3 shows a scenario of our XML
retrieval system.
Figure 1.3 : A scenario of our XML retrieval system.
On the document side, we build a computer program to understand XML documents. The
understanding process is based on an ontology called slot-tree. Slot-tree is a frame like
representation that embedded with XPATH [XPATH99] expression. In order to make computer
understand XML documents, we designed a slot-filling algorithm to map XML documents into
the slot-tree.
On the query side, we build a query interface for human to write queries easily. The
interface is built by transform the ontology into a web page. User may use the interface to write
structural queries just by choosing or typing values into slots to build a structural query.
In our approach, the slot-tree ontology is a key component for both documents
understanding and queries building. The slot-tree ontology mediates queries and documents in
the retrieval process to reduce the semantic gaps both on query side and document side.
However, it is not an easy job to build the slot-tree ontology. The ontology constructor
needs tools to build slot-tree ontology. The problem of construct slot-tree automatically based
on a set of XML documents is called the slot-mining problem. It is described as following.
How to mine the slot-tree ontology from a collection of XML documents ?
In order to handle the slot-tree mining problem, we developed a statistical method to build
the slot-tree automatically. The algorithm is called slot-mining algorithm that based on
correlation analysis between tags and terms in XML documents.
1.4 Outline of this thesis
This thesis is divided into four parts, including tutorial part , methods part , case study
part and conclusion part .
Part 1 sets the stage for all the others. Chapter 1 outlines the research problems and
approaches. Chapter 2 reviews the background literatures for our research - Designing an
XML retrieval system to reduce the semantic gap problem .
Part 2 is a detail description of our methods. Our methods are based on a knowledge
representation structure called slot-tree. The slot-tree is used in catching the semantics of XML
documents. It helps our XML retrieval system to understand XML documents.
Chapter 3 shows the syntax and semantics of slot-tree ontology, and shows a method that
uses the slot-tree to catch the semantics of XML documents called slot-filling algorithm.
Chapter 4 outlined an XML information retrieval system that based on slot-tree. The slot-tree
ontology and slot-filling algorithm are used to reduce the semantic gap of XML retrieval.
Chapter 5 shows the process of constructing slot-tree ontology. The steps of constructing
a slot-tree are outlined. After that, a method that constructs slot-tree automatically is proposed.
The method is a statistical program that called slot-mining algorithm. The slot-mining
algorithm mines slot-trees from XML documents based on the correlation analysis between
tags and terms. It helps peoples to construct the slot-tree ontology for a given XML collection.
Part 3 is test-beds of the slot-tree based approach. The slot-tree based approach is
examined in this part. Two cases are used to test the slot-tree based approaches. Chapter 6
shows the first case that is an XML collection about butterflies. The collection is a set of XML
documents in Chinese about butterflies in Taiwan. Chapter 7 shows the second case that called
Protein Information Resource (PIR). PIR is a large set of XML documents that released by
George Town University. The experiment on these two cases is used to analyze the strength
and weakness of the slot-tree based approach.
Part 4 is the conclusion part. Chapter 8 analyzes the strength of slot-tree based approach.
We compare the slot-tree based methods to some other XML retrieval methods, and point out
our contribution, conclusions and future works.
2 Background XML and Information Retrieval
In chapter 1, we have introduced our motivation, goals and research approaches. Briefly
speaking, we would like to build an XML retrieval system that reducing the semantic gap
between human and computer on XML. In this chapter, we will survey the related researches in
order to provide background knowledge for our research. Since our approach is using slot-tree
ontology to help the XML retrieval process, we will survey the topics of XML, information
retrieval and ontology in this chapter.
In section 2.1, we focus on the XML topics to survey the related specifications and
technologies. In section 2.2, we survey the information retrieval technologies. After that, we
will survey the current status and state of art in XML retrieval in section 2.3. Finally, we will
outline the relationship between ontology and XML retrieval in section 2.4.
2.1 XML
We have to understand XML in order to build an XML retrieval system that reduces the semantic gap.
In this section, we will survey the XML related specifications and technologies, especially literature
about knowledge representation and information retrieval.
XML is proposed by world-wide-web consortium (W3C) (http://www.w3c.org) in 1998. It s a tree
structured markup language with extensible tags. The following example is an XML document of
phonebook.
Example 2.1 An XML document
<?xml version= 1.0 ?>
<!DOCTYPE phonebook SYSTEM "phonebook.dtd">
<phonebooks xmlns= http://www.ntu.edu.tw/phonebook >
<people id= 001 >
<name>Johnson Chen</name>
<tel>02-34134345</tel>
</people>
<people id= 002 >
<name>Fanny Chen</name>
<tel>02-33451294</tel>
</people>
</phonebooks>
In example 2.1, the head part <?xml version= 1.0 ?> indicate that this document is an XML document.
The second line is the document type definition (DTD) part of this XML document. DTD is used to
validate the syntax of XML documents. The DTD part is optional and can be removed to ignore the
syntax validation process.
The third line, with a phonebooks tag, is the root node of this XML document. One XML
document has one and only one root node. In this line, the xmlns= http://www.ntu.edu.tw/phonebook
is the default name space of this XML document. Name space [XMLNS99] in XML is used to
distinguish tags with the same names form each other. So that people can define their own tags and
using others tags without have to worry about using the same tag name in different meaning.
A node in XML contains tag, attribute and text. phonebooks , In the example above, people
and name and tel are tags, xmlns and id are attributes, http://www.ntu.edu.tw/phonebook and
Johnson Chen and 02-34134345 are text parts.
XPath [XPATH99] is a specification that used to locate nodes in XML documents. If we would
like to locate all the people nodes, we may use the XPath expression //people to locate nodes of
people. The // operator means matching every descendent nodes. If we would like to locate the
people node with id = 001 , then we may use the XPath expression //people[@id= 001 ] to locate
the node. The @ symbol means the id is an attribute name. XPath is used in the slot-tree ontology
that is going to be discussed in chapter 3. We embed XPath into the slot-tree to locate nodes in XML,
and using the XPath to map XML documents into slot-tree ontology.
Many XML related specifications are proposed since 1997. XML has been a wide spreading
specification and used in many domains and applications, such as in data exchanging , data
presentation , data querying , and knowledge representation . For data exchanging, UDDI and
ebXML are used to mediate the data exchange process between business enterprises. For data
presentation, XSLT can be used to transform XML into HTML for presenting on the web. For data
querying, XQL, XML-QL and X-Query are used to query data in XML documents. For knowledge
representation, RDF/RDFS, DAML/DAMLS, XML topic map are proposed to represent knowledge in
XML format. We will survey specifications about data querying in section 2.3 that discussing the XML
query and retrieval topics, and survey specifications about knowledge representation in section 2.4 that
discussing the ontology topic.
2.2 Information retrieval
In order to build an XML retrieval system that reduce the semantic gap, we have to understand the
information retrieval technologies, and how to use natural language understanding technologies to
reduce the semantic gap of XML.
The evolution of IR technique is close related to the target document structure. Each time, a new
document structure proposed, a new IR technique developed. In 1970~1980, Vector Space Model is
developed to retrieve text documents. In 1990~1999, Random Walk Model developed to retrieve HTML
documents. Today, XML document are wide spreading. Many researchers are trying to develop new
retrieval models for XML.
Text Retrieval
Text Retrieval Technology is almost as old as the Computer Technology. There are many models for
text retrieval. The most well known is Vector Space Model (VSM) [Salton75]. In this model, each
document is represented by a k-dimensional vector of terms. A plain text is expressed as following.
d = (dt1, dt2, , dtk), where dti is the weight of term ti that show up in the document of d
In the expression above, where k equals the number of index terms in the collection. The order of
words in the text sequence is discarded.
A query is represented by a k-dimensional vector of terms, too. The query (q) may be represented
as the following vector.
q = (qt1, qt2, , qtk), where qti is the weight of term ti that show up in the query of q
Cosine coefficient is a popular measure for the similarity between a document and a query. The
definition of cosine similarity is the cosine of the angle between the document vectors d and the query
vectors q.
Similarity(d, q) =
∑∑
∑
==
=
=
•
k
i
  ¡
k
i
  ¡
  ¡
n
i
  ¡
qd
qd
qd
qd
1
2
1
2
1
*
)*(
||*||
One question is how to set the weight dti and qti in the vector space model. The tfidf is a simple and
common used weighting function. The tfidf weighting is defined as the product of term frequency (tf)
and inverse document frequency (idf)
Term frequency (tf) : tf(t,d) : the number of occurrences of term t in document d
Document frequency (df) : df t : the number of documents, containing term tj .
Inverse document frequency (idf) : the inverse number of documents in which the term occurs.
idf(t) = log(N/dft), where N is the number of documents.
For a given document d, dti= tfidf(ti, d) = tf(ti, d) * idf(t)
For a given document q, qti= tfidf(ti, q) = tf(ti, q) * idf(t)
The SMART system experiments lead by Salton [Salton88] shows that tfidf term weighting function
is the best in his 287 distinct combinations of term-weighting assignments. The tfidf weighting
function has been proved to be a good measure for the vector space model.
HTML Retrieval
The main issue of HTML-retrieval is to measure the importance of a document. A HTML retrieval
system retrieves documents that match the query, and then sort by importance. On the web, there are too
many documents to retrieve. The importance measure helps user to decide what he should read.
Documents on the web are different from the text collection because of the hyperlink structure.
The measure of HTML importance is based on the hyperlink analysis technique. Historically, hyperlink
analysis is developed based on the citation analysis technique. A simple strategy to measure the
importance of a web page is by counting the number of hyperlink that reference to it. A web page
referenced by many other pages is important.
In 1998, a random walk model used to weight the importance of web pages proposed was proposed
[Brin98][Page98]. The random walk model was then used in the Google search engine. In the random
walk model, a page is important if it is cited by many important pages. Formally speaking, each web
page in the random walk model has a weight measure w(d). An iterative process is used to recalculate
the w(d) in each iteration.
∑∈
←
Epqq
qwpw
),(:
)()(
Conceptually, the random walk model simulates the process of a person click web pages randomly.
The random walker chooses a web page randomly as a start page. After that, he randomly clicks a web
page in the page and repeats the click process on each clicked page. In the random walk model, a
important page will be visited with high probability.
Kleinberg proposed a Hub-Authority model to weight the impact of a web page [Kleinberg98]. Web
pages are divided into two classes in this model, hub-page and authority-page. The hub-authority model
is an iterative process. For a hub-page (h), it is important if the page point to many important
authority-pages. For an authority page (a), it is important if the page is cited by many important
hub-pages. Formally speaking, there are two weight on each page (d) in the hub-authority model, the
hub weighting measure h(d) and the authority weighting measure a(d). An iterative process is used to
recalculate the h(d) and a(d) in each iteration. Figure 2.1 shows the concept of hub-authority model.
Figure 2.1 The hub-authority model
A set of web page (D) contains many hyperlinks (E). For each page d in D, h(d) is the hub weight of
d, and a(d) is the authority weight of d. At first, we may set both h(d) and a(d) as 1/|D|, where |D| is the
number of documents in D. After that, an iteration is used to recalculate h(d) and a(d) based on the
following recurrence equations.
∑∈
←
Epqq
qhpa
),(:
)()(
∑∈
←
Eqpq
qaph
),(:
)()(
Hub-authority model is used to weight the importance of a web page, and decide whether a page is
a hub or authority. Besides weighting the importance, hub-authority model provides a mechanism to
classify the type of a web page.
Both hub-authority model and random walk model used the iterative approach to decide the
importance of a web page. The convergence analysis based on eigen-value in linear algebra is used to
analyze the behavior of recurrence equations used in these models. The paper of Kleinberg
[Kleinberg98] and Page et. al. [Page98] have further discussions for the theory of these models.
2.3 XML Querying and Retrieval
In order to manage XML documents, the database community and IR community have recently
focus on the research of storing, indexing, querying, and retrieving XML documents. For
storing, the database management systems are extended to support the function of storing XML
documents. One way is extending relational database system to store XML documents, another
way to store XML documents in object-oriented database (OODB) system. For indexing,
Patricia-trie and inverted-file are used to index XML documents. For querying, several XML
query languages are proposed to retrieve XML nodes. For searching, several systems are
designed to search XML documents. In this section, we will focus on the survey of XML query
languages and XML retrieving systems.
XML Query Language
Designing query languages for XML is a hot research topic for XML. XML query languages are much
more complex than text-retrieval and HTML-retrieval. XML query languages are more flexible than
database query languages. There are many XML query languages proposed in these years, such as Loral
[Loral97] , XML-QL [XML-QL98], XML-GL[XML-GL99], and X-Query [XQuery01].
Querying an XML collection is like to query a database. We usually query tables by SQL
language in a relational database. The following example shows a query to retrieve name and birthday
of United-State presidents.
SELECT name, birthday FROM people WHERE nation= US and job= president
An XML query language has to retrieve nodes in the tree of XML nodes. The following example
shows an X-Query example that retrieve name and birthday of United-State presidents.
For $p in //people
Let $n=?p/name, $b=?p/birthday
Where ?p/job = president¡¤and ?p/nation= US
Return ?n, ?b
XML-GL is a graphical notation used to retrieve XML documents. Figure 2.2 shows an example of
retrieve orders that ship books with title Introduction to XML to Los Angles.
Figure 2.2 An example of XML-GL
XML retrieval systems
There are several XML retrieval system proposed in these years. We will have a survey of
these systems in this section.
Lore was one pioneer research project for XML retrieval in Stanford-University. In this project,
an object-oriented database was used to store XML documents. The XML query language
Loral was developed. Besides that, a query interface DataGuider was developed to query
XML documents. Figure 2.3 is a screen catch of the DataGuider system.
Figure 2.3 The query interface of DataGuider system
XYZfind is a commercial system that split the querying process into four steps. The following
figures show the retrieval steps of the XYZfind retrieval system.
Step 1 : User type in a query to start the
category searching process.
Step 2 : The XYZfind system found
several related categories. User have to
click the target application.
Step 3 : User use the query interface to
build a query.
Step 4 : The XYZfind system retrieves
XML documents and shows on the
browser.
Figure 2.4 Retrieval steps of the XYZfind system
2.4 Using Ontology to Help the XML Retrieval Process
In order to reduce the semantic gap, we have to survey the technologies that used to make
computer understand natural language text. The design of XML does not eliminate the usage of
natural language text in the content of XML documents. Natural language texts are frequently
embedded in XML documents. The natural language understanding technologies that used to
reduce the semantic gap is still needed in the understanding process of XML documents. In this
section, we will focus on how to use natural language understanding technologies that based on
ontology representation to understand XML documents.
Natural language processing community has been trying to resolve the semantic gap
problem for a long time. Natural language understanding is a field that focuses on building
computer programs to understand natural language text [Grosz86] [Allen94]. However, the
word understanding used here is a misleading word. Computers do not really understand
natural language text as human. Calculation and symbolic reasoning is what computers can do.
Computers understand natural language text by mapping text into internal representation.
The internal representation guides the computer to do symbolic reasoning and act as it know
the meaning of natural language text.
Alan Turing designed the Turing-Test [Turing50] to test whether a computer understand
natural language text or not. For information retrieval, we adopt a similar definition as
Turing-Test. If a computer program that retrieve we want and discard what we do not want, and
organize the retrieval result into what we like to browse, then we say the computer program
understand documents and our queries. A computer can do what we like it to do is a smart
computer. A retrieval system that retrieves only what we want and organize the result into what
we like is a smart retrieval system.
A data-structure called ontology that represents the concept in human mind is used in the
process of understanding. Generally speaking, understanding is the process of mapping natural
language text into ontology. After the mapping, computer can do actions based on the mapping.
This is the style of computer understanding .
Ontology may be represented in different structures. The research topic that focuses on the
structure of ontology is called knowledge representation [Brachman85a]. Roughly speaking, there are
two approach to represent knowledge and ontology, logic-based approach and object-based approach.
We will introduce and compare these two approaches. It is a basis of our slot-tree ontology that is going
to be discussed in chapter 3.
The logic-based approach encodes knowledge into logic statements for reasoning,
including propositional-logic, first-order-logic, probabilistic-logic etc. Prolog is the most well
known programming language based on logic.
Logic-based approaches encode knowledge into logic statements. Based on logic
statement, a reasoning process is used to inference unforeseen true statements from these
predefined logic statements.
First-order logic is a powerful theory to represent knowledge and reasoning conclusions.
First-order logic is a monotonic logic system that contains predicates and quantifiers in logic
expressions. In first order logic, we use logical statement to represent the ontology. The
following example shows the logic statements that describe the inheritance relationship
between butterfly, insect and animal.
is(butterfly, insect)
is(insect, animal)
∀x, y, z is(x,y) ∧ is (y,z) Æ is(x,z)
The power of first order logic lies on the ability of monotonic reasoning. The monotonic
reasoning means any conclusions made will never being erased in the future. The 100% certainty of
facts, rules and conclusions should be assured in the first logic reasoning process. The following
example shows a reasoning process for the example above. The reasoning process inferred butterfly is
a kind of animal .
∀x, y, z is(x,y) ∧ is (y,z) Æ is(x,z) (bind x to butterfly, y to insect, z to animal)
-----------------------------------------------------------------------------------------------
is(butterfly,insect) ∧ is (insect,animal) Æ is(butterfly,animal)
-----------------------------------------------------------------------------------------------
conclusion : is(butterfly, animal)
A difficulty is that many uncertainty situations are encountered in the natural language
understanding process. The 100% certainty of first order logic cannot always being assured.
Probabilistic logic and fuzzy logic are developed to handle the uncertainty. However, the monotonic
property is lost in the uncertain reasoning process.
After reviewing the logic-based approach, we will introduce object-based approach. Object based
approach contains a set of representation methods, including frame, semantic network and script.
Generally speaking, frames are used to represent the internal structure of object, semantic networks are
used to represent the relation between objects, and scripts are used to describe an active scenario
involving many objects.
Frame is proposed by Minsky in 1975 [Minsky75] in the seminal paper "A framework for
representing knowledge". Frame is a method of representation that organizes knowledge into
chunks. However, Minsky did not formalize the frame concept into mathematics model.
Minsky explicitly argued in favor of staying flexible and nonformal. After that, some AI
systems are built based on the frame representation, such as the KL-ONE system [Brachman85b]
and the KRL language [Bobrow77].
Generally speaking, a frame is a structure that describes the internal structure of an object.
Frames are composed out of slots (attributes) for which fillers (scalar values, references to
other frames or procedures) have to be specified or computed. A slot can be expressed as a
tuple in the form of (object, slot, filler). It is easy to transform these tuples into a logic
predicate in the from of slot(object, filler).
One frame that inherits from another frame is called a sub-frame. The inherit property may
be expressed as the is relation between frames in the form of is(object, object). The inherit
property organize frames into hierarchy. The concept of frame that organizes statements into
object-based structures is easy for human to read and write. It was then adopted by
object-oriented programming language for people to write program easily. The following
example shows a frame for koairai that is a species of butterfly.
<object name= kodairai >
<is>butterfly</is>
<texture>eyespots</texture>
</object>
Semantic networks concentrate on categories of objects and the relations between them
[Quillian66] [Wood75]. Drawing graphs to represent the relationship between objects is the
basic idea of semantic network. In these graphs, a link may be represented as a tuple in the
form of (object, relation, object). It is easy to transform these tuples into a logic predicate in the
from of relation(object, object).
Scripts are used to describe a scenario involving many objects [Schank77]. Steps in the
scenario are described as lattices. One step may be triggered when its preceding steps are
finished. For example, the following script shows the process of make a cup of coffee.
1. Put an empty cup on table. Æ put_on(cup, table)
2. Put coffee powder into the cup Æ put_into(coffee powder, cup)
3. Filling hot water into the cup. Æ fill(hot water, cup)
4. Mixing the powder and the water by a spoon. Æ mix_by_spoon(powder, water)
5. Process finished.
In fact, we may translate object-based representations into logic rules. The difference between
logic-based representation and object-based representation lies on the organization principle.
Logic-based representation encodes knowledge into logic expressions, and the object-based
representation organizes these expressions into frames, semantic networks and scripts.
Reasoning is not a standardized part in object-based systems [Ifikes85]. The information stored in
frames has often been treated as the ¡§database¡¤ of t he kn owl edge syst e whereas the control of
reasoning has been left to other parts of the system. The most popular and effective reasoning
mechanism for frame is the production rules [Stefik83] [Kehler84]. Production rules are rules in the
form of pattern/action. It is a subset of predicate calculus with an added prescriptive component
indicating how the information in the rules is to be used during reasoning. Whenever a pattern is
matched, the production system will trigger the corresponding frame, and the action is performed to do
something that helps the understand process. After the pattern/action process, some values are filled
into frames as the conclusion. The reasoning process in object-based system that map natural language
text into slot-tree ontology is what we called the slot-filling process .
Both logic-based representation and object-based representation may be used to represent the
ontology and reasoning based on the ontology. Reasoning is helpful but not a necessary part for
computers to understand natural language. However, computers need a process to map natural language
text into ontology in order to understand it.
The mapping process for XML documents is easier than the mapping process for natural language
documents, because tags provide semantic contexts that make the process of mapping easily. In chapter
3, we will propose a slot-filling algorithm to map XML documents into slot-tree ontology in order to
reduce the semantic gap between human and computer on XML.
2.5 Discussion
In this chapter, we review the research background of XML, information retrieval and ontology.
However, the technology of XML retrieval now is not good enough and needs further research.
In fact, researchers in information retrieval community are trying hard to develop methods for
XML retrieval recently.
In the workshop of ACM SIGIR 2000 on XML and information retrieval, Carmel et al.
[Carmel00] discuss about several unsolved problems for XML retrieval in the workshop
summary. We list these problems as following.
1. Using XML query language is likely to improve precision. However, XML query
languages are not easy for people. How to make it easier to use for people?
2. A heterogeneous XML collection contains document structures are coming from different
sources, and the tag names and document structures may be different and idiosyncratic.
How to retrieve heterogeneous XML documents?
3. XML is specified using Unicode. The tag names coming from different sources may be
given in different languages. Since a word can have more that one translation and even no
translation, how to find or make the appropriate translation is an interesting issue for
multilingual information retrieval. How retrieve do multilingual XML documents?
4. Browsing XML retrieval results should be better than browsing text document. How to
organize the retrieval results for browsing? Is it the entire document, a part of the XML tree,
or perhaps a graph?
In this thesis, we will try to resolve these problems by develop an XML retrieval system. The
system is mainly designed to reduce the semantic gap between human and computer. In this
system, we develop programs for computer to understand XML documents easily, for human to
write query easily and browse query results easily. These methods are based on an ontology
representation called slot-tree. We will describe these methods in the next part. In chapter 3, we
will show how to represent slot-tree and map XML documents into slot-tree. In chapter 4, we
will show how to use the slot-tree ontology to help the XML retrieval process. In chapter 5, we
will design a method to build slot-tree automatically.
Part 2 : Slot-Tree Based Methods for XML Retrieval
3 Slot-Tree Ontology and Slot-Filling Algorithm
In part 1, we have introduces our motivation, goals and research approaches in chapter 1, and
review the related researches for XML, information retrieval and ontology in chapter 2. In part
2, we will show our method to reduce the semantic gap of XML retrieval. In order to reduce
the semantic gap, an ontology called slot-tree, is used to help the XML retrieval process in our
system. In this part, we focus one the usage of slot-tree ontology in our XML retrieval system.
Part 2 contains three chapters. In chapter 3, we will describe the syntax, semantics and
usage of slot-tree. In chapter 4, we will use the slot-tree to reduce the semantic gap in the XML
retrieval process. In chapter 5, we will show how to construct the slot-tree ontology, and design
a mining algorithm to build the slot-tree ontology automatically.
This chapter contains four sections. In section 3.1, we outline the structure of slot-tree
ontology and its usage in the process of understanding XML documents. In chapter 3.2, we
describe the syntax and semantics of slot-tree ontology. In chapter 3.3, we design the
slot-filling algorithm to map XML documents into slot-tree ontology that is the core of
understanding process. Finally, we have a discussion about slot-tree ontology and slot-filling
algorithm in section 3.4.
3.1 Introduction
In this chapter, we design an object-based representation called slot-tree ontology, and then use the
slot-tree to understand XML documents. As we have said in section 2.4, the word understand used
here means the process of mapping text in XML into the slot-tree. This enables a computer to trigger
the corresponding procedure to do what user like it to do, such as answering questions or retrieving
documents that user want. We will outline the slot-tree ontology and the slot-filling algorithm that maps
XML documents in this section, and describe the detail of slot-tree in section 3.2 and slot-filling
algorithm in 3.3.
Slot-tree representation is object-based approach to represent the internal structure of objects like
frame. We have surveyed object-based approach for knowledge representation, including frame,
semantic network and script in section 2.4. Generally speaking, frame is used to represent the internal
structures of objects, semantic network is used to represent relations between objects, and script is used
to represent scenarios that involve many objects. The object-based approach is conceptually consistent
to our notion about world, because the world is a composed by many objects in our sense. The
difference between slot-tree and frame is that a slot in slot-tree contains a set of paths to locate nodes in
XML documents. A path in a slot is in XPath format that was described in section 2.1. For example,
//butterfly//color is used to locate color nodes in the block of butterfly .
In our XML retrieval system, a slot-tree is encoded in XML format like the following
example.
Example 3.1 A simple slot-tree in XML format
<s slot= butterfly path= //butterfly >
<s slot= color path= //butterfly//adult//color >
<v value= brown />
<v value= white />
</s>
</s>
Based on the slot-tree ontology, we design a slot-filling algorithm that is used to map
XML documents into slot-tree ontology in the process of understanding. In the slot-filling
algorithm, a path in a slot is used to catch a block in XML like a hand, and a matching process
is used to map the content of the block into the slot. After the matching process, words that
matched any values in a slot are filled into the slot. The filled slot-tree after the matching
process is then used as a semantics structure of the XML document. We will show the detail of
slot-tree ontology in section 3.2 and the detail of slot-filling algorithm in section 3.3.
3.2 Slot-Tree Ontology
In this section, we propose an ontology representation called slot-tree. Slot-tree is an object-based
representation that describes the internal structure of an object like frame. We have described the frame
representation in section 2.4. We will describe the syntax, semantics and examples for slot-tree in this
section.
Definition 3.1 : A slot-tree is a tree (T) that each node in the slot-tree contains a tuple (s, Ps, Vs), where
s is the name of slot, Ps is a set of paths, and Vs is a set of values. The name of a slot is a label that
uniquely represents the slot. A path (p) in Ps is a string in XPath format that used to locate nodes in
XML documents. A value (v) in Vs is a term that contains a set of semantically identical words or
patterns.
Figure 3.1 shows the structure of a slot-tree, the {p} in each node represent a set of paths and the
{v} in each node represent a set of values. For a slot-tree that represent the internal structure of an
object, a slot in the tree may used to represent a property of the object, such as the color , shape ,
`texture`, `size`, etc. A value in the slot is a possible value of the property. For example, ¡§bl ack¡¨is
possible value in the ¡§col or slot.
Figure 3.1. The structure of a slot-tree
A slot-tree can be encoded as an XML document that each slot is encoded as a node in tag `s`.
The attribute `slot`in the node is the label of the slot. The attribute path contains a set of path in XPath
format that encode the {p} part for each slot. The node in tag `v`is a value that encodes the {v} part for
each slot. Example 3.2 shows a slot-tree for butterflies in XML format and figure 3.2 shows the graph
representation of the example.
Example 3.2. A slot-tree for butterflies in XML format
<s slot= `butterfly`path= `//butterfly`>
<s slot= `name`path=`//butterfly//name`/>
<s slot= `adult`path= `//butterfly//adult¡¨>
<s slot= `color`path= `//butterfly//adult//color`>
<v value= `black`/>
<v value= `brown`/>
<v value= `black&white`/>
</s>
<s slot= `texture`path= `//butterfly//adult//color`>
<v value= `lines`/>
<v value= `spots`/>
</s>
</s>
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
s {p} {v}
Figure 3.2 : The graph representation of slot-tree
Formally, the syntax of slot-tree is defined as grammars in figure 3.3. A slot (S) contains a label
(NAME), a set of path (P*) and a set of values (V*). The slot may also contain a set of sub-slot (S*). A
value (V) contains a label (NAME), a set of key (KEY*) and a set of matching rules (R*).
S Æ <s slot= NAME`path= `P*`> V* S* </s>
V Æ <v value= `NAME`keys= `KEY*` match= R* />
NAMEÆ Alphabetical String
KEY Æ Alphabetical String
Where P is a path in XPath format, R is a rule.
Figure 3.3 : The grammar of slot-tree
The symbol P used in figure 3.3 is in a path in the format of XML path language (XPath). XPath
is a specification that proposed by Web Consortium (W3C) used to locate nodes in XML documents.
The symbol / is used to match children nodes, the symbol // is used to match nodes inside the
current node. A tag name with a prefix @ symbol means an attribute. Example 3.3 shows several
example of XPath.
Example 3.3 : Examples of XML path language (XPath)
a. /butterfly/adult/color
b. //insect//color
c. //insect[@type=¡¥butt erfl y¡ƒ]// col
The path of example 3.3.a is used to locate color nodes that are children of an adult node, and
the adult node is a child of the butterfly node. The path of example 3.3.b is used to locate any
`color`nodes that are in the block of an `insect`node. The path of example 3.3.c is used to locate any
`color`nodes that are in the block of an `insect`node with values `butterfly`in the attribute `type`. If
you would like to learn more about XPath, please see the XPath specification in the following web page
- http://www.w3.org/TR/xpath.
A rule in the slot-tree is used to match a string in XML. The syntax of a rule (R) is further
defined as grammar in figure 3.4. A rule may contains & operator, | operator and
- operators. A symbol E is an expression that is part of a rule. Each expression contains
only a literal L or a pattern in the form of L..L .
R Æ (R & R)
R Æ (R | R)
R Æ E
R Æ -E
E Æ L {..L}
Figure 3.4 : The grammar of rules in slot tree
The & operator equals to a logical and . A R1 & R2 rule satisfied if and only if both
R1 and R2 are satisfied. The | operator equals to a logical or . A R1 | R2 rule satisfied if
and only if R1 or R2 is satisfied. A .. symbol in the syntax of E means a far connect. A
L1 .. L2 rules satisfied if a L1 string is followed by an L2 string in one sentence. The
following example shows a several rules as following.
Example 3.4 : Matching rules in slot-tree
a. R = white & black
b. R = lines & -spots
c. R = black .. head
The rule of example 3.4.a is used to match a sentence like `a butterfly that is mixed of black and white
color`, or `a butterfly with white wing and black head`. The rule of example 3.4.b is used to match a
sentence such as `a butterfly with brown lines on wings`, but cannot match the sentence `a butterfly
with brown lines and white spots on wings`. The rule of example 3.4.c is used to match a sentence such
as `a butterfly with black color on head`, but cannot match the sentence `a butterfly with has green
head and black wings`.
3.3 Slot-Filling Algorithm
A slot in a slot-tree is a container that may contain several fillers. The filler can be a value of a
sub-slot. A slot-filling algorithm is a method to map fillers into slots. In this chapter, we
describe how to map an XML document into slot-tree ontology.
Example 3.5 : An XML document for a butterfly
- <butterfly about= Athyma_fortuna_kodairai.jpg >
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
<size>Middle size, 50-60mm</size>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>
Example 3.6. A slot-tree for butterflies
<s slot= butterfly path= //butterfly >
<s slot= name path= //butterfly//name type= copy />
<s slot= adult path= //butterfly//adult¡¤>
<s slot= color path= //butterfly//adult//color >
<v value= black />
<v value= brown />
<v value= black&white />
</s>
<s slot= texture path= //butterfly//adult//color >
<v value= lines />
<v value= spots />
</s>
</s>
One simple way to fill values into the corresponding slot is by copy. A copy-slot is a slot with the
attribute (type=bcopy ) in it. The copy-slot is used to extract a value from a specified field. In the
slot-filling process of example 3.3, the value Athyma_fortuna_kodairaia is filled into the name slot
in example 3.4 just by copy.
Another way to fill values into slots is by keyword matching. A value is filled into a slot if the
value matched a sentence in the target XML document. The following example shows the process of
matching the spotted value in texture slot to the color nodes in XML document.
Example 3.7 An example of filling a value into slot by keyword matching
Texture block :
<color> Brown background color, Eye spots in white color </color>
Texture Slot :
<s slot= texture path= //butterfly//texture >
<v value= single color keys= single, mono, uniform />
<v value= spotted keys= spot />
<v value= lines keys= line />
</s>
Æ Matching result <s slot= texture values = spotted />
A slot-filling algorithm is designed to fill values into slots in a slot-tree. In order to
understand an XML document, we use the slot-filling algorithm to fill an XML document
into the slot-tree. The output of our slot-filling algorithm is a filled slot-tree, where each node
in the tree is filled by values. For a given XML document d, ds is part of the document that
covered by slot s. The output of the slot-filling algorithm is a set of slot-value (s,v) pairs.
Slot-Filling(d, T) = { (s,v) | v∈V, t is a term in d, w(v, ds) > ε }
The following figure shows the pseudo code of slot-filling algorithm.
Algorithm Slot-Filling(d, T)
SV = {}
for each s in T
ds = {c | (s, p) ∈M(T), (p, c) ∈d }
for each v in s
if w(v, ds) >ε then put (s,v : w(v, ds)) into SV
end for
end for
return SV
Figure 3.5 : The pseudo code of slot-filling algorithm
The time complexity of slot-filling algorithm is ∑s |ds|*|Vs|, where |ds| is the size of ds, and |Vs|
is the number of values in slot s.
3.4 Discussion
In this chapter, we have described the slot-tree ontology in section 3.2 and slot-filling
algorithm in section 3.3. The slot-filling algorithm is used to map XML documents into
slot-tree ontology in the understanding process. In chapter 4, we will use the slot-tree and
slot-filling algorithm to develop an ontology-based XML retrieval method, and using the
method to reduce the semantic gap between human and computer.
4 An Ontology-Based Approach for XML Querying, Retrieval and
Browsing
In the previous chapter, we have showed the slot-tree ontology and its usage. A mapping
between slot-tree and XML documents is built in the process of slot-filling algorithm. The
mapping process helps our XML retrieval system in reducing the semantic gap between human
and computer. In this chapter, we will outline the relationship between our XML retrieval
system and slot-tree ontology, and show the power of slot-tree.
In section 4.1, we will describe the process of our XML retrieval system, and outline
important components in our system. We will describe how to represent an XML documents
for retrieval in section 4.2, and describe the index structure in section 4.3. After that, the query
interface is described in section 4.4 and ranking strategies is described in section 4.5. And then
we show how to organize retrieval results for browsing in section 4.6. Finally, we have a
discussion about our XML retrieval system in section 4.7.
4.1 Introduction
Two technologies are needed in the process of searching for documents, retrieving and browsing.
Retrieving is the process of retrieves documents in a collection. After that, the retrieved documents
should be organized for browsing. Browsing is the process of read and traverse on the collection of
documents. We usually use retrieving and browsing techniques alternatively in a searching process. A
model integrated retrieving and browsing may used to improve the quality of searching.
Our research focuses on using ontology to improve the XML retrieval and browsing process. We
will focus on the following questions in this chapter.
1. How to encode XML documents for retrieval?
2. How to use slot-tree ontology to improve the efficiency of querying?
3. How to use slot-tree ontology to improve the efficiency of retrieval?
4. How to use slot-tree ontology to improve the efficiency of browsing?
Figure 4.1 shows a scenario of our approach to retrieve XML documents. First, a user build a
query by click or type on slots in the query interface, and then submit the query to the XML retrieval
system. The retrieval system retrieves XML documents, and then summarizes them for user to browse.
Figure 4.1 : A scenario of our XML retrieval system
The ontology in figure 4.1 is the slot-tree ontology that described in chapter 3. It is the core of our XML
retrieval system. The slot-tree ontology is used to build query interface, retrieve documents and
summarize retrieved documents for browsing. The XML queries, XML documents and query
interface are important objects in our system. The retrieval and extraction are important processes in our
system. We will introduce these objects and processes in this chapter.
4.2 XML Documents
An XML document is encoded as a tree-structure text. Figure 4.2 shows an XML document that
describes a butterfly.
- <butterfly about= Athyma_fortuna_kodairai.jpg >
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
<size>Middle size, 50-60mm</size>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>
Figure 4.2 : An XML document of butterfly
For conceptual simplicity, the XML example above is expressed as a sequence of (path,
value) pairs that describe the object.
(butterfly, )
(butterfly@about, Athyma_fortuna_kodairai.jpg)
(butterflyadult, )
(butterflyadulttexture, There are some eye spots in each wing)
(butterflyadultcolor, Brown background color, Eye spots in white color)
(butterflyadultsize, Middle size, 50-60mm)
(butterflyadult, )
(butterflygeography, )
(butterflygeographytaiwan, North-Taiwan, 1000-2000meters mountain area)
(butterflygeographyglobal, Central China Area)
(butterflygeography, )
(butterfly, )
Figure 4.3 : The (path, value) expression of an XML document
The (path, value) expression can be thought as an object concept model. A path specified a
property of an object. A value specified a value for the property. The object concept model above is a
binary relation that may be expressed as path(object, value). A path represents a logical predicate with
two arguments. An object in this model is expressed as a set of (path, value) pairs.
Storing Structure
The (path, value) representation does not reflect the tree structure of an XML document. In order to
represent the tree structure, we use a pair of index to represent begin and end of each block. In other
word, we extend each (path, value) pair with a (begin, end) pair to represent the begin node and end
node of each block. The butterfly example above is expressed as the following structure.
1, 12 (butterfly, )
2, 2 (butterfly@about, Athyma_fortuna_kodairai.jpg)
3, 7 (butterflyadult, )
4, 4 (butterflyadulttexture, There are some eye spots in each wing)
5, 5 (butterfly adult color, Brown background color, Eye spots in white color)
6, 6 (butterfly adult size, Middle size, 50-60mm)
7, 7 (butterfly adult, )
8, 11 (butterflygeography, )
9, 9 (butterflygeographytaiwan, North-Taiwan, 1000-2000 meters mountain area)
10, 10 (butterflygeographyglobal, Central China Area)
11, 11 (butterflygeography, )
12, 12 (butterfly, )
Figure 4.4 : The storing structure of an XML document
In the example above, each node is lead by a (begin, end) pair. The begin index of a node is
always identical to the ID of the node. A block with (begin, end) means it cover all nodes
between begin node and end node. For example, the first block 1,12 (butterfly,) covers nodes
from 1 to 12, the third block 3,7 (butterflyadult) covers nodes from 3 to 7. In this way, the tree
structure of XML is expressed as the cover/covered relations between nodes.
The begin-end pair structure totally reflects the hierarchical structure of XML documents. In
our XML storage system, we store the (begin, end) pairs in a table instead of storing as a tree.
4.3 Indexing structure
Based on the PVSM, we index (p,t) pairs instead of (t) for an XML retrieval system. There are several
data-structures for full-text indexing, such as inverted-file, signature-file and Patricia-trie. We use
inverted-file as the index structure of our XML retrieval system for simplicity.
The following example is a simple XML document. We will show how to index the following
XML document, for both text field and number field.
Example 4.4 An XML document for butterfly
<butterfly about= kodairai >
<adult>
<color>brown</color>
<texture>spot</texture>
<size>50-60mm</size>
</adult>
</butterfly>
Indexing Text: The following table shows our inverted-file structure. The inverted-file is stored in a
relational database now. The following figure shows an inverted-file for the example above.
#path, #term #object list
` `
#butterflyadultcolor, #brown ` ,#kodairai, #` ..
` `
#butterflyadulttexture, #spot ` ,#kodairai,`
` `
Figure 4.5 An example of text index in inverse file format
Indexing Number : Traditional full text indexing technology doesn`t index number. In our system,
number indexing is important for the browsing process. We may sort the search results in some
specified order based on number index. In the indexing process, we extract number from XML
documents and put into a number table as following.
#object, #path Number
` `
#kodairai, #butterflyadultsize 50
#kodairai, #butterflyadultsize 60
` `
Figure 4.6 An example of number index
4.4 Query Language and Query Interface
XML may used to encode metadata instead of data. Metadata is a kind of data that used to describe data.
We may use metadata to describe objects such as audio, video, people, etc. Based on metadata, we may
index image, video and audio in text format, so that we may query object by number and text field in
our XML retrieval system.
In our system, we design a program to transform slot-tree into HTML based query interface. A
template in Extensible Stylesheet Transformations (XSLT) is used to do the transformation.
In our query-interface, a value can be expressed as a string, a range of number, or an
object. A user may specify the value for a slot just by click a value or an icon in the slot. Our
retrieval system is not only used to retrieve text-based documents, but also used to retrieve
image or video. The following figure shows a query interface for butterflies.
Figure 4.7 : The Query Interface for Butterflies
A user may select a slot just by one click, and select a value in the slot or type keywords into the
slot. He may also specify a field for sorting. A query will be built and submit to the XML retrieval
system when he press the submit button.
A query in our system is a filled slot-tree. The following example shows a query find all
butterflies with broken wing and brown color .
<s slot= butterfly path= //butterfly >
<s slot= color path= //butterfly//adult//color keys= brown />
<s slot= shape path= //butterfly//adult//shape keys= broken />
</s>
4.5 Ranking Strategy
The ranking strategy for XML-retrieval is much more like database than text-retrieval. We may rank the
retrieval result by any field in XML documents. For example, we may sort the retrieval result by the
size of butterflies. We may also sort the retrieval result by the similarity between document and query
or by the importance of documents. In this section, we will show the ranking strategies that used to sort
the retrieval results.
Ranking by Field
In order to organize the retrieval result for user to browse, a user may specify the ranking
strategy. A user may specify any field to sort the result for browsing just like in a database. A
field can be sorted as numbers by scale or sorted as strings by alphabetical order, in either
increasing order or decreasing order. The variety of ranking strategies provides users a way to
organize the retrieval result into a list for browsing.
Ranking by Importance
In section 2.2, we have introduced how to measure the importance of a web page based on hyperlink.
Hyperlinks in XML may used to decide the importance of an XML document, too. In our XML retrieval
system, ranking by importance is used as a default ranking strategy. A simple way to measure the
importance of an XML document is by counting references to an XML document. We use the strategy
in our system for simplicity. In the future, we will try to accommodate random-walk model and
hub-authority model to measure the importance of XML documents in our XML retrieval system.
Ranking by Similarity
For text retrieval, a ranking strategy based on vector space model (VSM) and TFIDF weighting
function performs well. A brief survey for VSM and TFIDF was described in section 2.3. However, an
XML object is not only a sequence of words like a text, but also contains a lot of tags. For XML, we
extend VSM with a path to each term that is called the Path Vector Space Model (PVSM). An XML
document (d) could be expressed as the following vector v(d).
v(d) = (dp1,t1` d p1,tk ` dpn,t1` dpn,tk) dpi,ti is the weight of (pi, ti) pair in document object d
When several paths have similar meaning, we may cluster them into a slot for retrieval. The model after
paths clustering is called the Slot Vector Space Model (SVSM).
v(d) = (ds1,t1` d s1,tk ` dsn,t1` dsn,tk) dpi,ti is the weight of (pi, ti) pair in document object d
We may use the cosine-coefficient to measure the similarity between queries and documents in SVSM
just like in VSM.
Similarity(d, q) =
||*|| qd
qd •
However, we do not know what kind of weighting function is good to measure the value dsi,tj. Is TFIDF
good enough in the SVSM, or we need another measure. In our system, we express the dsi,tj as the
product of wsi,tj and tfsi,tj . Where tfsi,tj is the term frequency of the term tj in slot si , and wsi,tj ais the
weighting coefficient.
A difficulty for retrieval system today is too many documents are retrieved. When there are to many
retrieval results for browsing, the ranking strategy is used to present what users want to them. A user
may like to see large butterflies, important butterflies or butterflies that are similar to a query. The
variety of ranking strategy in XML provides ways for users to retrieve only what they like to browse.
4.6 Browsing XML documents
For an information retrieval system, the retrieved documents should be summarized and
organized into readable format for people to browse. In our XML retrieval system, slot-filling
algorithm is used to map the retrieved documents into filled slot-trees for browsing. The filled
slot-tree is a summary of documents that is easy to browse and is well organized. In this
section, we will show an example of slot-filling algorithm that fills XML documents into
slot-tree. Before that, we have to show an XML document and a slot-tree used in the algorithm.
The following example shows a simple slot-tree for butterfly.
<s slot= butterfly path= //butterfly >
<s slot= name path= //butterfly//name />
<s slot= adult path= //butterfly//adult¡¤>
<s slot= color path= //butterfly//adult//color >
<v value= black />
<v value= brown />
<v value= black&white /></s>
<s slot= texture path= //butterfly//adult//color >
<v value= lines />
<v value= spots /></s>
</s>
We may use the slot-filling algorithm to extract values from the following XML document.
- <butterfly about= Athyma_fortuna_kodairai.jpg >
<adult>
<texture>There are some eye spots in each wing</texture>
<color>Brown background color, Eye spots in white color</color>
</adult>
<geography>
<taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan>
<global>Central China Area</global>
</geography>
</butterfly>
The slot-filling algorithm will fill values into slot-tree. The following example shows the result of
filling.
<s slot= butterfly values= ¡§ At hy ma_f ort una_kodairai>
<s slot= adult >
<s slot= texture values= spot />
<s slot= color values = brown /></s>
<s slot= geography >
<s slot= ¡§Tai wan values= North />
<s slot= Global values= China /></s>
</s>
The result of slot-filling algorithm is a filled slot-tree. For human, it is easier to browses filled slot-trees
than browse the source documents. The filled slot-tree is a summary of the XML document and is well
organized.
4.7 Discussion
In this chapter, we design an XML retrieval system to reduce the semantic gap between human
and computer. The slot-tree ontology and the slot-filling algorithm are used in our XML
retrieval system to understand XML documents. Based on the slot-tree, we design a query
interface to reduce the semantic gap in query side. The interface helps people to write XML
queries easily. Based on the slot-filling algorithm, we design the slot vector space model
(SVSM) retrieve XML documents. The SVSM model helps computer to understand XML
documents. Besides that, the slot-filling algorithm also help computer to extract summary from
XML documents for browsing. Our goal of reducing the semantic gap between human and
computer is almost achieved by using slot-tree as a core representation.
We will study two cases of our XML retrieval systems in chapter 6 and chapter 7. In
chapter 6, we use the domain of butterflies as an example. In chapter 7, we use the domain of
proteins as an example. We will show the slot-tree, query interface, retrieved results and
summary for butterflies in chapter 6. And we will show the slot-tree, query interface, retrieved
results and summary for proteins in chapter 7.
5 The Construction of Slot-Tree Ontology
We have introduced the slot-tree ontology in chapter 3, and then showed an XML retrieval
system based on slot-tree ontology in chapter 4. However, building slot-tree ontology is a not
an easy job. In order to reduce the effort to build the slot-tree ontology, we have developed the
slot-mining algorithm. The slot-mining algorithm is a statistical approach to mine slot-tree
from XML documents. The algorithm is used to learn the slot-tree from a collection of XML
documents.
An overview of mining approaches is described in section 5.1. Section 5.2 provides
background for the text-mining technology. Section 5.3 shows how to construct slot-tree for a
given XML collection. Section 5.4 describes a method to mine slot-tree from XML documents
called slot-mining algorithm. Finally, we have a discussion for the building of slot-tree in
section 5.5.
5.1 Introduction
The goal of text mining is to find important patterns from text collection and organize these patterns
into ontology. In this thesis, we use the ontology to help the XML retrieval and browsing. The mining
technology may used to help us in the construction process of slot-tree ontology. In this section, we will
focus on the text-mining problem for XML.
Slot-tree is an ontology representation method. Our mining approach is to build a XML-mining
program to induce values for each slot. In this section, we assume that each value is represented by a
term (or a word) for simplicity. Based on this assumption, we developed a statistical program to mine
values for each slot.
The semi-structured property of XML makes the mining program work. For a given XML
collection, the distribution of a term is highly depends on the tags. For example, the following terms
show up more frequent in the <color> block than in the other blocks.
<color> black , white , yellow , blue , green </color>
The problem of mining the important values for each slot is called the Slot-Mining Problem. We will
propose a mining-algorithm that is based on a simple observation the distribution of terms depends on
the tag. A term shows up more frequently in a tag is likely to be a key value for the corresponding slot.
5.2 Background
The goal of text mining is to discover some regularity in text-data. A text-mining program induces rules
from text or learn grammar form corpus, these rules are used in the process of natural language
understanding and information extraction.
For natural language processing, inside-outside algorithm is a popular tool to learn probabilistic
context-free grammar (PCFG) from tree-bank corpus. However, tree-bank corpus is not easy to build.
Building a tree-bank by human is a time consuming job. Some other text-learning methods are
developed to learn from text corpus. For example, link grammar is a simple head-driven grammar that
developed to parse natural language sentence. A learning algorithm is developed to learn the
link-grammar from text-corpus. Besides that, transducer is a learning algorithm to induce finite-state
automata from a given text-corpus. Learning transducer is easier than learning a context-free grammar.
For information extraction, a wrapper is an algorithm to learn a simple grammar from structured text,
such as web page. A wrapper will induce some rule to wrapping the document. For example, a simple
wrapper may learn the prefix and postfix of each field from a collection of program generated web page.
We may extract fields from web page based on these prefix and postfix. A transducer may also used to
learn the extraction rule from a collection of web page, too.
However, these methods are used to learn the grammar of input text, not used to learn ontology from a
given document collection. In this chapter, we will propose a learning algorithm that mine slot-tree
ontology from a given XML collection in section 5.4. The algorithm is called the slot-mining algorithm.
This algorithm is a tool to help the domain-knowledge designer to design the slot-tree ontology.
Before we show the slot-mining algorithm, we have to show the process for human to build a
slot-tree in section 5.3, in order to observe what is need in designing such an algorithm.
5.3 The process of building a slot-tree
In order to show the ontology design process, we will trace the designing step of a simple slot-tree for
butterfly. There are six steps to design a slot-tree.
1. Browse XML data.
2. Identify object boundary.
3. List all tags in this domain.
4. Identify slots for this domain.
5. Mapping each slot to tags (or xpath).
6. Identify values for each slot.
Browsing XML data : The first step to design a slot-tree is to browse data in order to understand data.
What is the structure of the XML collection? Can we identify the object boundary in XML documents?
What s the meaning of each tag? Does each tag correspond to a slot? What are candidate values for a
slot? We have to answer these questions before construct a slot-tree.
Identifying object boundaries : An object-block is an XML block that correspond to a object. We have
to identify the boundary of object-block to find out what objects the collection contains. For example, in
our butterfly collection, a <butterfly></butterfly> block is the boundary of a butterfly object.
Listing all tags in this collection : an XML tag usually has strong semantic meaning. For example, the
<color> tag represents the color of a butterfly. We may list all tags to understand the semantics for each
tag. For the simple butterfly collection, we list all tags as following.
Butterfly, adult, texture, color, size, geography, Taiwan, global
Identifying slots for this collection : We are lucky to find out that these tag are not ambiguous.
The semantics of tags are clear and definite. We may build a slot for each tag.
Mapping slots to tags (or xpath) : For the simple butterfly collection, we can map each tag to one slot
directly. The following example shows the schema of slot-tree.
<s slot= butterfly >
<s slot= adult >
<s slot= texture />
<s slot= color />
<s slot= size />
</s>
<s slot= geography >
<s slot= ¡§Tai wan />
<s slot= Global />
</s>
</s>
Identifying values for each slot : In order to identify values for each slot, we have to read the data for
each slot. For example, if we read the data in <color> tag, we may found that the black , white ,
brown , orange , yellow , green , blue , purple , gray are key values for this slot. We may fill
them into the values list of the color slot. After we fill values for each slot. We finish the slot-tree
building process. The following XML document shows a slot-tree for the simple butterfly collection.
<s slot= butterfly >
<s slot= adult >
<s slot= color >
<v value= black /><v value= white /><v value= brown />
<v value= yellow /><v value= orange /><v value= green />
<v value= blue /><v value= purple /><v value= gray />
</s>
<s slot= texture >
<v value= single color keys= single, mono, uniform />
<v value= spotted keys= spot />
<v value= lines keys= line />
</s>
<s slot= size >
<v value= small /><v value= middle /><v value= large />
</s>
</s>
<s slot= geography >
<s slot= ¡§Tai wan>
<v value= north /><v value = center /><v value = south /><v value = east />
</s>
<s slot= Global >
<v value= Enrope /><v value = China /><v value = India />
<v value = America /><v value = Australia />
</s>
</s>
</s>
In the slot-tree example above, a <v> tag represent a value in a slot. The simplest value is a
keyword. We may also specify a set of keywords or rules for a value, such as the single color value in
the texture slot.
The last step Identifying values for each slot is the most human laboring step in the whole
slot-tree building process. In order to construct slot-tree automatically, we develop the slot-mining
algorithm to mine slot-tree from XML documents in the next section.
5.4 Slot-mining algorithm
A slot-mining algorithm mines slot-tree from XML documents. The first step is to extract paths
in XML documents to build a schema. The second step is using statistical correlation analysis
to find out what terms is important for these paths. After that, a slot-tree is built that each slot
corresponds to a path in XML documents. The following figure shows a concept model of the
slot-mining algorithm.
Figure 5.1 The process of slot-mining algorithm
Before we describe the algorithm, we have to define some mathematics notation for it.
Definition : Slot-Vector
A slot-vector is a vector of (slot, term) pairs for a given collection of XML blocks (B).
v(B) = (Bs1,t1, , B s1,tk , ,Bsn,t1, ,Bsn,tk)
B si,tj is the weight of (tj) shows up in blocks for slot(sj) of B
|B| is the abbreviation for ∑s,t Bs,t
|Bt| is the abbreviation for ∑s Bs,t
|Bs| is the abbreviation for ∑ t Bs,t
Definition : Slot-Vector Space Model (SVSM)
The model of represent XML document by Slot-Vector is called Slot-Vector Space Model.
Example
1. A slot-vector for a given collection (D) is represented as the following formula.
v(D) = (Ds1,t1, , D s1,tk , ,Dsn,t1, ,Dsn,tk)
2. A slot-vector for a specified slot (s) of collection (D) is represented as the following formula.
v(Ds) = (Ds,t1, , D s,tk)
3. A slot-vector for a given document (d) is represented as the following formula.
v(d) = (ds1,t1, , d s1,tk , ,dsn,t1, ,dsn,tk)
4. A slot-vector for a specified slot (s) of document (d) is represented as the following formula.
v(ds) = (ds,t1, , d s,tk)
Slot-Mining Problem
Given an XML documents collection (D) and a set of slots (S), find the key values for each slot : v(s).
Slot-Mining Algorithm
The slot vector for D is v(D) = (Ds1,t1, , D s1,tk , ,Dsn,t1, ,Dsn,tk)
Let |Dt| = ∑ Dsi,t
The slot vector for Ds is v(Ds) = (Ds,t1, , D s,tk)
v>r
(s) = { w | Ds,t /|Ds| > r * |Dt|/|D| }
v>r
(s) is called the r-key-set for slot (s)
In our XML-mining system, we set the parameter (r = 2.0) to extract the key values for each slot.
The following figure shows the pseudo code of slot-mining algorithm.
Algorithm Slot-Mining (D)!
P = {p | p is a path in D}
for each (p,t) in D
|Dp,t | = |Dp,t|+1
|Dp| = |Dp|+1
|Dt| = |Dt|+1
|D| = |D|+1
end for
for each (p,t) in PT
p(t | p) = |Dp,t | / |Dp|
p(t) = |Dt | / |D|
if p(t|p)/p(t) > r then put (p,t) into SV
end for
return SV!
Figure 5.2 The Pseudo Code of Slot-Mining Algorithm
The slot-mining algorithm mines values from XML collection D. The mining values should be
modified and organized into slot-tree for improving the quality. Let s have a look at a mining
example for slot color .
Example :
<color> head, brown, yellow, body, white, wing, gray, blue, black, background, line, spot </color>
In the mining result above, brown, yellow, white, gray, blue, black are what we want, but head,
body, wing, background, line, spot are noise words. Until now, we cannot distinguish these two
groups by statistical method. We have to find out a way to distinguish them. One possible solution is
to combine a dictionary like WordNet to distinguish these two groups. We will try this solution in
the future.
5.5 Discussion
In order to help people constructing slot-tree ontology, we developed a slot-mining algorithm
to mine slot-tree from XML documents. The slot-mining algorithm is used as an authoring tool
to construct the slot-tree ontology.
The slot-mining algorithm mines slot-trees from a collection of XML documents. Our
approach is based on statistical correlation analysis between tags and terms. The correlation
analysis decides what terms are important for a given tag, and fills terms into the slot of this
tag.
Some modification is needed for the automatically constructed slot-tree in order to
improve the quality. At first, we have to merge paths with the same meaning into a slot in order
to simplify the structure of slot-tree. Second, we have to delete some incorrect mined-values
and merge values with the same meaning in order to improve the quality of each slot.
The slot-mining algorithm is used to construct the ontology for butterflies in section 6.7
and used to construct the ontology for proteins in section 7.7. We will show full version the
mined slot-tree in these sections.
Part 3 : Case Studies
6 Case Study - A Digital Museum of Butterflies
In part 2, we have described our methods, including slot-tree ontology, slot-filling algorithm,
slot vector space model and slot-mining algorithm. These methods are used to build a semantic
retrieval system for XML. In this part, we will use two XML collections to test our methods,
including a collection for butterflies and a collection for proteins.
In chapter 6, we will test our methods on the collection of A Museum of Butterflies in
Taiwan (MBT) . In chapter 7, we will test our methods on the collection of Protein
Information Resource (PIR) . Both collections are encoded in XML format.
In this chapter, an overview of MBT is given in section 6.1. A source XML document of
MBT is showed in section 6.2. A slot-tree for MBT is described in section 6.3. A query
interface based on the slot-tree is described in section 6.4. The slot-filling process for MBT is
described in section 6.5. The retrieval process for MBT is discussed in section 6.6. The mining
process to build slot-tree for MBT is discussed in section 6.7. A discussion of our approach on
MBT is given in section 6.8.
6.1 Introduction
The Digital Museum of Butterfly is a collection of butterfly in Taiwan. Each document in this collection
describes a species of butterfly in Taiwan. The following table is a profile for this collection.
Table 6.1 : A Museum of Butterflies in Taiwan
Collection A Museum of Butterflies in Taiwan ( )
Working Group NMNS : National Museum of Natural Science ( ), Taiwan
URL : http://www.nmns.edu.tw/
NCNU : National Chi-Nan University ( ), Taiwan
URL : http://dlm.ncnu.edu.tw/butterfly/index.htm
NTU : National Taiwan University ( ), Taiwan
URL : http://turing.csie.ntu.edu.tw/ncnudlm/
Size 356 species, 356 XML documents.
Language Tag in English, Content in Chinese
Digital Museum for Butterfly in Taiwan contains XML documents for 356 species of butterfly in
Taiwan. Roughly specking, tags may be classified into groups as following.
Table 6.2 : XML tags for butterflies in Taiwan
Group Fields
Classification name, family, cfamily (Chinese family), genus, species, subspecies
Host Host plant, Honey plant
Geography Taiwan, global
Egg Color, shape, feature, characteristic, days of growth, enemy
Larva Color, shape, feature, characteristic, days of growth, enemy
Pupa Color, shape, feature, characteristic, days of growth, enemy
Adult Color, shape, texture, characteristic, life period, enemy
6.2 The Representation of Butterflies in XML
The following figure shows an XML document for the butterfly kodairai .
- <butterfly>
<cname> </cname>
- <classification>
<family>Nymphalidae</family>
<cfamily> </cfamily>
<genus>Athyma</genus>
<species>fortuna</species>
<sub_species>kodairai</sub_species></classification>
<hostplant> (Caprifoliaceae) (Viburnum luzonicum var. matsudai) </hostplant>
<honeyplant> </honeyplant>
- <geographic><taiwan> 1000-2000 </taiwan>
<global> </global></geographic>
- <life_stage>
- <egg>
<feature>
`
<color> </color> <size> 1.1-1.3mm </size>
<predator> </predator>
<days_of_growth> 5-6 </days_of_growth></egg>
- <larva>
<feature>
`
<color>
</color>
<size> 33-41 mm </size>
<predator> </predator>
<days_of_growth> </days_of_growth>
<defense>
</defense></larva>
- <pupa>
<feature>
</feature>
<color> </color>
<size> 22-27mm </size>
<predator> </predator>
<days_of_growth> 15-20 </days_of_growth>
<defense> </defense></pupa>
- <adult>
<feature>
</feature>
<color>
</color>
<size> 50-60mm </size>
<characteristic> </characteristic>
<habitate> </habitate>
<predator> </predator>
<days_of_growth> </days_of_growth>
<defense> </defense>
<season> </season>
<behavior>
</behavior>
</adult>
</life_stage>
</butterfly>
Figure 6.1 : An XML document for butterfly (Full List)
6.3 Slot-Tree Ontology for Butterflies
Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for butterfly is
consistent to the target collection, both of them are in the following schema.
<butterfly>
<classification/>
<Geography/>
<life-period>
<Egg/>
<Larva/>
<Pupa/>
<Adult/>
</life-period>
</butterfly>
Each object in the life period (egg, larva, pupa, adult) has a sub schema to describe the object. The
schema looks like the following tree.
<object>
<Color/>
<shape/>
<feature/>
<size>
</object>
The consistency between slot-tree and document ease our design process. Besides that, the
consistency also eliminates ambiguity for our retrieval and browsing process. On the contrary, a lousy
design of XML document structure will makes our domain-knowledge design process difficult, and
makes our domain-knowledge hard to help the retrieval process and browsing process.
A fragment of the slot-tree for butterfly is showed in the following figure. For a full list of slot-tree,
please see appendix 1.
- <butterfly>
- <family slot=" " path="//butterfly//cfamily//">
<v value=" " keys="Hesperiidae" /><v value=" " keys="ycaenidae" /> .</family>
- <adult slot=" " keys="Adult" path="//butterfly//adult//">
- <shape slot=" " keys="Adult:Shape" path="//butterfly//adult//shape//">
<v value=" " image="swallowtail.gif"/> <v value="
" /> </shape>
- <color slot=" " keys="Adult:Color" path="//butterfly//adult//color//">
<v value=" " keys="Black" />¡ K <v value=" "
keys="Black_White"/> </color>
- <texture slot=" " keys="Adult:Texture" path="//butterfly//adult//texture//">
<v value=" " image="mono.gif" /><v value=" " image="spot.gif"
/> </texture>
</adult>
- <pupa slot=" " keys="Pupa" path="//butterfly//pupa//">
- <s slot=" " path="//butterfly//pupa//"><v value=" " keys="Skin_Stick" /> </s>
- <s slot=" " keys="Pupa:Color" path="//butterfly//pupa//color//">
<v value=" " keys="Green"/> <v value=" " keys="Wood" /> </s>
- <s slot=" " keys="Pupa:Feature" path="//butterfly//pupa//feature//">
<v value=" " keys="Laying_Pupa"/><v value=" " keys="Hanging_Pupa"/>
</s></pupa>
- <egg slot=" " keys="Egg" path="//butterfly//egg//">
- <s slot=" " keys="Egg:Shape" path="//butterfly//egg//feature//">
<v value=" " keys="Ball" image="egg_ball.jpg" />
<v value=" " keys=" +Half_Ball" image="egg_half_ball.jpg" /> </s>
- <s slot=" " keys="Egg:Color" path="//butterfly//egg//color//">
<v value=" " keys="Milk_White" /> </s>
- <s slot=" " keys="Egg:Texture" path="//butterfly//egg//feature//">
<v value=" " keys="Smooth /> <v value=" "
keys="Square_Texture"/> </s>
</egg>
- <larva slot=" " keys="Larva+ " path="//butterfly//larva//">
- <s slot=" " keys="Larva:shape" path="//butterfly//larva//feature//">
<v value=" " keys="Like_Shuttle" /><v value=" " keys="Like_Bird’s_Shit"
/> </s>
- <s slot=" " keys="Larva:Color" path="//butterfly//larva//color//">
<v value=" " keys="Green" /><v value=" " keys="Brown" /> </s>
- <s slot=" " keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic">
<v value=" " keys="Short_Hair" /><v value=" " keys="Long_Hair" /> ¡ K</s>
</larva>
- <s slot=" " keys="Taiwan" path="//butterfly//geographic//taiwan//">
<v value=" " keys="North_Taiwan+ " /> </s>
- <s slot=" " path="//butterfly//geographic//global//">
<v value=" " keys="South_Asia " /><v value=" " keys="China" /> . </s>
- <s slot=" " keys="Size" path="//butterfly//adult//size//">
<v value=" " keys="Small_Size+ " /><v value=" " keys="Middle_Size+ " /></s>
- <s slot=" " keys=" =Habitate" path="//butterfly//adult//habitate//">
<v value=" " keys="Ground" />¡ Kv value=" " keys="High_Mountain
/> </s>
- <s slot=" " keys="Hostplant+ " path="//butterfly//hostplant//">
<v value=" " keys="Leguminosae" /><v value=" " keys="Euphorbiaceae" /> </s>
- <s slot=" " keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//">
<v value=" " keys="Nectar" /><v value=" " keys="Juice " />r </s>
</butterfly>
Figure 6.2 : A slot-tree for butterflies
6.4 Query Interface
The query interface is built automatically by transform the slot-tree into a web page. We use
XSLT to transform slot-tree into HTML. The following figure shows a query interface for
butterflies.
A query-interface is automatically generated from slot-tree by XSLT template. The XSLT
template transforms the slot-tree into a HTML document. Then we show it as a web page on
the browser. The following figure shows the interface for butterfly domain.
Figure 6.3 A Query Interface for Butterflies
The query interface above generates the following query.
<query sort_by= `path= `/butterfly/adult/size$meter`order= `-`>
<s slot= ` `path= `//butterfly//adult//texture`value=` `/>
<s slot= ` `path= `//butterfly//geographic//Taiwan`value=` `/>
</query>
After the interface submits the query to our XML retrieval system, the retrieval results will be shows up.
The query above specified the query expression and the ranking strategy. The ranking strategy is by the
size of adult butterfly in decreasing order. Based on the query, the XML retrieval system will retrieve
the butterfly object and ranking by size of butterfly. We will show the query results in the following
section.
6.5 Slot-Filling Algorithm
We have to parse XML objects before the fill documents into slot-tree. For example, the following
XML document is a butterfly called `maraho¡¨.
<butterfly>
<cname> </cname><geographic>
<taiwan> 1000-1500 ¡ K</taiwan>
</geographic>
<egg><feature> </feature></egg><adult><color> </color><adult>
<footnote> ¡ K </footnote>
</butterfly>
The example above will be parsed into a sequence of (path, value) pair as following.
(butterfly cname )
(butterfly geographic taiwan 1000-1500 ` )
(butterfly egg  feature, )
(butterfly adult  color, )
Then we may fill them into corresponding slot as following.
(butterfly egg  feature, )
Æ <slot name=" " path="butterfly//egg//feature">
<value name=" "/>
<value name=" "/>
¡ K
6.6 XML Retrieval
After the user submits the query to the XML retrieval system, the XML retrieval system
retrieves the query results. Then an XML extraction algorithm extracts values for each slot.
After that, a sorting function sorts the result by the size of butterflies. The following figure
shows the query results.
Figure 6.4 A Query Result for Butterflies
6.7 Slot-Mining Algorithm
Chinese Word Learning
A problem for Chinese language is the word boundary detection. For English, there is a space between
words in a sentence. But in Chinese, there are not spaces between words. This problem causes some
difficulty in our XML Text-Mining problem. One way to solve this problem is use a dictionary to find
out the words shows in a sentence. The deficiency of this approach is that no dictionary contains all
words. And there are many unknown words used in a special domain. We have to learn words
dynamically to conquer the problem. In our system, we adopt the keyword-learning algorithm proposed
by L.F.Chien [Chien97]. This keyword-learning algorithm is based on the following observation
Both the right hand side and left hand side of a word should be free¡ƒ. The free means a word can
connect to many neighbors statistically. For example, we may extract the word `from the
following sentences based on the statistical freedom of this word.
` `
` `
` `
` `
` ` left neighbor { } right neighbor { }
` ` left neighbor { } right neighbor { }
` ` left neighbor { } right neighbor { }
For the string ` `, both left side and right side has four neighbors. But for ` `, there are only
one left neighbor. For ` `, there are only one right neighbor. A string with many neighbors in both
sides is very possible to be a `word`, so that ` `is putted into the learning-dictionary for the
following XML text-mining step.
Slot-Mining
After the word learning step, the slot mining algorithm describe in section 3.5 is used to extract
important word for each slot. The following table shows some results of of the Slot-Mining (part of
slot-tree).
Table 6.3 : A Result of Slot-Mining Algorithm for Butterflies
Slot Value List
butterflyclassificationcfamily , , , , , ,
butterflyclassificationfamily Satyridae, Pieridae, Papilionidae, Papilio, Nymphalidae, Lycaenidae,
Hesperiidae, Danaidae
butterflycname , , , , , , , , , ,
butterflyfootnote , , , , , , , , ,
, , , , , , , , , ,,
, , , , , , , , ,
, , , , , , ,
butterflygeographicglobal , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , , ,
, , , , , ,
butterflyhoneyplant , , , , , , , , , , ,
, , , , , , , , , , , ,
, , , , , , , , , , ,
,
butterflylife_stageadultpredator , , , , , , , , , , ,
butterflylife_stageeggcharacteristic , , , , , , , , , ,
, , , , , , , ,
butterflylife_stageeggfeature , , , , , , , , , , , ,
, , , , , , , , , , , ,
, , , , , , , , , , , ,
, , , , , , , , , ,
, ,
butterflylife_stageadultcolor , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , , , ,
, , , , , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , ,
, , , , , , , , , , , , ,
, , , , , , , , , ,
, , , , , , , , ,
, , , , , , , , ,
, , , , , ,
6.8 Discussion
In this chapter, we have studied our methods on the case of butterflies. We describe the
following methods.
1. Modeling XML documents of butterflies.
2. Constructing slot-tree ontology for butterflies.
3. Using slot-filling algorithm to map XML documents into slot-tree of butterflies.
4. Using slot-tree ontology to build query interface for butterflies.
5. Using slot-tree ontology to help XML retrieval for butterflies.
6. Mining slot-tree ontology from XML documents of butterflies.
These methods reduce the semantic gap between human and computer in the domain of
butterflies. The query interface enable user to write queries easily. The slot-filling algorithm
makes computer understand XML documents easily. Finally, the mining algorithm makes us
construct slot-tree ontology easily.
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach
XML Retrieval - A Slot Filling Approach

More Related Content

What's hot

Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sources
IJwest
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
Jie Bao
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
eswcsummerschool
 

What's hot (20)

The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sources
 
Intro to OWL & Ontology
Intro to OWL & OntologyIntro to OWL & Ontology
Intro to OWL & Ontology
 
Federated data stores using semantic web technology
Federated data stores using semantic web technologyFederated data stores using semantic web technology
Federated data stores using semantic web technology
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
The Web Ontology Language
The Web Ontology LanguageThe Web Ontology Language
The Web Ontology Language
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Owl web ontology language
Owl  web ontology languageOwl  web ontology language
Owl web ontology language
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Semantic web Technology
Semantic web TechnologySemantic web Technology
Semantic web Technology
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
 
Jarrar: OWL -Web Ontology Language
Jarrar: OWL -Web Ontology LanguageJarrar: OWL -Web Ontology Language
Jarrar: OWL -Web Ontology Language
 
Jarrar: OWL (Web Ontology Language)
Jarrar: OWL (Web Ontology Language)Jarrar: OWL (Web Ontology Language)
Jarrar: OWL (Web Ontology Language)
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
Modelling Knowledge Organization Systems and Structures
Modelling Knowledge Organization Systems and StructuresModelling Knowledge Organization Systems and Structures
Modelling Knowledge Organization Systems and Structures
 
The Semantic Web #9 - Web Ontology Language (OWL)
The Semantic Web #9 - Web Ontology Language (OWL)The Semantic Web #9 - Web Ontology Language (OWL)
The Semantic Web #9 - Web Ontology Language (OWL)
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
 
Entity Linking
Entity LinkingEntity Linking
Entity Linking
 
Introduction to XPath
Introduction to XPathIntroduction to XPath
Introduction to XPath
 

Viewers also liked

Tipos De Virus InformáTicos
Tipos De Virus InformáTicosTipos De Virus InformáTicos
Tipos De Virus InformáTicos
guest9d71856c
 
A4 diseno 2010_es (1)
A4 diseno 2010_es (1)A4 diseno 2010_es (1)
A4 diseno 2010_es (1)
Ruth Ribera
 
Profesionalización de la función pública: elemento de un Estado eficiente
Profesionalización de la función pública: elemento de un Estado eficienteProfesionalización de la función pública: elemento de un Estado eficiente
Profesionalización de la función pública: elemento de un Estado eficiente
FUSADES
 

Viewers also liked (20)

A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 
系統程式 - 附錄
系統程式 - 附錄系統程式 - 附錄
系統程式 - 附錄
 
系統程式 -- 第 6 章
系統程式 -- 第 6 章系統程式 -- 第 6 章
系統程式 -- 第 6 章
 
系統程式 -- 第 10 章
系統程式 -- 第 10 章系統程式 -- 第 10 章
系統程式 -- 第 10 章
 
結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統結合統計與規則的多層次中文斷詞系統
結合統計與規則的多層次中文斷詞系統
 
文件空間中正交分類樹的建構
文件空間中正交分類樹的建構 文件空間中正交分類樹的建構
文件空間中正交分類樹的建構
 
系統程式 -- 附錄
系統程式 -- 附錄系統程式 -- 附錄
系統程式 -- 附錄
 
Flyer WeGov Projekt
Flyer WeGov ProjektFlyer WeGov Projekt
Flyer WeGov Projekt
 
Schulentwicklung Tablet PC
Schulentwicklung Tablet PCSchulentwicklung Tablet PC
Schulentwicklung Tablet PC
 
O MAR. Proxecto educativo de centro do CEIP de Xuño.
O MAR. Proxecto educativo de centro do CEIP de Xuño.O MAR. Proxecto educativo de centro do CEIP de Xuño.
O MAR. Proxecto educativo de centro do CEIP de Xuño.
 
Diplomprüfung
DiplomprüfungDiplomprüfung
Diplomprüfung
 
Revista Bodybell - Día del padre 2014
Revista Bodybell - Día del padre 2014Revista Bodybell - Día del padre 2014
Revista Bodybell - Día del padre 2014
 
História do Paraná - Parte 2
História do Paraná - Parte 2História do Paraná - Parte 2
História do Paraná - Parte 2
 
Tipos De Virus InformáTicos
Tipos De Virus InformáTicosTipos De Virus InformáTicos
Tipos De Virus InformáTicos
 
A4 diseno 2010_es (1)
A4 diseno 2010_es (1)A4 diseno 2010_es (1)
A4 diseno 2010_es (1)
 
Profesionalización de la función pública: elemento de un Estado eficiente
Profesionalización de la función pública: elemento de un Estado eficienteProfesionalización de la función pública: elemento de un Estado eficiente
Profesionalización de la función pública: elemento de un Estado eficiente
 
Proyecto
Proyecto Proyecto
Proyecto
 
Revista164
Revista164Revista164
Revista164
 
Elvyttävän korjausrakentamisen toimintatavat. Sakari Pulakka, Tarja Häkkinen,...
Elvyttävän korjausrakentamisen toimintatavat. Sakari Pulakka, Tarja Häkkinen,...Elvyttävän korjausrakentamisen toimintatavat. Sakari Pulakka, Tarja Häkkinen,...
Elvyttävän korjausrakentamisen toimintatavat. Sakari Pulakka, Tarja Häkkinen,...
 
Novedades editoriales (1ra entrega) Julio 2016
Novedades editoriales (1ra entrega) Julio 2016Novedades editoriales (1ra entrega) Julio 2016
Novedades editoriales (1ra entrega) Julio 2016
 

Similar to XML Retrieval - A Slot Filling Approach

Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
soumya
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
Alex Sumner
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
IJMER
 
eXtensible Markup Language
eXtensible Markup LanguageeXtensible Markup Language
eXtensible Markup Language
Aditya Raj
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
Himanshu kandwal
 

Similar to XML Retrieval - A Slot Filling Approach (20)

E05412327
E05412327E05412327
E05412327
 
Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
 
xml and xhtml.pptx
xml and xhtml.pptxxml and xhtml.pptx
xml and xhtml.pptx
 
Tutor Xml Gxs
Tutor Xml GxsTutor Xml Gxs
Tutor Xml Gxs
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
 
Xml
XmlXml
Xml
 
Cl4201593597
Cl4201593597Cl4201593597
Cl4201593597
 
call for paper 2012, hard copy of journal, research paper publishing, where t...
call for paper 2012, hard copy of journal, research paper publishing, where t...call for paper 2012, hard copy of journal, research paper publishing, where t...
call for paper 2012, hard copy of journal, research paper publishing, where t...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Ijert semi 1
Ijert semi 1Ijert semi 1
Ijert semi 1
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 
eXtensible Markup Language
eXtensible Markup LanguageeXtensible Markup Language
eXtensible Markup Language
 
Module 5 XML Notes.pdf
Module 5 XML Notes.pdfModule 5 XML Notes.pdf
Module 5 XML Notes.pdf
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Xml iet 2015
Xml iet 2015Xml iet 2015
Xml iet 2015
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
 
UNIT-1 Web services
UNIT-1 Web servicesUNIT-1 Web services
UNIT-1 Web services
 

More from 鍾誠 陳鍾誠

西洋史 (你或許不知道但卻影響現代教育的那些事)
西洋史  (你或許不知道但卻影響現代教育的那些事)西洋史  (你或許不知道但卻影響現代教育的那些事)
西洋史 (你或許不知道但卻影響現代教育的那些事)
鍾誠 陳鍾誠
 

More from 鍾誠 陳鍾誠 (20)

用十分鐘瞭解 新竹科學園區的發展史
用十分鐘瞭解  新竹科學園區的發展史用十分鐘瞭解  新竹科學園區的發展史
用十分鐘瞭解 新竹科學園區的發展史
 
用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus用十分鐘搞懂 λ-Calculus
用十分鐘搞懂 λ-Calculus
 
交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥交⼤資訊⼯程學系備審資料 ⾱詠祥
交⼤資訊⼯程學系備審資料 ⾱詠祥
 
smallpt: Global Illumination in 99 lines of C++
smallpt:  Global Illumination in 99 lines of C++smallpt:  Global Illumination in 99 lines of C++
smallpt: Global Illumination in 99 lines of C++
 
西洋史 (你或許不知道但卻影響現代教育的那些事)
西洋史  (你或許不知道但卻影響現代教育的那些事)西洋史  (你或許不知道但卻影響現代教育的那些事)
西洋史 (你或許不知道但卻影響現代教育的那些事)
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
 
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列梯度下降法  (隱藏在深度學習背後的演算法) -- 十分鐘系列
梯度下降法 (隱藏在深度學習背後的演算法) -- 十分鐘系列
 
用十分鐘理解 《微分方程》
用十分鐘理解  《微分方程》用十分鐘理解  《微分方程》
用十分鐘理解 《微分方程》
 
系統程式 -- 前言
系統程式 -- 前言系統程式 -- 前言
系統程式 -- 前言
 
系統程式 -- 附錄
系統程式 -- 附錄系統程式 -- 附錄
系統程式 -- 附錄
 
系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作系統程式 -- 第 12 章 系統軟體實作
系統程式 -- 第 12 章 系統軟體實作
 
系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統系統程式 -- 第 11 章 嵌入式系統
系統程式 -- 第 11 章 嵌入式系統
 
系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統系統程式 -- 第 10 章 作業系統
系統程式 -- 第 10 章 作業系統
 
系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器系統程式 -- 第 9 章 虛擬機器
系統程式 -- 第 9 章 虛擬機器
 
系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器系統程式 -- 第 8 章 編譯器
系統程式 -- 第 8 章 編譯器
 
系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言系統程式 -- 第 7 章 高階語言
系統程式 -- 第 7 章 高階語言
 
系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器系統程式 -- 第 6 章 巨集處理器
系統程式 -- 第 6 章 巨集處理器
 
系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入系統程式 -- 第 5 章 連結與載入
系統程式 -- 第 5 章 連結與載入
 
系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器系統程式 -- 第 4 章 組譯器
系統程式 -- 第 4 章 組譯器
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 

XML Retrieval - A Slot Filling Approach

  • 1. XML - ( ) XML Retrieval A Slot Filling Approach
  • 2. 1997 2002 , , ` ` ` ` , , ` ` : ` ! ` ,
  • 3. XML 1998 W3C XML XML XML XML XML XML (Slot-Tree Ontology) XML XML (Slot-Filling Algorithm) XML XML XML XML (Data Mining) (Slot-Mining Algorithm) XML - (Protein Information Resource) XML XML , XML
  • 4. Abstract Extensible Markup Language (XML) is widely used in data exchanging and knowledge representation. A retrieval system that used to manage the content of XML documents is strongly desired. In order to improve the efficiency of XML retrieval systems, we design a set of methods based on a ontology called slot-trees, and use the slot-trees to help the XML retrieval process. One problem for us to build smart computer is that computer cannot understand natural language as good as human. This is called the semantic gap between human and computer. For XML retrieval systems, semantic gap lies on both the query side and document side. The semantic gap on the query side is due to the difficulty for human to write structured query. The semantic gap on the document side is due to the difficulty for computer to understand XML documents. In order to reduce the semantic gap, we design a XML retrieval system based on a notion of slot-tree ontology. Slot-tree ontology is an object-based knowledge representation. In this thesis we develop slot-tree ontology to represent the inner structure of an object. We then introduce a slot-filling algorithm that maps XML documents into the slot-tree ontology in order to capture the semantics. After that, we design a XML retrieval system based on the slot-tree ontology and slot-filling algorithm. The system includes a slot-based query interface, a semantic retrieval model for XML, and a program that extract summary for browsing. Since the construction of slot-tree is not an easy job, we also develop a slot-mining algorithm to construct the slot-tree automatically. Our slot-mining algorithm is a statistical approach based on the correlation analysis between tags and words. The highly correlated terms are filled into the slot-tree as values. This algorithm eases the construction process of the slot-tree. Two XML collections, one on butterflies and another on proteins, are used as test-bed of our XML retrieval system. We found that our XML retrieval system is easy to use and performs well in the retrieval effectiveness and the quality of browsing. Furthermore, the slot-mining algorithm can fill important words into each slot. However, the mining results should be modified manually in order to improve the quality of the slot-tree. Finally, we summary our contributions on XML retrieval, and then compare our methods to some other methods. A qualitative analysis is given in the last chapter. We also suggest directions for further research.
  • 5. XML Retrieval : A Slot-Filling Approach Ph.D. Dissertation Chen, Chung Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail : johnson@turing.csie.ntu.edu.tw Advisor : Jieh Hsiang 23 July 2002
  • 6. Content Part 1 : Tutorial of This Thesis 1 Introduction 1 1.1 Motivation 1 1.2 Research problems 3 1.3 Research approaches 5 1.4 Outline of this thesis 7 2 Background XML and Information Retrieval 8 2.1 XML 8 2.2 Information retrieval 9 2.3 XML querying and retrieval 12 2.4 Using ontology to help the XML retrieval process 16 2.5 Discussion 20 Part 2 : Slot-Tree Based Methods for XML Retrieval 3 Slot-Tree Ontology and Slot-Filling Algorithm 21 3.1 Introduction 21 3.2 Slot-tree ontology 22 3.3 Slot-filling algorithm 26 3.4 Discussion 28 4 An Ontology Based Approach for XML Querying, Retrieval and Browsing 29 4.1 Introduction 29 4.2 XML documents 30 4.3 Indexing structure 32 4.4 Query language and query interface 33 4.5 Ranking strategy 34 4.6 Browsing XML documents 36 4.7 Discussion 37 5 The Construction of Slot-Tree Ontology 38 5.1 Introduction 38 5.2 Background 39 5.3 The process of building a slot-tree 39
  • 7. 5.4 Slot-mining algorithm 41 5.5 Discussion 44 Part 3 : Case Studies 6 Case Study - A Digital Museum of Butterflies 46 6.1 Introduction 46 6.2 The representation of butterflies in XML 47 6.3 Slot-tree ontology for butterflies 48 6.4 Query interface 51 6.5 Slot-filling algorithm 52 6.6 XML retrieval 53 6.7 Slot-mining algorithm 53 6.8 Discussion 56 7 Case Study - Protein Information Resource 57 7.1 Introduction 57 7.2 The representation of proteins in XML 58 7.3 Slot-tree ontology for proteins 58 7.4 Query interface 59 7.5 Slot-filling algorithm 60 7.6 XML retrieval 61 7.7 Slot-mining algorithm 62 7.8 Discussion 64 Part 4 : Conclusions 8 Conclusions and Contributions 65 8.1 Comparison 65 8.2 Contributions 69 8.3 Discussion and future work 70 Reference 71 Appendix 1 : A Museum of Butterflies in Taiwan 77 Appendix 2 : Protein Information Resource 85
  • 8. Part 1 : Tutorial of This Thesis 1 Introduction This thesis introduces an information retrieval (IR) method for XML. One big problem for information retrieval is that computer cannot understand documents as good as people. The problem is called the semantic gap problem. Our goal is building an information retrieval system to reduce the semantic gap between human and computer on XML. Our approach is using ontology to help the searching processes for XML, include querying, retrieval and browsing. This thesis is opened with our motivation in section 1.1. Our research problems are proposed in section 1.2. Our research approaches are described in section 1.3. An overview of this thesis is outlined in section 1.4. 1.1 Motivation Extensible Markup Language (XML) [XML98] is a standard to encode semi-structured documents. XML is useful in data representation, data exchanging and data publishing on the web. Many people believes that XML will be a widely spread standard in the future. For this reason, XML has gained much attention in both the information community and in the field of database research. XML is a markup language with extensible tags. Everyone may define his own markup language based on XML. In fact, hundreds of specifications based on XML have been proposed from 1997 to 2002. These specifications are designed to fulfill the need of some domains or some applications. For example, Protein Information Resource (PIR) (http://pir.georgetown.edu/) is an XML collections designed to record the data about proteins. UDDI [UDDI00] is an XML specifications designed to record the profile of business companies. XML is designed to be easy understood by human and computer. XML is encoded in text format for human to read and understand easily. Tags in XML provide semantic background for computer to understand the content correctly. XML can be used as a bridge between human writing and computer understanding. A smart computer program that understands XML documents is useful. However, building a computer program to understand XML documents is still very difficult. In this thesis, we propose methods for computer to understand XML documents.
  • 9. The natural language processing (NLP) community has been focus on the processing and understanding of natural language documents for a long time [Grosz86]. However, understanding natural language documents is still very difficult for computer programs. No effective approach is powerful enough to solve the understanding problem. Building a smart computer program to understand natural language texts is very difficult because of the semantic gap . The semantic gap is described as following. Computer cannot understand natural language as good as human. The semantic gap causes some difficulties for information retrieval systems. For example, an information retrieval system cannot understand our natural language queries, and retrieve many documents that are not semantically related to our queries. There are two semantic gaps for an information retrieval system, one for queries understanding and another for documents understanding. These gaps are list as following. Gap 1 : Computer cannot understand queries as good as human. Gap 2 : Computer cannot understand documents as good as human. Figure 1.1 : Semantic gaps of natural language In order to reduce the semantic gap problem, researchers in NLP community have been trying hard to resolve the following question.
  • 10. How to make computers understand natural language? However, natural language is too difficult for computer programs to understand now. Although many people have been devoted to solve the problem for more than thirty years, designing a computer program to understand natural language is still an open research problem. Computers do not understand natural language well. Why don t we design a structured language that is easy for computer to understand and easy for human to write. If we can design such a language, then we have a common language between human and computer. People may write documents in this language for computer to understand. Then we may build computer programs to understand documents in this language. XML is such a language that is easy for human to write. However, we have no method for computer to understand XML documents easily. If we can design such a computer program, we may reduce the semantic gap for XML, so that XML may plays as a bridge between human and computer. In this thesis, our goal is to reduce the semantic gap on XML. Our approach is to design methods for computer to understand XML documents. Our research problem is described in the next section. 1.2 Research problems XML is a markup language with extensible tags. People have to understand tags before writing XML documents. If there are too many tags for an XML writer to remember, he cannot write XML documents easily. If a writer has to mark each word up in XML documents, he cannot write it easily, too. On the other hand, if a writer mark documents up roughly, it is difficult for computer to understand. The tradeoff between human writing and computer understanding is called the human-computer dilemma of XML . Our goal is to design an XML retrieval system to resolve the human-computer dilemma of XML . For an XML retrieval system, there are two semantic gaps between human and computer, one gap on query side and another gap on document side. Figure 1.2 shows these two gaps.
  • 11. Figure 1.2 : Semantic gaps of XML On the document side, an XML document may be easy for human to write but not so easy for computer to understand. An XML document with many natural language texts is not so easy for computer to understand. Example 1.1 shows an XML document that contains natural language text in the color block and size block. It is not so easy for computers to understand the XML document. Example 1.1 : An XML document that is not easy for computer to understand <butterfly name= kodairai > <color>with black wing and white spots on it</color> <size>middle size butterflies, from 50mm to 60mm</size> </butterfly> On the contrarily, an XML document may be easy for computer to understand but not so easy for human to read and write. An XML document that marks each word up is not so easy for human to read and write. Example 1.2 shows an XML document that is not easy for human to read and write. Example 1.2 An XML document that is not easy for human to read and write <butterfly name= kodairai > <color><wing>black<wing><texture>white spot</texture></color> <size> <classification>middle size</classification> <from>50mm</from><to>60mm</to> </size>
  • 12. </butterfly> The same things happen on the query side, an XML query may be easy for human to write but not so easy for computer to understand. An XML query with natural language is not so easy for computer to understand. Example 1.3 shows an XML query that is not so easy for computer to understand. Example 1.3 An XML query that is not easy for computer to understand <butterfly>in black color with white spots</butterfly> On the contrarily, an XML query may be easy for computer to understand but not easy for human to read and write. A structuralized XML query is not so easy for human to read and write. Example 1.4 shows an XML query that is not so easy for human to read and write. Example 1.4 An XML query that is not easy for human to read and write For $b in //butterfly Where ?b/color = black¡¤and ?b/texture= white spots Return ?b Two approaches may be used to reduce semantic gap between human and computer on XML. The first approach is building computer programs to understand XML documents or queries. The second approach is building tools for human to write XML documents or queries. We adopt the first approach on the document side and adopt the second approach on the query side. It means that we build a computer program to understand roughly tagged XML documents, and we build a tool for human to write XML queries easily. The following section shows our approach. 1.3 Research approaches In this thesis, we build an XML retrieval system to reduce the semantic gap between human and computer on XML. An ontology called slot-tree is used to help the retrieval process. A user may use the query interface to write queries easily. The slot-tree ontology also helps the computer to understand XML documents easily. Figure 1.3 shows a scenario of our XML retrieval system.
  • 13. Figure 1.3 : A scenario of our XML retrieval system. On the document side, we build a computer program to understand XML documents. The understanding process is based on an ontology called slot-tree. Slot-tree is a frame like representation that embedded with XPATH [XPATH99] expression. In order to make computer understand XML documents, we designed a slot-filling algorithm to map XML documents into the slot-tree. On the query side, we build a query interface for human to write queries easily. The interface is built by transform the ontology into a web page. User may use the interface to write structural queries just by choosing or typing values into slots to build a structural query. In our approach, the slot-tree ontology is a key component for both documents understanding and queries building. The slot-tree ontology mediates queries and documents in the retrieval process to reduce the semantic gaps both on query side and document side. However, it is not an easy job to build the slot-tree ontology. The ontology constructor needs tools to build slot-tree ontology. The problem of construct slot-tree automatically based on a set of XML documents is called the slot-mining problem. It is described as following. How to mine the slot-tree ontology from a collection of XML documents ?
  • 14. In order to handle the slot-tree mining problem, we developed a statistical method to build the slot-tree automatically. The algorithm is called slot-mining algorithm that based on correlation analysis between tags and terms in XML documents. 1.4 Outline of this thesis This thesis is divided into four parts, including tutorial part , methods part , case study part and conclusion part . Part 1 sets the stage for all the others. Chapter 1 outlines the research problems and approaches. Chapter 2 reviews the background literatures for our research - Designing an XML retrieval system to reduce the semantic gap problem . Part 2 is a detail description of our methods. Our methods are based on a knowledge representation structure called slot-tree. The slot-tree is used in catching the semantics of XML documents. It helps our XML retrieval system to understand XML documents. Chapter 3 shows the syntax and semantics of slot-tree ontology, and shows a method that uses the slot-tree to catch the semantics of XML documents called slot-filling algorithm. Chapter 4 outlined an XML information retrieval system that based on slot-tree. The slot-tree ontology and slot-filling algorithm are used to reduce the semantic gap of XML retrieval. Chapter 5 shows the process of constructing slot-tree ontology. The steps of constructing a slot-tree are outlined. After that, a method that constructs slot-tree automatically is proposed. The method is a statistical program that called slot-mining algorithm. The slot-mining algorithm mines slot-trees from XML documents based on the correlation analysis between tags and terms. It helps peoples to construct the slot-tree ontology for a given XML collection. Part 3 is test-beds of the slot-tree based approach. The slot-tree based approach is examined in this part. Two cases are used to test the slot-tree based approaches. Chapter 6 shows the first case that is an XML collection about butterflies. The collection is a set of XML documents in Chinese about butterflies in Taiwan. Chapter 7 shows the second case that called Protein Information Resource (PIR). PIR is a large set of XML documents that released by George Town University. The experiment on these two cases is used to analyze the strength and weakness of the slot-tree based approach. Part 4 is the conclusion part. Chapter 8 analyzes the strength of slot-tree based approach. We compare the slot-tree based methods to some other XML retrieval methods, and point out our contribution, conclusions and future works.
  • 15. 2 Background XML and Information Retrieval In chapter 1, we have introduced our motivation, goals and research approaches. Briefly speaking, we would like to build an XML retrieval system that reducing the semantic gap between human and computer on XML. In this chapter, we will survey the related researches in order to provide background knowledge for our research. Since our approach is using slot-tree ontology to help the XML retrieval process, we will survey the topics of XML, information retrieval and ontology in this chapter. In section 2.1, we focus on the XML topics to survey the related specifications and technologies. In section 2.2, we survey the information retrieval technologies. After that, we will survey the current status and state of art in XML retrieval in section 2.3. Finally, we will outline the relationship between ontology and XML retrieval in section 2.4. 2.1 XML We have to understand XML in order to build an XML retrieval system that reduces the semantic gap. In this section, we will survey the XML related specifications and technologies, especially literature about knowledge representation and information retrieval. XML is proposed by world-wide-web consortium (W3C) (http://www.w3c.org) in 1998. It s a tree structured markup language with extensible tags. The following example is an XML document of phonebook. Example 2.1 An XML document <?xml version= 1.0 ?> <!DOCTYPE phonebook SYSTEM "phonebook.dtd"> <phonebooks xmlns= http://www.ntu.edu.tw/phonebook > <people id= 001 > <name>Johnson Chen</name> <tel>02-34134345</tel> </people> <people id= 002 > <name>Fanny Chen</name> <tel>02-33451294</tel> </people> </phonebooks>
  • 16. In example 2.1, the head part <?xml version= 1.0 ?> indicate that this document is an XML document. The second line is the document type definition (DTD) part of this XML document. DTD is used to validate the syntax of XML documents. The DTD part is optional and can be removed to ignore the syntax validation process. The third line, with a phonebooks tag, is the root node of this XML document. One XML document has one and only one root node. In this line, the xmlns= http://www.ntu.edu.tw/phonebook is the default name space of this XML document. Name space [XMLNS99] in XML is used to distinguish tags with the same names form each other. So that people can define their own tags and using others tags without have to worry about using the same tag name in different meaning. A node in XML contains tag, attribute and text. phonebooks , In the example above, people and name and tel are tags, xmlns and id are attributes, http://www.ntu.edu.tw/phonebook and Johnson Chen and 02-34134345 are text parts. XPath [XPATH99] is a specification that used to locate nodes in XML documents. If we would like to locate all the people nodes, we may use the XPath expression //people to locate nodes of people. The // operator means matching every descendent nodes. If we would like to locate the people node with id = 001 , then we may use the XPath expression //people[@id= 001 ] to locate the node. The @ symbol means the id is an attribute name. XPath is used in the slot-tree ontology that is going to be discussed in chapter 3. We embed XPath into the slot-tree to locate nodes in XML, and using the XPath to map XML documents into slot-tree ontology. Many XML related specifications are proposed since 1997. XML has been a wide spreading specification and used in many domains and applications, such as in data exchanging , data presentation , data querying , and knowledge representation . For data exchanging, UDDI and ebXML are used to mediate the data exchange process between business enterprises. For data presentation, XSLT can be used to transform XML into HTML for presenting on the web. For data querying, XQL, XML-QL and X-Query are used to query data in XML documents. For knowledge representation, RDF/RDFS, DAML/DAMLS, XML topic map are proposed to represent knowledge in XML format. We will survey specifications about data querying in section 2.3 that discussing the XML query and retrieval topics, and survey specifications about knowledge representation in section 2.4 that discussing the ontology topic. 2.2 Information retrieval In order to build an XML retrieval system that reduce the semantic gap, we have to understand the information retrieval technologies, and how to use natural language understanding technologies to reduce the semantic gap of XML.
  • 17. The evolution of IR technique is close related to the target document structure. Each time, a new document structure proposed, a new IR technique developed. In 1970~1980, Vector Space Model is developed to retrieve text documents. In 1990~1999, Random Walk Model developed to retrieve HTML documents. Today, XML document are wide spreading. Many researchers are trying to develop new retrieval models for XML. Text Retrieval Text Retrieval Technology is almost as old as the Computer Technology. There are many models for text retrieval. The most well known is Vector Space Model (VSM) [Salton75]. In this model, each document is represented by a k-dimensional vector of terms. A plain text is expressed as following. d = (dt1, dt2, , dtk), where dti is the weight of term ti that show up in the document of d In the expression above, where k equals the number of index terms in the collection. The order of words in the text sequence is discarded. A query is represented by a k-dimensional vector of terms, too. The query (q) may be represented as the following vector. q = (qt1, qt2, , qtk), where qti is the weight of term ti that show up in the query of q Cosine coefficient is a popular measure for the similarity between a document and a query. The definition of cosine similarity is the cosine of the angle between the document vectors d and the query vectors q. Similarity(d, q) = ∑∑ ∑ == = = • k i   ¡ k i   ¡   ¡ n i   ¡ qd qd qd qd 1 2 1 2 1 * )*( ||*|| One question is how to set the weight dti and qti in the vector space model. The tfidf is a simple and common used weighting function. The tfidf weighting is defined as the product of term frequency (tf) and inverse document frequency (idf) Term frequency (tf) : tf(t,d) : the number of occurrences of term t in document d Document frequency (df) : df t : the number of documents, containing term tj . Inverse document frequency (idf) : the inverse number of documents in which the term occurs.
  • 18. idf(t) = log(N/dft), where N is the number of documents. For a given document d, dti= tfidf(ti, d) = tf(ti, d) * idf(t) For a given document q, qti= tfidf(ti, q) = tf(ti, q) * idf(t) The SMART system experiments lead by Salton [Salton88] shows that tfidf term weighting function is the best in his 287 distinct combinations of term-weighting assignments. The tfidf weighting function has been proved to be a good measure for the vector space model. HTML Retrieval The main issue of HTML-retrieval is to measure the importance of a document. A HTML retrieval system retrieves documents that match the query, and then sort by importance. On the web, there are too many documents to retrieve. The importance measure helps user to decide what he should read. Documents on the web are different from the text collection because of the hyperlink structure. The measure of HTML importance is based on the hyperlink analysis technique. Historically, hyperlink analysis is developed based on the citation analysis technique. A simple strategy to measure the importance of a web page is by counting the number of hyperlink that reference to it. A web page referenced by many other pages is important. In 1998, a random walk model used to weight the importance of web pages proposed was proposed [Brin98][Page98]. The random walk model was then used in the Google search engine. In the random walk model, a page is important if it is cited by many important pages. Formally speaking, each web page in the random walk model has a weight measure w(d). An iterative process is used to recalculate the w(d) in each iteration. ∑∈ ← Epqq qwpw ),(: )()( Conceptually, the random walk model simulates the process of a person click web pages randomly. The random walker chooses a web page randomly as a start page. After that, he randomly clicks a web page in the page and repeats the click process on each clicked page. In the random walk model, a important page will be visited with high probability. Kleinberg proposed a Hub-Authority model to weight the impact of a web page [Kleinberg98]. Web pages are divided into two classes in this model, hub-page and authority-page. The hub-authority model is an iterative process. For a hub-page (h), it is important if the page point to many important
  • 19. authority-pages. For an authority page (a), it is important if the page is cited by many important hub-pages. Formally speaking, there are two weight on each page (d) in the hub-authority model, the hub weighting measure h(d) and the authority weighting measure a(d). An iterative process is used to recalculate the h(d) and a(d) in each iteration. Figure 2.1 shows the concept of hub-authority model. Figure 2.1 The hub-authority model A set of web page (D) contains many hyperlinks (E). For each page d in D, h(d) is the hub weight of d, and a(d) is the authority weight of d. At first, we may set both h(d) and a(d) as 1/|D|, where |D| is the number of documents in D. After that, an iteration is used to recalculate h(d) and a(d) based on the following recurrence equations. ∑∈ ← Epqq qhpa ),(: )()( ∑∈ ← Eqpq qaph ),(: )()( Hub-authority model is used to weight the importance of a web page, and decide whether a page is a hub or authority. Besides weighting the importance, hub-authority model provides a mechanism to classify the type of a web page. Both hub-authority model and random walk model used the iterative approach to decide the importance of a web page. The convergence analysis based on eigen-value in linear algebra is used to analyze the behavior of recurrence equations used in these models. The paper of Kleinberg [Kleinberg98] and Page et. al. [Page98] have further discussions for the theory of these models. 2.3 XML Querying and Retrieval In order to manage XML documents, the database community and IR community have recently
  • 20. focus on the research of storing, indexing, querying, and retrieving XML documents. For storing, the database management systems are extended to support the function of storing XML documents. One way is extending relational database system to store XML documents, another way to store XML documents in object-oriented database (OODB) system. For indexing, Patricia-trie and inverted-file are used to index XML documents. For querying, several XML query languages are proposed to retrieve XML nodes. For searching, several systems are designed to search XML documents. In this section, we will focus on the survey of XML query languages and XML retrieving systems. XML Query Language Designing query languages for XML is a hot research topic for XML. XML query languages are much more complex than text-retrieval and HTML-retrieval. XML query languages are more flexible than database query languages. There are many XML query languages proposed in these years, such as Loral [Loral97] , XML-QL [XML-QL98], XML-GL[XML-GL99], and X-Query [XQuery01]. Querying an XML collection is like to query a database. We usually query tables by SQL language in a relational database. The following example shows a query to retrieve name and birthday of United-State presidents. SELECT name, birthday FROM people WHERE nation= US and job= president An XML query language has to retrieve nodes in the tree of XML nodes. The following example shows an X-Query example that retrieve name and birthday of United-State presidents. For $p in //people Let $n=?p/name, $b=?p/birthday Where ?p/job = president¡¤and ?p/nation= US Return ?n, ?b XML-GL is a graphical notation used to retrieve XML documents. Figure 2.2 shows an example of retrieve orders that ship books with title Introduction to XML to Los Angles.
  • 21. Figure 2.2 An example of XML-GL XML retrieval systems There are several XML retrieval system proposed in these years. We will have a survey of these systems in this section. Lore was one pioneer research project for XML retrieval in Stanford-University. In this project, an object-oriented database was used to store XML documents. The XML query language Loral was developed. Besides that, a query interface DataGuider was developed to query XML documents. Figure 2.3 is a screen catch of the DataGuider system.
  • 22. Figure 2.3 The query interface of DataGuider system XYZfind is a commercial system that split the querying process into four steps. The following figures show the retrieval steps of the XYZfind retrieval system. Step 1 : User type in a query to start the category searching process. Step 2 : The XYZfind system found several related categories. User have to click the target application. Step 3 : User use the query interface to build a query.
  • 23. Step 4 : The XYZfind system retrieves XML documents and shows on the browser. Figure 2.4 Retrieval steps of the XYZfind system 2.4 Using Ontology to Help the XML Retrieval Process In order to reduce the semantic gap, we have to survey the technologies that used to make computer understand natural language text. The design of XML does not eliminate the usage of natural language text in the content of XML documents. Natural language texts are frequently embedded in XML documents. The natural language understanding technologies that used to reduce the semantic gap is still needed in the understanding process of XML documents. In this section, we will focus on how to use natural language understanding technologies that based on ontology representation to understand XML documents. Natural language processing community has been trying to resolve the semantic gap problem for a long time. Natural language understanding is a field that focuses on building computer programs to understand natural language text [Grosz86] [Allen94]. However, the word understanding used here is a misleading word. Computers do not really understand natural language text as human. Calculation and symbolic reasoning is what computers can do. Computers understand natural language text by mapping text into internal representation. The internal representation guides the computer to do symbolic reasoning and act as it know the meaning of natural language text. Alan Turing designed the Turing-Test [Turing50] to test whether a computer understand natural language text or not. For information retrieval, we adopt a similar definition as Turing-Test. If a computer program that retrieve we want and discard what we do not want, and organize the retrieval result into what we like to browse, then we say the computer program understand documents and our queries. A computer can do what we like it to do is a smart computer. A retrieval system that retrieves only what we want and organize the result into what we like is a smart retrieval system.
  • 24. A data-structure called ontology that represents the concept in human mind is used in the process of understanding. Generally speaking, understanding is the process of mapping natural language text into ontology. After the mapping, computer can do actions based on the mapping. This is the style of computer understanding . Ontology may be represented in different structures. The research topic that focuses on the structure of ontology is called knowledge representation [Brachman85a]. Roughly speaking, there are two approach to represent knowledge and ontology, logic-based approach and object-based approach. We will introduce and compare these two approaches. It is a basis of our slot-tree ontology that is going to be discussed in chapter 3. The logic-based approach encodes knowledge into logic statements for reasoning, including propositional-logic, first-order-logic, probabilistic-logic etc. Prolog is the most well known programming language based on logic. Logic-based approaches encode knowledge into logic statements. Based on logic statement, a reasoning process is used to inference unforeseen true statements from these predefined logic statements. First-order logic is a powerful theory to represent knowledge and reasoning conclusions. First-order logic is a monotonic logic system that contains predicates and quantifiers in logic expressions. In first order logic, we use logical statement to represent the ontology. The following example shows the logic statements that describe the inheritance relationship between butterfly, insect and animal. is(butterfly, insect) is(insect, animal) ∀x, y, z is(x,y) ∧ is (y,z) Æ is(x,z) The power of first order logic lies on the ability of monotonic reasoning. The monotonic reasoning means any conclusions made will never being erased in the future. The 100% certainty of facts, rules and conclusions should be assured in the first logic reasoning process. The following example shows a reasoning process for the example above. The reasoning process inferred butterfly is a kind of animal . ∀x, y, z is(x,y) ∧ is (y,z) Æ is(x,z) (bind x to butterfly, y to insect, z to animal) ----------------------------------------------------------------------------------------------- is(butterfly,insect) ∧ is (insect,animal) Æ is(butterfly,animal)
  • 25. ----------------------------------------------------------------------------------------------- conclusion : is(butterfly, animal) A difficulty is that many uncertainty situations are encountered in the natural language understanding process. The 100% certainty of first order logic cannot always being assured. Probabilistic logic and fuzzy logic are developed to handle the uncertainty. However, the monotonic property is lost in the uncertain reasoning process. After reviewing the logic-based approach, we will introduce object-based approach. Object based approach contains a set of representation methods, including frame, semantic network and script. Generally speaking, frames are used to represent the internal structure of object, semantic networks are used to represent the relation between objects, and scripts are used to describe an active scenario involving many objects. Frame is proposed by Minsky in 1975 [Minsky75] in the seminal paper "A framework for representing knowledge". Frame is a method of representation that organizes knowledge into chunks. However, Minsky did not formalize the frame concept into mathematics model. Minsky explicitly argued in favor of staying flexible and nonformal. After that, some AI systems are built based on the frame representation, such as the KL-ONE system [Brachman85b] and the KRL language [Bobrow77]. Generally speaking, a frame is a structure that describes the internal structure of an object. Frames are composed out of slots (attributes) for which fillers (scalar values, references to other frames or procedures) have to be specified or computed. A slot can be expressed as a tuple in the form of (object, slot, filler). It is easy to transform these tuples into a logic predicate in the from of slot(object, filler). One frame that inherits from another frame is called a sub-frame. The inherit property may be expressed as the is relation between frames in the form of is(object, object). The inherit property organize frames into hierarchy. The concept of frame that organizes statements into object-based structures is easy for human to read and write. It was then adopted by object-oriented programming language for people to write program easily. The following example shows a frame for koairai that is a species of butterfly. <object name= kodairai > <is>butterfly</is> <texture>eyespots</texture> </object>
  • 26. Semantic networks concentrate on categories of objects and the relations between them [Quillian66] [Wood75]. Drawing graphs to represent the relationship between objects is the basic idea of semantic network. In these graphs, a link may be represented as a tuple in the form of (object, relation, object). It is easy to transform these tuples into a logic predicate in the from of relation(object, object). Scripts are used to describe a scenario involving many objects [Schank77]. Steps in the scenario are described as lattices. One step may be triggered when its preceding steps are finished. For example, the following script shows the process of make a cup of coffee. 1. Put an empty cup on table. Æ put_on(cup, table) 2. Put coffee powder into the cup Æ put_into(coffee powder, cup) 3. Filling hot water into the cup. Æ fill(hot water, cup) 4. Mixing the powder and the water by a spoon. Æ mix_by_spoon(powder, water) 5. Process finished. In fact, we may translate object-based representations into logic rules. The difference between logic-based representation and object-based representation lies on the organization principle. Logic-based representation encodes knowledge into logic expressions, and the object-based representation organizes these expressions into frames, semantic networks and scripts. Reasoning is not a standardized part in object-based systems [Ifikes85]. The information stored in frames has often been treated as the ¡§database¡¤ of t he kn owl edge syst e whereas the control of reasoning has been left to other parts of the system. The most popular and effective reasoning mechanism for frame is the production rules [Stefik83] [Kehler84]. Production rules are rules in the form of pattern/action. It is a subset of predicate calculus with an added prescriptive component indicating how the information in the rules is to be used during reasoning. Whenever a pattern is matched, the production system will trigger the corresponding frame, and the action is performed to do something that helps the understand process. After the pattern/action process, some values are filled into frames as the conclusion. The reasoning process in object-based system that map natural language text into slot-tree ontology is what we called the slot-filling process . Both logic-based representation and object-based representation may be used to represent the ontology and reasoning based on the ontology. Reasoning is helpful but not a necessary part for computers to understand natural language. However, computers need a process to map natural language text into ontology in order to understand it.
  • 27. The mapping process for XML documents is easier than the mapping process for natural language documents, because tags provide semantic contexts that make the process of mapping easily. In chapter 3, we will propose a slot-filling algorithm to map XML documents into slot-tree ontology in order to reduce the semantic gap between human and computer on XML. 2.5 Discussion In this chapter, we review the research background of XML, information retrieval and ontology. However, the technology of XML retrieval now is not good enough and needs further research. In fact, researchers in information retrieval community are trying hard to develop methods for XML retrieval recently. In the workshop of ACM SIGIR 2000 on XML and information retrieval, Carmel et al. [Carmel00] discuss about several unsolved problems for XML retrieval in the workshop summary. We list these problems as following. 1. Using XML query language is likely to improve precision. However, XML query languages are not easy for people. How to make it easier to use for people? 2. A heterogeneous XML collection contains document structures are coming from different sources, and the tag names and document structures may be different and idiosyncratic. How to retrieve heterogeneous XML documents? 3. XML is specified using Unicode. The tag names coming from different sources may be given in different languages. Since a word can have more that one translation and even no translation, how to find or make the appropriate translation is an interesting issue for multilingual information retrieval. How retrieve do multilingual XML documents? 4. Browsing XML retrieval results should be better than browsing text document. How to organize the retrieval results for browsing? Is it the entire document, a part of the XML tree, or perhaps a graph? In this thesis, we will try to resolve these problems by develop an XML retrieval system. The system is mainly designed to reduce the semantic gap between human and computer. In this system, we develop programs for computer to understand XML documents easily, for human to write query easily and browse query results easily. These methods are based on an ontology representation called slot-tree. We will describe these methods in the next part. In chapter 3, we will show how to represent slot-tree and map XML documents into slot-tree. In chapter 4, we will show how to use the slot-tree ontology to help the XML retrieval process. In chapter 5, we will design a method to build slot-tree automatically.
  • 28. Part 2 : Slot-Tree Based Methods for XML Retrieval 3 Slot-Tree Ontology and Slot-Filling Algorithm In part 1, we have introduces our motivation, goals and research approaches in chapter 1, and review the related researches for XML, information retrieval and ontology in chapter 2. In part 2, we will show our method to reduce the semantic gap of XML retrieval. In order to reduce the semantic gap, an ontology called slot-tree, is used to help the XML retrieval process in our system. In this part, we focus one the usage of slot-tree ontology in our XML retrieval system. Part 2 contains three chapters. In chapter 3, we will describe the syntax, semantics and usage of slot-tree. In chapter 4, we will use the slot-tree to reduce the semantic gap in the XML retrieval process. In chapter 5, we will show how to construct the slot-tree ontology, and design a mining algorithm to build the slot-tree ontology automatically. This chapter contains four sections. In section 3.1, we outline the structure of slot-tree ontology and its usage in the process of understanding XML documents. In chapter 3.2, we describe the syntax and semantics of slot-tree ontology. In chapter 3.3, we design the slot-filling algorithm to map XML documents into slot-tree ontology that is the core of understanding process. Finally, we have a discussion about slot-tree ontology and slot-filling algorithm in section 3.4. 3.1 Introduction In this chapter, we design an object-based representation called slot-tree ontology, and then use the slot-tree to understand XML documents. As we have said in section 2.4, the word understand used here means the process of mapping text in XML into the slot-tree. This enables a computer to trigger the corresponding procedure to do what user like it to do, such as answering questions or retrieving documents that user want. We will outline the slot-tree ontology and the slot-filling algorithm that maps XML documents in this section, and describe the detail of slot-tree in section 3.2 and slot-filling algorithm in 3.3. Slot-tree representation is object-based approach to represent the internal structure of objects like frame. We have surveyed object-based approach for knowledge representation, including frame, semantic network and script in section 2.4. Generally speaking, frame is used to represent the internal structures of objects, semantic network is used to represent relations between objects, and script is used to represent scenarios that involve many objects. The object-based approach is conceptually consistent to our notion about world, because the world is a composed by many objects in our sense. The difference between slot-tree and frame is that a slot in slot-tree contains a set of paths to locate nodes in
  • 29. XML documents. A path in a slot is in XPath format that was described in section 2.1. For example, //butterfly//color is used to locate color nodes in the block of butterfly . In our XML retrieval system, a slot-tree is encoded in XML format like the following example. Example 3.1 A simple slot-tree in XML format <s slot= butterfly path= //butterfly > <s slot= color path= //butterfly//adult//color > <v value= brown /> <v value= white /> </s> </s> Based on the slot-tree ontology, we design a slot-filling algorithm that is used to map XML documents into slot-tree ontology in the process of understanding. In the slot-filling algorithm, a path in a slot is used to catch a block in XML like a hand, and a matching process is used to map the content of the block into the slot. After the matching process, words that matched any values in a slot are filled into the slot. The filled slot-tree after the matching process is then used as a semantics structure of the XML document. We will show the detail of slot-tree ontology in section 3.2 and the detail of slot-filling algorithm in section 3.3. 3.2 Slot-Tree Ontology In this section, we propose an ontology representation called slot-tree. Slot-tree is an object-based representation that describes the internal structure of an object like frame. We have described the frame representation in section 2.4. We will describe the syntax, semantics and examples for slot-tree in this section. Definition 3.1 : A slot-tree is a tree (T) that each node in the slot-tree contains a tuple (s, Ps, Vs), where s is the name of slot, Ps is a set of paths, and Vs is a set of values. The name of a slot is a label that uniquely represents the slot. A path (p) in Ps is a string in XPath format that used to locate nodes in XML documents. A value (v) in Vs is a term that contains a set of semantically identical words or patterns. Figure 3.1 shows the structure of a slot-tree, the {p} in each node represent a set of paths and the {v} in each node represent a set of values. For a slot-tree that represent the internal structure of an object, a slot in the tree may used to represent a property of the object, such as the color , shape ,
  • 30. `texture`, `size`, etc. A value in the slot is a possible value of the property. For example, ¡§bl ack¡¨is possible value in the ¡§col or slot. Figure 3.1. The structure of a slot-tree A slot-tree can be encoded as an XML document that each slot is encoded as a node in tag `s`. The attribute `slot`in the node is the label of the slot. The attribute path contains a set of path in XPath format that encode the {p} part for each slot. The node in tag `v`is a value that encodes the {v} part for each slot. Example 3.2 shows a slot-tree for butterflies in XML format and figure 3.2 shows the graph representation of the example. Example 3.2. A slot-tree for butterflies in XML format <s slot= `butterfly`path= `//butterfly`> <s slot= `name`path=`//butterfly//name`/> <s slot= `adult`path= `//butterfly//adult¡¨> <s slot= `color`path= `//butterfly//adult//color`> <v value= `black`/> <v value= `brown`/> <v value= `black&white`/> </s> <s slot= `texture`path= `//butterfly//adult//color`> <v value= `lines`/> <v value= `spots`/> </s> </s> s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v} s {p} {v}
  • 31. Figure 3.2 : The graph representation of slot-tree Formally, the syntax of slot-tree is defined as grammars in figure 3.3. A slot (S) contains a label (NAME), a set of path (P*) and a set of values (V*). The slot may also contain a set of sub-slot (S*). A value (V) contains a label (NAME), a set of key (KEY*) and a set of matching rules (R*). S Æ <s slot= NAME`path= `P*`> V* S* </s> V Æ <v value= `NAME`keys= `KEY*` match= R* /> NAMEÆ Alphabetical String KEY Æ Alphabetical String Where P is a path in XPath format, R is a rule. Figure 3.3 : The grammar of slot-tree The symbol P used in figure 3.3 is in a path in the format of XML path language (XPath). XPath is a specification that proposed by Web Consortium (W3C) used to locate nodes in XML documents. The symbol / is used to match children nodes, the symbol // is used to match nodes inside the current node. A tag name with a prefix @ symbol means an attribute. Example 3.3 shows several example of XPath. Example 3.3 : Examples of XML path language (XPath) a. /butterfly/adult/color b. //insect//color c. //insect[@type=¡¥butt erfl y¡ƒ]// col The path of example 3.3.a is used to locate color nodes that are children of an adult node, and the adult node is a child of the butterfly node. The path of example 3.3.b is used to locate any
  • 32. `color`nodes that are in the block of an `insect`node. The path of example 3.3.c is used to locate any `color`nodes that are in the block of an `insect`node with values `butterfly`in the attribute `type`. If you would like to learn more about XPath, please see the XPath specification in the following web page - http://www.w3.org/TR/xpath. A rule in the slot-tree is used to match a string in XML. The syntax of a rule (R) is further defined as grammar in figure 3.4. A rule may contains & operator, | operator and - operators. A symbol E is an expression that is part of a rule. Each expression contains only a literal L or a pattern in the form of L..L . R Æ (R & R) R Æ (R | R) R Æ E R Æ -E E Æ L {..L} Figure 3.4 : The grammar of rules in slot tree The & operator equals to a logical and . A R1 & R2 rule satisfied if and only if both R1 and R2 are satisfied. The | operator equals to a logical or . A R1 | R2 rule satisfied if and only if R1 or R2 is satisfied. A .. symbol in the syntax of E means a far connect. A L1 .. L2 rules satisfied if a L1 string is followed by an L2 string in one sentence. The following example shows a several rules as following. Example 3.4 : Matching rules in slot-tree a. R = white & black b. R = lines & -spots c. R = black .. head The rule of example 3.4.a is used to match a sentence like `a butterfly that is mixed of black and white color`, or `a butterfly with white wing and black head`. The rule of example 3.4.b is used to match a sentence such as `a butterfly with brown lines on wings`, but cannot match the sentence `a butterfly with brown lines and white spots on wings`. The rule of example 3.4.c is used to match a sentence such as `a butterfly with black color on head`, but cannot match the sentence `a butterfly with has green head and black wings`.
  • 33. 3.3 Slot-Filling Algorithm A slot in a slot-tree is a container that may contain several fillers. The filler can be a value of a sub-slot. A slot-filling algorithm is a method to map fillers into slots. In this chapter, we describe how to map an XML document into slot-tree ontology. Example 3.5 : An XML document for a butterfly - <butterfly about= Athyma_fortuna_kodairai.jpg > <adult> <texture>There are some eye spots in each wing</texture> <color>Brown background color, Eye spots in white color</color> <size>Middle size, 50-60mm</size> </adult> <geography> <taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan> <global>Central China Area</global> </geography> </butterfly> Example 3.6. A slot-tree for butterflies <s slot= butterfly path= //butterfly > <s slot= name path= //butterfly//name type= copy /> <s slot= adult path= //butterfly//adult¡¤> <s slot= color path= //butterfly//adult//color > <v value= black /> <v value= brown /> <v value= black&white /> </s> <s slot= texture path= //butterfly//adult//color > <v value= lines /> <v value= spots /> </s> </s> One simple way to fill values into the corresponding slot is by copy. A copy-slot is a slot with the attribute (type=bcopy ) in it. The copy-slot is used to extract a value from a specified field. In the slot-filling process of example 3.3, the value Athyma_fortuna_kodairaia is filled into the name slot in example 3.4 just by copy.
  • 34. Another way to fill values into slots is by keyword matching. A value is filled into a slot if the value matched a sentence in the target XML document. The following example shows the process of matching the spotted value in texture slot to the color nodes in XML document. Example 3.7 An example of filling a value into slot by keyword matching Texture block : <color> Brown background color, Eye spots in white color </color> Texture Slot : <s slot= texture path= //butterfly//texture > <v value= single color keys= single, mono, uniform /> <v value= spotted keys= spot /> <v value= lines keys= line /> </s> Æ Matching result <s slot= texture values = spotted /> A slot-filling algorithm is designed to fill values into slots in a slot-tree. In order to understand an XML document, we use the slot-filling algorithm to fill an XML document into the slot-tree. The output of our slot-filling algorithm is a filled slot-tree, where each node in the tree is filled by values. For a given XML document d, ds is part of the document that covered by slot s. The output of the slot-filling algorithm is a set of slot-value (s,v) pairs. Slot-Filling(d, T) = { (s,v) | v∈V, t is a term in d, w(v, ds) > ε } The following figure shows the pseudo code of slot-filling algorithm. Algorithm Slot-Filling(d, T) SV = {} for each s in T ds = {c | (s, p) ∈M(T), (p, c) ∈d } for each v in s if w(v, ds) >ε then put (s,v : w(v, ds)) into SV end for end for return SV Figure 3.5 : The pseudo code of slot-filling algorithm
  • 35. The time complexity of slot-filling algorithm is ∑s |ds|*|Vs|, where |ds| is the size of ds, and |Vs| is the number of values in slot s. 3.4 Discussion In this chapter, we have described the slot-tree ontology in section 3.2 and slot-filling algorithm in section 3.3. The slot-filling algorithm is used to map XML documents into slot-tree ontology in the understanding process. In chapter 4, we will use the slot-tree and slot-filling algorithm to develop an ontology-based XML retrieval method, and using the method to reduce the semantic gap between human and computer.
  • 36. 4 An Ontology-Based Approach for XML Querying, Retrieval and Browsing In the previous chapter, we have showed the slot-tree ontology and its usage. A mapping between slot-tree and XML documents is built in the process of slot-filling algorithm. The mapping process helps our XML retrieval system in reducing the semantic gap between human and computer. In this chapter, we will outline the relationship between our XML retrieval system and slot-tree ontology, and show the power of slot-tree. In section 4.1, we will describe the process of our XML retrieval system, and outline important components in our system. We will describe how to represent an XML documents for retrieval in section 4.2, and describe the index structure in section 4.3. After that, the query interface is described in section 4.4 and ranking strategies is described in section 4.5. And then we show how to organize retrieval results for browsing in section 4.6. Finally, we have a discussion about our XML retrieval system in section 4.7. 4.1 Introduction Two technologies are needed in the process of searching for documents, retrieving and browsing. Retrieving is the process of retrieves documents in a collection. After that, the retrieved documents should be organized for browsing. Browsing is the process of read and traverse on the collection of documents. We usually use retrieving and browsing techniques alternatively in a searching process. A model integrated retrieving and browsing may used to improve the quality of searching. Our research focuses on using ontology to improve the XML retrieval and browsing process. We will focus on the following questions in this chapter. 1. How to encode XML documents for retrieval? 2. How to use slot-tree ontology to improve the efficiency of querying? 3. How to use slot-tree ontology to improve the efficiency of retrieval? 4. How to use slot-tree ontology to improve the efficiency of browsing? Figure 4.1 shows a scenario of our approach to retrieve XML documents. First, a user build a query by click or type on slots in the query interface, and then submit the query to the XML retrieval system. The retrieval system retrieves XML documents, and then summarizes them for user to browse.
  • 37. Figure 4.1 : A scenario of our XML retrieval system The ontology in figure 4.1 is the slot-tree ontology that described in chapter 3. It is the core of our XML retrieval system. The slot-tree ontology is used to build query interface, retrieve documents and summarize retrieved documents for browsing. The XML queries, XML documents and query interface are important objects in our system. The retrieval and extraction are important processes in our system. We will introduce these objects and processes in this chapter. 4.2 XML Documents An XML document is encoded as a tree-structure text. Figure 4.2 shows an XML document that describes a butterfly. - <butterfly about= Athyma_fortuna_kodairai.jpg > <adult> <texture>There are some eye spots in each wing</texture> <color>Brown background color, Eye spots in white color</color> <size>Middle size, 50-60mm</size> </adult> <geography> <taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan> <global>Central China Area</global> </geography> </butterfly> Figure 4.2 : An XML document of butterfly
  • 38. For conceptual simplicity, the XML example above is expressed as a sequence of (path, value) pairs that describe the object. (butterfly, ) (butterfly@about, Athyma_fortuna_kodairai.jpg) (butterflyadult, ) (butterflyadulttexture, There are some eye spots in each wing) (butterflyadultcolor, Brown background color, Eye spots in white color) (butterflyadultsize, Middle size, 50-60mm) (butterflyadult, ) (butterflygeography, ) (butterflygeographytaiwan, North-Taiwan, 1000-2000meters mountain area) (butterflygeographyglobal, Central China Area) (butterflygeography, ) (butterfly, ) Figure 4.3 : The (path, value) expression of an XML document The (path, value) expression can be thought as an object concept model. A path specified a property of an object. A value specified a value for the property. The object concept model above is a binary relation that may be expressed as path(object, value). A path represents a logical predicate with two arguments. An object in this model is expressed as a set of (path, value) pairs. Storing Structure The (path, value) representation does not reflect the tree structure of an XML document. In order to represent the tree structure, we use a pair of index to represent begin and end of each block. In other word, we extend each (path, value) pair with a (begin, end) pair to represent the begin node and end node of each block. The butterfly example above is expressed as the following structure. 1, 12 (butterfly, ) 2, 2 (butterfly@about, Athyma_fortuna_kodairai.jpg) 3, 7 (butterflyadult, ) 4, 4 (butterflyadulttexture, There are some eye spots in each wing) 5, 5 (butterfly adult color, Brown background color, Eye spots in white color) 6, 6 (butterfly adult size, Middle size, 50-60mm) 7, 7 (butterfly adult, ) 8, 11 (butterflygeography, ) 9, 9 (butterflygeographytaiwan, North-Taiwan, 1000-2000 meters mountain area) 10, 10 (butterflygeographyglobal, Central China Area)
  • 39. 11, 11 (butterflygeography, ) 12, 12 (butterfly, ) Figure 4.4 : The storing structure of an XML document In the example above, each node is lead by a (begin, end) pair. The begin index of a node is always identical to the ID of the node. A block with (begin, end) means it cover all nodes between begin node and end node. For example, the first block 1,12 (butterfly,) covers nodes from 1 to 12, the third block 3,7 (butterflyadult) covers nodes from 3 to 7. In this way, the tree structure of XML is expressed as the cover/covered relations between nodes. The begin-end pair structure totally reflects the hierarchical structure of XML documents. In our XML storage system, we store the (begin, end) pairs in a table instead of storing as a tree. 4.3 Indexing structure Based on the PVSM, we index (p,t) pairs instead of (t) for an XML retrieval system. There are several data-structures for full-text indexing, such as inverted-file, signature-file and Patricia-trie. We use inverted-file as the index structure of our XML retrieval system for simplicity. The following example is a simple XML document. We will show how to index the following XML document, for both text field and number field. Example 4.4 An XML document for butterfly <butterfly about= kodairai > <adult> <color>brown</color> <texture>spot</texture> <size>50-60mm</size> </adult> </butterfly> Indexing Text: The following table shows our inverted-file structure. The inverted-file is stored in a relational database now. The following figure shows an inverted-file for the example above. #path, #term #object list ` ` #butterflyadultcolor, #brown ` ,#kodairai, #` .. ` ` #butterflyadulttexture, #spot ` ,#kodairai,`
  • 40. ` ` Figure 4.5 An example of text index in inverse file format Indexing Number : Traditional full text indexing technology doesn`t index number. In our system, number indexing is important for the browsing process. We may sort the search results in some specified order based on number index. In the indexing process, we extract number from XML documents and put into a number table as following. #object, #path Number ` ` #kodairai, #butterflyadultsize 50 #kodairai, #butterflyadultsize 60 ` ` Figure 4.6 An example of number index 4.4 Query Language and Query Interface XML may used to encode metadata instead of data. Metadata is a kind of data that used to describe data. We may use metadata to describe objects such as audio, video, people, etc. Based on metadata, we may index image, video and audio in text format, so that we may query object by number and text field in our XML retrieval system. In our system, we design a program to transform slot-tree into HTML based query interface. A template in Extensible Stylesheet Transformations (XSLT) is used to do the transformation. In our query-interface, a value can be expressed as a string, a range of number, or an object. A user may specify the value for a slot just by click a value or an icon in the slot. Our retrieval system is not only used to retrieve text-based documents, but also used to retrieve image or video. The following figure shows a query interface for butterflies.
  • 41. Figure 4.7 : The Query Interface for Butterflies A user may select a slot just by one click, and select a value in the slot or type keywords into the slot. He may also specify a field for sorting. A query will be built and submit to the XML retrieval system when he press the submit button. A query in our system is a filled slot-tree. The following example shows a query find all butterflies with broken wing and brown color . <s slot= butterfly path= //butterfly > <s slot= color path= //butterfly//adult//color keys= brown /> <s slot= shape path= //butterfly//adult//shape keys= broken /> </s> 4.5 Ranking Strategy The ranking strategy for XML-retrieval is much more like database than text-retrieval. We may rank the retrieval result by any field in XML documents. For example, we may sort the retrieval result by the
  • 42. size of butterflies. We may also sort the retrieval result by the similarity between document and query or by the importance of documents. In this section, we will show the ranking strategies that used to sort the retrieval results. Ranking by Field In order to organize the retrieval result for user to browse, a user may specify the ranking strategy. A user may specify any field to sort the result for browsing just like in a database. A field can be sorted as numbers by scale or sorted as strings by alphabetical order, in either increasing order or decreasing order. The variety of ranking strategies provides users a way to organize the retrieval result into a list for browsing. Ranking by Importance In section 2.2, we have introduced how to measure the importance of a web page based on hyperlink. Hyperlinks in XML may used to decide the importance of an XML document, too. In our XML retrieval system, ranking by importance is used as a default ranking strategy. A simple way to measure the importance of an XML document is by counting references to an XML document. We use the strategy in our system for simplicity. In the future, we will try to accommodate random-walk model and hub-authority model to measure the importance of XML documents in our XML retrieval system. Ranking by Similarity For text retrieval, a ranking strategy based on vector space model (VSM) and TFIDF weighting function performs well. A brief survey for VSM and TFIDF was described in section 2.3. However, an XML object is not only a sequence of words like a text, but also contains a lot of tags. For XML, we extend VSM with a path to each term that is called the Path Vector Space Model (PVSM). An XML document (d) could be expressed as the following vector v(d). v(d) = (dp1,t1` d p1,tk ` dpn,t1` dpn,tk) dpi,ti is the weight of (pi, ti) pair in document object d When several paths have similar meaning, we may cluster them into a slot for retrieval. The model after paths clustering is called the Slot Vector Space Model (SVSM). v(d) = (ds1,t1` d s1,tk ` dsn,t1` dsn,tk) dpi,ti is the weight of (pi, ti) pair in document object d We may use the cosine-coefficient to measure the similarity between queries and documents in SVSM just like in VSM. Similarity(d, q) = ||*|| qd qd •
  • 43. However, we do not know what kind of weighting function is good to measure the value dsi,tj. Is TFIDF good enough in the SVSM, or we need another measure. In our system, we express the dsi,tj as the product of wsi,tj and tfsi,tj . Where tfsi,tj is the term frequency of the term tj in slot si , and wsi,tj ais the weighting coefficient. A difficulty for retrieval system today is too many documents are retrieved. When there are to many retrieval results for browsing, the ranking strategy is used to present what users want to them. A user may like to see large butterflies, important butterflies or butterflies that are similar to a query. The variety of ranking strategy in XML provides ways for users to retrieve only what they like to browse. 4.6 Browsing XML documents For an information retrieval system, the retrieved documents should be summarized and organized into readable format for people to browse. In our XML retrieval system, slot-filling algorithm is used to map the retrieved documents into filled slot-trees for browsing. The filled slot-tree is a summary of documents that is easy to browse and is well organized. In this section, we will show an example of slot-filling algorithm that fills XML documents into slot-tree. Before that, we have to show an XML document and a slot-tree used in the algorithm. The following example shows a simple slot-tree for butterfly. <s slot= butterfly path= //butterfly > <s slot= name path= //butterfly//name /> <s slot= adult path= //butterfly//adult¡¤> <s slot= color path= //butterfly//adult//color > <v value= black /> <v value= brown /> <v value= black&white /></s> <s slot= texture path= //butterfly//adult//color > <v value= lines /> <v value= spots /></s> </s> We may use the slot-filling algorithm to extract values from the following XML document. - <butterfly about= Athyma_fortuna_kodairai.jpg > <adult> <texture>There are some eye spots in each wing</texture> <color>Brown background color, Eye spots in white color</color>
  • 44. </adult> <geography> <taiwan>North-Taiwan, 1000-2000meters mountain area </taiwan> <global>Central China Area</global> </geography> </butterfly> The slot-filling algorithm will fill values into slot-tree. The following example shows the result of filling. <s slot= butterfly values= ¡§ At hy ma_f ort una_kodairai> <s slot= adult > <s slot= texture values= spot /> <s slot= color values = brown /></s> <s slot= geography > <s slot= ¡§Tai wan values= North /> <s slot= Global values= China /></s> </s> The result of slot-filling algorithm is a filled slot-tree. For human, it is easier to browses filled slot-trees than browse the source documents. The filled slot-tree is a summary of the XML document and is well organized. 4.7 Discussion In this chapter, we design an XML retrieval system to reduce the semantic gap between human and computer. The slot-tree ontology and the slot-filling algorithm are used in our XML retrieval system to understand XML documents. Based on the slot-tree, we design a query interface to reduce the semantic gap in query side. The interface helps people to write XML queries easily. Based on the slot-filling algorithm, we design the slot vector space model (SVSM) retrieve XML documents. The SVSM model helps computer to understand XML documents. Besides that, the slot-filling algorithm also help computer to extract summary from XML documents for browsing. Our goal of reducing the semantic gap between human and computer is almost achieved by using slot-tree as a core representation. We will study two cases of our XML retrieval systems in chapter 6 and chapter 7. In chapter 6, we use the domain of butterflies as an example. In chapter 7, we use the domain of proteins as an example. We will show the slot-tree, query interface, retrieved results and summary for butterflies in chapter 6. And we will show the slot-tree, query interface, retrieved results and summary for proteins in chapter 7.
  • 45. 5 The Construction of Slot-Tree Ontology We have introduced the slot-tree ontology in chapter 3, and then showed an XML retrieval system based on slot-tree ontology in chapter 4. However, building slot-tree ontology is a not an easy job. In order to reduce the effort to build the slot-tree ontology, we have developed the slot-mining algorithm. The slot-mining algorithm is a statistical approach to mine slot-tree from XML documents. The algorithm is used to learn the slot-tree from a collection of XML documents. An overview of mining approaches is described in section 5.1. Section 5.2 provides background for the text-mining technology. Section 5.3 shows how to construct slot-tree for a given XML collection. Section 5.4 describes a method to mine slot-tree from XML documents called slot-mining algorithm. Finally, we have a discussion for the building of slot-tree in section 5.5. 5.1 Introduction The goal of text mining is to find important patterns from text collection and organize these patterns into ontology. In this thesis, we use the ontology to help the XML retrieval and browsing. The mining technology may used to help us in the construction process of slot-tree ontology. In this section, we will focus on the text-mining problem for XML. Slot-tree is an ontology representation method. Our mining approach is to build a XML-mining program to induce values for each slot. In this section, we assume that each value is represented by a term (or a word) for simplicity. Based on this assumption, we developed a statistical program to mine values for each slot. The semi-structured property of XML makes the mining program work. For a given XML collection, the distribution of a term is highly depends on the tags. For example, the following terms show up more frequent in the <color> block than in the other blocks. <color> black , white , yellow , blue , green </color> The problem of mining the important values for each slot is called the Slot-Mining Problem. We will propose a mining-algorithm that is based on a simple observation the distribution of terms depends on the tag. A term shows up more frequently in a tag is likely to be a key value for the corresponding slot.
  • 46. 5.2 Background The goal of text mining is to discover some regularity in text-data. A text-mining program induces rules from text or learn grammar form corpus, these rules are used in the process of natural language understanding and information extraction. For natural language processing, inside-outside algorithm is a popular tool to learn probabilistic context-free grammar (PCFG) from tree-bank corpus. However, tree-bank corpus is not easy to build. Building a tree-bank by human is a time consuming job. Some other text-learning methods are developed to learn from text corpus. For example, link grammar is a simple head-driven grammar that developed to parse natural language sentence. A learning algorithm is developed to learn the link-grammar from text-corpus. Besides that, transducer is a learning algorithm to induce finite-state automata from a given text-corpus. Learning transducer is easier than learning a context-free grammar. For information extraction, a wrapper is an algorithm to learn a simple grammar from structured text, such as web page. A wrapper will induce some rule to wrapping the document. For example, a simple wrapper may learn the prefix and postfix of each field from a collection of program generated web page. We may extract fields from web page based on these prefix and postfix. A transducer may also used to learn the extraction rule from a collection of web page, too. However, these methods are used to learn the grammar of input text, not used to learn ontology from a given document collection. In this chapter, we will propose a learning algorithm that mine slot-tree ontology from a given XML collection in section 5.4. The algorithm is called the slot-mining algorithm. This algorithm is a tool to help the domain-knowledge designer to design the slot-tree ontology. Before we show the slot-mining algorithm, we have to show the process for human to build a slot-tree in section 5.3, in order to observe what is need in designing such an algorithm. 5.3 The process of building a slot-tree In order to show the ontology design process, we will trace the designing step of a simple slot-tree for butterfly. There are six steps to design a slot-tree. 1. Browse XML data. 2. Identify object boundary. 3. List all tags in this domain. 4. Identify slots for this domain. 5. Mapping each slot to tags (or xpath). 6. Identify values for each slot.
  • 47. Browsing XML data : The first step to design a slot-tree is to browse data in order to understand data. What is the structure of the XML collection? Can we identify the object boundary in XML documents? What s the meaning of each tag? Does each tag correspond to a slot? What are candidate values for a slot? We have to answer these questions before construct a slot-tree. Identifying object boundaries : An object-block is an XML block that correspond to a object. We have to identify the boundary of object-block to find out what objects the collection contains. For example, in our butterfly collection, a <butterfly></butterfly> block is the boundary of a butterfly object. Listing all tags in this collection : an XML tag usually has strong semantic meaning. For example, the <color> tag represents the color of a butterfly. We may list all tags to understand the semantics for each tag. For the simple butterfly collection, we list all tags as following. Butterfly, adult, texture, color, size, geography, Taiwan, global Identifying slots for this collection : We are lucky to find out that these tag are not ambiguous. The semantics of tags are clear and definite. We may build a slot for each tag. Mapping slots to tags (or xpath) : For the simple butterfly collection, we can map each tag to one slot directly. The following example shows the schema of slot-tree. <s slot= butterfly > <s slot= adult > <s slot= texture /> <s slot= color /> <s slot= size /> </s> <s slot= geography > <s slot= ¡§Tai wan /> <s slot= Global /> </s> </s> Identifying values for each slot : In order to identify values for each slot, we have to read the data for each slot. For example, if we read the data in <color> tag, we may found that the black , white , brown , orange , yellow , green , blue , purple , gray are key values for this slot. We may fill them into the values list of the color slot. After we fill values for each slot. We finish the slot-tree building process. The following XML document shows a slot-tree for the simple butterfly collection.
  • 48. <s slot= butterfly > <s slot= adult > <s slot= color > <v value= black /><v value= white /><v value= brown /> <v value= yellow /><v value= orange /><v value= green /> <v value= blue /><v value= purple /><v value= gray /> </s> <s slot= texture > <v value= single color keys= single, mono, uniform /> <v value= spotted keys= spot /> <v value= lines keys= line /> </s> <s slot= size > <v value= small /><v value= middle /><v value= large /> </s> </s> <s slot= geography > <s slot= ¡§Tai wan> <v value= north /><v value = center /><v value = south /><v value = east /> </s> <s slot= Global > <v value= Enrope /><v value = China /><v value = India /> <v value = America /><v value = Australia /> </s> </s> </s> In the slot-tree example above, a <v> tag represent a value in a slot. The simplest value is a keyword. We may also specify a set of keywords or rules for a value, such as the single color value in the texture slot. The last step Identifying values for each slot is the most human laboring step in the whole slot-tree building process. In order to construct slot-tree automatically, we develop the slot-mining algorithm to mine slot-tree from XML documents in the next section. 5.4 Slot-mining algorithm
  • 49. A slot-mining algorithm mines slot-tree from XML documents. The first step is to extract paths in XML documents to build a schema. The second step is using statistical correlation analysis to find out what terms is important for these paths. After that, a slot-tree is built that each slot corresponds to a path in XML documents. The following figure shows a concept model of the slot-mining algorithm. Figure 5.1 The process of slot-mining algorithm Before we describe the algorithm, we have to define some mathematics notation for it. Definition : Slot-Vector A slot-vector is a vector of (slot, term) pairs for a given collection of XML blocks (B). v(B) = (Bs1,t1, , B s1,tk , ,Bsn,t1, ,Bsn,tk) B si,tj is the weight of (tj) shows up in blocks for slot(sj) of B |B| is the abbreviation for ∑s,t Bs,t |Bt| is the abbreviation for ∑s Bs,t |Bs| is the abbreviation for ∑ t Bs,t Definition : Slot-Vector Space Model (SVSM) The model of represent XML document by Slot-Vector is called Slot-Vector Space Model. Example 1. A slot-vector for a given collection (D) is represented as the following formula.
  • 50. v(D) = (Ds1,t1, , D s1,tk , ,Dsn,t1, ,Dsn,tk) 2. A slot-vector for a specified slot (s) of collection (D) is represented as the following formula. v(Ds) = (Ds,t1, , D s,tk) 3. A slot-vector for a given document (d) is represented as the following formula. v(d) = (ds1,t1, , d s1,tk , ,dsn,t1, ,dsn,tk) 4. A slot-vector for a specified slot (s) of document (d) is represented as the following formula. v(ds) = (ds,t1, , d s,tk) Slot-Mining Problem Given an XML documents collection (D) and a set of slots (S), find the key values for each slot : v(s). Slot-Mining Algorithm The slot vector for D is v(D) = (Ds1,t1, , D s1,tk , ,Dsn,t1, ,Dsn,tk) Let |Dt| = ∑ Dsi,t The slot vector for Ds is v(Ds) = (Ds,t1, , D s,tk) v>r (s) = { w | Ds,t /|Ds| > r * |Dt|/|D| } v>r (s) is called the r-key-set for slot (s) In our XML-mining system, we set the parameter (r = 2.0) to extract the key values for each slot. The following figure shows the pseudo code of slot-mining algorithm. Algorithm Slot-Mining (D)! P = {p | p is a path in D} for each (p,t) in D |Dp,t | = |Dp,t|+1 |Dp| = |Dp|+1 |Dt| = |Dt|+1
  • 51. |D| = |D|+1 end for for each (p,t) in PT p(t | p) = |Dp,t | / |Dp| p(t) = |Dt | / |D| if p(t|p)/p(t) > r then put (p,t) into SV end for return SV! Figure 5.2 The Pseudo Code of Slot-Mining Algorithm The slot-mining algorithm mines values from XML collection D. The mining values should be modified and organized into slot-tree for improving the quality. Let s have a look at a mining example for slot color . Example : <color> head, brown, yellow, body, white, wing, gray, blue, black, background, line, spot </color> In the mining result above, brown, yellow, white, gray, blue, black are what we want, but head, body, wing, background, line, spot are noise words. Until now, we cannot distinguish these two groups by statistical method. We have to find out a way to distinguish them. One possible solution is to combine a dictionary like WordNet to distinguish these two groups. We will try this solution in the future. 5.5 Discussion In order to help people constructing slot-tree ontology, we developed a slot-mining algorithm to mine slot-tree from XML documents. The slot-mining algorithm is used as an authoring tool to construct the slot-tree ontology. The slot-mining algorithm mines slot-trees from a collection of XML documents. Our approach is based on statistical correlation analysis between tags and terms. The correlation analysis decides what terms are important for a given tag, and fills terms into the slot of this tag. Some modification is needed for the automatically constructed slot-tree in order to improve the quality. At first, we have to merge paths with the same meaning into a slot in order to simplify the structure of slot-tree. Second, we have to delete some incorrect mined-values and merge values with the same meaning in order to improve the quality of each slot.
  • 52. The slot-mining algorithm is used to construct the ontology for butterflies in section 6.7 and used to construct the ontology for proteins in section 7.7. We will show full version the mined slot-tree in these sections.
  • 53. Part 3 : Case Studies 6 Case Study - A Digital Museum of Butterflies In part 2, we have described our methods, including slot-tree ontology, slot-filling algorithm, slot vector space model and slot-mining algorithm. These methods are used to build a semantic retrieval system for XML. In this part, we will use two XML collections to test our methods, including a collection for butterflies and a collection for proteins. In chapter 6, we will test our methods on the collection of A Museum of Butterflies in Taiwan (MBT) . In chapter 7, we will test our methods on the collection of Protein Information Resource (PIR) . Both collections are encoded in XML format. In this chapter, an overview of MBT is given in section 6.1. A source XML document of MBT is showed in section 6.2. A slot-tree for MBT is described in section 6.3. A query interface based on the slot-tree is described in section 6.4. The slot-filling process for MBT is described in section 6.5. The retrieval process for MBT is discussed in section 6.6. The mining process to build slot-tree for MBT is discussed in section 6.7. A discussion of our approach on MBT is given in section 6.8. 6.1 Introduction The Digital Museum of Butterfly is a collection of butterfly in Taiwan. Each document in this collection describes a species of butterfly in Taiwan. The following table is a profile for this collection. Table 6.1 : A Museum of Butterflies in Taiwan Collection A Museum of Butterflies in Taiwan ( ) Working Group NMNS : National Museum of Natural Science ( ), Taiwan URL : http://www.nmns.edu.tw/ NCNU : National Chi-Nan University ( ), Taiwan URL : http://dlm.ncnu.edu.tw/butterfly/index.htm NTU : National Taiwan University ( ), Taiwan URL : http://turing.csie.ntu.edu.tw/ncnudlm/ Size 356 species, 356 XML documents. Language Tag in English, Content in Chinese Digital Museum for Butterfly in Taiwan contains XML documents for 356 species of butterfly in Taiwan. Roughly specking, tags may be classified into groups as following.
  • 54. Table 6.2 : XML tags for butterflies in Taiwan Group Fields Classification name, family, cfamily (Chinese family), genus, species, subspecies Host Host plant, Honey plant Geography Taiwan, global Egg Color, shape, feature, characteristic, days of growth, enemy Larva Color, shape, feature, characteristic, days of growth, enemy Pupa Color, shape, feature, characteristic, days of growth, enemy Adult Color, shape, texture, characteristic, life period, enemy 6.2 The Representation of Butterflies in XML The following figure shows an XML document for the butterfly kodairai . - <butterfly> <cname> </cname> - <classification> <family>Nymphalidae</family> <cfamily> </cfamily> <genus>Athyma</genus> <species>fortuna</species> <sub_species>kodairai</sub_species></classification> <hostplant> (Caprifoliaceae) (Viburnum luzonicum var. matsudai) </hostplant> <honeyplant> </honeyplant> - <geographic><taiwan> 1000-2000 </taiwan> <global> </global></geographic> - <life_stage> - <egg> <feature> ` <color> </color> <size> 1.1-1.3mm </size> <predator> </predator> <days_of_growth> 5-6 </days_of_growth></egg> - <larva> <feature> ` <color> </color>
  • 55. <size> 33-41 mm </size> <predator> </predator> <days_of_growth> </days_of_growth> <defense> </defense></larva> - <pupa> <feature> </feature> <color> </color> <size> 22-27mm </size> <predator> </predator> <days_of_growth> 15-20 </days_of_growth> <defense> </defense></pupa> - <adult> <feature> </feature> <color> </color> <size> 50-60mm </size> <characteristic> </characteristic> <habitate> </habitate> <predator> </predator> <days_of_growth> </days_of_growth> <defense> </defense> <season> </season> <behavior> </behavior> </adult> </life_stage> </butterfly> Figure 6.1 : An XML document for butterfly (Full List) 6.3 Slot-Tree Ontology for Butterflies
  • 56. Our ontology is represented as a slot-tree in XML format. The slot-tree we designed for butterfly is consistent to the target collection, both of them are in the following schema. <butterfly> <classification/> <Geography/> <life-period> <Egg/> <Larva/> <Pupa/> <Adult/> </life-period> </butterfly> Each object in the life period (egg, larva, pupa, adult) has a sub schema to describe the object. The schema looks like the following tree. <object> <Color/> <shape/> <feature/> <size> </object> The consistency between slot-tree and document ease our design process. Besides that, the consistency also eliminates ambiguity for our retrieval and browsing process. On the contrary, a lousy design of XML document structure will makes our domain-knowledge design process difficult, and makes our domain-knowledge hard to help the retrieval process and browsing process. A fragment of the slot-tree for butterfly is showed in the following figure. For a full list of slot-tree, please see appendix 1. - <butterfly> - <family slot=" " path="//butterfly//cfamily//"> <v value=" " keys="Hesperiidae" /><v value=" " keys="ycaenidae" /> .</family> - <adult slot=" " keys="Adult" path="//butterfly//adult//"> - <shape slot=" " keys="Adult:Shape" path="//butterfly//adult//shape//"> <v value=" " image="swallowtail.gif"/> <v value=" " /> </shape>
  • 57. - <color slot=" " keys="Adult:Color" path="//butterfly//adult//color//"> <v value=" " keys="Black" />¡ K <v value=" " keys="Black_White"/> </color> - <texture slot=" " keys="Adult:Texture" path="//butterfly//adult//texture//"> <v value=" " image="mono.gif" /><v value=" " image="spot.gif" /> </texture> </adult> - <pupa slot=" " keys="Pupa" path="//butterfly//pupa//"> - <s slot=" " path="//butterfly//pupa//"><v value=" " keys="Skin_Stick" /> </s> - <s slot=" " keys="Pupa:Color" path="//butterfly//pupa//color//"> <v value=" " keys="Green"/> <v value=" " keys="Wood" /> </s> - <s slot=" " keys="Pupa:Feature" path="//butterfly//pupa//feature//"> <v value=" " keys="Laying_Pupa"/><v value=" " keys="Hanging_Pupa"/> </s></pupa> - <egg slot=" " keys="Egg" path="//butterfly//egg//"> - <s slot=" " keys="Egg:Shape" path="//butterfly//egg//feature//"> <v value=" " keys="Ball" image="egg_ball.jpg" /> <v value=" " keys=" +Half_Ball" image="egg_half_ball.jpg" /> </s> - <s slot=" " keys="Egg:Color" path="//butterfly//egg//color//"> <v value=" " keys="Milk_White" /> </s> - <s slot=" " keys="Egg:Texture" path="//butterfly//egg//feature//"> <v value=" " keys="Smooth /> <v value=" " keys="Square_Texture"/> </s> </egg> - <larva slot=" " keys="Larva+ " path="//butterfly//larva//"> - <s slot=" " keys="Larva:shape" path="//butterfly//larva//feature//"> <v value=" " keys="Like_Shuttle" /><v value=" " keys="Like_Bird’s_Shit" /> </s> - <s slot=" " keys="Larva:Color" path="//butterfly//larva//color//"> <v value=" " keys="Green" /><v value=" " keys="Brown" /> </s> - <s slot=" " keys="Larva:Texture" path="//butterfly/life_stage/larva/characteristic"> <v value=" " keys="Short_Hair" /><v value=" " keys="Long_Hair" /> ¡ K</s> </larva> - <s slot=" " keys="Taiwan" path="//butterfly//geographic//taiwan//"> <v value=" " keys="North_Taiwan+ " /> </s> - <s slot=" " path="//butterfly//geographic//global//"> <v value=" " keys="South_Asia " /><v value=" " keys="China" /> . </s> - <s slot=" " keys="Size" path="//butterfly//adult//size//">
  • 58. <v value=" " keys="Small_Size+ " /><v value=" " keys="Middle_Size+ " /></s> - <s slot=" " keys=" =Habitate" path="//butterfly//adult//habitate//"> <v value=" " keys="Ground" />¡ Kv value=" " keys="High_Mountain /> </s> - <s slot=" " keys="Hostplant+ " path="//butterfly//hostplant//"> <v value=" " keys="Leguminosae" /><v value=" " keys="Euphorbiaceae" /> </s> - <s slot=" " keys="Eat Food" path="//butterfly//adult//behavior//;//butterfly//honeyplant//"> <v value=" " keys="Nectar" /><v value=" " keys="Juice " />r </s> </butterfly> Figure 6.2 : A slot-tree for butterflies 6.4 Query Interface The query interface is built automatically by transform the slot-tree into a web page. We use XSLT to transform slot-tree into HTML. The following figure shows a query interface for butterflies. A query-interface is automatically generated from slot-tree by XSLT template. The XSLT template transforms the slot-tree into a HTML document. Then we show it as a web page on the browser. The following figure shows the interface for butterfly domain. Figure 6.3 A Query Interface for Butterflies
  • 59. The query interface above generates the following query. <query sort_by= `path= `/butterfly/adult/size$meter`order= `-`> <s slot= ` `path= `//butterfly//adult//texture`value=` `/> <s slot= ` `path= `//butterfly//geographic//Taiwan`value=` `/> </query> After the interface submits the query to our XML retrieval system, the retrieval results will be shows up. The query above specified the query expression and the ranking strategy. The ranking strategy is by the size of adult butterfly in decreasing order. Based on the query, the XML retrieval system will retrieve the butterfly object and ranking by size of butterfly. We will show the query results in the following section. 6.5 Slot-Filling Algorithm We have to parse XML objects before the fill documents into slot-tree. For example, the following XML document is a butterfly called `maraho¡¨. <butterfly> <cname> </cname><geographic> <taiwan> 1000-1500 ¡ K</taiwan> </geographic> <egg><feature> </feature></egg><adult><color> </color><adult> <footnote> ¡ K </footnote> </butterfly> The example above will be parsed into a sequence of (path, value) pair as following. (butterfly cname ) (butterfly geographic taiwan 1000-1500 ` ) (butterfly egg feature, ) (butterfly adult color, ) Then we may fill them into corresponding slot as following. (butterfly egg feature, ) Æ <slot name=" " path="butterfly//egg//feature"> <value name=" "/>
  • 60. <value name=" "/> ¡ K 6.6 XML Retrieval After the user submits the query to the XML retrieval system, the XML retrieval system retrieves the query results. Then an XML extraction algorithm extracts values for each slot. After that, a sorting function sorts the result by the size of butterflies. The following figure shows the query results. Figure 6.4 A Query Result for Butterflies 6.7 Slot-Mining Algorithm Chinese Word Learning A problem for Chinese language is the word boundary detection. For English, there is a space between words in a sentence. But in Chinese, there are not spaces between words. This problem causes some difficulty in our XML Text-Mining problem. One way to solve this problem is use a dictionary to find out the words shows in a sentence. The deficiency of this approach is that no dictionary contains all words. And there are many unknown words used in a special domain. We have to learn words dynamically to conquer the problem. In our system, we adopt the keyword-learning algorithm proposed
  • 61. by L.F.Chien [Chien97]. This keyword-learning algorithm is based on the following observation Both the right hand side and left hand side of a word should be free¡ƒ. The free means a word can connect to many neighbors statistically. For example, we may extract the word `from the following sentences based on the statistical freedom of this word. ` ` ` ` ` ` ` ` ` ` left neighbor { } right neighbor { } ` ` left neighbor { } right neighbor { } ` ` left neighbor { } right neighbor { } For the string ` `, both left side and right side has four neighbors. But for ` `, there are only one left neighbor. For ` `, there are only one right neighbor. A string with many neighbors in both sides is very possible to be a `word`, so that ` `is putted into the learning-dictionary for the following XML text-mining step. Slot-Mining After the word learning step, the slot mining algorithm describe in section 3.5 is used to extract important word for each slot. The following table shows some results of of the Slot-Mining (part of slot-tree). Table 6.3 : A Result of Slot-Mining Algorithm for Butterflies Slot Value List butterflyclassificationcfamily , , , , , , butterflyclassificationfamily Satyridae, Pieridae, Papilionidae, Papilio, Nymphalidae, Lycaenidae, Hesperiidae, Danaidae butterflycname , , , , , , , , , , butterflyfootnote , , , , , , , , , , , , , , , , , , ,, , , , , , , , , , , , , , , , , butterflygeographicglobal , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
  • 62. , , , , , , , , , , , , , , , , , , , , , , , , , , , butterflyhoneyplant , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , butterflylife_stageadultpredator , , , , , , , , , , , butterflylife_stageeggcharacteristic , , , , , , , , , , , , , , , , , , butterflylife_stageeggfeature , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , butterflylife_stageadultcolor , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
  • 63. 6.8 Discussion In this chapter, we have studied our methods on the case of butterflies. We describe the following methods. 1. Modeling XML documents of butterflies. 2. Constructing slot-tree ontology for butterflies. 3. Using slot-filling algorithm to map XML documents into slot-tree of butterflies. 4. Using slot-tree ontology to build query interface for butterflies. 5. Using slot-tree ontology to help XML retrieval for butterflies. 6. Mining slot-tree ontology from XML documents of butterflies. These methods reduce the semantic gap between human and computer in the domain of butterflies. The query interface enable user to write queries easily. The slot-filling algorithm makes computer understand XML documents easily. Finally, the mining algorithm makes us construct slot-tree ontology easily.