A Schema-Based Approach To Modeling And Querying WWW Data

A Schema-based Approach to Modeling and Querying WWW Data
S. Comai1
, E. Damiani2
, R. Posenato3
, L. Tanca1;3
Email: comai,tanca@elet.polimi.it, edamiani@crema.unimi.it, posenato@sci.univr.it
(1) Politecnico di Milano, Dipartimento di Elettronica e Informazione
(2) Universit
a di Milano, Polo di Crema
(3) Universit
a di Verona, Facolt
a di Scienze MM. FF. NN.
Contact author: Letizia Tanca, Politecnico di Milano, Dip. di Elettronica e Informazione
Via Ponzio, 34/5, I20133 Milano, Italy, Tel. +39-23993531, Fax +39-2-23993411
Abstract
The steady growth of the amount of data published via the World Wide Web has led to a number of attempts
to support e ective Web querying, as a complement to conventional navigation and indexing techniques.
Many approaches to WWW querying try to compensate for the lack of a common data model by considering
the hypertextual structure of Web sites; unfortunately, this structure is not necessarily related to data semantics.
In this paper, we propose to specify both logical and structural/navigational aspects of a Web site, through
the unifying notion of schema. Following the style of such languages as Good and G-Log, site schemata, instances
and queries are represented as graphs: this also allows for a uniform representation of queries and views, the latter
expressing alternative, customized access structures to the site information.
Our approach is particularly suitable for Intranet applications, and can be smoothly transferred to Internet
Web sites as more and more of them are produced on the basis of hypermedia design methodologies.
After introducing the proposed architecture, data model and language, we classify types of queries according
to graph topology and semantic features and investigate their execution complexity. Then we present the design
of our prototype and compare our approach to some related work.
1 Introduction
The use of design methodologies for hypermedia applications is currently widely employed to develop multimedia
hypertextual applications. Besides providing conventional or object-oriented design elements, such as E/R-like
entities or OMT-like classes, nearly all modern hypermedia speci cation languages are associated to a navigation
and presentation semantics, respectively indicating how entities are to be navigated and presented to the user.
Though no explicit notion of schema is currently associated with the semi-structured, hypertext-style information
contained in most Web sites, several design methodologies such as HDM or RMM [GMP95, ISB95] already provide
some more or less formal means to express data semantics. Based on these methodologies, research prototypes of
Web site generators [FP] and commercial Web authoring environments are now available.
In our opinion a great amount of the semantic information produced during the hypermedia design process is
eventually lost when this process is ended, since no query support is generally o ered to hypermedia products, whose
fruition is based on free user navigation. This loss is perceived even more when the hypermedia we are trying to
access is available as a Web site, since the need for ecient and e ective retrieval of information on the net is much
1

more critical.
We believe that an e ective description and manipulation language for Web sites, running side by side with the
current browsing and searching environments, should be available; this should allow Web users to exploit the rich
amount of semantics contained in the Web site, by means of database-like querying capabilities.
The need for Web access tools that allow for the de nition and exploitation of information on data semantics
has also been recognized, for example, by the electronic commerce community [Ham97] which is particularly keen on
developing better search capabilities to overcome the inherent limitations of keyword based searches.
Our approach relies on the availability of a clean notion of site schema as the basis for a semantics-aware search
for information through the Web: the availability of a schema is in our opinion an essential prerequisite for the
development of an e ective query mechanism, since schemata carry, in a synthetic form, most of the semantic
information needed for querying.
In this paper we describe WG-Log, a graph-oriented language supporting the representation and query of nav-
igation, presentation and logical aspects of hypermedia. Graphs have since long time taken a signi cant place in
the database design process, and been also used at the logical speci cation level [LZ96, GPdBG94, CM90] after
the introduction of object-oriented and semantic data models; moreover they are naturally connected to graphical
interfaces.
WG-Log has its formal basis in the graph-oriented language G-Log [PPT95]: in G-Log, directed labeled graphs
are used as the formalism to specify and represent database schemata, instances and queries. The nodes of the
graphs stand for objects and the edges indicate relationships between objects. G-Log, being conceived as an object
oriented database language, provides two kinds of nodes: printable nodes, depicted as ellipses, indicate objects with
a representable value; non-printable nodes, depicted as rectangles, indicate abstract objects. Graph edges denote
logical relationships between objects.
WG-Log extends G-Log by including some standard hypermedia design notations ([GMP95, ISB95]), as for
instance navigational links: WG-Log descriptions cleanly denote both logical and structural (navigational) concepts,
with the additional possibility to specify some typical hypermedia features like index pages or entry points, which
are essentially related to the hypertext presentation style.
The advantages of this approach are many, the rst one being the immediate availability of a uniform mechanism
for query and view formulation. When a new view over a site is estabilished, and clients can formulate the query to
include additional information and a good deal of restructuring to the result. In fact, this powerful mechanism can
2

be exploited to reuse existing sites' content as well as schemata to produce new sites.
Another advantage of schemata is that they convey information about site contents that can be used for site
classi cation and clustering purposes.
Our approach is particularly suitable for Intranet applications, since we can easily envisage an organization whose
Web sites conform to the same structure, or alternatively a set of federated Web sites whose structure is captured
by a unique, global schema. The proposal can be smoothly transferred to Internet Web sites as more and more of
them are produced on the basis of hypermedia design methodologies and tools.
This paper is organized as follows: Section 2 contains a survey related work; in Section 3 we present an outline
of our approach, with a preliminary description of the system architecture. Section 4 presents the data model and
language of WG-Log, while Section 5 introduces a classi cation of types of queries according to graph topology
and semantic features, and investigates their execution complexity. Section 6 draws the conclusions. Appendix A
describes the implementation choices we made in our prototype while Appendix B contains an example of query
execution.
2 Related Work
In this section we provide a brief overview of related work (see also [Tor96]). Our discussion is based on how the
various approaches deal with the representation of site semantics.
Free text indexing - No representation of semantics. Early approaches to Web indexing tried to collect
and index title-like information about every reachable page of data on the WWW and then build Boolean keyword
searches into the resulting document. Many current Web search engines are still partially based on this approach,
where search results are at lists of HTML pages; however, nowadays one could hardly nd a Web search engine
relying on terms indexing alone. Some keyword-based indexes, like the World Wide Web Worm (WWWW) [McB94],
the WebCrawler [Pin94] and Lycos [Lyc] try to complement keyword indexing by taking into account the HTML
document structure in order to make educated guesses about semantics.
Representation of semantics via taxonomies. Several search engines do not use keyword indexing but
exploit a taxonomy representing sites' content. The popular Yahoo [Yah] search engine relies on a broad hierarchical
classi cation systems of subjects, much similar to those used by the Library of Congress. Yahoo's success has spawned
multiple similar tools, all based on the idea of providing large, monolithic servers holding indexes of site contents
(WebCompass [Qua], SavvySearch [Dre] and others).
3

Structural representation of sites. A considerable amount of research has been made on how to provide
database-style support for querying the Web, and three main WWW query languages have been proposed so far:
Web3QL [KS95], WebSQL [MMM96] and WebLog [LSS96]. The rst two languages are modelled after standard SQL
used for RDBMS, while the third is inspired to the Datalog language. However, all these three languages refrain from
the explicit representation of semantics. Indeed, Web3QL and WebSQL o er a standard relational representation of
Web pages, such as Document(url, title, text, type, length), which can be easily constructed from standard
HTML tagging. The user can present SQL-like queries to Web sites based on that relational representation. Content-
related queries (for instance: Document.text = ``Italy'') are mapped in free-text searches using a conventional
search engine. In addition to similar query capabilities, Web3QL o ers an elementary graph pattern search facility,
allowing users to search for simple paths in the graph representing the navigational structure of a Web site. Finally,
WebLog propose an O-O instance representation technique which leads to a powerful deductive query language, fully
equipped with recursion; but again it lacks any representation of data semantics.
Instance-based representation of semantics. Nowadays it is widely recognized that to build e ective Web-
based services, developers must be able to impose some sort of semantic structure upon Web sites in order to
support ecient information capture [Ham97]. A well-known technique for instance-based semantics representation
is semantic tagging, i.e. the use of extended HTML tags to represent semantic information. The basic idea underlying
this approach is that a new kind of HTML tags can be used to superimpose a representation of semantics (based,
for instance, on standard entity-relationship technique) on the navigational structure of a Web site. Semantic tags
can be used to refer to an entity the data stored in a Web page and to denote relationships as semantic links that
are not meant to be followed in navigation only, but used for querying purposes. Several variations of the semantic
tagging idea ([KMSS97, DL96]) have been proposed by various researchers; moreover, HTML standard committees
[Wor] seem to be considering its partial endorsement. However, no e ective query support based on semantic tagging
is available yet.
Other approaches try to address the problem of Web indexing and querying in the more general framework of
dealing with semi-structured data. For instance, the Tsimmis system [GMPQ+97] proposes an OEM object model
to represent semistructured information together with a powerful query language, Lorel. For each Web site, the user
de nes OEM classes to be used in its Tsimmis representation. Then, an extraction technique based on a textual
lter is applied, initializing objects from Web pages' data. Indeed, Tsimmis' additional DataGuide facility allows to
identify regularities in the extracted instance representation to produce a full- edged site schema.
4

Schema-based representation of semantics. With the partial exception of Tsimmis, all the approaches
described above lack an explicit notion of schema. This may be due to the fact that, while the advantages of schema-
aware query formulation are widely recognized in the database context, this technique has been considered unfeasible
on the WWW because Web sites are considered too dynamic to allow schematization. However, this situation is
evolving as an increasing number of sites, particularly on Intranets, are being designed using well speci ed design
methodologies such as HDM [GMP95], RMM [ISB95] YOO [BMY95] and the like. Some of these tools even translate
the site schema into a relational representation [FP]. A representation of semantics based on a standard relational
schema is also used in the Araneus project [AM97] where Web site crawling is employed to induce schemata of Web
pages. These ne grained page schemata are later to be combined into a site-wide schema, and a special-purpose
language, Ulixes is used to build relational views over it. Resulting relational views can be queried using standard
SQL, or trasformed into autonomous Web sites using a second special-purpose language, Penelope. It is worth
observing that the Araneus approach to schema induction requires semi-structured Web site data to be converted in
relational tables to allow database-style querying. In WG-Log, graph-based instance and schema representations are
used for querying, while Web site data remain in their original, semi-structured form.
3 An Outline of Our Approach
Our approach to Web querying involves ve basic steps: 1) schema identi cation, 2) schema retrieval, 3) query
formulation, 4) instance retrieval and results restructuring, 5) presentation of results. These steps, together with the
basic system modules needed to perform them, will be brie y described in the following.
An outline of our system architecture is shown in Figure 1, while a detailed description of the implementation of
our prototype is in Appendix A.
The schema identi cation and retrieval steps involve a dialogue between the client module and a distributed
object-oriented server called Schema Robot. Essentially, a Schema Robot is a trader object which provides Web
site schemata, stored and presented in WG-Log form, to clients in execution over the Net. Users can either query
Robots by using simple free-text, keyword search in the schema repository or adopt some more sophisticated style
of interaction. Interacting with Schema Robots, users identify Web servers holding the information they need on
the basis of their schema, i.e. on the semantics of their content. Exploiting schema information, users can choose
whether to query the chosen servers or resort to conventional Web navigation.
After helping the user in schema identi cation, clients must provide facilities for easy and e ective query formula-
5

Thesaurus
Schema
Repository
Keywords
Answer Schemata
(+ remote site info)
Remote Site Area
Schema
Instance
Instance
presentation issues
Local
Query Manager
presentation
issues
Instance
Query
Answer
Query
partial
answer
Client Area
User
Schema Search Area (Robot)
Interface
Query (graphical) Schema Robot
Query Manager
Remote
Figure 1: Outline of our system architecture.
tion. Since many current Web users are well acquainted with graph-like representation of the hypertextual structure
of Web sites, the main purpose of our client module is to provide user-friendly visual tools for query formulation,
based on graph-like schema and query representation. Novice users can rely on a cut-and-paste approach to put
together ready-made basic blocks, together with schema parts, to compose their queries, while experts retain the
high expressive power of a full edged query language.
An e ective support for query execution and instance retrieval is essential in the development of Web-based
applications. In our approach, queries specify both the information to be extracted from a site instance, and the
integration and results restructuring operations to be executed on it. This is expecially useful in constructing site
views, i.e. queries whose result is not volatile but forms a new, independently browsable Web site. Views to be stored
at the client side are proposed as a powerful tool for reusing Web sites' content as well as structure.
Queries are delivered to object-oriented servers called Remote Query Managers, which are in execution side by
side with the conventional Web servers at target Web sites. Remote Query Managers use an internal instance
representation to execute queries and to return results in the format selected by the user.
There are two basic modes for the remote Query Manager to return results: as a list of handles (e.g. URLs),
in order to keep at a minimum network and client resource consumption; and as a result instance graph which
includes complete restructuring of hypertextual links inside the set of HTML pages. The result instance graph is
then transmitted to the client over the network; advanced compression techniques may well be of help to reduce
resource needs for this execution mode.
6

The partial result computed by the Remote Query Manager is processed at the client side by a module called Local
Query Manager, which has access to the additional information requested by the user, and produces the complete
query result as speci ed.
The WG-Log language also provides facilities for the presentation of results. This mechanism allows the client to
require customized presentation styles on the basis, for instance, of constraints on the resources available at the user
location.
Note that, in a sophisticated view of the interaction with Schema Robots, these can be regarded as browsable,
similarity-based hypertexts of schemata, providing information about Web sites, or used by specialized agents as a
source of meta-information for reasoning about Web sites content [LSRH97]. Although in this paper we shall not
deal with site classi cation issues, it is worthwhile to remark that Schema Robots need not be all equal; for instance,
communicating domain schema robots might form a partitioned distributed repository of all known schemata on the
Net, while category schema robots might allow for a subject related classi cation and search of Web sites.
Finally, several current research projects address the problem of building a representation of site semantics by
means of knowledge discovery on the Web [AM97, GMPQ+97]. An interesting possibility is translating information
about the sites content in a WG-Log schema form, to be classi ed and stored by the Schema Robots system.
4 The Data Model and Language
In this section we introduce the data model and the language of WG-Log; formal presentations of this material can
be found in the papers on G-Log [PPT95] and in previous papers on WG-Log [DT97].
In WG-Log, directed labeled graphs are used as the formalism to specify and represent Web site schemata,
instances, views (allowing customized access paths or structures), and queries. The nodes of the graphs stand
for objects and the edges indicate logical or navigational relationships between objects. The whole lexicon of WG-
Log is depicted in Figure 2. In WG-Log schemata, instances and queries we distinguish four kinds of nodes:
slots (also called concrete nodes), depicted as ellipses, indicate objects with a representable value; instances
of slots are strings, texts, pictures, sound tracks, numbers, movies or movie frames (depending on the desired
granularity of representation);
entities, depicted as rectangles, indicate abstract objects such as monuments, professors, or cities; note that
an abstract object can be chosen to correspond to one or more Web pages, possibly linked to each other in
di erent ways: it is for the designer to decide which level of granularity the schema is meant to convey;
7

collections, represented by a rectangle containing a set of horizontal lines, indicate collections or aggregates of
objects, generally of the two types above; an instance of such a node is the index of all painters in a certain
gallery. This is the rst example of a concept dictating the presentation style, since the presence of an index is
related to layout rather than semantical issues;
entry points, depicted as triangles, represent the unique page that gives access to a portion of the site (or to an
alternative view of the site), for instance the site home page. To each entry point type corresponds only one
node in the site instance.
It is worth noticing that entry points, like collection nodes, are very useful when a complex query is issued that
requires an appropriate access structure to alternative, customized presentations of Web portions, or when designing
a new view of the site.
We also distinguish three kinds of graph edges:
structural edges, representing navigational links between pages; such an edge may stand for the link between a
collection node representing the painter index and the entity of type painter;
logical edges, labeled with relation names, representing logical relationships between objects; such an edge
might connect painters to their paintings. The presence of a logical relationship does not necessarily imply the
presence of a navigational link between the two entities at the instance level;
double edges, representing a navigational link coupled with a logical link; they are depicted as double labeled
edges, and might connect a painter to his/her paintings, also indicating that there is a navigational link that
allows paintings to be reached from their author.
At the schema level, each kind of edges may have a single or a double arrow: double arrows indicate that the
represented link is multi-valued, i.e., there may be more than one destination entity in the site instance; single
arrows impose the single-valuedness of the edge [PPT95]. As an example of use of these lexical elements, Figure 3
shows the WG-Log schema while Figure 4 contains a simpli ed instance of our experimental WWW site representing
Monumental Verona, whose URL is http://romeo.sci.univr.it/vrtour. This WG-Log description was easily
obtained as a part of the design process of the site.
8

Relation Name Relation Name
Relation Name
Relation Name
Coupled
Logical-Navigational
Link
Logical Link
Navigational Link
Mono-valued Multi-valued
Entity Slot (attribute)
Index Entry-Point
Figure 2: WG-Log lexicon
4.1 WG-Log schemata
A (site) schema contains information about the structure of the Web site. This includes the (types of) objects that
are allowed in the Web site, how they can be related and what values they can take. Logical as well as navigational
elements can be included into a site schema, thus allowing for exibility in the choice of the level of detail.
Formally, a schema contains the following disjoint sets: 1) a set of Entry Point labels EP, 2) a set SL of concrete
object (or Slot) Labels, 4) a set ENL of ENtity Labels, 5) a set COL of COllection Labels, 6) a set LEL of Logical
Edge Labels, one Structural Edge Label 7) SEL (which in practice is omitted), 8) a set DEL of Double Edge Labels,
and 9) a set P of productions1.
Sometimes we refer to the rst four kinds of labels as object labels. Each of the three edge label sets is partitioned
into the two sets of mono and multi valued edge labels. The productions dictate the structure of WG-Log instances
(which are the actual sites); the productions are triples representing the types of the edges in the instance graph.
The rst component of a production always belongs to ENL [ COL [ EP, since only non-concrete objects can be
related to other objects. The second component is an edge label and the third component is an object label of any
type. Note that we allow the presence of slots as components of entry points and collection nodes.
A Web site schema is easily represented as a directed labelled graph, by taking all the objects as nodes and all
the productions as edges. Note that two nodes might be connected by more than one edge. If multiple logical edges
connect two nodes, they represent di erent relationships between those objects; the presence of a structural edge
1We require that the productions form a set because we do not allow duplicate productions in a schema
9

Monument
String
Text
Period
Text
Image
String
Image Image
Text
String
Text
String
Place
Contains
Created_by
Index Index Index
Monument
Author Place
Lived_in Created_in
Description
Name
Photo
Description
Name
Home
Map
Photo
Biography
Name
Name
Description
M_contains
Author
Author_of
Figure 3: The WG-Log schema of the monumental Verona site
Monuments
String
String
String String
String
String
Text
String
String
Text
Text
String
String
Text
Text
String
String
String
String
String
String
String
Text
Name Name
Name
Name
Name
Name
Home
Places
Authors
Liston V. Mazzini P.zza Erbe
Sanmicheli
Name
Name
Malfatti
Palazzo
Malf.txt
Malf.jpg
Name
Name
Portoni
V. Roma
Photo
Barb.txt
Photo
Barb.jpg
Aren.txt
Description
Arena
Aren.jpg
Photo
Name
Description
Barbieri
Palazzo
Name
Port.jpg
Photo
Port.txt
Description
Description
Port.txt
Port.jpg
Guardia
Created_by
Author_of
Created_by
Author_of
Created_by
Contains
Contains
Contains
Lived_in
Created_in
Created_in
Created_in
Created_in
Contains
Contains
Bibiena
Name
Barbieri
Lived_in
Lived_in
Venetian
Neoclassical
Baroque
Roman
Name
Gran
P.zza Bra
Author_of
Name
Description
Photo
Name
Description
T. Filarmonico
Name
Fil.jpg
Fil.txt Photo
String
Image
Image
Image
Image
Image
Image
Created_in
Monument Monument
Monument
Place
Place Place
Monument
Monument
Place
Place
Period
Period
Period
Period
Author
Author
Author Monument
7
5
4
6
8
12
17 13 14
11
18 19 20
21 16
15
24
22
23
0 2
1
3
10
9
Figure 4: A site instance
10

between two nodes represents the (possible) presence of a navigational direct link between the corresponding pages
in the site instance: no two nodes, however, can be connected by more than one structural edge, since this would be
meaningless. Besides these edges, we also assume an implicit equality edge (a logical edge with an equality sign as
label) going from each node of the schema to itself.
Finally, we assume a function associating to each slot label a set of constants, which is its domain; for instance,
the domain of a slot of type image might be the set of all jpeg les.
4.2 WG-Log instances
A (Web site) instance over a schema S contains the actual information that is stored in the Web site pages. It is
a directed labeled graph I = (N;E). N is a set of labeled nodes. Each node represents an object whose type is
speci ed by its label. The label (n) of a node n of N belongs to EP [ SL [ ENL [ COL. If (n) is in EP, then
n is an entry point node, and is the only one with label (n); if (n) is in SL, then n is a concrete node (or a slot);
if (n) is in ENL, then n is an abstract object, that can coincide with one or more of site pages; otherwise n is a
collection node. If n is concrete, it has an additional label print(n), called the print label, which must be a constant
in the domain ((n)) of the label (n). print(n) indicates the value of the concrete object, for instance an image
le or a text. Thus, all the instance nodes must conform to some schema node.
E is a set of directed labeled edges with only one arrow. An edge e of E going from node n to n0
is denoted
(n; ;n0
). is the label of e and belongs to LEL [ fSELg [ DEL. The edges must conform to the productions of
the schema, so ((n); ;(n0
)) must belong to P and, if is a mono-valued edge label, no two edges (n; ;n0
) and
(n; ;n00
) are allowed. As in WG-Log instances, we also assume an implicit equality edge going from each node of
the instance to itself. Figure 4 contains an instance over the schema of Figure 3.
4.3 WG-Log Rules and Queries
A WG-Log query is a (set of) graph(s) whose nodes can belong to all four node types used in the schema; moreover,
a dummy node is allowed, whose purpose is to match all the node types of the schema. The edges can be logical,
double or navigational, but no double arrows are allowed, since for the moment queries cannot specify cardinality
constraints over the instance links.
This allows for purely logical, mixed or purely navigational queries; in all three cases, in fact, the query results
in a transformation performed on the instance. In the sequel we give examples of all three query types.
WG-Log queries use rules, programs and goals to deduce or restructure information from the information con-
11

String Bibiena
name
Monument list
Monument
Created_by
Author
Figure 5: A WG-Log rule.
tained in the Web site pages. Rules and goals are themselves graphs, which can be arranged in programs in such
a way that new views (or perspectives) of (parts of) the Web site be available for the user. WG- Log rules are
constructed from patterns.
A pattern over a Web site schema is similar to an instance over that schema. There are three di erences: 1) in
a pattern equality edges may occur between di erent nodes, having the same label, 2) in a pattern concrete nodes
may have no print label, and 3) a pattern may contain entity nodes with the dummy label, used to refer to a generic
instance node. The purpose of the last two elements is allowing incomplete speci cation of the information when
formulating a query.
A pattern denotes a graph that has to be embedded in an instance, i.e. matched to a part of that instance. An
equality edge between two di erent nodes indicates that they must be mapped to the same node of the instance.
Like Horn clauses, rules in WG-Log represent implications. To distinguish the body of the rule from the head
in the graph (pattern) P representing the rule, the part of P that corresponds to the body is coloured red, and the
part that corresponds to the head is green. Since this paper is in black and white, we use thin lines for red nodes
and edges and thick lines for green ones. Figure 5 contains a WG-Log rule over the Web site schema of Figure 3. It
expresses the query: Find all the monuments whose author is Bibiena.
The application of a rule r to a site instance I produces a minimal superinstance of I that satis es r. For the sake
of clarity, we rst consider queries expressed by a single rule that does non involve negation. Informally, an instance
satis es such a rule if every matching of the red part of the rule in the instance can be extended to a matching of
whole rule in the instance. The matchings of (parts of) rules in instances are called embeddings. For example, the
instance I of Figure 4 does not satisfy the rule r of Figure 5 because one possible embedding i of the red part of r
in I (the Monument-nodes pertaining to Bibiena ) is not connected to a Monument list-node in I.
Because I does not satisfy r, I is extended in a minimal way such that it satis es r. In this case, the e ect of
rule application is that a Monument list-node is created and linked to all the appropriate Monument-nodes by a
navigational edge. Now the instance satis es the rule, and no smaller superinstance of I does, so this is the result of
the query speci ed by the rule.
12

Monument
String
Monument
String
name
description
Text
Monument List
name
photo
Image
Monument List
(a) (b)
Figure 6: Two WG-Log goals
Note that the new instance, obtained from the query, contains a new access structure (the node Monument list
and its links to all Bibiena's monuments), allowing the retrieval of the nodes in a novel way, which was not possible
in the initial instance.
However, one might object that the probable intention of the user was not to construct a whole superinstance of
the initial instance, but only to get access to Bibiena's monuments!! To this purpose, WG-Log allows the expression
of goals, in order to lter out non-interesting information from the instance obtained from the query.
A goal over a schema S is a subschema of S, and is used combined with a query. The e ect of applying a goal G
over a schema S to an instance I over S is called I restricted to G (notation: IjG) and is the maximal subinstance of
I that is an instance over G. The de nition of satisfaction of a WG-Log rule is easily extended to rules with goals.
If R is a WG-Log rule, then an instance I over G satis es R with goal G if there exists an instance I0
satisfying R
such that I0
jG = I.
In this particular case, we might use the goal of Figure 6(a) to present to the user only the access structure
Monument list with its links to Bibiena's monuments. Note that the goal speci es that the user is only interested
to the name and description of each monument, and the other information (namely, the photo) is to be dispensed of.
At the system's level, to make querying more e ective, if no goal is speci ed and no further information is added,
the system takes as default goal speci cation the rule pattern itself, and excludes from the presentation of the query
result those nodes in the result instance that are not in the rule graph.
Rules in WG-Log can also contain negation in the body: we use solid lines to represent positive information and
dashed lines to represent negative information. So a WG-Log rule can contain three colours: red solid (RS), red
dashed (RD), and green solid (GS).
The rule of Figure 7 expresses the query nd all monuments of the Venetian period whose author is not Bibiena.
The instance I of Figure 4 does not satisfy this rule either. There are two possible embeddings of the RS part of
r in I, which cannot be extended to embeddings of the whole red (RS and RD) part of r in I. For each of these
13

Period String
Venetian
Name
Monument Author String
Created_By Name
Bibiena
Monument list Created_In
Figure 7: A WG-Log rule involving negation.
two embeddings, the Monument-nodes of I should be connected to a Monument list-node (to embed also the
GS part of r in I), and this is not the case in I. Again, the solution is a superinstance of I that also contains the
appropriate Monument list-node node, and again it is possible to apply a goal that lters out uninteresting parts
of the instance. For instance the goal of Figure 6(b) chooses all the monuments with their names and photos. Note
that the rule of Figure 7 requires a logical Created in link between Monument and Period. The answer will contain
both logical and double links of the instance, since the latter are a special case of logical relationships.
To express more complex queries in WG-Log, we can combine several rules and apply them simultaneously to the
same instance: a WG-Log set A is a nite set of WG-Log rules that work on the same instance. The generalization
of satisfaction to the case of WG-Log rule sets is straightforward. Let A be a WG-Log set; an instance I satis es A
if I satis es every rule of A.
There is a strong connection between G-Log and rst order predicate calculus. In [PPT95] G-Log is seen as a
graphical counterpart of logic. WG-Log is only a syntactic variant of G-Log, whose semantics we want to retain in
order to keep its expressive power and representation capability; thus the same correspondence holds for WG-Log.
Consider for instance the rule of Figure 7. This may be expressed in First Order Logic as follows:
8m 8p 8a 9Monument-list :
created in(m;p) ^period(p;Venetian) ^name(a;Bibiena) ^:created by(m;a)
)SEL(Monument-list;m)
In the previous section we de ned when an instance satis es a WG-Log rule set; by examining the logical counterpart
of WG-Log, we get an intuition of the meaning of a WG-Log rule; however, in orderto use WG-Log as a query language
we need to de ne its e ect, i.e. the way it acts on instances to produce other instances.2
Informally, the semantics Sem(A) of a WG-Log set A is the set of instance pairs (I;J) (a relation over instances),
such that J satis es A and J is a minimal superinstance of I. In general, given an instance, there will be more than
2Note that simpler languages like Datalog do not capture the whole expressive power of G-Log or WG-Log: a Datalog rule is expressed
in WG-Log by a simple rule containing red solid nodes and edges, and only one green edge. Thus, it is not possible to express the semantics
of WG-Log by translating it in Datalog.
14

one result of applying a WG-Log set of rules, which corresponds to the fact that WG-Log is non-deterministic and
the semantics is a relation and not a function [PPT95, DT97] .
In WG-Log, it is allowed to sequence sets of rules. A WG-Log program P is a nite list of WG-Log sets to be
applied in sequence according to a speci ed order. The semantics Sem(P) of a WG-Log program P = hA1;:::;Ani
is the set of pairs of instances (I1;In+1), such that there is a chain of instances I2;:::;In for which (Ij;Ij+1)
belongs to the semantics of Aj , for all j. If a number of WG-Log rules are put in sequence instead of in one set,
then, because minimization is applied after each rule, fewer minimal models are allowed: sequencing can be used
to achieve deterministic behaviour. The notion of goal is straightforwardly extended to be applied to the results of
WG-Log rule sets or of whole WG-Log programs.
The use of all the complexity levels (rules, sets and programs), possibly in conjunction with a goal, guarantees
that WG-Log is computationally complete [PPT95], i.e., it can produce any desired superinstance of a given instance.
Normally, one or two rules, together with a goal, are sucient to express most of the interesting queries we can
pose to a Web site; however, some important queries do require the full language complexity. As an example, suppose
we want to nd all the pairs of nodes that are unreachable from each other by navigation; in other words, we want
all the pairs that are not in the transitive closure of the relationship expressed by label SEL. An easy and natural
way to solve this query is to compute the transitive closure stc of SEL, and then take the complement ctc of that
relation. The WG-Log program of Figure 8 solves this problem. It is a sequence of two sets of rules. The rst set,
which consists of two rules, adds stc-edges (logical) between all nodes that are linked by a SEL-path. The second
set has only one rule and takes the complement of the transitive closure by adding a ctc-edge if there is no stc-edge.
Note the use of dummy nodes in the query speci cation. Eventually, a goal can be added to select only the interesting
node pairs, for instance a query might ask all the nodes that are not reachable from the page of the artist Bibiena
by using the goal of Figure 9. Note that the presence of goals reduces dramatically the amount of information that
has to be transmitted from the remote side to the client. Moreover, such goals can be used to optimize computation;
however, this is outside the scope of this paper.
5 Query Evaluation and Classi cation
In order to be able to express a rich set of queries, we have conceived WG-Log as a language with a complex semantics;
this gives rise to a computation algorithm that, in the general case, is quite inecient. However, in most cases the
queries on the Web are expressed by only one or two rules, and possibly a goal which contributes to improving the
15

stc
stc
SEL-label
SEL-label
stc
ctc
stc
Figure 8: The complement of the navigational transitive closure.
String
ctc
Author
Bibiena
name
Figure 9: A goal on the complement of the navigational transitive closure.
eciency of program computation.
In this section we present rst the query evaluation algorithm in its most abstract and general form3 and then
give a classi cation of the possible queries on Web sites. In Subsection 5.3 a computation algorithm for the most
frequent queries will be presented. For this algorithm, we show its complexity.
For the sake of simplicity, from now on we call objects the abstract objects, while concrete objects will always
be referred to as slots.
5.1 General Query Evaluation
The general algorithm GenComp computes the result of a generic WG-Log set. Suppose we are given a set of rules
A = fr1;:::;rk g on schema S, and a nite instance I over S. The procedure GenComp will try to extend the
instance I to an instance J, in such a way that J is nite and (I;J) 2 Sem(A). If this is impossible, it will print the
message: No solution. GenComp calls the function Extend, which recursively adds elements to J until J satis es
A, or until J cannot be extended anymore to satisfy A. In this last case, the function backtracks to points where it
made a choice among a number of minimal extensions and continues with the next possible minimal choice. If the
function backtracks to its rst call, then there is no solution. In this sense, GenComp reminds the backtracking
xpoint procedure that computes stable models [SZ90].
The algorithm uses the notion of legal, minimal extension of an instance. By legal, we mean that the extension
may only contain nodes and edges not belonging to the schema S of the initial instance. Minimal indicates that no
subpart of the extension is sucient to make the embedding under consideration extendible.
3It is interesting to note that the algorithm reduces to the standard xpoint computation for those WG-Log programs that are the
graphical counterpart of Datalog, i.e. sets of rules that consist of a red solid part and one green solid edge.
16

In [PPT95] we proved that the GenComp algorithm is sound and nitely complete for every WG-Log set A, that
is, for every input instance I, GenComp produces every nite instance J such that (I;J) 2 Sem(A).
GenComp(I;A) fInput: I=instance graph, A=set of rule graphsg
J := I;
if (Extend(J;A))
then
minimize(J);
output(J);
else
output(No solutions);
proc Extend(J;A) fInput: J=instance graph, A=set of rule graphsg
begin
for (every rule r of A) do
for (every embedding of the RS part of rule r in J) do
if (J does not satisfy the whole rule r)
then
SetExt := ;;
if (the graph of r has a RD part)
then
for (every legal; minimal RD extension Ext of J) do
SetExt := SetExt[Ext;
od
for (every legal; minimal GS extension Ext of J) do
SetExt := SetExt[Ext;
od
while (SetExt ;) do
Ext := select(SetExt); fselect Ext from SetExt and remove it from SetExtg
J := add(Ext;J); fadd Ext to graph Jg
if (Extend(J;A))
then return(true);
else J := remove(Ext;J); fremove Ext from graph Jg
od
return(false);
od
od
return(true);
end
The result of the application of a WG-Log program is obtained by sequencing several applications of the algorithm
for sets.
In the most general case, the result of a program with goal is computed by pruning the result instance of the
nodes and edges not speci ed by the goal. Of course, the best use we can make of a goal is using its information to
avoid redundant computation. This is a matter of future research for us; however, the algorithm we present in the
next section for the so-called simple queries already applies this principle partially, since only nodes and links that
are mentioned in the query graph take part to the computation and are nally reported to the user.
17

5.2 Query classi cation
The complexity of the general algorithm GenComp is accounted for by the high expressive power of the language,
but if we consider particular classes of queries it is possible to provide more ecient algorithms. For this reason we
rst classify the WG-Log query w.r.t. di erent orthogonal dimensions. Then, for a particularly frequent class of
queries, we present an algorithm with low complexity.
We consider the following dimensions: 1) complexity of the query, 2) kind of information associated to each node
of the rule graphs, 3) graph connectivity and 4) topology of each colored part of the rule graph.
The rst dimension takes into account the whole query, formed by rules and (possibly) a goal. The other
dimensions refer to the single rules of the query: in particular, the last two dimensions study the single colored parts
(RS, RD and GS) of a rule pattern.
1. Complexity: we consider the number of patterns which constitute a single query. We call 1-queries the queries
formed by a single rule r, i.e. represented by a single pattern; otherwise we use the general term query. For
example, the queries of Figures 5 and 7 are 1-queries, whereas the query of Figure 8 is not. As already noted
in the previous section, a typical query on the Web is composed by one or two rules possibly combined with a
goal; query with more than two rules are conceptually very complex, and thus not frequent.
2. Kind of information: we call detailed queries the queries whose object nodes are not labeled with dummy,
and non-detailed queries the queries which present at least one dummy label. For example the queries of
Figures 5 and 7 are detailed and the query of Figure 8 is non-detailed. Dummy nodes are often used to
express the concept of reachability (logical or navigational) of a generic object as seen in the computation of
the transitive closure.
3. Graph connectivity: A colored part (RS, RD or GS) of a rule pattern is connected if for any two distinct nodes
there exists at least a chain (i.e. a non-directed path) of that color between the two nodes, disconnected
otherwise. For example in the query of Figure 7 there are two RS parts separated by the RD edge labeled
with Created by: for every matching of the two RS parts the absence of the edge Created by has to be checked.
Notice that the combination of unconnected colored parts allows to formulate very complex queries.
4. Topology: we classify each colored part (RS, RD, GS) of the rule as:
list, when all the object nodes of the colored part are connected by a simple chain: e.g. the RS part of the
18

Period
Author String
Name
Sammicheli
String
String
Author String
String
Period
Author
String
Name
Bibiena
Lived_in
Created_in
Monument
Contains
Created_by
Monument-list
Place
Liston
Name
Name
Venetian
b)
a)
Created_in
Monument
Created_by
Contains
Monument-list
Name
Place
Sammicheli
Name
Liston
Figure 10: Examples of rule topologies
query of Figure 5 is a list.
star, when only one object node i is directly connected to n object nodes and these nodes are connected
only to node i. An example of query whose RS part is a star is the query of Figure 10(a), that nds the
list of the monuments created in the Venetian period by author Sammicheli contained in the place Liston:
node Monument is the center of the star.
spider, when there is one object node directly connected to n lists (of any depth). Notice that the star is
a particular case of spider. For example the RS part of the query nd the list of the monuments created
by author Sammicheli, contained in the place Liston, and created in the period when author Bibiena lived
of Figure 10(b) is a spider: its center is the Monument-node.
Stars and spiders are typical topologies of the RS part of a query, since we are generally interested in extracting
the information of a particular set of abstract objects (possibly one), e.g. the monuments, which satisfy prop-
erties obtained through one or more links with the abstract objects expressing those properties, e.g. created in
a particular period or created by a speci c author.
Note that a classi cation according to topologies can characterize also WG-Log schemata and WG-Log in-
stances. The classi cations w.r.t the topology and the connectivity of the pattern of a single colored part
are particularly meaningful for the RS part of the rule, since the algorithm we present next rst searches the
embedding of the RS part in the instance graph and then extends the embedding to the other two colored parts:
the form of the RS part and also the aggregation of the colored parts in the rule can in uence the evaluation
of the query w.r.t. a given instance. Note also that the position of slots does not in uence the classi cation by
topology; this is due to the fact that slots will be represented as parts (or attributes) of the abstract objects.
The problem of graph matching has been widely studied in di erent forms. For example, [MW89] studies the
19

complexity of matching graph paths to a generic graph (instance). The problem is found to be NP-hard in the
general case, while the authors show that it is polynomial in the case of con ict-free instance graphs, i.e. graphs
where each instance node does not belong to more than one pattern.
In [Bat97] G. Di Battista analysed the problem in terms of graph homomorphisms; under the same hypothesis
of con ict-free instance graphs as before, the problem is reduced to the classical MaxFlow, and thus proved to be
polynomial, in case of star-shaped rule graphs. However, it is interesting to remark that ours is a problem of graph
isomorphisms rather than graph homomorphisms, since our semantics requires that edges in the query be mapped
to other edges in the instance, and not to generic paths.
We prepare now to solve the problem for our particular case, by setting the context of the rst class of (very
frequent) queries we study.
We call simple query a detailed 1-query having a connected RS part of any form, a (possibly unconnected) RD
part formed by lists of length 1, and a connected GS part formed by a single node linked to the rest of the graph by
an edge. E.g. the queries of Figure 5 and Figure 10 are simple.
Simple queries are rather expressive and allow to formulate the most common queries on the Web. The GS part
formed by a single node express a most frequent situation, since it allows to organize in a list the information which
satis es the red part of the query (more complex GS parts are instead generally used in views to restructure the
information); moreover, RD parts formed by lists of length 1 are suciently expressive, since generally we want to
negate some relations directly linked to the information to be extracted, while a whole list of dashed object nodes
becomes quite dicult to be understood.
5.3 The Simple-Query Computation Algorithm
We now present the SA (Simple Algorithm), that computes the result of a WG-Log simple query by using a kind of
depth- rst search technique. Appendix B contains an example of query computation.
Suppose we are given a rule R on schema S and a nite instance I over S. The algorithm SA will produce a
graph J0
in such a way that J = I [ J0
is nite and (I;J) 2 Sem(R). Intuitively, J0
is an instance over the schema
represented by the solid part of the rule. The usefulness of instance J0
is that it is obtained by setting a kind of
implicit goal on the query in order to prune the nal instance of all the information that is not mentioned in the
query itself. This allows a signi cant amount of optimization, because during computation the algorithm will only
retain query-related information. In case the user be interested in the whole nal instance, it is always possible to
20

eliminate the implicit goal by merging the output J0
with the input instance I.
The algorithm SA tries to determine J0
by nding all the embeddings of the RS part of the rule in the instance
graph and verifying that these embeddings are not extendible in the instance with the RD part of the query. The
search of all the embeddings is made by a depth- rst search in the graph of I guided by a depth- rst search in the
graph of R. Starting from an object node x in R, the algorithm searches all the corresponding object nodes in I and,
for each of these, tries to search the embedding for the whole RS part. We say that an instance node is corresponding
to x if it has the same label as x; and the same or a superset of the slots of x.
The Depth First Search procedure, given an object node x of the graph rule and an object node y of the instance
graph, searches all object nodes adjacent to y corresponding to the object nodes adjacent to x.
The algorithm uses eight auxiliary functions: Starting Object(R) returns the RS node with fewest corresponding
nodes in the instance; Corresponding Instance Nodes(I;x) returns the set of the nodes of I that correspond to the
node x of the rule; Create Green Node(Y ) creates a new instance node of the same type of the rule green node
and links it to the nodes in the set Y . Add Node To Instance(J;y) adds the instance node y to the graph J if it
is not already present. Corresponding(x;y) veri es the correspondence of x and y according to the color of x. If
the node x is RS, the function returns true if the instance node y has the same slots as x, or a superset thereof,
false otherwise; if the node x is RD, the function returns true if all the slots of the instance node y are di erent
from the slots of x, false otherwise. Incident Edges(x;ex) returns the set of labels of the incident edges of node x;
Adjacent Nodes(x;ex) returns the set of object nodes linked to x by an edge labeled ex. Connected to Green(x)
returns true if x is connected to the GS node (of the rule).
We denote by J0
= SA(I;R) the output instance of SA, given the input instance I and rule R. It is straightforward
to show that the instance J = I
SJ0
is such that (I;J) 2 Sem(R) and, therefore, SA is sound. Moreover, nite
completeness can be derived by observing that SA is a simple optimization of the GenComp algorithm that is shown
to be sound and nitely complete in [PPT95].
The time complexity of algorithm SA can be determined by evaluating the number of comparisons between the
query object nodes and the instance object nodes (function Corresponding()) because this is the most expensive
operation of the algorithm. The order of the number of comparisons is essentially a function of the length of the
longest simple path in the query graph.
21

SA(I;R) fInput: I=instance graph, R=rule graphg
begin
J := null;
to green := null; fset of instance nodes connected to green nodeg
x := Starting Object(R);
set of y := Corresponding Instance Object(I;x);
for y 2 set of y do
Make Unvisited Nodes(R); fPrepare for a depth- rst searchg
Depth First Search(I;J;R;x;y;to green)
od
if (to green 6= ;)
then
y0
= Create Green Node(to green);
Add node To Instance(J;y0
);
output(J);
else
output(No solutions);
end
proc Depth First Search(I;J;R;x;y;to green;)
fInput: I=instance graph, J=result graph, R=rule graph,x=rule nodeg
fy=instance node, to green=set of instance nodeg
if Corresponding(x;y)
then
visit[x] := true;
corresponding[x] := y;
for e0
x 2 Incident Edges(x) do
x0
:= Adjacent Nodes(x;e0
x);
embedding := false;
for y0
2 Adjacent Nodes(y;e0
x) do
if :(visit[x0
])
then
if (Depth First Search(I;J;R;x0
;y0
))
then embedding := true
visit[x0
] := false;
corresponding[x0
] := null;
else
if (corresponding[x0
] = y0
)
then embedding := true;
od
if :(embedding) then return(false);
od
Add Node To Instance(J;y);
if Connected to Green(x) then to green := to green
Sy;
return(true);
else return(false);
.
Let R = (V;E), with jV j = n, be the rule graph and I = (V 0
;E0
), with jV 0
j = m, the instance graph. Without
loss of generality, let x1 be the starting node of the algorithm and
x1;x1
2; ::: x1
j1
;
22

.
.
.
x1;xk
2 ; ::: xk
jk
;
be the k (0 k n) disjoint simple paths (ignoring the common starting point x1) such that j1 j2 jk
and
Sxj
i = V.
An upper bound to the comparisons made by the algorithm is
#comparisons m +
k
X
i=1
ji
Y
l=1
m(xi
l)
where mi (
Pj mj = m) is the cardinality of the instance object nodes of type i and is the label function (Ref.
Sec. 4.2).
The worst case for the algorithm is given, for example, when the rule is a list of n nodes of the same type and the
instance has m nodes of this type; it is simple to show that in this case the time complexity is O(mn). However, if
we consider the most probable star topology for the rule, even if the nodes are of the same type, the time complexity
already reduces to O(nm2
) that is polynomial in the size of the two graphs.
To improve the performance of SA in its current implementation, we have optimized the data structures of
the algorithm. There are two main data structures: the Typed Adjacency List TAL and the Instance Table IT.
The graphs are represented by the TAL, where each entry gives the type (navigational, logical or coupled) and
the orientation (in, out) of links incident to the object node; this allows to visit more eciently the graphs. The
slots are not stored in the TAL list; it is sucient and surely less space-consuming to store them in auxiliary data
structures pertaining to each single object, in order to allow fast label matching. IT associates the label of each
schema object to a list of unique numbers called instance identi ers; this allows the SA to eciently trace instances
of schema-de ned objects in the instance graph. In this way, even if in the worst case the time complexity remains
the same, the performance can be signi cantly improved. An example of rule evaluation is shown in Appendix B.
6 Conclusions and future work
In this paper we have presented WG-Log, a new approach to querying Web sites which is based on the speci cation
of site semantics by a notion of site schema. In WG-Log, graph-based instance, schema and query representations are
used, while Web site data remain in their original, semi-structured form, thus allowing the system to run side-by-side
with the more traditional searching and browsing mechanisms.
This approach is particularly suited for Intranets, but our future research will address speci c problems related
23

to its smooth extension to the Internet, providing the possibility to express and evaluate queries across federated or
totally unrelated sites. Other future work concerns the study of the complexity of di erent classes of queries with
their optimization, the di usion of our prototype to di erent Internet sites in order to verify the practical feasibility
of the approach, the presentation issues related to query answering and the full integration of the WG-Log querying
environment with the existing site exploration tools.
Acknowledgments
The authors wish to thank R. Torlone for putting forward the idea that gave rise to the present research, and A.
Bertoni for many useful discussions on the subject. Thanks are also due to the students M. Baldi, F. Insaccanebbia
and S. Sera ni, from the University of Verona, for their contribution to the development of the prototype as a part
of their Master's Thesis.
References
[AM97] Paolo Atzeni and Giansalvatore Mecca. To Weave the Web. In Proceedings of VLDB'97, pages 206{215,
1997.
[Bat97] G. Di Battista. Personal communications. 1997.
[BMY95] V. Balasubramanian, Bang Min Ma, and Joonhee Yoo. A systematic approach to de-
signing a WWW application. Communications of the ACM, 38(8):47{48, August 1995.
http://www.acm.org/pubs/toc/Abstracts/0001-0782/208355.html.
[CM90] Mariano P. Consens and Alberto O. Mendelzon. The G+== GraphLog visual query system. SIGMOD
Record (ACM Special Interest Group on Management of Data), 19(2):388, June 1990.
[DL96] Andrew Davison and Seng Wai Loke. Logicweb: Enhancing the web with logic programming. Submitted
to JLP.http://www.cs.mu.oz.au/~
swloke/papers/lw.ps.gz, 1996.
[Dre] D. Dreilinger. SavySearch Home Page. http://www.lycos.com.
[DT97] E. Damiani and L. Tanca. Semantic Approach to Structuring and Querying the Web Sites. In Procedings
of 7th IFIP Work. Conf. on Database Semantics (DS-97), 1997.
[FP] Pietro Fraternali and Paolo Paolini. Autoweb: Automatic Generation of Web Applications from Declar-
ative Speci cations. http://www.ing.unico.it/Autoweb.
[GMP95] Franca Garzotto, Luca Mainetti, and Paolo Paolini. Hypermedia design, analy-
sis, and evaluation issues. Communications of the ACM, 38(8):74{86, August 1995.
[GMPQ+97] H
ector Garc
a-Molina, Yannis Papakonstantinou, D. Quass, A. Rajaraman, Y. Saviv, Je rey Ullman,
V. Vassalos, and Jennifer Widom. The TSIMMIS Approach to Mediation: Data Models and Languages.
In Proceedings of JIIS, volume 2, pages 117{132, 1997.
[GPdBG94] Marc Gyssens, Jan Paredaens, Jan Van den Bussche, and Dirk Van Gucht. A graph-oriented object
database model. IEEE Transactions on Knowledge and Data Engineering, 6(4):572{586, August 1994.
ftp://wins.uia.ac.be/pub/good/good.ps.Z.
[Ham97] Scott Hamilton. E-commerce for the 21st century. Computer, 30(5):44{47, May 1997.
24

[ISB95] Tom
as Isakowitz, Edward A. Stohr, and P. Balasubramanian. RMM: A methodology for
structured hypermedia design. Communications of the ACM, 38(8):34{44, August 1995.
[KMSS97] Y. Kogan, D. Michaeli, Y. Sagiv, and O. Shmueli. Utilizing the Multiple Facets of WWW Contents.
In Proceedings of NGITS, 1997.
[KS95] D. Konopnicki and O. Shmueli. W3QL: A Query System for the World Wide Web. In Proceedings of
the 21th International Conf. on Very Large Databases, pages 54{65, Zurich, 1995.
[LSRH97] Sean Luke, Lee Spector, David Rager, and James Hendler. Ontology-based web agents. In W. Lewis
Johnson, editor, Proceedings of the First International Conference on Autonomous Agents, New York,
1997. ACM Press.
[LSS96] L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and
restructuring the Web. In IEEE, editor, Sixth International Workshop on Research Issues in Data
Engineering: interoperability of nontraditional database systems: proceedings, February 26{27, 1996,
New Orleans, Louisiana, pages 12{21, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA,
1996. IEEE Computer Society Press.
[Lyc] Lycos, Inc. The Lycos Catalog of the Internet. http://www.lycos.com.
[LZ96] Dario Lucarella and Antonella Zanzi. A visual retrieval environment for hypermedia information sys-
tems. ACM Transactions on Information Systems, 1(14):3{29, 1996.
[McB94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. In O. Nierstarsz, editor,
Proceedings of the rst International World Wide Web Conference, page 15, CERN, Geneva, May
1994. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps.
[MMM96] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide Web. In IEEE, editor,
Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems:
December 18{20, 1996, Miami Beach, Florida, pages ??{??, 1109 Spring Street, Suite 300, Silver Spring,
MD 20910, USA, 1996. IEEE Computer Society Press.
[MW89] A. O. Mendelzon and P. T. Wood. Finding regular simple paths in graph databases. In Proceedings
of the 15th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Amsterdam,
pages 185{193, August 1989.
[Pin94] Brian Pinkerton. Finding What People Want: Experiences with the WebCrawler. In Proceedings of
Second Annual WWW Conference, 1994.
[PPT95] Jan Paredaens, Peter Peelman, and Letizia Tanca. G-Log: A graph-based query language. IEEE
Transactions on Knowledge and Data Engineering, 7(3):436{453, June 1995.
[Qua] Quarterdeck, Inc. Webcompass Fact Sheet. http://www.arachnid.qdeck.com/qdeck/products/webcompass.
[Sie96] Jon Siegel. CORBA: Fundamentals and Programming. John Wiley Sons Inc., New York, 1 edition,
1996.
[SZ90] Domenico Sacca and Carlo Zaniolo. Stable models and non-determinism in logic programs with nega-
tion. In ACM-SIGART ACM-SIGACT, ACM-SIGMOD, editor, Proceedings of the 9th ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems, pages 205{217, Nashville, TE, April
1990. ACM Press.
[Tor96] Riccardo Torlone. Linguaggi di interrogazione per il World Wide Web. In Proceeding of SEBD 96,
1996.
[Wor] World Wide Web Consortium. HTML 4.0 Speci cation Working Draft.
http://www.w3.org/TR/WD-html40/.
[Yah] Yahoo, Inc. Yahoo! http://www.yahoo.com.
25

A The Design of a Classi cation and Querying System based on WG-
Log
In order to show the feasibility of our approach, we designed and developed a prototype of our system that is
currently under testing at the University of Verona, Italy. The prototype design aims to addressing the following
three problems:
Performance The inherent complexity of our approach makes performance a critical issue. We designed our
prototype as a three-tier client-server system, where server system modules (Schema Robots and Remote Query
Managers) are implemented as object-oriented wrappers which use standard, high performance database tech-
nologyto store their local data. Our thin client module is implemented as a platform independent, downloadable
Java applet; to achieve acceptable performance at the client site, this applet is mainly devoted to graphical
user-interface tasks.
Modularity Server system modules (Schema Robots, Local and Remote Query Managers) are seen by clients
as distributed objects exporting a standard set of services to be invoked by clients via Java's standard Remote
Method Invocation facility. The invocation arguments being themselves objects, the message granularity of an
Interchange Object Protocol is supported, allowing for future migration to a full- edged CORBA framework
[Sie96]. The distributed objects approach allows complete decoupling between system modules.
Scalability The current prototype allows Schema Robots to communicate via a reserved set of RMI primitives,
exchanging information to build an evolvable, distributed classi cation hierarchy. This will be particularly
important for future developments, as our approach is currently aimed to Intranets but potentially extendable
to a wider set of Internet sites.
To clarify how our current implementation operates, let us brie y summarize the steps required to issue a query.
1. When a user connects to a Schema Robot via a conventional browser, an ordinary HTTP connection is esta-
bilished, downloading the client Java applet to the user's machine. After downloading, all interaction between
system modules takes place in the form of RMI invocations.
2. Schema Robot's FindSchema primitive is rst invoked by the client applet, specifying as input argument a Java
object containing the keywords speci ed by the user. Following the three-tier client server paradigm, schema
identi cation is performed by a conventional query posed by the Schema Robot to a relational database via a
26

JDBC gateway. As a result, FindSchema's output parameter sent to the client is again an object wrapping the
found schema in textual form.
3. After receiving the schema object, the client applet performs its graphical representation, allowing the user
to visually build a query object. This object is then speci ed as input parameter of one of Remote Query
Manager's Query primitives. The client module, following user's directions, can choose the primitive returning
the kind of output it prefers, selecting between a list of pages and a fully restructured hypertext.
4. While o ering an RMI interface, the Remote Query manager comprises a fast compiled C++ module storing
an optimized instance representation (in the well-known form of adjacency lists) and implementing the query
execution algorithm described in the previous sections to build the query response
B An example of Rule Evaluation
We shall now brie y comment on how SA can be used, at least in principle, to execute a WG-Log rule. Our sample
query execution is based on two data structures:
the (Typed) Adjacency List TAL;
the Instance Table IT linking schema entities and their instances.
The role of the instance table is in many respects similar to that of the ontology introduced in [LSRH97]. Each
entry of the adjacency list gives the type (navigational, logical or coupled) and the orientation (in, out) of the links
incident to the object node. For the sake of conciseness, Table 1 does not show such a list, but a simpler list where
each entry indicates the simple type and orientation of link (n=navigational, l=logical, c=coupled, i=in, o=out) and
the node connected by the link only for the nodes involved in the rule of Figure 5. Moreover, rows and columns
pertaining to slots are not listed in TAL, since it is sucient and surely less space-consuming to store them in auxiliary
data structures pertaining to each single entity, in order to allow fast label matching. IT (Table 1) associates the
label of each schema entity to a list of unique numbers called instance identi ers; this allows the algorithm to trace
instances of schema-de ned entities in the instance graph.
Consider the query asking for the monuments whose author is Bibiena; the initial values of the SA parameters
are the whole instance of Figure 4 and the single rule of Figure 5.
At rst, the algorithm SA chooses as starting rule node the Author because it is the most discriminant node of
the query, i.e. it is the RS node with fewest correspondingnodes in the instance. The Corresponding Instance Entities
27

Node Incident links with nodes
.
.
.
8 (l,o,7), (l,o,22), (l,i,3), (l,i,11)
9 (l,o,7), (l,o,23), (l,i,3), (l,i,12)
.
.
.
11 (l,o,7), (l,o,8), (l,i,18), (l,i,2), (l,i,22)
12 (l,o,7), (l,o,9), (l,i,17), (l,i,2)
.
.
.
Labels Instance nodes
Home 0
Places 1
Monuments 2
Authors 3
Period 4, 5, 6, 7
Author 8, 9, 10
Monument 11, 12, 13, 14, 15, 16
Place 17, 18, 19, 20, 21
Author of Index 22, 23, 24
Table 1: The simpli ed instance list and the instance table
returns as set of y the list f9g. Now, we are ready to follow the tree of recursive calls of the Depth First Search for
our sample SA execution.
J := null;
to green := null;
x := Author(Bibiena);
for y 2 set of y
y :=9;
Depth First Search(I,J,x,y)
x' := Monument;
y' :=12;
Depth First Search(I,J,x',y')
return(true);
J := J
S f12g;
to green:=to green
S f12g;
return(true);
end for
y' := new node linked to 12;
J := J [ y';
output(J);
28

A Schema-Based Approach To Modeling And Querying WWW Data

Recommended

Recommended

More Related Content

Similar to A Schema-Based Approach To Modeling And Querying WWW Data

Similar to A Schema-Based Approach To Modeling And Querying WWW Data (20)

More from Lisa Garcia

More from Lisa Garcia (20)

Recently uploaded

Recently uploaded (20)

A Schema-Based Approach To Modeling And Querying WWW Data