10.1.1.21.5883

744 views
693 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
744
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

10.1.1.21.5883

  1. 1. Enabling Technologies for Interoperability Ubbo Visser, Heiner Stuckenschmidt, Holger Wache, Thomas V¨gele o TZI, Center for Computing Technologies University of Bremen D-28215 Bremen, Germany {visser|heiner|wache|vogele}@informatik.uni-bremen.de Abstract formation integration. The solution of information in- tegration applies here because existing information can We present a new approach, which proposes to mini- be accessed by remote systems in order to supplement mize the numerous problems existing in order to have their own data basis. fully interoperable GIS. We discuss the existence of these heterogeneity problems and the fact that they The advantages of successful information integration must be solved to achieve interoperability. These prob- are obvious for many reasons: lems are addressed on three levels: the syntactic, struc- • Quality improvement of data due to the availability tural and semantic level. In addition, we identify the needs for an approach performing semantic translation of large and complete data. for interoperability and introduce a uniform descrip- • Improvement of existing analysis and application of tion of contexts. Furthermore, we discuss a conceptual the new analysis. architecture Buster (Bremen University Semantic Translation for Enhanced Retrieval) which can pro- • Cost reduction resulting from the multiple use of ex- vide intelligent information integration based on a re- isting information sources. classification of information entities in a new context. • Avoidance of redundant data and conflicts that can Lastly, we demonstrate our theories by sketching a real life scenario. arise from redundancy. However, before we can establish efficient informa- tion integration, difficulties arising from organizational, Introduction competence questions and many other technical prob- Over the last few years much work has been con- lems have to be solved. Firstly, a suitable information ducted in regards to the research topic fully interopera- source must be located which contains the data needed ble GIS. Vckovski (Vckovski, 1998) for example gives for a given task. Once the information source has been an overview of the problems regarding data integra- found, access to the data contained therein has to be tion and geographical information systems. Further- provided. Furthermore, access has to be provided on more, the proceedings of the 2nd International Con- both a technical and an informational level. In short, ference on Interoperating Geographic Information Sys- information integration not only needs to provide full tems (Interop99) (Vckovski et al., 1999) consists of nu- accessibility to the data, it also requires that the ac- merous contributions about this research topic (e. g. cessed data may be interpreted by the remote system. (Wiederhold, 1999), (Landgraf, 1999)). In addition, re- While the problem of providing access to information cent studies in areas such as data warehousing (Wiener has been largely solved by the invention of large-scale et al., 1996) and information integration (Galhardas computer networks, the problem of processing and in- et al., 1998) have also addressed interoperability prob- terpreting retrieved information remains an important lems. research topic. This paper will address three of the GIS’s share the need to store and process large problems mentioned above: amounts of diverse data, which is often geographically distributed. Most GIS’s use specific data models and • finding suitable information sources, databases for this purpose. This implies, that making • enabling a remote system to process the accessed new data available to the system requires the data to be data, transferred into the system’s specific data format. This • and solutions to help the remote system interpreting is a process which is very time consuming and tedious. the accessed data as intended by its source. Data acquisition, automatically or semi-automatically, often makes large-scale investment in technical infras- In addressing these questions we will explore tech- tructure and/or manpower inevitable. These obstacles nologies which enable systems to interoperate, always are some of the motivation behind the concept of in- bearing in mind the special needs of GIS. 1
  2. 2. Levels of Integration Our modern information soci- Structural Integration ety requires complete access to all information available. The task of structural data integration is, to re-format The opening of information systems towards integrated the data structures to a new homogeneous data struc- access, which has been encouraged in order to satisfy ture. This can be done with the help of a formalism this demand, creates new challenges for many areas of that is able to construct one specific information source computer science. In this paper, we distinguish differ- out of numerous other information sources. This is a ent integration tasks, that need to be solved in order to classical task of a middleware which can be done with achieve complete integrated access to information: CORBA (OMG, 1992) on a low level or rule-based me- diators (Wiederhold, 1992) on a higher level. Syntactic Integration: Many standards have Mediators provide flexible integration of several infor- evolved that can be used to integrate different informa- mation systems such as database management systems, tion sources. Beside classical database interfaces such GIS, or the world wide web. A mediator combines, as ODBC, web-oriented standards such as HTML and integrates, and abstracts the information provided by XML are gaining importance. the sources. Normally the sources are encapsulated by wrappers. Over the last few years numerous mediators have Structural Integration: The first problem that been developed. A popular example is the rule-driven passes a purely syntactic level is the integration of het- TSIMMIS mediator (Chawathe et al., 1994), (Papakon- erogeneous structures. This problem is normally solved stantinou et al., 1996). The rules in the mediator de- by mediator systems defining mapping rules between scribe how information of the sources can be mapped different information structures. to the integrated view. In simple cases, a rule mediator converts the information of the sources into information on the integrated view. The mediator uses the rules to Semantic Integration: In the following, we use split the query, which is formulated with respect to the the term semantic integration or semantic translation, integrated view, into several sub-queries for each source respectively, to denote the resolution of semantic con- and combine the results according to query plan. flicts, that make a one to one mapping between concepts A mediator has to solve the same problems which are or terms impossible. discussed in the federated database research area, i. e. structural heterogeneity (schematic heterogeneity) and Our approach provides an overall solution to the semantic heterogeneity (data heterogeneity) (Kim and problem of information integration, taking into ac- Seo, 1991), (Naiman and Ouksel, 1995), (Kim et al., count all three levels of integration and combining 1995). Structural heterogeneity means that different several technologies, including standard markup lan- information systems store their data in different struc- guages, mediator systems, ontologies, and a knowledge- tures. Semantic heterogeneity considers the content based classifier. and semantics of an information item. In rule-based mediators, rules are mainly designed in order to rec- oncile structural heterogeneity. Where as discovering Enabling Technologies semantic heterogeneity problems and their reconcilia- tion play a subordinate role. But for the reconciliation In order to overcome the obstacles mentioned earlier, of the semantic heterogeneity problems, the semantic it is not sufficient to solve the heterogeneity problems level must also be considered. Contexts are one possi- separately. It is important to note that these problems bility to describe the semantic level. A context contains can only be solved with a system taking all three levels ”meta data relating to its meaning, properties (such as of integration into account. In the following subsections its source, quality, and precision), and organization” we will give a short introduction to what we mean by (Kashyap and Sheth, 1997). A value has to be consid- problems concerning the syntactic, structural and se- ered in its context and may be transformed into another mantic integration. context (so-called context transformation). Syntactic Integration Semantic Integration The typical task of syntactic data integration is, to spec- The semantic integration process is by far the most ify the information source on a syntactic level. This complicated process and presents a real challenge. As means, that different data type problems can be solved with database integration, semantic heterogeneities are (e. g. short int vs. int and/or long). This first data ab- the main problems that have to be solved within spa- straction is used to re-structure the information source. tial data integration (Vckovski, 1998). Other authors The standard technology to overcome problems on from the GIS community call this problem inconsisten- this level are wrappers. Wrappers hide the internal data cies (Shepherd, 1991). Worboys & Deen (Worboys and structure model of a source and transform the contents Deen, 1991) have identified two types of semantic het- to a uniform data structure model. erogeneity in distributed geographic databases: 2
  3. 3. • Generic semantic heterogeneity: Heterogeneity re- specifies elements that can be used in an XML docu- sulting from field- and object-based databases. ment. In the document, the elements are delimited by a start and an end tag. It has a type and may have a • Contextual semantic heterogeneity: Heterogeneity set of attribute specifications consisting of a name and based on different meanings of concepts and schemes. a value. The generic semantic heterogeneity is based on the The additional constraints in a DTD refer to the log- different concepts of space or data models being used. ical structure of the document, this especially includes In this paper, we will focus on contextual semantic het- the nesting of tags inside the information body that erogeneity which is based on different semantics of the is allowed and/or required. Further restrictions that local schemata. can be expressed in a DTD concern the type of the at- In order to discover semantic heterogeneities, a for- tributes and default values to be used when no attribute mal representation is needed. Lately, WWW standard- value is provided. ized markup languages such as XML and RDF have been developed by the W3C community for this pur- Schema Definitions and Mappings: An XML pose (W3C, 1998), (W3C, 1999). We will describe the schema itself is, an XML document defining the valid value of these languages for the semantic description structure of an XML document in the spirit of a DTD. of concepts and also argue that we need more sophisti- The elements used in a schema definition are of the cated approaches to overcome the semantic heterogene- type ’element’ and have attributes that are defining the ity problem. restrictions already mentioned above. The information in such an element is a list of further element definitions Ontologies have been identified to be useful for the in- that have to be nested inside the defined element. tergration/interoperation process (Visser et al., 2000). Furthermore, XML schema have some additional fea- The advantages and disadvantages of this technology tures that are very useful to define data structures such will be discussed in a separate subsection. as: Ontologies can be used to describe information • Support for basic data types. sources. However, how does the actual integration pro- • Constraints on attributes such as occurrence con- cess work? This will be briefly discussed in the following straints. subsections. We call this process semantic mapping. • Sophisticated structures such as type definition de- XML/RDF and semantic modeling XML and rived by extending or restricting other types. RDF have been developed for the semantic description of information sources. • A name-space mechanism allowing the combination of different schemata. XML – Exchanging Information: In order to We will not discuss these features at length. How- overcome the purely visualization-oriented annotation ever, it should be mentioned that the additional fea- provided e. g. by HTML, XML was proposed as an ex- tures make it possible to encode rather complex data tensible language allowing the user to define his own structures. This enables us to map data-models of ap- tags in order to indicate the type of it’s content. There- plications from whose information we want to share fore, it followed that the main benefit of XML lies ac- with others on an XML schema. From this point, we tually in the opportunity to exchange data in a struc- can encode our information in terms of an XML docu- tured way. Recently, this idea has been emphasized by ment and make it (together with the schema, which is introducing XML schemata that could be seen as a def- also an XML document) available over the internet. inition language for data structures. In the following This procedure has a big potential in the actual ex- paragraphs we sketch the idea behind XML and de- changing of data. However, the user must to commit scribe XML schema definitions and their potential use to our data-model in order to make use of the informa- for data exchange. tion. We must point out that an XML schema defines the structure of data providing no information about the content or the potential use for others. Therefore, The General Idea: A data object is said to be it lacks an important advantage of meta-information. XML document if it follows the guidelines for well- We argued that XML is designed to provide an inter- formed XML documents provided by the W3C com- change format for weakly structured data by defining munity. The specification provide a formal grammar the underlying data-model in a schema and by using used in well-formed documents. In addition to the gen- annotations, from the schema, in order to clarify the eral grammar, the user can impose further grammatical role of single statements. Two things are important in constraints on the structure of a document using a doc- this claim from the information sharing point: ument type definition (DTD). A XML document is valid • XML is purely syntactic/structural in nature. if it has an associated type definition and complies to the grammatical constraints of that definition. A DTD • XML describes data on the object level. 3
  4. 4. Consequently, we have to find other approaches if we Semantic modeling: After introducing the W3C want to describe information on the meta level and de- standards for information exchange and meta-data an- fine its meaning. In order to fill this gap, the RDF notation we have to investigate their usefulness for in- standard has been proposed as a data model for repre- formation integration with reference to the three lay- senting meta-data about web pages and their content ers of integration (see section ). Firstly, we previously using an XML syntax. discovered that XML is only concerned with the issue of syntactic integration. However, XML defines struc- tures as well, except there are no sophisticated mecha- RDF – A Standard Format: The basic model un- nism for mapping different structures. Secondly, RDF derlying RDF is very simple, every kind of information is designed to provide some information on the semantic about a resource which may be a web page or an XML level, by enabling us to include meta-information in the element is expressed in terms of a triple (resource, prop- description of a web-page. In the last section we men- erty, value). tioned, RDF in it’s current state fails to really provide Thereby, the property is a two-placed relation that semantic descriptions. Rather it provides a common connects a resource to a certain value of that property. syntax and a basic vocabulary that can be used when This value can be a simple data-type or a resource. Ad- describing this meta-data. Fortunately, the designers of ditionally, the value can be replaced by a variable rep- RDF are aware that there is a strong need for an addi- resenting a resource that is further described by nested tional ’logical level’ which defines a clear semantics for triples making assertions about the properties of the RDF-expressions and provides a basis for integration resource that is represented by the variable. Further- mechanisms. more, RDF allows multiple values for a single prop- Our conclusion about current web standards is that erty. For this purpose, the model contains three build- using XML and especially XML schemata is a suit- in data types called collections, namely an unordered able way of exchanging data with a well defined syn- lists (bag), ordered lists (seq), and sets of alternatives tax and structure. Furthermore, simple RDF provides (alt) providing some kind of an aggregation mechanism. a uniform syntax for exchanging meta-information in A further requirement arising from the nature of the a machine-readable format. However, in their current web is the need to avoid name-clashes that might oc- state neither XML nor RDF provides sufficient support cur when referring to different web-sites that use differ- for the integration of heterogeneous structures or dif- ent RDF-models to annotate meta-data. RDF defines ferent meanings of terms. There is a need for semantic name-spaces for this purpose. Name-spaces are defined modeling and reasoning about structure and meaning. by referring to an URL that provides the names and Promising candidates for semantic modeling approaches connecting it to a source id that is then used to an- can be found in the areas of knowledge representation, notate each name in an RDF specification defining the as well as, in the distributed databases community. We origin of that particular name: source id:name will discuss some of these approaches in the following A standard syntax has been developed to express section. RDF-statements making it possible to identify the statements as meta-data, thereby providing a low level Ontologies Recently, the use of formal ontologies to language for expressing the intended meaning of infor- support information systems has been discussed (Guar- mation in a machine processable way. ino, 1998), (Bishr and Kuhn, 1999). The term ’Ontol- ogy’ has been used in many ways and across different communities (Guarino and Giaretta, 1995). If we want RDF/S – A Basic Vocabulary: The very simple to motivate the use of ontologies for information inte- model underlying ordinary RDF-descriptions leave a lot gration we have to define what we mean when we refer of freedom for describing meta-data in arbitrary ways. to ontologies. In the following sections, we will intro- However, if people want to share this information, there duce ontologies as an explication of some shared vo- has to be an agreement on a standard core of vocabulary cabulary or conceptualization of a specific subject mat- in terms of modeling primitives that should be used to ter. Further, we describe the way an ontology explicates describe meta-data. RDF schemes (RDF/S) attempt concepts and their properties and finally argue for the to provide such a standard vocabulary. benefit of this explication in many typical application Looking closer at the modeling components, re- scenarios. veals that RDF/S actually borrows from frame sys- tems well known from the area of knowledge represen- tation. RDF/S provides a notion of concepts (class), Shared Vocabularies and Conceptualizations: slots (property), inheritance (SubclassOf, SubslotOf) In general, each person has an individual view on the and range restrictions (Constraint Property). Unfortu- world and the things he/she has to deal with every nately, no well-defined semantics exist for these model- day. However, there is a common basis of understanding ing primitives in the current state. Further, parts such in terms of the language we use to communicate with as the re-identification mechanism are not well defined each other. Terms from natural language can there- even on an informal level. Lastly, there is no reasoning fore, be assumed to be a shared vocabulary relying on support available, not even for property inheritance. a (mostly) common understanding of certain concepts 4
  5. 5. with very little variety. This common understanding re- context knowledge by an ontology can be compared: lies on specific idea of how the world is organized. We often call these ideas a conceptualization of the world. Level of Formality: These conceptualizations provide a terminology that The specification of a conceptualization and its implicit can be used for communication between people. context knowledge, can be done at different levels of for- The example of our natural language demonstrates, mality. As already mentioned above, a glossary of terms that a conceptualization cannot be universally valid, can also be seen as an ontology, despite its purely in- but rather a limited number of persons committed to formal character. A first step to gain more formality, is that particular conceptualization. This fact is reflected to describe a structure to be used for the description. in the existence of different languages which differ even A good example of this approach is the standard web more (English and Japanese) or much less (German and annotation language XML (see section ). The DTD is Dutch). Confusion can become worse when we are con- an ontology describing the terminology of a web page sidering terminologies developed for a special scientific on a low level of formality. Unfortunately, the rather in- or economic areas. In these cases, we often find situ- formal character of XML encourages its misuse. While ations where one term refers to different phenomena. the hierarchy of an XML specification was originally The use of the term ’ontology’ in philosophy and in designed to describe a layout, it can also be exploited computer science serves as an example. The conse- to represent sub-type hierarchies, (van Harmelen and quence of this confusion is, a separation into different Fensel, 1999) which may lead to confusion. Fortunately, groups, that share terminology and its conceptualiza- this problem can be solved by assigning formal seman- tion. These groups are then called information commu- tics to the structures used for the description of the nities. ontology. An example of this is the conceptual model- The main problem with the use of a shared termi- ing language CML (Schreiber et al., 1994). CML offers nology according to a specific conceptualization of the primitives that describe a domain which can be given world is that much information remains implicit. When a formal semantic in terms of first order logic (Aben, a mathematician talks about a binomial normal he is 1993). However, a formalization is only available for referring to a wider scope than just the formula itself. the structural part of a specification. Assertions about Possibly, he will also consider its interpretation (the terms and the description of dynamic knowledge is not number of subsets of a certain size) and its potential formalized which offers total freedom for a descrip- uses (e. g. estimating the chance of winning in a lot- tion. On the other, there are specification languages tery). which are completely formal. A prominent example is Ontologies set out to overcome this problem of im- the Knowledge Interchange Format (KIF) (Genesereth plicit and hidden knowledge by making the conceptu- and Fikes, 1992) which was designed to enable different alization of a domain (e. g. mathematics) explicit. This knowledge-based systems to exchange knowledge. KIF corresponds to one of the definitions of the term ontol- has been used as a basis for the Ontolingua language ogy most popular in computer science (Gruber, 1993): (Gruber, 1991) which supplies formal semantics to that language as well. An ontology is an explicit specification of a conceptual- Extend of Explication: ization. The other comparison criterion is, the extend of ex- plication that is reached by the ontology. This crite- An ontology is used to make assumptions about the rion is strongly connected with the expressive power of meaning of a term available. It can also be viewed an the specification language used. We already mentioned explication, of the context a term, it is normally used DTD’s which are mainly a simple hierarchy of terms. in. Lenat (Lenat, 1998) for example, describes context Furthermore, we can generalize this by saying that, the in terms of twelve independent dimensions that have least expressive specification of an ontology consists of to be know in order to understand a piece of knowledge an organization of terms in a network using two-placed completely. He also demonstrates how these dimensions relations. The idea of this goes back to the use of se- can be explicated, using the ’Cyc’ ontology. mantic networks in the seventies. Many extensions of the basic idea examined have been proposed. One of the Specification of Context Knowledge: There most influential ones was, the use of roles that could be are many different ways in which an ontology may expli- filled out by entities showing a certain type (Brachman, cate a conceptualization and the corresponding context 1977). This kind of value restriction can still be found in knowledge. The possibilities range from a purely infor- recent approaches. RDF schema descriptions (Brickley mal natural language description of a term correspond- and Guha, 2000), which might become a new standard ing to a glossary up, to a strictly formal approach, with for the semantic descriptions of web-pages, are an exam- the expressive power of full first order predicate logic or ple of this. An RDF schema contains class definitions even beyond (e. g. Ontolingua (Gruber, 1991)). Jasper with associated properties that can be restricted by so- and Uschold (Jasper and Uschold, 1999) distinguish two called constraint-properties. However, default values ways in which the mechanisms for the specification of and value range descriptions are not expressive enough 5
  6. 6. to cover all possible conceptualizations. A more ex- sion purposes. Another very challenging application of pressive power can be provided by allowing classes to ontology-based specification is the reuse of existing soft- be specified by logical formulas. These formulas can ware. In this case, the specifying ontology serves as a be restricted to a decidable subset of first order logic. basis to decide if an existing component matches the This is the approach of description logics (Borgida and requirements of a given task. Patel-Schneider, 1994). Nevertheless, there are also ap- Depending on the purpose of the specification, on- proaches that allow for even more expressive descrip- tologies of different formal strength and expressiveness tions. In Ontolingua for example, classes can be de- are to be utilized. While the process of communica- fined by arbitrary KIF-expressions. Beyond the ex- tion design decisions and the acquisition of additional pressiveness of full first-order predicate logic, there are information normally benefit from rather informal and also special purpose languages that have an extended expressive ontology representations (often graphical), expressiveness to cover specific needs of their applica- the directed search for information needs a rather strict tion area. Examples are; specification languages for specification with a limited vocabulary to limit the com- knowledge-based systems which often including vari- putational effort. At the moment, the support of semi- ants of dynamic logic to describe system dynamics. automatic software reuse seems to be one of the most challenging applications of ontologies, because it re- Applications: Ontologies are useful for many dif- quires expressive ontologies with a high level of formal ferent applications, that can be classified into several strength. areas. Each of these areas, has different requirements The previously discussed considerations might pro- on the level of formality and the extend of explication voke the impression that the benefits of ontologies are provided by the ontology. We will review briefly com- limited to systems analysis and design. However, an mon application areas, namely the support of commu- important application area of ontologies is the integra- nication processes, the specification of systems and in- tion of existing systems. The ability to exchange infor- formation entities and the interoperability of computer mation at run time, also known as interoperability, is systems. an valid and important topic. The attempt to provide Information communities are useful because they ease interoperability suffers from problems similar to those communication and cooperation among members with associated with the communication amongst different the use of shared terminology with well defined mean- information communities. The important difference be- ing. On the other hand, the formalization of informa- ing the actors are not people able to perform abstrac- tion communities makes communication between mem- tion and common sense reasoning about the meaning bers from different information communities very diffi- of terms, but machines. In order to enable machines cult. Generally, because they do not agree on a common to understand each other, we also have to explicate the conceptualization. Although, they may use the shared context of each system on a much higher level of formal- vocabulary of natural language, most of the vocabulary ity. Ontologies are often used as Inter-Linguas in order used in their information communities is highly spe- to provide interoperability: They serve as a common cialized and not shared with other communities. This format for data interchange. Each system that wants situation demands for an explication and explanation to inter-operate with other systems has to transfer its of the use of terminology. Informal ontologies with a data information into this common framework. Interop- large extend of explication are a good choice to over- erability is achieved by explicitly considering contextual come these problems. While definitions have always knowledge in the translation process. played an important role in scientific literature, concep- Semantic Mapper For an appropriate support of an tual models of certain domains are rather new. Nowa- integration of heterogeneous information sources an ex- days systems analysis and related fields like software plicit description of semantics (i. e. an ontology) of each engineering, rely on conceptual modeling to communi- source is required. In principle, there are three ways cate structure and details of a problem domain as well how ontologies can be applied: as the proposed solution between domain experts and engineers. Prominent examples of ontologies used for • a centralized approach, where each source is related communication are Entity-Relationship diagrams and to one common domain ontology, Object-oriented Modeling languages such as UML. • a decentralized approach, where every source is re- ER-diagrams as well as UML are not only used for lated to its own ontology, or communication, they also serve as building plans for • a hybrid approach, where every source is related to its data and systems guiding the process of building (en- own ontology but the vocabulary of these ontologies gineering) the system. The use of ontologies for the stem from a common domain ontology description of information and systems has many bene- fits. The ontology can be used to identify requirements A common domain ontology describes the seman- as well as inconsistencies in a chosen design. Further, it tics of the domain in the SIMS mediator (Arens et al., can help to acquire or search for available information. 1996). In the global domain model of these approaches Once a systems component has been implemented, its all terms of a domain are arranged in a complex struc- specification can be used for maintenance and exten- ture. Each information source is related to the terms 6
  7. 7. of the global ontology (e. g. with articulation axioms of the sources to provide the specific knowledge for the (Collet et al., 1991)). However, the scalability of such corresponding component in the query phase. A media- a fixed and static common domain model is low (Mitra tor for example, which is associated with the structural et al., 1999), because the kind of information sources level, is responsible for the reconciliation of the struc- which can be integrated in the future is limited. tural heterogeneity problems. The mediator is config- In OBSERVER (Mena et al., 1996) and SKC (Mi- ured by a set of rules that describe the structural trans- tra et al., 1999) it is assumed, that a predefined ontol- formation of data from one source to another. The rules ogy for each information source exists. Consequently, are acquired in the acquisition phase with the help of new information sources can easily be added and re- the rule generator. moved. But the comparison of the heterogeneous on- An important characteristic of the Buster architec- tologies leads to many homonym, synonym, etc. prob- ture is the semantic level, where two different types lems, because the ontologies use their own vocabulary. of tools exists for solving the semantic heterogeneity In SKC (Mitra et al., 1999) the ontology of each source problems. This demonstrates the focus of the Buster is described by graphs. Graph transformation rules are system, providing a solution for this type of problems. used to transport information from one ontology into Furthermore, the need for two types of tools exhibits, another ontology (Mitra et al., 2000). These rules can that the reconciliation of semantic problems is very dif- only solve the schematic heterogeneities between the ficult and must be supported by a hybrid architecture ontologies. where different components are combined. In MESA (Wache et al., 1999) the third hybrid ap- In the following sections we describe the two phases proach is used. Each source is related to its source and the components in detail. ontology. In order to make the source ontologies com- parable, a common global vocabulary is used, organized Query Phase in a common domain ontology. This hybrid approach In the query phase a user submits a query request to provides the biggest flexibility because new sources can one or more data sources in the network of integrated easily be integrated and, in contrast to the decentralized data sources. In this query phase several components approach, the source ontologies remain comparable. of different levels interact (see Fig. 1). In the next section we will describe how ontologies On the syntactic level, wrappers are used to establish can help to solve heterogeneity problems. a communication channel to the data source(s), that is independent of specific file formats and system imple- BUSTER - An Approach for mentations. Each generic wrapper covers a specific file- Comprehensive Interoperability or data-format. For example, generic wrappers may ex- ist for ODBC data sources, XML data files, or specific In chapter 2 we described the methods needed to GIS formats. Still, these generic wrappers have to be achieve structural, syntactic, and semantic interoper- configured for the specific requirements of a data source. ability. In this chapter, we propose the Buster- ap- The mediator on the structural level uses informa- proach (Bremen University Semantic Translator for En- tion obtained from the wrappers and ”combines, in- hanced Retrieval), which provides a comprehensive so- tegrates and abstracts” (Wiederhold, 1992) them. In lution to reconcile all heterogeneity problems. the Buster approach, we use generic mediators which During an acquisition phase all desired informa- are configured by transformation rules (query definition tion for providing a network of integrated informa- rules QDR). These rules describe in a declarative style, tion sources is acquired. This includes the acquisi- how the data from several sources can be integrated and tion of a Comprehensive Source Description (CSD) of transformed to the data structure of original source. each source together with the Integration Knowledge On the semantic level, we use two different tools spe- (IK) which describes how the information can be trans- cialized for solving the semantic heterogeneity prob- formed from one source to another. lems. Both tools are responsible for the context In the query phase, a user or an application (e. g. transformation, i. e. transforming data from an source- a GIS) formulates a query to an integrated view of context to a goal-context. There are several ways sources. Several specialized components in the query how the context transformation can be applied. In phase use the acquired information, i. e. the CSD’s and Buster we consider the functional context transfor- IK’s, to select the desired data from several informa- mation and context transformation by re-classification tion sources and to transform it to the structure and (Stuckenschmidt and Wache, 2000). the context of the query. In the functional context transformation, the con- All software components in both phases are associ- version of data is done by application of a predefined ated to three levels: the syntactic, the structural and functions. A function is declaratively represented in the semantic level. The components on each level deal Context Transformation Rules. These (CTR’s) describe with the corresponding heterogeneity problems. The from which source-context to which goal-context can be components in the query phase are responsible for solv- transformed by the application of which function. The ing the corresponding heterogeneity problems whereas context transformation rules are invoked by the CTR- the components in the acquisition phase use the CSD’s Engine. The functional context transformation can be 7
  8. 8. use types in urban and rural areas of Germany. The ATKIS data sets are generated and maintained by a working group of several public agencies on a fed- eral and state level. The complexity of the task of keep- ing all data sets up-to-date and the underlying adminis- trative structure, causes a certain delay in the produc- tion and delivery of new updated maps. Consequently, the engineer in our application example, is likely to work with ATKIS maps that are not quite up-to-date but show discrepancies with respect to features observ- able in reality. The engineer needs tools to compare his potentially inconsistent base-data with more recent representations of reality, in order to define potential problem areas. In our example the CORINE land cover (EEA, 1999) data base provide satellite images. From 1985 to 1990, the European Commission carried out the CORINE Pro- gramme (Co-ordination of Information on the Envi- ronment). The results are essentially of three types, which correspond to the three aims of the Programme: (a) an information system on the state of the envi- Figure 1: The query phase of the BUSTER architecture ronment in the European Community has been cre- ated (the CORINE system). It is composed of a series of data bases describing the environment in the Euro- used for example, in the transformation of area mea- pean Community, as well as the data bases with back- sures in hectars to area measures in acres, or the trans- ground information. (b) Nomenclatures and method- formation of one coordinate system into another. All ologies were developed for carrying out the programs, context transformation rules can be described with the which are now used as the reference in the areas con- help of mathematical functions. cerned at the community level. (c) A systematic effort Further to the functional context transformation, was made to concert activities with all the bodies in- Buster also allows the classification of data into an- volved in the production of environmental information other context. This is utilized to automatically map the especially at international level. As a result of this ac- concepts of one data source to concepts of another data tivity, and indeed of the whole programs, several groups source. To be more precise, the context description (i. e. of international scientists have been working together the ontological description of the data) is re-classified. towards agreed targets. They now share a pool of ex- The source-context description, to which the data is an- pertise on various themes of environmental information. notated, is obtained from the CSD, completed with the data information and relates to goal-context descrip- The technologies of syntactic, structural, and seman- tions. After the context re-classification the data is sim- tic integration described in section can be applied to ply replaced with the data which is annotated with the facilitate this task. related goal-context. Context re-classification together with the data replacement is useful for the transforma- Following, is a step-by-step example of how a typi- tion of catalog terms, e. g. exchanging the term of an cal user interaction with the system in the query phase source catalog by a term from the goal catalog. could look: A Query Example We demonstrate the query phase and the interaction of the components by a real world 1. The user starts the query from within his native GIS example. The scenario presents a typical user, for ex- tool (here: ATKIS maps in ArcView). He defines ample an environmental engineer in a public adminis- the parameters of the query, such as the properties tration, who is involved in some kind of urban plan- and formats of the originating system, the specified ning process. The basis for his work is a set of digital area of interest (bounding rectangle, coordinate sys- maps and a GIS to view, evaluate, and manipulate these tem etc.), and information about the requested at- maps. tribute data (here: ”land use”). Then he submits the query to the network of integrated data sources. In our example, the engineer uses a set of ATKIS maps in an ArcView (ESRI, 1994) environment. ATKIS 2. The query is matched against the central network stands for ”Amtliches Topographisch-Kartographisches database, and a decision is made about which of the Informationssystem”, i. e. the official German informa- participating data sources a.) cover the area of inter- tion system related to maps and topographical infor- est and b.) hold information on the attribute ”land mation (AdV, 1998). Among others, the ATKIS data use”. A list of all compatible data sources is created source offers detailed information with respect to land- and send back to the user. 8
  9. 9. From this list, the user selects one or more data sources and re-submits the query to the system. In our example, the engineer selects a set of CORINE land-cover satellite images. 3. The system consults the central database and re- trieves basic information needed to access the se- lected data source(s). This includes information about technical, syntactical, and structural details as well as rules needed for the access exchange of data from these sources. 4. The information is used to select and configure suit- able wrappers from a repository of generic wrappers. Once the wrappers are properly installed, a suitable mediator is selected from a repository of generic me- diators. Among others, the mediator-rules describe the fields that hold the requested information (here: the fields holding land-use information). With the help of wrappers and mediators, a direct connection to the selected data source(s) can be es- tablished, and individual instances of data can be Figure 2: The data acquisition phase of the BUSTER accessed. Architecture 5. For the context transformation from the source con- text into the query context the mediator queries the CTR-Engine. For example, the CTR-Eengine trans- The Comprehensive Source Description Each forms the area measure in hectares to area measures CSD consists of meta data that describe technical and in acres. administrative details of the data source as well as its structural and syntactic schema and annotations. In If the CTR-Engine cannot transform the context, be- addition, the CSD comprises a source ontology, i. e. a cause no appropriate CTR’s exists, it queries the re- detailed and computer-readable description of the con- classifier for a context mapping. In our example, it is cepts stored in the data source. The CSD is attached to used to re-classify the CORINE land-use attributes the respective data source. It should be available in a of all polygons in the selected area of interest to highly interchangeable format (for example XML), that make them consistent with the ATKIS classification allows easy data exchange over computer networks. scheme. Setting up a CSD is the task of the domain special- If no context transformation can be performed the ist responsible for the creation and maintenance of the mediator rejects the data. specific data source. With the help of specialized tools 6. The result of the whole process is a new map for the that use repositories of pre-existing general ontologies selected area that shows CORINE data re-classified and terminologies, the tedious task of setting up a CSD to the ATKIS framework. The engineer in our exam- can be supported. These tools examine existing CSD’s ple can overlay the original ATKIS set of maps with of other but similar sources and generate hypotheses for the new map. He can then apply regular GIS tools similar parts of the new CSD’s. The domain specialist to make immediate decisions about which areas of must verify – eventually modifying – the hypotheses the ATKIS maps are inconsistent with the CORINE and add them to the CSD of the new source. With satellite images and consequently need to be updated. these acquisition tools the creation of new CSD’s can be simplified (Wache et al., 1999). Data Acquisition Phase Before the first query can be submitted, the knowledge, The Integration Knowledge In a second step of in fact the Comprehensive Source Description (CSD) the data acquisition phase, the data source is added and Integration Knowledge (IK) has to be acquired. to the network of integrated data sources. In order The first step of the data acquisition phase consists of for the new data source to be able to exchange data gathering information about the data source that is to with the other data sources in the network, Integration be integrated (Fig. 2). This information is stored in Knowledge (IK) must be acquired. The IK is stored a source-specific data base, the Comprehensive Source in a centralized database that is part of the network of Descriptor (CSD). A CSD has to be created for each integrated data sources. data source that participates in a network of integrated The IK consists of several separated parts which data sources. provides specific knowledge for the components in the query phase. For example, the rule generator exam- 9
  10. 10. ines several CSD’s and creates rules for the mediator [Bergamashi et al., 1999] Bergamashi, Castano, (Wache et al., 1999). The wrapper configurator uses Vincini, and Beneventano (1999). Intelligent tech- the information about the sources in order to adapt niques for the extraction and integration of hetero- generic wrappers to the heterogeneous sources. geneous information. In Workshop Intelligent Infor- Creating the IK is the task of the person responsi- mation Integration, IJCAI 99, Stockholm, Sweden. ble for operating and maintaining the network of in- [Bishr and Kuhn, 1999] Bishr, Y. and Kuhn, W. tegrated data sources. Due to the complexity of the (1999). The Role of Ontology in Modelling Geospa- IK needed for the integration of multiple heterogeneous tial Features, volume 5 of IFGI prints. Institut f¨ru data sources and the unavoidable semantic ambiguities, Geoinformatik, Universit¨t M¨nster, M¨nster. a u u it may not be possible to accomplish this task automat- ically. However, the acquisition of the IK can be sup- [Borgida and Patel-Schneider, 1994] Borgida, A. and ported by semi-automatic tools. In general, such ac- Patel-Schneider, P. (1994). A semantics and complete quisition tools use the information stored in the CSDs algorithm for subsumption in the classic description to pre-define parts of the IK and propose them to the logic. JAIR, 1:277–308. human operator who makes the final decision about [Brachman, 1977] Brachman, R. (1977). What’s in a whether to accept, edit, or reject them. concept: Structural foundations for semantic nets. In- ternational Journal of Man-Machine Studies, 9:127– Summary 152. [Brickley and Guha, 2000] Brickley, D. and Guha, R. In order to make GIS interoperable, several problems (2000). Resource description framework (rdf) schema have to be solved. We argued that these problems can specification 1.0. Technical Report PR-rdf-schema, be divided onto three levels of integration, the syntac- W3C. http://www.w3.org/TR/2000/CR-rdf-schema- tic, structural, and semantic level. In our opinion it is 20000327/. crucial to note that the problem of interoperable GIS can only be solved if solutions (modules) on all three [Chawathe et al., 1994] Chawathe, S., Garcia-Molina, levels of integration are working together. We believe H., Hammer, J., Ireland, K., Papakonstantinou, Y., that it is not possible to solve the heterogeneity prob- Ullman, J., and Widom, J. (1994). The TSIMMIS lems separately. Project: Integration of Heterogeneous Information The Buster- approach uses different components Sources. In Proceedings of IPSJ Conference, pages 7– for different tasks on different levels and provides a 18. conceptional solution for these problems. The com- [Collet et al., 1991] Collet, C., Huhns, M. N., and ponents can be any existing systems. We use wrap- Shen, W.-M. (1991). Resource integration using a pers for the syntactic level, mediators for the struc- large knowledge base in carnot. IEEE Computer, tural level, and both context transformation rule en- 24(12):55–62. gines (CTR-Engines) and classifiers (mappers) for the [EEA, 1999] EEA (1997-1999). Corine land cover. semantic level. CORBA as low level middleware is used technical guide, European Environmental Agency, for the communication of the components. ETC/LC, European Topic Centre on Land Cover. At the moment, a few wrappers are available (e. g. ODBC-, XML-wrapper), a wrapper for shape files will [ESRI, 1994] ESRI (1994). Introducing ArcView. En- be available soon. We are currently developing a medi- vironmental Systems Research Institute (ESRI), Red- ator and the CTR-Engine and use FaCT (Fast Classifi- lands,CA. USA. cation of Terminologies) (Horrocks, 1999) as a reasoner [Galhardas et al., 1998] Galhardas, H., Simon, E., and for our prototype system. Buster is a first attempt Tomasic, A. (1998). A framework for classifying envi- to solve the heterogeneous problems mentioned in this ronmental metadata. In AAAI, Workshop on AI and paper, however, a lot of work has to be done in various Information Integration, Madison, WI. areas. [Genesereth and Fikes, 1992] Genesereth, M. and Fikes, R. (1992). Knowledge interchange format ver- References sion 3.0 reference manual. Report of the Knowledge [Aben, 1993] Aben, M. (1993). Formally specifying re- Systems Laboratory KSL 91-1, Stanford University. usable knowledge model components. Knowledge Ac- [Gruber, 1991] Gruber, T. (1991). Ontolingua: A quisition Journal, 5:119–141. mechanim to support portable ontologies. KSL Re- [AdV, 1998] AdV (1998). Amtliches Topographisch- port KSL-91-66, Stanford University. Kartographisches Informationssystem ATKIS. Lan- [Gruber, 1993] Gruber, T. (1993). A translation ap- desvermessungsamt NRW, Bonn. proach to portable ontology specifications. Knowledge [Arens et al., 1996] Arens, Y., Hsu, C.-N., and Acquisition, 5(2). Knoblock, C. A. (1996). Query processing in the sims [Guarino, 1998] Guarino, N. (1998). Formal ontology information mediator. In Advanced Planning Technol- and information systems. In Guarino, N., editor, FOIS ogy, California, USA. AAAI Press. 98, Trento, Italy. IOS Press. 10
  11. 11. [Guarino and Giaretta, 1995] Guarino, N. and Gia- ticulation of ontology interdependencies. In Proc. Ex- retta, P. (1995). Ontologies and knowledge bases: To- tending DataBase Technologies, EDBT 2000, volume wards a terminological clarification. In Mars, N., edi- Lecture Notes on Computer Science, Konstanz, Ger- tor, Towards Very Large Knowledge Bases: Knowledge many. Springer Verlag. Building and Knowledge Sharing, pages 25–32. Ams- [Naiman and Ouksel, 1995] Naiman, C. F. and Ouksel, terdam. A. M. (1995). A classification of semantic conflicts in [Horrocks, 1999] Horrocks, I. (1999). FaCT and iFaCT. heterogeneous database systems. Journal of Organi- In (Lambrix et al., 1999), pages 133–135. zational Computing, pages 167–193. [Jasper and Uschold, 1999] Jasper, R. and Uschold, M. [OMG, 1992] OMG (1992). The common object re- (1999). A framework for understanding and classifying quest broker: Architecture and specification. OMG ontoogy applications. In Proceedings of the 12th Banff Document 91.12.1, The Object Management Group. Knowledge Acquisition for Knowledge-Based Systems Revision 1.1.92. Workshop. University of Calgary/Stanford University. [Papakonstantinou et al., 1996] Papakonstantinou, Y., [Kashyap and Sheth, 1997] Kashyap, V. and Sheth, A. Garcia-Molina, H., and Ullman, J. (1996). Medmaker: (1997). Cooperative Information Systems: Current A mediation system based on declarative specifica- Trends and Directions, chapter Semantic Heterogene- tions. In International Conference on Data Engineer- ity in Global Information Systems: The role of Meta- ing, pages 132–141, New Orleans. data, Context and Ontologies. Academic Press. [Schreiber et al., 1994] Schreiber, A., Wielinga, B., [Kim et al., 1995] Kim, W., Choi, I., Gala, S., and Akkermans, H., Velde, W., and Anjewierden, A. Scheevel, M. (1995). Modern Database: The Ob- (1994). Cml the commonkads conceptual modeling ject Model, Interoperability, and Beyond, chapter On language. In et al., S., editor, A Future of Knowledge Resolving Schematic Heterogeneity in Multidatabase Acquisition, Proc. 8th European Knowledge Acquisi- Systems, pages 521–550. ACM Press / Addison- tion Workshop (EKAW 94), number 867 in Lecture Wesley Publishing Company. Notes in Artificial Intelligence. Springer. [Kim and Seo, 1991] Kim, W. and Seo, J. (1991). Clas- [Shepherd, 1991] Shepherd, I. D. H. (1991). Informa- sifying schematic and data heterogeinity in multi- tion integration in gis. In Maguire, D. J., Goodchild, database systems. IEEE Computer, 24(12):12–18. M. F., and Rhind, D. W., editors, Geographical Infor- [Lambrix et al., 1999] Lambrix, P., Borgida, A., Lenz- mation Systems: Principles and applications. Long- erini, M., M¨ller, R., and Patel-Schneider, P., editors o man, London, UK. (1999). Proceedings of the International Workshop on [Stuckenschmidt and Wache, 2000] Stuckenschmidt, Description Logics (DL’99). H. and Wache, H. (2000). Context modelling and [Landgraf, 1999] Landgraf, G. (1999). Evolution of transformation for semantic interoperability. In eo/gis interoperability; towards an integrated applica- Knowledge Representation Meets Databases (KRDB tion infrastructure. In Vckovski, A., editor, Interop99, 2000). to appear. volume 1580 of Lecture Notes in Computer Science, [van Harmelen and Fensel, 1999] van Harmelen, F. and Z¨rich, Switzerland. Springer. u Fensel, D. (1999). Practical knowledge representation [Lenat, 1998] Lenat, D. (1998). The dimensions of con- for the web. In Fensel, D., editor, Proceedings of the text space. Available on the web-site of the Cycorp IJCAI’99 Workshop on Intelligent Information Inte- Corporation. (http://www.cyc.com/publications). gration. [Maguire et al., 1991] Maguire, D. J., Goodchild, [Vckovski, 1998] Vckovski, A. (1998). Interoperable and M. F., and Rhind, D. W., editors (1991). Geographi- Distributed Processing in GIS. Taylor & Francis, Lon- cal Information Systems: Principles and applications. don. Longman, London, UK. [Vckovski et al., 1999] Vckovski, A., Brassel, K., and [Mena et al., 1996] Mena, E., Kashyap, V., Illarra- Schek, H.-J., editors (1999). Proceedings of the 2nd In- mendi, A., and Sheth, A. (1996). Managing multiple ternational Conference on Interoperating Geographic information sources through ontologies: Relationship Information Systems, volume 1580 of Lecture Notes in between vocabulary heterogeneity and loss of informa- Computer Science, Z¨rich. Springer. u tion. In Baader, F., Buchheit, M., Jeusfeld, M. A., and [Visser et al., 2000] Visser, U., Stuckenschmidt, H., Nutt, W., editors, Proceedings of the 3rd Workshop Schuster, G., and V¨gele, T. (2000). Ontologies for o Knowledge Representation Meets Databases (KRDB geographic information processing. Computers & Geo- ’96). sciences. submitted. [Mitra et al., 1999] Mitra, P., Wiederhold, G., and [W3C, 1998] W3C (1998). Extensible markup language Jannink, J. (1999). Semi-automatic integration of (xml) 1.0. W3C Recommendation. knowledge sources. In Fusion ’99, Sunnyvale CA. [W3C, 1999] W3C (1999). Resource descrition frame- [Mitra et al., 2000] Mitra, P., Wiederhold, G., and work (rdf) schema specification. W3C Proposed Rec- Kersten, M. (2000). A graph-oriented model for ar- ommendation. 11
  12. 12. [Wache et al., 1999] Wache, H., Scholz, T., Stieghahn, H., and K¨nig-Ries, B. (1999). o An integration method for the specification of rule–oriented medi- ators. In Kambayashi, Y. and Takakura, H., edi- tors, Proceedings of the International Symposium on Database Applications in Non-Traditional Environ- ments (DANTE’99), pages 109–112, Kyoto, Japan. [Wiederhold, 1992] Wiederhold, G. (1992). Mediators in the architecture of future information systems. IEEE Computer, 25(3):38–49. standard reference for mediators. [Wiederhold, 1999] Wiederhold, G. (1999). Mediation to deal with heterogeneous data sources. In Vckovski, A., editor, Interop99, volume 1580 of Lecture Notes in Computer Science, Z¨rich, Switzerland. Springer. u [Wiener et al., 1996] Wiener, J., Gupta, H., Labio, W., Zhuge, Y., Garcia-Molina, H., and Widom, J. (1996). Whips: A system prototype for warehouse view main- tenance. In Workshop on materialized views, pages 26–33, Montreal, Canada. [Worboys and Deen, 1991] Worboys, M. F. and Deen, S. M. (1991). Semantic heterogeneity in distributed geographical databases. SIGMOID Record, 20(4). 12

×