The Semantic Web: Has the DB Community missed the
Vipul Kashyap, National Library of Medicine, NIH
3 April, 2002
There is a widespread interest in various research communities in the issues related to the Semantic Web.
In this white paper, we try to understand the reasons behind the popularity of the current “Syntactic Web”
and try to identify factors that might lead to the success of the “Semantic Web”. Among others the DB and
IS communities were the ones that completely missed the bus, when it came to the Web, and may be might
miss the bus again when it comes to the “Semantic Web”. The research issues in enabling the Semantic
Web are organized using a “layered” semantic networking metaphor, and the various problem components
are identified. Interestingly, we observe that DB research has spanned all the problem components
identified above, and that the theme of the semantic web can provide a unifying framework for all these
components. We conclude by proposing a set of critical problems that DB researchers are well positioned
to solve and identify crucial assumptions underlying that the DB community has to adopt in order to
address (successfully) the research issues to the Semantic Web.
The Success of the “Syntactic” Web
It is an obvious truth to everyone that the current success of the Web is way beyond what
was ever imagined. It was primarily conceived as a means for physicists in CERN to
share scientific data with each other. However, it has since then blossomed into a world-
wide infrastructure for data and knowledge exchange and e-commerce transactions
involving technologies as wide and diverse as: databases, user interfaces, hypermedia,
internetworking protocols, distributed object computing, machine learning/data mining,
etc. Why was the web so hugely successful, and more importantly why had the major
technology areas on which the web critically depends for sustenance, failed to anticipate
the sudden emergence of the web? We feel that the answer to these questions is critical to
understand the feasibility and predict the success of the Semantic Web vision. Let’s try to
analyze the success of the web along the following major dimensions:
• Technology: An argument can be made that the success of the web was due to the
sophistication of the underlying technology that enabled it. However, internet
protocols such at telnet, ftp, gopher; DBMS servers, distributed object computing
frameworks such as CORBA/RMI; and hypermedia based systems existed much
before the web came into being. Why is that none of these component areas were able
to anticipate the web?
• Multimedia: The ability to put up multimedia information probably contributed to
the success of the web as it is “cognitively easier” for people to browse multimedia
information (a picture is worth a thousand words) as opposed to text documents and
database tables. This, we believe may be a contributory factor in the success of the
• Ease of use: The most important reason for the success of the web is that the
technology is relatively simple enough for generating a critical mass of adherents to
it. For example, it is very easy to publish a (multimedia) web page and also to “jump”
from one document to another in the web hyperspace. This was due to effective
utilization of hypermedia technology to insulate users from protocol specific
• A Sociological Experiment: A more accurate description of the web is that it is a
sociological experiment rather than a technological invention. The ease of use enables
the use of the web infrastructure for people to exchange information and data with
each other. The introduction of instant messaging services enhances the feel of a
“virtual society” on the web.
It is clear from the above discussion that technology played a relatively minor role in the
success of the web, even though the web critically depends on the various technological
components for it. We need to think afresh how the semantic web can help make things
easier for the user and enhance his/her web experience. On the other hand the DB
research community needs to re-examine its assumptions, especially in the context of
deployment and usability of the various technologies being developed. We now discuss
the various research issues for enabling the Semantic Web using the multi-layer
Research Issues for the Semantic Web: The multi-layered network
In this section, we borrow a metaphor from arguably the most successful research
community, network and internet devices and protocols. The metaphor is that of the
network protocol stack, where one layer of the stack depends on one or more layers
below it. One may view information at all layers as increasingly “semantic”, as one goes
from physical signals, to bits and bytes, to data frames and packets, to objects and
methods, to entities and relationships, to processes. We now try to organize the various
research issues for the Semantic Web according to this metaphor as illustrated in Figure 1
below. Building on the success of the data networking and middleware communities, the
above picture tries to relate organize and relate the semantic web efforts along multiple
layers, some of whom are described below:
• Object Interoperability: This is the layer at which the current middleware products
are aimed in the industry. However these objects are primarily defined as containers
for software and for streamlining the software development process. The CORBA,
EJB object models are examples of standards at this layer.
• Meta-Model Interoperability: This is the layer at which the cross-over from the
“data” space to the “knowledge” space takes place. The objects here are viewed as
containers of knowledge to be fleshed out by upper layers. The OKBC and RDF(S)
core models are examples of standards at this layer.
• Ontology Interoperability: This is the layer where ontologies, schemas and
classifications are built upon common underlying standardized meta-models. The
ability to use different ontologies to specify and query information constitutes
interoperability at this layer.
• Meta-Data (View/Query) Interoperability: Semantic metadata descriptions can be
constructed from one or more underlying ontologies. Issues at this layer would be to
decompose information requests into those supported by the individual semantic
metadata descriptions corresponding to the information sources.
Figure 1: The multi-layered stack metaphor for the Semantic Web
The ability to organize semantic web research along these layers helps us organize the
work require to build out the underlying infrastructure of the semantic web. The issues
that arise are: development of standards and industry wide APIs at each of the layers.
Building up semantic-web specific functions such as semantic routings, “semantic”
content delivery networks. Specification of further application layers may also be
required. An interesting research topic arises at the “crossover” from the Distributed
Object Computing world to the Semantic World, i.e. interoperation across meta (or data)
models such as frame based and object oriented models. However the most interesting
“semantic” issues arise from the meta-model layers upwards, as we expect the semantic
web community to either standardize on a rich meta-model or a limited set of meta-
models with mappings across them. We visualize the semantic web fabric as a collection
of ontologies and metadata descriptions and inter-relationships and correspondence
The Semantic Web Fabric: A Collection of Metadata Descriptions and
One way of visualizing the Semantic Web is illustrated in Figure 2, as a collection of
ontologies corresponding to different domains and user communities, and metadata
descriptions constructed from those ontologies. Ontologies have been identified as the
crucial component for capturing and representing semantics. Same information from
differing perspectives may be captured using different ontologies and inter-ontology
interoperation is the key problem that needs to be addressed in order to make the
semantic web a reality. Information requests specified using a particular ontology have to
be transformed into similar requests expressed in terms used in other ontologies.
User Query/ User Query/ User Query/
Information Information Information
Request Request Request
r r Metadata
Distributed Computing Infrastructure (J2EE, .NET, CORBA, Agents)
Figure 2: The Semantic Web Fabric: A Collection of Metadata and Ontologies
This necessitates an infrastructure that has components that manage ontologies, inter-
ontological relationships, metadata descriptions constructed from the various ontologies
and mappings of these metadata descriptions to the various data and information sources
on the web. A set of component functions that are crucial for enabling the semantic web
• Bootstrapping, Creation and Maintenance of Semantic Knowledge
o Collaborative and Sociological Processes, Statistical Techniques
o Ontology Building, Maintenance and Versioning Tools
• Re-use of Existing Semantic Knowledge (Ontologies)
• Annotation/Association/Extraction of Knowledge with/from Underlying Data
• Information Retrieval and Analysis (Distributed Querying/Search/Inference
• Semantic Discovery and Composition of Services
• Distributed Computing/Communication Infrastructures
o Component based technologies, Agent based systems, Web Services
• Repositories for managing data and semantic knowledge
o Relational Databases, Content Management Systems, Knowledge Base
As enumerated above, the scope and range of issues involved in Semantic Web research
is wide and varied and spans multiple disciplines and research areas. Surprisingly, even
though, the database community has not taken up issues related to the Semantic Web,
work being done within the community has spanned almost all the categories mentioned
above, the topic of discussion for the next section.
DB Research and the Semantic Web
We now discuss the areas of DB research that overlap with the Semantic Web effort,
which are as follows:
• Semantic Data Models: Database researchers have been working on various types of
semantic data models with constructs at higher level abstractions such as
generalizations and aggregations. The main focus of the work here however was
support of queries at a higher level of abstraction and efficient indexing structures for
the same. Inference based on semantics was not the main focus of this area of work.
• Multi-database Schema Heterogeneity and Schema Integration: There is a wide
body of work in multidatabase literature that attempts to identify and enumerate
various schema heterogeneities, and techniques for resolving those heterogeneities in
the context of schema integration. Attempts have also been made to use domain
ontologies for integration of data across multiple databases.
• Schema Evolution: Even though for the most part, the database schema has been
assumed to be relatively static, there has been work on schema evolution and
versioning in the context of object oriented databases.
• Object Oriented/XML Databases: Specialized databases such as object oriented
databases in the 90s and XML databases today, have and are being developed to
address specialized needs of web content and complex data.
• Deductive Databases/Rule Based Systems: Rule based approaches for handling and
manipulating data have been implemented in various deductive database prototypes.
Rule based approaches are also visible in implementations of triggers in commercial
relational database systems,
• Mediators and Wrappers: The availability of non-traditional (non-relational) data
sources on the web created a need for exporting a “relational” view of the underlying
data. Wrappers focused on encapsulating a data source into a relational or an object-
relational model, whereas mediators focused on partitioning queries and combining
results from multiple data sources.
• Multidatabase/Federated Database Query Processing: Query processing across
multiple autonomous databases has been a significant endeavor in the federated
database field and frameworks to support mappings and query decomposition
algorithms have been proposed.
• Data Mining: The presence of huge amounts of data in corporate databases
(compounded by the data explosion on the internet), has given rise to the need for
automatically “mining” patterns from the databases to come up with insights that can
be applied to derive further business efficiencies. The data mining field of work has
focused on coming up with scalable and efficient algorithms for the same.
• Probabilistic Databases: Though not a part of mainstream database literature, there
has been a significant amount of work on storage, manipulation and querying of
• Workflow-based Coordination Systems: Work has been done in definition of task
based workflow processes and control and coordination of the same. There has been
work in dynamic instantiation and combination of workflows and there have been
approaches to re-use this work in the context of web services.
• Security in Database Systems: Security in database systems research has focused on
specification of access and authorization policies based on group membership and
techniques to enforce and prove the correctness of the policies specified.
• Multimedia Databases: The explosion of data on the web has given rise to
specialized databases dealing with multimedia information such as text, images and
Based on the above discussion, it may be observed that even though the database
community has not bought into the Semantic Web vision, work has been done across all
the problem components crucial for the Semantic Web. We believe that the Semantic
Web provides the underlying theme that can tie in all the disparate pieces of work. We
now discuss the various missing gaps that need to be addressed to make the Semantic
Web a reality and what DB research can contribute in that context.
We now discuss some of the critical missing gaps in Semantic Web research and how the
DB community can respond to these challenges:
• Ontology Impedance/Integration/Interoperation: Ontology impedance may be
defined as the semantic mismatch between two or more ontologies that are being
merged. However, the ontology integration problem is slightly different from the
schema integration problem, as it focuses on the semantics of the relationships and
domain specific constraints on the information. Work needs to be done to estimate the
consequent loss of information that results from this impedance. Schema integration
work may prove to be a good starting point for ontology integration and
• Scalability/Performance: Issues related to scalability of web servers serving semantic
web content is a critical issue on which the future semantic web depends. Work is
needed to come up with techniques that exploit “semantics” to design better caching
techniques, e.g., semantic content distribution networks. There is also a need for
metrics and measurements to evaluate how well algorithms for the semantic web will
perform and scale. In general, this has been the strong point of DB research and this is
an area where a significant contribution is most likely from the database community.
• Dynamic Ontologies: A fundamental but flawed assumption being made by the
database research community is that database schemas, and by that extension,
ontoloiges are static in nature. Real world ontologies are likely to be dynamic and
evolve over time and algorithms and techniques should be based on this important
• Semantic Metadata Extraction: Two crucial factors that will determine the success
of the semantic web are: the ease and cost of developing and maintaining ontologies,
mappings and articulation rules; and the ease of constructing semantic annotations.
Tools that drive the extraction process based on text processing and NLP techniques
(most of the data on the web is textual) are important. There is significant work on
mappings, etc., in Federated Database work and this needs to be augmented by
looking at techniques from Information Retrieval (clustering) and NLP.
• Inferences based on the Semantics of the Data: The DB community has focused
more on issues such as indexing, caching, etc., where the schemas and range of
queries are known ahead of time. However, on the semantic web, where ontologies
are dynamic and user requests that might change over time, inferences based on the
semantics of data might be an important tool to address some of the issues.
• Semantics of Multimedia Data: There is a need to focus more on non-traditional data
such as text, images and video. The challenge before the DB community is to be able
to evolve a data model that is as simple and effective as the relational model and a
query language similar to SQL, which treats structured and unstructured data in a
• Semantics of Processes/Plans/Workflows: Whereas there has been a lot of work in
task and process-based workflow by the DB community, there is a need for the ability
to map high level semantic descriptions to workflow instances and compose existing
workflow instances on the fly. Once again, the DB community is well positioned to
• Digital Rights Management: The appearance of new types of multimedia data
creates a need for digital rights management, which the DB community should
respond to. This is a new area for DB research, as the need for “watermarking”
relational data was never felt.
We believe that the next wave of research will focus on re-using data
models/ontologies/schemas in an open and dynamic environment. This requires the DB
community to change its assumptions and think “out of the box” in order to make an
In general, we believe that the database research community is well positioned to address
the challenges involved in enabling the Semantic Web, and furthermore, the Semantic
Web theme serves as a good unifying framework for pulling together disparate pieces of
work being performed by the various researchers in the database community. However,
there is a need to think “outside the box” and change some of the underlying assumptions
in order to make an impact in this area:
Data Models/Schemas/Ontologies will form the critical infrastructure for the
Semantic Web. More attention should be paid to issues such as model manipulation,
management and querying.
Re-use of pre-existing data models/schemas/ontologies is crucial in describing the
semantics of various information sources, i.e., issues regarding this layer must be paid
the same level of attention, as issues related to data management.
There is a need to relax consistency and completeness requirements and estimate the
“error” in the results returned.
Semantics of information should be used to minimize “error” in the information
The new environment is likely to be more “dynamic” in nature – schemas,
workflows, queries, etc. can no longer be assumed to be static.
We believe, that if the DB community adapts to these requirements, it stands a good
chance of making an impact, otherwise, it might miss the bus, again!