Realizing Semantic Web - Light Weight semantics and beyond

Realizing Semantic Web: Lightweight Semantics and Beyond
Krishnaprasad Thirunarayan (T. K. Prasad)
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-45435

1

Outline
• Domain Goals and Challenges
• Cyberinfrastructure Investments in Science
• Utility and Continuum of Machine-Processable
Semantics : An Architecture
•
•
•

What?: Nature of Data and Granurality of Semantics
Why?: Lightweight semantics and its benefits
How?: Community-ratified Ontologies
+ Semantic Annotations of Data and Documents
+ Linked Open Materials Data

• Research: Processing Tabular Data
2

Domain Goals and Challenges
• Materials Science and Engineering Data and
Information sharing, discovery, and application are
possible only if domain scientists are able and
willing to do so.
• Technological challenges
– Computational tools and repositories conducive to easy
exchange, curation, attribution, and analysis of data

• Cultural challenges
– Proper protection, control, and credit for sharing data
3

Category of
Geoscience
Data

Characteristics

Strategy for Reuse

CI Strategy

Short tail
science
data created
by large
organization
s and
projects

Few, large (TB+),
structured, spatially
rich (e.g., remote
sensing), largely
homogeneous,
highly visible,
curated

Planned integration
strategies, could use formal
ontologies / domain models
and vocabularies,
visualization tools and APIs

Data centers / grids
generally using
relational databases
and files, maintained
by people with
significant IT skills

Long tail
science
data created
by individual
scientists
and small
groups

Many, small (GB+),
heterogeneous,
invisible (except via
publications),
poorly curated

Multi-domain and broad
vocabularies (including
community established
ones), create semantic
metadata (annotations) and
optionally publish, search
and download legacy data,
or use an open data
initiative

Web-based easy to
learn and use semantic
tools for annotation,
publication, search and
download that can be
used by individual
scientists without
significant IT skills
4

Our Thesis
Associating machine-processable semantics
with materials science and engineering data
and documents can help overcome
challenges associated with data discovery,
integration and interoperability caused by
data heterogeneity.

5

What?: Nature of Data and Documents
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents
(e.g., publications and technical specs usually
include text, numerics, units of measure, images
and equations)
• Tabular data (e.g., ad hoc spreadsheets and
complex tables incorporating “irregular” entries)
6

Fragment of Materials and Process spec for Ti Alloy
Bars, Wire, Forgings, and Rings.

7

What?: Granularity of Semantics and Applications: Examples

• Synonyms
– Chemistry, Chemical Composition, Chemical Analysis, ...
– Bend Test, Bending, ...
– Delivery Condition, Process/Surface Finish, Temper, "as received by
purchaser", ...

• Coreference vs broadening/narrowing
– Tubing vs welded tubing vs flash-welded part

• Capturing characteristic-value pairs
– Recognize and Normalize: “0.1 inch and under in nominal thickness”
is translated to “Thickness <= 0.1 in”.
– Glean elided characteristic: controlled term “solution heat treated”
implies the characteristic “heat treat type”.
8

What?: Granularity of Semantics and Associated Applications

• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data integration,
interoperability and reasoning in Linked Open
Materials Science Data
9

Computer Assisted Document Extraction Tool
Typical view of the tagged Spec

Tree/Structure view of the Spec

10

Tag
Editor

Few More Examples: Procedure Melt Methods

View of the Original Spec

Tagged Spec

11

Tag
Editor

Few More Examples: Procedure Melt Methods

The SDL

12

Why?: Benefits of Lightweight Semantics
• Ease of use by domain experts
– Faster and wider adoption, promoting evolution

• Low upfront cost to support

• Shallow semantics has wider applicability to a
range of documents/data and appeal to a broader
community of geoscientists
• Bottom-line: “Learn to Walk before we Run”
13

How?: Using Semantic Web Technologies
Machine-processable semantics achieved by
addressing
• Syntactic Heterogeneity: Using XML syntax and
RDF datamodel (labelled graph structure)
• Semantic Heterogeneity:
– Using “common” controlled vocabularies, taxonomies
and ontologies
– Using federated data sources, exchanges, querying,
and services
14

How?: Ingredients for Semantics-based Cyber Infrastructure

• Use of community-ratified controlled vocabularies
and lightweight ontologies (upper-level,
hierarchies)
• Ease registration, publishing, and discovery
• Provide support for provenance and access control
• Track data citation for credit for data sharing
• Semi-automatic annotation of data and documents
: Manual + Automatic
15

How?: Search Continuum
•

Keyword-based full-text search

•

+ Manually provided content and source metadata

•
•

•

Uses upper-level ontology

+ Automatically extracted metadata
•
•

Map text to concepts/properties/values
Semantic + faceted search using background knowledge

+ Deeper semi-automatic content annotation and
extraction
•
•

Aggregating related pieces of information; conditioning
Integration and Interoperation

•

+ Linked Open Material Science Data

•

+ Federated and Faceted Querying and Services
16

Linked Open Data – Why do we need data?

17

Linked Open Data – Just data is not enough
• More and more data are available, But …
Isolated islands of data is not enough, akin to
the web of documents without hyperlinks.

data
set A

data
set D

data
set B

data
set F

data
set E
data
set C

Need to interlink data over the web to enable
content-rich applications.

Linked Data

data
set A

data
set D
data
set F

data
set B

data
set E
data
set C

18

Linked Open Data – A Realization
http://dbpedia../politici
an

http://ex./John_Kennedy

http://dbpedia../Profession

Owl:sameAs
http://ex./AuthoredBook

http://dbpedia../John_F._Kennedy
http://ex./A_Nation_of_I
mmigrants
http://ex./publishedIn
1964

http://dbpedia../BirthDate

1917-05-29

http://ex./genre

http://ex./non-fiction

http://dbpedia../Capital

http://dbpedia../Boston

http://dbpedia../BirthPlace

http://dbpedia../Massac
husetts

http://dbpedia../Country

http://dbpedia../United
_States
19

Linked Open Data

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
20

Example: Lightweight Semantic Registration of Data
Title of data
Type of data

Selected from five tier vocabulary
provided Keywords
maps, excel files, images, text

Data format

structured or unstructured

Description of data

brief unstructured description of content

Contact information of provider(s)

name of provider(s), email for verification,
lineage
location

Spatial extent of data and
reference system
Temporal extent of data

date range in time or age range if not recent

Date and type of Related
Publication(s)
Host site for publication

Journal, Thesis, Agency report, not published

Access restrictions

copyright regulations

Journal, Library, Personal computer
21

System Architecture and Components

22

Deeper Issues: Semantic Formalization
of Tabular Data
Problems and A Practical Approach
(“When rubber meets the road”)

skip
23

Nature of tables
• Compact structures for sharing information
– Minimize duplication

• Types of Tables
– Regular : Dense Grid with explicit schema
information in terms of column and row
headings => Tractable
– Irregular: Sparse Grid with implicit schema and
ad hoc placement of heading => Hard
24

Challenges Associated with Typical Spreadsheet/Table

•
•

Meant for human consumption
Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
26

Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation

• USE: Manual population of regular tables and
automatic translation into LOD

27

Kno.e.sis
thank you, and please visit us at

http://knoesis.org/

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA

28

Realizing Semantic Web - Light Weight semantics and beyond

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Realizing Semantic Web - Light Weight semantics and beyond

Similar to Realizing Semantic Web - Light Weight semantics and beyond (20)

Recently uploaded

Recently uploaded (20)

Realizing Semantic Web - Light Weight semantics and beyond

Editor's Notes