The talk titled "Realizing Semantic Web - Light Weight semantics and beyond" given by prof. T.K. Prasad at the ICMSE-MGI Digital Data Workshop held at Kno.e.sis Center from November 13-14 2013. The talk emphasized on annotation and search framework.
workshop page: http://wiki.knoesis.org/index.php/ICMSE-MGI_Digital_Data_Workshop
Realizing Semantic Web - Light Weight semantics and beyond
1. Realizing Semantic Web: Lightweight Semantics and Beyond
Krishnaprasad Thirunarayan (T. K. Prasad)
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-45435
1
2. Outline
• Domain Goals and Challenges
• Cyberinfrastructure Investments in Science
• Utility and Continuum of Machine-Processable
Semantics : An Architecture
•
•
•
What?: Nature of Data and Granurality of Semantics
Why?: Lightweight semantics and its benefits
How?: Community-ratified Ontologies
+ Semantic Annotations of Data and Documents
+ Linked Open Materials Data
• Research: Processing Tabular Data
2
3. Domain Goals and Challenges
• Materials Science and Engineering Data and
Information sharing, discovery, and application are
possible only if domain scientists are able and
willing to do so.
• Technological challenges
– Computational tools and repositories conducive to easy
exchange, curation, attribution, and analysis of data
• Cultural challenges
– Proper protection, control, and credit for sharing data
3
4. Category of
Geoscience
Data
Characteristics
Strategy for Reuse
CI Strategy
Short tail
science
data created
by large
organization
s and
projects
Few, large (TB+),
structured, spatially
rich (e.g., remote
sensing), largely
homogeneous,
highly visible,
curated
Planned integration
strategies, could use formal
ontologies / domain models
and vocabularies,
visualization tools and APIs
Data centers / grids
generally using
relational databases
and files, maintained
by people with
significant IT skills
Long tail
science
data created
by individual
scientists
and small
groups
Many, small (GB+),
heterogeneous,
invisible (except via
publications),
poorly curated
Multi-domain and broad
vocabularies (including
community established
ones), create semantic
metadata (annotations) and
optionally publish, search
and download legacy data,
or use an open data
initiative
Web-based easy to
learn and use semantic
tools for annotation,
publication, search and
download that can be
used by individual
scientists without
significant IT skills
4
5. Our Thesis
Associating machine-processable semantics
with materials science and engineering data
and documents can help overcome
challenges associated with data discovery,
integration and interoperability caused by
data heterogeneity.
5
6. What?: Nature of Data and Documents
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents
(e.g., publications and technical specs usually
include text, numerics, units of measure, images
and equations)
• Tabular data (e.g., ad hoc spreadsheets and
complex tables incorporating “irregular” entries)
6
7. Fragment of Materials and Process spec for Ti Alloy
Bars, Wire, Forgings, and Rings.
7
8. What?: Granularity of Semantics and Applications: Examples
• Synonyms
– Chemistry, Chemical Composition, Chemical Analysis, ...
– Bend Test, Bending, ...
– Delivery Condition, Process/Surface Finish, Temper, "as received by
purchaser", ...
• Coreference vs broadening/narrowing
– Tubing vs welded tubing vs flash-welded part
• Capturing characteristic-value pairs
– Recognize and Normalize: “0.1 inch and under in nominal thickness”
is translated to “Thickness <= 0.1 in”.
– Glean elided characteristic: controlled term “solution heat treated”
implies the characteristic “heat treat type”.
8
9. What?: Granularity of Semantics and Associated Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data integration,
interoperability and reasoning in Linked Open
Materials Science Data
9
10. Computer Assisted Document Extraction Tool
Typical view of the tagged Spec
Tree/Structure view of the Spec
10
11. Computer Assisted Document Extraction Tool
Tag
Editor
Few More Examples: Procedure Melt Methods
View of the Original Spec
Tagged Spec
11
12. Computer Assisted Document Extraction Tool
Tag
Editor
Few More Examples: Procedure Melt Methods
The SDL
12
13. Why?: Benefits of Lightweight Semantics
• Ease of use by domain experts
– Faster and wider adoption, promoting evolution
• Low upfront cost to support
• Shallow semantics has wider applicability to a
range of documents/data and appeal to a broader
community of geoscientists
• Bottom-line: “Learn to Walk before we Run”
13
14. How?: Using Semantic Web Technologies
Machine-processable semantics achieved by
addressing
• Syntactic Heterogeneity: Using XML syntax and
RDF datamodel (labelled graph structure)
• Semantic Heterogeneity:
– Using “common” controlled vocabularies, taxonomies
and ontologies
– Using federated data sources, exchanges, querying,
and services
14
15. How?: Ingredients for Semantics-based Cyber Infrastructure
• Use of community-ratified controlled vocabularies
and lightweight ontologies (upper-level,
hierarchies)
• Ease registration, publishing, and discovery
• Provide support for provenance and access control
• Track data citation for credit for data sharing
• Semi-automatic annotation of data and documents
: Manual + Automatic
15
16. How?: Search Continuum
•
Keyword-based full-text search
•
+ Manually provided content and source metadata
•
•
•
Uses upper-level ontology
+ Automatically extracted metadata
•
•
Map text to concepts/properties/values
Semantic + faceted search using background knowledge
+ Deeper semi-automatic content annotation and
extraction
•
•
Aggregating related pieces of information; conditioning
Integration and Interoperation
•
+ Linked Open Material Science Data
•
+ Federated and Faceted Querying and Services
16
18. Linked Open Data – Just data is not enough
• More and more data are available, But …
Isolated islands of data is not enough, akin to
the web of documents without hyperlinks.
data
set A
data
set D
data
set B
data
set F
data
set E
data
set C
Need to interlink data over the web to enable
content-rich applications.
Linked Data
data
set A
data
set D
data
set F
data
set B
data
set E
data
set C
18
19. Linked Open Data – A Realization
http://dbpedia../politici
an
http://ex./John_Kennedy
http://dbpedia../Profession
Owl:sameAs
http://ex./AuthoredBook
http://dbpedia../John_F._Kennedy
http://ex./A_Nation_of_I
mmigrants
http://ex./publishedIn
1964
http://dbpedia../BirthDate
1917-05-29
http://ex./genre
http://ex./non-fiction
http://dbpedia../Capital
http://dbpedia../Boston
http://dbpedia../BirthPlace
http://dbpedia../Massac
husetts
http://dbpedia../Country
http://dbpedia../United
_States
19
20. Linked Open Data
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
20
21. Example: Lightweight Semantic Registration of Data
Title of data
Type of data
Selected from five tier vocabulary
provided Keywords
maps, excel files, images, text
Data format
structured or unstructured
Description of data
brief unstructured description of content
Contact information of provider(s)
name of provider(s), email for verification,
lineage
location
Spatial extent of data and
reference system
Temporal extent of data
date range in time or age range if not recent
Date and type of Related
Publication(s)
Host site for publication
Journal, Thesis, Agency report, not published
Access restrictions
copyright regulations
Journal, Library, Personal computer
21
23. Deeper Issues: Semantic Formalization
of Tabular Data
Problems and A Practical Approach
(“When rubber meets the road”)
skip
23
24. Nature of tables
• Compact structures for sharing information
– Minimize duplication
• Types of Tables
– Regular : Dense Grid with explicit schema
information in terms of column and row
headings => Tractable
– Irregular: Sparse Grid with implicit schema and
ad hoc placement of heading => Hard
24
26. Challenges Associated with Typical Spreadsheet/Table
•
•
Meant for human consumption
Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
26
27. Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation
• USE: Manual population of regular tables and
automatic translation into LOD
27
28. Kno.e.sis
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
28
Editor's Notes
Vision : Short-term and long-term goals,Approaches, Benefits, and Challenges (reflecting cost/convenience/pay-off trade-offs) continuum to preserve investment (2) Glimpse at Semantic Formalization of Tabular Data Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications
able (technological challenges) and willing (cultural challenges) Realization requires semantics-empowered techniques and at this point benefit from semantic technologies and tools for “convenient” discovery (using useful and usable technologies)-------------Economic and shortening-time-to-development-and-market motive
Significant current investments in sciences “Relative comparison”Data Centers: NASA, USGS, NOAA------------------------------------------------------------------------------Cf. Big Data Volume and Velocity challengeCf. Big Data Variety and Veracity challengeUltimately want to derive value
Addl. areas that can be benefited => GeoscienceSyntactic (format) and semantic (domain models, perspectives) heterogeneity [text vs excel vs XML) (units of measure, well-entrenched vocabularies)
Use Case: Materials and process specificationsVariety challenge: Sources of heterogeneity syntactic (excel, XML, text) vs semantic (UOM, controlled terms)Attribute value-pairs : explicit vs implicit : conditioned on shape, dimension (making these connections explicit from text doc non-trivial)Table captions : Use text-basedmetadata to help mediate => tabular data
SAE Intl. : AMS 4928N : Aerospace Material Specification Ti Alloy Bars, Wire, Forgings, and Rings. UNS R56400 (Issued 7/1/1957 Revised 4/1/1993)-----------------------------------------------------------------------Human comprehensible but requires understanding that the spec is for (1) Four types of material : Bars, Wire, Forgings, and Rings. (2)And the tolerance for Hydrogen content for forgings is lax …+ References to other specs that need to be inherited[[+ test frequency / lot size]](3) Residual elements?? => need background knowledge -- => is it zero/traces?Why also specify ppm? Convenience? 0.05 = 500 ppm (0.05% = 0.05/100 = 500 x 10^-6======================Two issues: what is th semantics? How can that be obtained automatically? Or more realistically, semi-automatically.
(Ref: B 50T26 S7, Sections 1, 4.2, 4.4)Synonyms: stemming (syntactic) to richer thesaurus (simple KB) (to MAP doc / text strings to domain concepts / ontology)Coreference issues: Purpose of Semantics => What is literally given vs what is really meant? E.g., KB says welded tubing ISA tubing, but in a paragraph that describes ‘welded tubing’, one can refer to it using “the tubing”.RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs Besides determining related phrases using clause, line, paragraph boundary, etc. we may need to use semantic/domain model/ontology to normalize or fill-in implicit details==============================PLUS NLP-lite issues:There is confusion regarding the distribution of “and” over “or”, and over the interpretation of “and” and “or”. For instance, is “X or Y and Z” = “X and Z or Y and Z”? Similarly, “and” in the context “P is X and Y” connotes intersection, while “and” in the context of “P and Q are X” connotes union. ------------Ingot chemistry vs product chemistry
Semantics at different levels of detail and developed in stages : “Rome was not built in a day”! : Cost-benefit trade-offs------------------------------------------------------ANALOGY: Table of content (top-down, prescribed, static) vs Index (bottom-up, gleaned to describe, dynamic)--------------------------------------------------------Controlled vocabularies <= Lightweight ontologies [ legacy vocab + community agreed semantic relationships] <= Formal ontologiesOriginal document vs its translation => traceability (provenance)---------Past Research: We have dealt with top-down UMLS ontology vs bottom-up facts from Pubmed in HPCO (Literature-based discovery -> LBD)---------Pick from existing upper-level ontology vocabulary => manual ; indexing table columns, rows, captionsSemi-automatic metadata generation/embedding => annotation: mapping text to concept; summarization: triple extraction => semantic search with bg KBTranslation and summarization - [Integration and Interoperation requires Alignment of vocabularies] Graphical representation and queryingLiterature-based discovery: navigate through the documents based on path search through their LOD renditions (extractions)-----------------------------RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs = LOD Eventually allows combining and comparing specs==============================Biomaterials use case: Gold surface affinity of peptide sequence===================PLUS NLP-lite issues:There is confusion regarding the distribution of “and” over “or”, and over the interpretation of “and” and “or”. For instance, is “X or Y and Z” = “X and Z or Y and Z”? Similarly, “and” in the context “P is X and Y” connotes intersection, while “and” in the context of “P and Q are X” connotes union. --------------Compare, manipulate, and combine specs
Light-weight semantics as the first step
More detailed annotation for extraction …Elided/surfaced : melt quantityAnd-Or issuesIn the translation: atmosphere specified only for nonconsumable electrode.
Syntax tad inelegant because it is meant to be machine-processable more readily than for human consumption.In reality, it is serving both purposes, for traceability and verification purposes.Supports faceted search comparing and combining specs
Even if sold on deeper formal semantics, LWS is a necessary first step and we advocate it …
Use Case: Materials and process specificationsVariety challenge: Sources of heterogeneity syntactic (excel, XML, text) vs semantic (UOM, controlled terms)Attribute value-pairs : explicit vs implicit : conditioned on shape, dimension (making these connections explicit from text doc non-trivial)Table captions : Use text-basedmetadata to help mediate => tabular data-----------------------Unification – integration vs federation – interoperation/mediation
Less training ASTM, NIST, MIL-stds (Handbook 21, 5)Flat list of terms and their associated definitionsHierarchical organization of properties, alloys, performance metrics, …Cross relationships: (1) Qualitative dependencies (proportionality)(2) Quantitative dependencies (equations/formula)
Document text – table captions + Provenance and content keywords + Annotation tools + attibute-value pairs ; consolidation of related pieces; conditioning------------------------------Biomaterials Use-case: Gold binding peptidesRecognize Gold surfaces, peptide sequences, and then their relationships
http://www.bbc.co.uk/music/ => [search by artist] “Elton john”How to build a site like BBC which is constantly kept up-to-date (e.g. artist details)? Either site editors manually edit the data and the web site, or have data automatically extracted from other web sites that is kept current (say by the crowd). E.g., BBC uses external open datasets such as Wikipedia and Music Brainz. How do we build an agile infrastructure that can benefit from sharable, evolving, open data sets?
Open data sets alone in isolation are analogous to the web of documents without hyperlinks. We need the capability to access the data via standard technologies, and interlink data over the web… That is what we call linked open data
Primary principles of linked open data1. Use URIs to describe the data (machine processable – std)2. Associate Descriptions to the data2. Interlink data whenever possible – data integration is happening via interlinking at data level and instance levelRealization of this generic linked data principle is achieved using the semantic web technologies (RDF, RDFS and OWL)-------------The nature of issues to be dealt with / resolved / accomplished in creating LOD
Diagram in the below shows the domain coverage of LOD by 2011
How do we search: Content and Context-based (provenance)hierarchy of terms for manual selection – publications – who wrote, where appeared, when
NSF-SBIR “Computer-Assisted Document Interpretations Tools” [Materials and Process specs relevant to aircraft and automobile industries]
Use case: Materials and Process specs (e.g., composition table, tensile test information (conditional constructs; cross-section info placed on a row by itself; nested tables; blank values)[cannot reproduce GE specs due to copyright issues]
AMS 4928Nhttp://www.youtube.com/watch?v=D8U4G5kcpcMhttp://www.ndt-ed.org/EducationResources/CommunityCollege/Materials/Mechanical/Mechanical.htmMost structural materials are anisotropic, which means that their material properties vary with orientation.In products such as sheet and plate, the rolling direction is called the longitudinal direction, the width of the product is called the (long) transverse direction, and the thickness is called the short transverse direction.
Use case: Materials and Process specsCompact structures for sharing informationMinimize duplication
In content extraction from tables, a human extractor formalizes the data using “predefined” tables, and a wizard then generates LOD from it.Extractor is responsible for gleaning the semantics (manual part)Wizard responsible for the mechanical translation (automatic part)==================The yardstick of success is the extent to which regular parts of the table can be automatically assimilated and translated, while leaving more complex parts for manual guidance.