Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Semantic Technologies for Big Sciences including Astrophysics
1. Semantic Technologies for Big Science and Astrophysics
Invited presentation: EarthCube Solar-Terrestrial End-User Workshop
NJIT, Newark NJ, August 13-15, 2014
Amit Sheth, T. K. Prasad
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
2. 2
Astrophysics
Lots of data
Heterogeneous
Complex
http://en.wikipedia.org/wiki/Astrophysics#mediaviewer/File:NGC_4414_%28NASA-med%29.jpg
3. 3
Challenge
• How can we handle this vast, heterogeneous,
and complex data space?
• Focus on complexity rather than raw processing:
integration, collaboration, reuse
Can Semantic (Web) technologies ease
the challenges and empower the scientists?
4. The Semantic Web vision: 1999-2001
• Sir Tim Berners Lee, in his 1999 “Weaving the Web” book,
emphasized the significance of metadata about Web
documents.
• Well known May 2001 article presented an agent and an AI
based vision for “next generation of the World Wide Web”
with content amenable to automation.
• With Taalee (later Voquette, Semagix) I founded in 1999, I
pursued a highly practical realization with semantic search,
browsing and analysis products. Had commercial
applications starting 2000, patent awarded in 2001.
4
6. 1
• Agreement and Knowledge: Agreement about a
common vocabulary/nomenclature, conceptual
models and domain knowledge, ontology
– Codified as Schema + Knowledge Base.
– Agreement is what enables interoperability.
– Formal machine processable description is what
leads to automation.
– Manual, semi-automated, automated creation of
ontologies
7. 2
• Semantic Annotation (Metadata Extraction):
Associating meaning with data, or labeling data so
it is more meaningful to the system and people.
– Manual
– Semi-automatic (automatic with human
verification)
– Automatic
8. 3
• Reasoning/Computation, Applications:
– Semantics enabled search, browsing
– Data integration, collaboration
– Visualization
– Analyses including pattern discovery, mining, hypothesis
validation
– Answering complex queries, making connections (paths,
sub graphs), supporting discovery
10. SSN
Ontology
Using Semantics to Climb Levels of Abstraction: an example
3 Interpreted data
(abductive)
[in OWL]
e.g., diagnosis
2 Interpreted data
(deductive)
[in OWL]
e.g., threshold
1 Annotated Data
[in RDF]
e.g., label
0 Raw Data
[in TEXT]
e.g., number
Intellego
Hyperthyroidism
… …
Elevated
Blood
Pressure
Systolic blood pressure of 150 mmHg
“150”
10
11. Semantic Web technologies – in practice
● Ontologies to capture domain knowledge (sometimes
taxonomy/nomenclature is good enough)
● Languages to represent/capture domain knowledge
and data - OWL, RDF/RDFS.
● Data sharing and publishing online (e.g., LOD).
● Annotation, semantic search, semantic browsing
● Provenance,…
Widely used in biomedicine; quite a few applications in
healthcare, growing use and explorations in geosciences
and more…
11
12. In this talk, I will review/borrow from
• ScienceWISE at EPFL which uses semantic
technology to serve Physicists including
Astrophysicists: shared vocabulary, annotation,
browsing for related concepts
• Semantic (web) technologies for health care and
life sciences encompassing collaborative research,
prototypes, open source tools and ontologies,
deployed applications, commercialization,…
• MaterialWays: Our project in Materials Genome
Initiatives …
12
13. “Ontology” in physics domain – ScienseWISE
● ScienceWISE
WISE - Web based Interactive Semantic Environment
● An interactive and crowdsourced tool to capture
knowledge from scientists’ daily routine work.
● Core consists of a community built ontology.
● Literature gets annotated and bookmarked using
the ontology.
13
15. Value Proposition
Associating machine-processable semantics
with scientific, engineering data and
documents can help overcome challenges
associated with data discovery, integration
and interoperability caused by data
heterogeneity.
15
16. Benefits of using semantics for Astrophysicists (and other sciences)
• Challenges
– Massive volume
– Heterogeneity (i.e., from many sources, format/structure, text,
images).
– Interoperability and sharing data
– Provenance and Access Control.
• Need techniques beyond ScienceWISE
– Interested in data beyond scientific publications
– Data sharing (and credit/data citation for data sharing)
– Provenance and Access control
– A framework to capture, search, and discover astrophysical
data
16
17. Nature of Data and Documents
17
Relational/Tabular Data
XML document
Image
Technical Specs
Irregular Tables
Publications
18. Granularity of Semantics and Applications: Examples
• Synonyms
– Chemistry, Chemical Composition, Chemical Analysis, ...
– Bend Test, Bending, ...
– Delivery Condition, Process/Surface Finish, Temper, "as received by
purchaser", ...
• Coreference vs broadening/narrowing
– Tubing vs welded tubing vs flash-welded part
• Capturing characteristic-value pairs
– Recognize and Normalize: “0.1 inch and under in nominal thickness”
is translated to “Thickness <= 0.1 in”.
– Glean elided characteristic: controlled term “solution heat treated”
implies the characteristic “heat treat type”.
18
19. Granularity of Semantics and Associated Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data integration and
interoperability.
19
20. Using Semantic Web Technologies
Machine-processable semantics achieved by
addressing
• Syntactic Heterogeneity: Using XML syntax and
RDF datamodel (labelled graph structure)
• Semantic Heterogeneity:
– Using “common” controlled vocabularies, taxonomies
and ontologies
– Using federated data sources, exchanges, querying,
and services
20
21. Ingredients for Semantics-based Cyber Infrastructure
• Use of community-ratified controlled
vocabularies and lightweight ontologies
(upper-level, hierarchies)
• Semi-automatic annotation of data and
documents
• Support for provenance and access control
21
22. A proposed “light-weight semantics” approach
(for highly distributed community, low start up time, long tail science)…
22
23. 23
Our applications in
Materials Genome Initiative
Materialways (our project related to Material Genomics Initiative):
http://wiki.knoesis.org/index.php/MaterialWays
24. Matvocab home page
Search and discovery
Annotate documents
Visualize the
knowledge base
Create process
assertions
Query vocabulary
View, edit, and add
26. Annotate, search, and track provenance
• Vocabulary is used to annotate documents.
• Annotated documents can be indexed.
• Documents can be integrated reliably based
on common terms of interest and
provenance information.
26
28. Create process assertions (OnCET)
• Add information about inputs to and outputs
of a process as assertions in triple form
using standard vocabulary.
• Add assertions about materials domain
knowledge using vocabulary terms and
relationship among them, e.g., about
process control parameters and
performance characteristics.
29. Provenance Metadata
• Explains the origin of an artifact, such as
– How was it created?
– Who created it?
– When was it created?
• Example: for a given material X
– Which processes are involved in making the material and
what are the relevant performance properties?
– What are the inputs, control parameters and outputs of a
process?
– Which research/engineering team performed an
experiment?
30. 30
Capturing provenance metadata - iExplore
generic PMC prepreg
generic hand lay-up
generic PMC lay-up
generic autoclave cure
generic PMC
subjected to
subjected to
yields
yields
31. Vocabulary Provenance
31
ASM Handbook
MIL Handbook 5
Vocabulary terms MIL Handbook 17
Vocabulary term exWpreisksei-db ina RsDeFd a nCd rpoubwlishde-ds oonluinrec (hinttpg:// kVnooecsias.borug/mlaartvyocab/A-basis)
33. Our proposal - Astrophysics
• Tagging, annotation, search
• Knowledgebase ->
Ontology
• Provenance – at every data
level
• Data access control
• Capture process flows
• Capture relationships
between concept instances
• Visualization of process
flows
ScienceWISE - Physics
• Tagging, annotation, search
• Ontology ->
Knowledgebase
• Provenance
33
34. Our approach to help in Astrophysics
• Access control and provenance details at every
data level -> handle huge amount of astrophysics
data.
• Create relationships between concepts and
visualize them in graph format.
• Adding facts or assertion about each concept.
37. Public-Private Data Sharing
• Enhance publicly available datasets while
retaining intellectual property data privately for
businesses
Private data and metadata
(e.g. ongoing experimental processes, intellectual property data)
37
Selectively shared data and metadata
(e.g. with ongoing collaborators, licensed data)
Public data and metadata
(e.g., released products, material specifications)
38. Federated Architecture
OEM partner A
38
Private
Shared
Public
Federal Endpoint
1. User
Authentication
2. Federated Semantic
Query Processor
AC
Processor
Semantic
Query
Processor
Private
Shared
Public
AC
Processor
Semantic
Query
Processor
OEM partner B
3. Semantics
Mappings
Private
Shared
Public
AC
Processor
Semantic
Query
Processor
OEM supplier C
39. Principles of a Federation
• Each component controls access to its local data
independently (local autonomy).
• A query is decomposed to multiple sub-queries,
each sub-query is executed at one component.
• Results from sub-queries are combined by the
federated query processor (control global access)
40. Can we choose any part of our
Semantic Web data
to share with public community,
or with selective collaborators ?
41. Different levels of granularity
– Individual resources
• Example: a material product, a manufacturing process
– Individual triples
• Example: properties of a product, or process
– Entire datasets
Enable flexible selection of any data piece to be
shared at anytime
42. Federal
Endpoint
2. AC-embedded Query Execution
Local Component A
Creating
Resources
Granting
Permissions
Inferring
Permissions
AC
Processes
User X of either
Public group or Collaborators
Manager Y
of component A
1. Query Rewriting
43. Various Policies
• Role-based Access Control (RBAC)
• Mandatory Access Control (MAC)
• Attribute-based Access Control (ABAC)
• Discretionary Access Control (DAC)
1. Which policy? Depends on the
organization’s needs!
2. Our AC mechanism can be extended to
support any of these policies.
44. Advance capability: semantic browsing
• Example of Scooner:
http://wiki.knoesis.org/index.php/Scooner
• Demo:
http://knoesis.wright.edu/library/demos/scooner-demo/
44
45. Take Away
Use of semantic web technologies
can help overcome challenges associated with
data discovery, integration, and interoperability,
caused by data heterogeneity.
Use provenance and access control information
help share/exchange data reliably.
45
46. 46
Kno.e.sis
Thank you, and please visit us at
http://knoesis.org/
http://wiki.knoesis.org/index.php/MaterialWays
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Special Thanks (MaterialWays team): . Clare Paul (AFRL),
Kalpa Gunaratna, Vinh Nguyen, Sarasi Lalithsena, Swapnil Soni. Nitisha Jayakumar, Siva Cheekula.
Editor's Notes
Addl. areas that can be benefited => Geoscience
Syntactic (format) and semantic (domain models, perspectives) heterogeneity
[text vs excel vs XML) (units of measure, well-entrenched vocabularies)
Use Case: Materials and process specifications
Variety challenge: Sources of heterogeneity
syntactic (excel, XML, text) vs semantic (UOM, controlled terms)
Attribute value-pairs : explicit vs implicit
: conditioned on shape, dimension
(making these connections explicit from text doc non-trivial)
Table captions : Use text-based metadata to help mediate
=> tabular data
(Ref: B 50T26 S7, Sections 1, 4.2, 4.4)
Synonyms: stemming (syntactic) to richer thesaurus (simple KB)
(to MAP doc / text strings to domain concepts / ontology)
Coreference issues:
Purpose of Semantics => What is literally given vs what is really meant?
E.g., KB says welded tubing ISA tubing, but in a paragraph that
describes ‘welded tubing’, one can refer to it using “the tubing”.
RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of material
Formalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs
Besides determining related phrases using clause, line, paragraph boundary, etc.
we may need to use semantic/domain model/ontology to normalize or fill-in implicit details
==============================
PLUS NLP-lite issues:
There is confusion regarding the distribution of “and” over “or”, and over the interpretation of “and” and “or”.
For instance, is “X or Y and Z” = “X and Z or Y and Z”?
Similarly, “and” in the context “P is X and Y” connotes intersection,
while “and” in the context of “P and Q are X” connotes union.
------------
Ingot chemistry vs product chemistry
Semantics at different levels of detail and developed in stages : “Rome was not built in a day”! :
Cost-benefit trade-offs
------------------------------------------------------
ANALOGY: Table of content (top-down, prescribed, static) vs Index (bottom-up, gleaned to describe, dynamic)
--------------------------------------------------------
Controlled vocabularies <= Lightweight ontologies [ legacy vocab + community agreed semantic relationships] <= Formal ontologies
Original document vs its translation => traceability (provenance)
---------
Past Research: We have dealt with top-down UMLS ontology vs bottom-up facts from Pubmed in HPCO
(Literature-based discovery -> LBD)
---------
Pick from existing upper-level ontology vocabulary => manual ; indexing
table columns, rows, captions
Semi-automatic metadata generation/embedding =>
annotation: mapping text to concept; summarization: triple extraction => semantic search with bg KB
Translation and summarization - [Integration and Interoperation requires Alignment of vocabularies]
Graphical representation and querying
Literature-based discovery: navigate through the documents based on path search through their LOD renditions (extractions)
-----------------------------
RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of material
Formalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs
= LOD Eventually allows combining and comparing specs
==============================
Biomaterials use case: Gold surface affinity of peptide sequence
===================
--------------
Compare, manipulate, and combine specs
Use Case: Materials and process specifications
Variety challenge: Sources of heterogeneity
syntactic (excel, XML, text) vs semantic (UOM, controlled terms)
Attribute value-pairs : explicit vs implicit
: conditioned on shape, dimension
(making these connections explicit from text doc non-trivial)
Table captions : Use text-based metadata to help mediate
=> tabular data
-----------------------
Unification – integration vs federation – interoperation/mediation
Less training
ASTM, NIST, MIL-stds (Handbook 21, 5)
Flat list of terms and their associated definitions
Hierarchical organization of properties, alloys, performance metrics, …
Cross relationships: (1) Qualitative dependencies (proportionality)
(2) Quantitative dependencies (equations/formula)
Vocabulary created in the previous step used to automatically annotate the set of documents
Definition
Example
Our tool i-Explore allows to browse how a particular output product is being created, the processed involved, and the input materials
We created a Mediawiki extension for managing and editing vocabulary terms. Due to the nature of the material science domain, one term may have been defined differently in multiple sources.
The source/right provenance metadata is captured in our provenance data model.
Our tool i-Explore allows to browse how a particular output product is being created, the processed involved, and the input materials
Data is spread all over the heterogeneous sources but inaccessbile to researchers and engineers: private lab info, a desktop, notebook, firewall
To make it easy for everyone, a single access point to search for all publicly available information about materials?
For each organization like research lab and industry company, there are three kinds of data: private, selectively shared and public
Semantics Mappings
To meet customized needs of different organizations
By capturing the access control primitive operators in processes
1) A manager Y of a local component can grant access to individual users or a group of users.
The Public group is dedicated to the entired federated system. Any resources granted to this Public group is available for everyone.
2) Meanwhile, we are also able to track any access rights in the system.
One important scenario may be, one manager Y suspects can ask why a suspectious user has access to an important resource.