Successfully reported this slideshow.
Your SlideShare is downloading. ×

Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 71 Ad

Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications

Download to read offline

http://wiki.knoesis.org/index.php/MaterialWays
http://www.knoesis.org/?q=research/semMat
http://wiki.knoesis.org/index.php/MaterialWays

Abstract

The sharing, discovery, and application of materials science and engineering data and documents are possible only if domain scientists are able and willing to do so. We need to overcome technological challenges such as the development of convenient computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data, and cultural challenges such as proper protection, control, and credit for sharing data. Our thesis and value proposition is that associating machine-processable semantics with materials science and engineering data and documents can provide a solid foundation for overcoming challenges associated with data discovery, integration, and interoperability caused by data heterogeneity. Specifically, easy to use and low upfront cost lightweight semantics in the form of file-level annotation can enable document discovery and sharing, while deeper data-level annotation using standardized ontologies can benefit semantic search and summarization. Machine processability achieved through fine-grained semantic annotation, extraction, and translation can enable data integration, interoperability and reasoning, ultimately leading to Linked Open Materials Science Data. Thus, a different granularity of semantics provides a continuum of cost/ease of use and expressiveness trade-off. In this presentation, we also show the application of semantic techniques for content extraction from materials and process specifications which are semi-structured and table-rich, and the application of semantic web techniques and technologies for materials vocabulary integration and curation (via semantic media wiki), semantic web visualization, efficient representation of provenance metadata and access control (via singleton property), and biomaterials information extraction

http://wiki.knoesis.org/index.php/MaterialWays
http://www.knoesis.org/?q=research/semMat
http://wiki.knoesis.org/index.php/MaterialWays

Abstract

The sharing, discovery, and application of materials science and engineering data and documents are possible only if domain scientists are able and willing to do so. We need to overcome technological challenges such as the development of convenient computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data, and cultural challenges such as proper protection, control, and credit for sharing data. Our thesis and value proposition is that associating machine-processable semantics with materials science and engineering data and documents can provide a solid foundation for overcoming challenges associated with data discovery, integration, and interoperability caused by data heterogeneity. Specifically, easy to use and low upfront cost lightweight semantics in the form of file-level annotation can enable document discovery and sharing, while deeper data-level annotation using standardized ontologies can benefit semantic search and summarization. Machine processability achieved through fine-grained semantic annotation, extraction, and translation can enable data integration, interoperability and reasoning, ultimately leading to Linked Open Materials Science Data. Thus, a different granularity of semantics provides a continuum of cost/ease of use and expressiveness trade-off. In this presentation, we also show the application of semantic techniques for content extraction from materials and process specifications which are semi-structured and table-rich, and the application of semantic web techniques and technologies for materials vocabulary integration and curation (via semantic media wiki), semantic web visualization, efficient representation of provenance metadata and access control (via singleton property), and biomaterials information extraction

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications (20)

Advertisement

Recently uploaded (20)

Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications

  1. 1. Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications Krishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing 1
  2. 2. Relevant Funded Projects : A Brush with Pain Points and Promise • Semantic Web-based Data Exchange and Interoperability for OEM-Supplier Collaboration (Pratt and Whitney) (2014-2015) • KDDM: Federated Semantic Services Platform for Open Materials Science and Engineering (AFRL) (2013-2016) • Computer Assisted Document Interpretation Tools. (NSF SBIR Phases I and II with Cohesia Corp.) (1999-2002) • Document => Materials and Process Specs (alloys) 2
  3. 3. Selected URLs and Publications • http://www.knoesis.org/?q=research/semMat • http://wiki.knoesis.org/index.php/MaterialWays • Nishita Jaykumar, PavanKalyan Yallamelli, Vinh Nguyen, Sarasi Lalithsena, Krishnaprasad Thirunarayan, Amit Sheth. KnowledgeWiki: An OpenSource Tool for Creating Community-Curated Vocabulary, with a Use Case in Materials Science. In LDOW - WWW 2016. Montreal, Canada; 2016. • Vinh Nguyen, Olivier Bodenreider, Amit Sheth. Don't like RDF Reification? Making Statements about Statements using Singleton Property. 23rd International conference on World Wide Web (WWW 2014). NY: ACM; 2014. p. 759-770. • Krishnaprasad Thirunarayan, Amit Sheth, Kalpa Gunaratna, Vinh Nguyen, Siva Cheekula, Sarasi Lalithsena, Nishita Jaykumar, Swapnil Soni, Clare Paul. Architecture and Prototype for Materials Knowledge Management System using Semantic Web Technologies and Techniques: A Preliminary Report. WSU, 2014 3
  4. 4. Selected URLs and Publications • Krishnaprasad Thirunarayan, On Embedding Machine- Processable Semantics into Documents, In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 7, pp. 1014- 1018, July 2005. • K. Thirunarayan, A. Berkovich, and D. Sokol, An Information Extraction Approach to Reorganizing and Summarizing Specifications, In: Information and Software Technology Journal, Vol. 47, Issue 4, pp. 215-232, 2005. • K. Thirunarayan, A. Berkovich, and D. Sokol, Semi-automatic Content Extraction from Specifications, In: Proceedings of 6th International Conference on Applications of Natural Language to Information Systems, LNCS 2553, pp. 40-51, June 2002. 4
  5. 5. Outline • Domain Goals and Challenges • Utility and Continuum of Machine-Processable Semantics : An Architecture • What?: Nature of Data and Granurality of Semantics • Why?: Lightweight semantics and its benefits • How?: Community-ratified Ontologies + Semantic Annotations of Data and Documents + Linked Open Materials Data • Applications: • (Skip) Long-term Research: Processing Tabular Data • Integrating vocabularies : Matvocab KnowledgeWiki use case • Document Annotation : Biomaterials use case • Visualization and Navigation : iExplore • Private-Public Data Sharing • Conclusion 5
  6. 6. Domain Goals and Challenges • Materials Science and Engineering Data and Document sharing, discovery, and application are possible only if domain scientists are able and willing to do so. • Technological challenges – Computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data • Cultural challenges – Proper protection, control, and credit for sharing data 6
  7. 7. Our Thesis / Value Proposition Associating machine-processable semantics with materials science and engineering data and documents can help overcome challenges associated with data discovery, integration and interoperability caused by data heterogeneity. 7
  8. 8. What?: Nature of Data • Structured Data (e.g., relational) • Semi-structured, Heterogeneous Documents (e.g., publications and technical specs which usually include text, numerics, units of measure, images and equations) • Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries) 8
  9. 9. 9 Fragment of Materials and Process spec for: Ti Alloy Bars, Wire, Forgings, and Rings.
  10. 10. What?: Granularity of Semantics and Applications: Examples • Synonyms – Chemistry, Chemical Composition, Chemical Analysis, ... – Bend Test, Bending, ... – Delivery Condition, Process/Surface Finish, Temper, "as received by purchaser", ... • Co-reference vs broadening/narrowing – Tubing vs welded tubing vs flash-welded part • Capturing characteristic-value pairs – Recognize and Normalize: “0.1 inch and under in nominal thickness” is translated to “Thickness <= 0.1 in”. – Glean elided characteristic: controlled term “solution heat treated” implies the attribute “heat treat type”. 10
  11. 11. 1 2 3 of Semantic Web
  12. 12. 1 • Ontology: Agreement about a common vocabulary/nomenclature, conceptual models and domain knowledge – Codified as Schema + Knowledge Base. – Agreement is what enables interoperability. – Formal machine processable description is what leads to automation.
  13. 13. 2 • Semantic Annotation (Metadata Extraction): Associating meaning with data, or labeling data so it is more meaningful to the system and people. – Manual – Semi-automatic (automatic with human verification) – Automatic
  14. 14. 3 • Reasoning/Computation: – Semantics enabled search – Data integration – Answering complex queries and making connections (paths, sub-graphs) – Analyses including pattern discovery, mining, hypothesis validation – Visualization
  15. 15. How to integrate well? From Syntax to Semantics 15
  16. 16. SSN Ontology 2 Interpreted data (deductive) [in OWL] e.g., threshold 1 Annotated Data [in RDF] e.g., label 0 Raw Data [in TEXT] e.g., number Using Semantics to Climb Levels of Abstraction: an example 3 Interpreted data (abductive) [in OWL] e.g., diagnosis Intellego “150” Systolic blood pressure of 150 mmHg Elevated Blood Pressure Hyperthyroidism …… 16
  17. 17. Semantic Web Data Subject Object Predicate A triple is in the format (Subject, Predicate, Object). An RDF Dataset is a set of triples.
  18. 18. What?: Granularity of Semantics and Associated Applications • Lightweight semantics: File and document-level annotation to enable discovery and sharing • Richer semantics: Data-level annotation and extraction for semantic search and summarization • Fine-grained semantics: Data integration, interoperability and reasoning in Linked Open Materials Science Data 18
  19. 19. Computer Assisted Document Extraction Tool Tree/Structure view of the SpecTypical view of the tagged Spec
  20. 20. Computer Assisted Document Extraction Tool Example: Procedure Melt Methods View of the Original Spec Tagged Spec Tag Editor
  21. 21. Computer Assisted Document Extraction Tool The SDL Few More Examples: Procedure Melt MethodsTag Editor
  22. 22. Why?: Benefits of Lightweight Semantics • Ease of use by domain experts – Faster and wider adoption, promoting evolution • Low upfront cost to support • Shallow semantics has wider applicability to a range of documents/data and appeal to a broader community • Bottom-line: “Learn to Walk before we Run” 22
  23. 23. How?: Using Semantic Web Technologies Machine-processable semantics achieved by addressing • Syntactic Heterogeneity: Using XML syntax and RDF datamodel (labelled graph structure) • Semantic Heterogeneity: – Using “common” controlled vocabularies, taxonomies and ontologies – Using federated data sources, exchanges, querying, and services 23
  24. 24. How?: Ingredients for Semantics-based Cyber Infrastructure • Use of community-ratified controlled vocabularies and lightweight ontologies (upper-level, hierarchies) • Ease registration, publishing, and discovery • Provide support for provenance and access control • Track data citation for credit for data sharing • Semi-automatic annotation of data and documents : Manual + Automatic 24
  25. 25. How?: Search Continuum • Keyword-based full-text search • + Manually provided content and source metadata • Uses upper-level ontology • + Automatically extracted metadata • Map text to concepts/properties/values • Semantic + faceted search using background knowledge • + Deeper semi-automatic content annotation and extraction • Aggregating related pieces of information; conditioning • Integration and Interoperation • + Linked Open Material Science Data • + Federated and Faceted Querying and Services 25
  26. 26. Linked Open Data • Use “URIs” as identifiers to describe things http://dbpedia.org/resource/John_F._Kennedy • Associate descriptions to the identifiers 26 db:John_F. _Kennedy db:Politician db:Profession
  27. 27. Linked Open Data • Connect things together 27 db:John_F. _Kennedy db:Politician db:Profession ex:John_K ennedy ex:A_Nation _of_Immigra nts ex:authored_book owl:sameAs
  28. 28. Linked Open Data 28
  29. 29. Title of data Selected from five tier vocabulary provided Keywords Type of data maps, excel files, images, text Data format structured or unstructured Description of data brief unstructured description of content Contact information of provider(s) name of provider(s), email for verification, lineage Spatial extent of data and reference system location Temporal extent of data date range in time or age range if not recent Date and type of Related Publication(s) Journal, Thesis, Agency report, not published Host site for publication Journal, Library, Personal computer Access restrictions copyright regulations Example: Lightweight Semantic Registration of Data 29
  30. 30. System Architecture and Components 30
  31. 31. Problems and A Practical Approach (“When rubber meets the road”) Deeper Issues: Semantic Formalization of Tabular Data 31 skip
  32. 32. Nature of tables • Compact structures for sharing information – Minimize duplication • Types of Tables – Regular : Dense Grid with explicit schema information in terms of column and row headings => Tractable – Irregular: Sparse Grid with implicit schema and ad hoc placement of heading => Hard 32
  33. 33. 33
  34. 34. Challenges Associated with Typical Spreadsheet/Table • Meant for human consumption • Irregular : – Not simple rectangular grid • Heterogeneous – All rows not interpreted similarly • Complex – Meaning of each row and each column context dependent • Footnotes modify meaning of entries (esp. in materials and process specifications) 34
  35. 35. Practical Semi-Automatic Content Extraction • DESIGN: Develop regular data structures that can be used to formalize tabular information. – Provide a natural expression of data – Provide semantics to data, thereby removing potential ambiguities – Enable automatic translation • USE: Manual population of regular tables and automatic translation into LOD 35
  36. 36. 36 Our applications in Materials Genome Initiative
  37. 37. Matvocab home page Search and discovery Annotate documents Visualize the knowledge base Query vocabulary View, edit, and add Create and process assertions
  38. 38. 38 Vocabulary Creation / Curation N. Jaykumar, P. Yallamelli, V. Nguyen, S. Lalithsena, K. Thirunarayan, A. Sheth, C. Paul: KnowledgeWiki: An OpenSource Tool for Creating Community Curated Vocabulary, with a Use Case in Materials Science (Linked Data on the Web, World Wide Web Conference 2016)
  39. 39. KnowledgeWiki: An OpenSource Tool for Creating Community-Curated Vocabulary, with a Use Case in Materials Science WWW - LDOW 2016, Canada Nishita Jaykumar, Pavankalyan Yallamelli, Vinh Nguyen, Sarasi Lalithsena, Krishnaprasad Thirunarayan, Amit Sheth Kno.e.sis, Wright State University Clare Paul *Air Force Research Laboratory, Wright-Patterson AFB
  40. 40. 40 • Collaboration with AFRL Context for Research ASM HNDBK MIL HNDBK-5 MIL HNDBK-17 (Standardized Vocabularies) SKOS Dublin Core QUDT VAEM … Crowdsourcing from domain experts Consolidated vocabulary (MatVocab)
  41. 41. 41 Motivating Example Facts: Name Definition Source A-Basis The mechanical property value is the value above which … ASM Handbook, Volume 21: Composites. ABasis A statistically-based material property; a 95% lower… Composite Materials Handbook - Volume 1. MIL-HDBK-17F-1F, 17 June 2002 A-Basis The lower of either a statistically calculated number… Metallic Materials and Elements for Aerospace Vehicle Structures, MIL- HDBK-5J, 31 January 2003
  42. 42. 42 Facts: Name Definition Source YoungsModulus The ratio of normal stress to corresponding … ASM Handbook, Volume 21: Composites. ModulusYoungs The ratio of change in stress to change … MIL-HDBK-17 • Same term has multiple definitions that needs to be represented with its provenance information, that includes data such as, source and time. Motivating Example
  43. 43. 43 Related Work Auxiliary node approach A-Basis Auxiliary node1 … A statistically-based material … P26v P26s P580q P582q … • Properties represented in the wikidata model do not correspond to RDF properties • Ad hoc: Lack of formal semantics
  44. 44. • Extension to Mediawiki • We use the Semantic Form extension of Semantic Mediawiki for our task • Inability to represent metadata about the metadata 44 Semantic Mediawiki http://www.slideshare.net/cool_uk/semantic-mediawiki-simple-tutorial Representing entities and simple metadata The '''United Kingdom''' is a country located in [[Located in::Europe]].
  45. 45. 45
  46. 46. 46 • Adopted the Singleton Property method for capturing triple metadata in SMW • Importing legacy data with provenance in bulk using the Singleton Property method • Importing existing RDF datasets with provenance into SMW for curation Our Approach
  47. 47. Subject Predicate Object Source License Autoclave hasDefinition “A closed vessel for producing…” MIL-HDBK-17F-1F, 17 All rights reserved Singleton Property Facts: Subject Predicate Object hasDefinition#1 rdf:sp hasDefinition Autoclave hasDefinition#1 “A closed vessel for producing…” hasDefinition#1 hasSource MIL-HDBK-17 hasDefinition#1 hasLicense All rights reserved Singleton Property Translation 47 A singleton property represents one specific relationship between two entities under a certain context. It is assigned a uri, as any other property, and can be considered as a subproperty or an instance of a generic property. "Don't like RDF reification?: making statements about statements using singleton property."Proceedings of the 23rd international conference on World wide web. ACM, 2014.
  48. 48. • Formal semantics defined • Scalable, e.g., to LOD • Compatible with existing standards – RDF, RDFS, SPARQL • Can be used to capture multiple types of metadata – Provenance, time, location 48 Why use Singleton Property? Fu, Gang, et al. "Exposing Provenance Metadata Using Different RDF Models." arXiv preprint arXiv:1509.02822 (2015). Nguyen, Vinh, Olivier Bodenreider, and Amit Sheth. Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What Works Well With Wikidata?." Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA. 2015.
  49. 49. 49 Singleton v/s Regular Template Autoclave Definition Text Image Source Rights Autoclave Definition Text Image Source Rights Source Rights
  50. 50. 50 Regular Vs Singleton templates Subject Predicate Object Autoclave hasDefinition#1 “A closed vessel…” hasDefinition#1 singletonPropertyOf skos:definition hasDefinition#1 source “ASM Handbook” hasDefinition#1 license “Reproduced by…” Autoclave hasImage#1 “Image.jpg” hasImage#1 singletonPropertyOf mv:image Subject Predicate Object Autoclave hasDefinition “A closed vessel…” Autoclave source “ASM Handbook” Autoclave license “Reproduced by…” Autoclave hasImage “Image.jpg”
  51. 51. 51 Overall Architecture
  52. 52. • Properties of interest to domain experts: – Definition Text – Source – License – Creator – Abbreviation – Synonyms – Units – ….. 52 Use Case in Materials Science mv: is matvocab namespace
  53. 53. 53 Statistics of the Vocabulary Import Use Case Type SMW 1 Number of vocabularies imported 3 2 Total number of terms imported from ASM 1295 3 Total number of terms imported from MILHNDBK-5 19 4 Total number of terms imported from MILHNDBK-17 179 5 Total number of Singleton Templates created 6 6 Total number of Regular Templates created 5 7 Total number of pages created 1,685
  54. 54. 54 Search & Discovery
  55. 55. Annotate, search, and track provenance • Vocabulary is used to annotate documents. • Annotated documents can be indexed. • Documents can be integrated reliably based on common terms of interest and provenance information. 55
  56. 56. 56 Annotate documents using standard vocabulary
  57. 57. • Explains the origin of an artifact, such as – How was it created? – Who created it? – When was it created? • Example: for a given material X – Which processes are involved in making the material and what are the relevant performance properties? – What are the inputs, control parameters and outputs of a process? – Which research/engineering team performed an experiment? Provenance Metadata
  58. 58. 58 Capturing and Exploring provenance metadata - iExplore generic PMC prepreg generic hand lay-up generic PMC lay-up generic autoclave cure generic PMC subjected to subjected to yields yields
  59. 59. 59 Capturing and Exploring Vocabulary Provenance - iExplore Definition Rights Source Vocabulary term
  60. 60. Biomaterials Knowledge Extraction : Protein/Peptides/Amino Acids-Precious Metal Bindings • Recognition and extraction of crystalline surface patterns for precious metals (e.g., Gold/Silver surface patterns via Miller Indices - Au(100), Au(110), Ag(111)), protein/peptide/amino acid sequences, and indicators of binding relationship. – Example Input: They found that an alanine-substituted peptide (AYSSGAPPAPPF) exhibited the highest affinity for gold, while a proline-substituted peptide (AYPPGAPPMPPF) showed almost no affinity. 60
  61. 61. 61
  62. 62. Semantic Web Based Data Exchange and Interoperability for OEM-Supplier Collaboration 62
  63. 63. Goal and Example Accomplishment • Implement a Collaboration Platform using Semantic Web technology in the backend. – Semantic Web representation (RDF) and querying (SPARQL) hidden from the users (domain scientists) for convenience. • Example functionality incorporated in the “Beta” version of the PW-11 Collaboration Platform – Creation of a project by its owner and assigning users to groups (e.g., ordinary, external, foreign) in a project – Assigning access control rights based on group/user/file – Searching, requesting, and uploading files respecting access restrictions 63
  64. 64. Overall Plan • Implement necessary user interfaces and backend processing to facilitate the Collaboration use cases. – Develop and document user interfaces to support flexible access control and data exchange – Store information as metadata in the form of triples to support light-weight reasoning • Virtuoso triple store – Upload and store files (in the server’s file system) respecting user-project access control restrictions • Ubuntu, Java VM, Apache Tomcat Web Server 64
  65. 65. Pre-requisites • Pre-populated set of authorized users (for authentication) – Realistically this will require significant scrutiny of a user outside the collaboration platform. • Simple access control architecture and mechanisms (that can be extended further based on user feedback). • Kno.e.sis prototype assumed availability of an ITAR certified container to host the collaboration platform. Thus, the development of additional infrastructure for ITAR compliance was out of scope. 65
  66. 66. Public-Private Data Sharing • Enhance publicly available datasets while retaining intellectual property data privately for businesses 66 Private data and metadata (e.g. ongoing experimental processes, intellectual property data) Selectively shared data and metadata (e.g. with ongoing collaborators, licensed data) Public data and metadata (e.g., released products, material specifications)
  67. 67. OEM partner A Federated Architecture 67 Private Shared Public Federal Endpoint 1. User Authentication 2. Federated Semantic Query Processor AC Processor Semantic Query Processor OEM partner B Private Shared Public AC Processor Semantic Query Processor OEM supplier C Private Shared Public AC Processor Semantic Query Processor 3. Semantics Mappings
  68. 68. Principles of a Federation • Each component controls access to its local data independently (local autonomy). • A query is decomposed to multiple sub-queries, each sub-query is executed at one component. • Results from sub-queries are combined by the federated query processor (control global access)
  69. 69. Kno.e.sis Tools • Doozer: Ontology creator from Wikipedia category hierarchy • Scooner: Tool for trailblazing using semantic triples • Kino: Faceted Search Engine • iExplore: Visualize and navigate semantic / linked data • BLOOMS: Ontology alignment tool 69
  70. 70. Take Away Use of semantic web technologies can help overcome challenges associated with data discovery, integration, and interoperability, caused by data heterogeneity, and use of provenance and access control can help to share/exchange data reliably. 70
  71. 71. 71 thank you, and please visit us at http://knoesis.org/ Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA Kno.e.sis

×