Your SlideShare is downloading. ×
Realizing Semantic Web - Light Weight semantics and beyond
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Realizing Semantic Web - Light Weight semantics and beyond

625
views

Published on

The talk titled "Realizing Semantic Web - Light Weight semantics and beyond" given by prof. T.K. Prasad at the ICMSE-MGI Digital Data Workshop held at Kno.e.sis Center from November 13-14 2013. The …

The talk titled "Realizing Semantic Web - Light Weight semantics and beyond" given by prof. T.K. Prasad at the ICMSE-MGI Digital Data Workshop held at Kno.e.sis Center from November 13-14 2013. The talk emphasized on annotation and search framework.

workshop page: http://wiki.knoesis.org/index.php/ICMSE-MGI_Digital_Data_Workshop

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
625
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Vision : Short-term and long-term goals,Approaches, Benefits, and Challenges (reflecting cost/convenience/pay-off trade-offs) continuum to preserve investment (2) Glimpse at Semantic Formalization of Tabular Data Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications
  • able (technological challenges) and willing (cultural challenges) Realization requires semantics-empowered techniques and at this point benefit from semantic technologies and tools for “convenient” discovery (using useful and usable technologies)-------------Economic and shortening-time-to-development-and-market motive
  • Significant current investments in sciences “Relative comparison”Data Centers: NASA, USGS, NOAA------------------------------------------------------------------------------Cf. Big Data Volume and Velocity challengeCf. Big Data Variety and Veracity challengeUltimately want to derive value
  • Addl. areas that can be benefited => GeoscienceSyntactic (format) and semantic (domain models, perspectives) heterogeneity [text vs excel vs XML) (units of measure, well-entrenched vocabularies)
  • Use Case: Materials and process specificationsVariety challenge: Sources of heterogeneity syntactic (excel, XML, text) vs semantic (UOM, controlled terms)Attribute value-pairs : explicit vs implicit : conditioned on shape, dimension (making these connections explicit from text doc non-trivial)Table captions : Use text-basedmetadata to help mediate => tabular data
  • SAE Intl. : AMS 4928N : Aerospace Material Specification Ti Alloy Bars, Wire, Forgings, and Rings. UNS R56400 (Issued 7/1/1957 Revised 4/1/1993)-----------------------------------------------------------------------Human comprehensible but requires understanding that the spec is for (1) Four types of material : Bars, Wire, Forgings, and Rings. (2)And the tolerance for Hydrogen content for forgings is lax …+ References to other specs that need to be inherited[[+ test frequency / lot size]](3) Residual elements?? => need background knowledge -- => is it zero/traces?Why also specify ppm? Convenience? 0.05 = 500 ppm (0.05% = 0.05/100 = 500 x 10^-6======================Two issues: what is th semantics? How can that be obtained automatically? Or more realistically, semi-automatically.
  • (Ref: B 50T26 S7, Sections 1, 4.2, 4.4)Synonyms: stemming (syntactic) to richer thesaurus (simple KB) (to MAP doc / text strings to domain concepts / ontology)Coreference issues: Purpose of Semantics => What is literally given vs what is really meant? E.g., KB says welded tubing ISA tubing, but in a paragraph that describes ‘welded tubing’, one can refer to it using “the tubing”.RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs Besides determining related phrases using clause, line, paragraph boundary, etc. we may need to use semantic/domain model/ontology to normalize or fill-in implicit details==============================PLUS NLP-lite issues:There is confusion regarding the distribution of “and” over “or”, and over the interpretation of “and” and “or”. For instance, is “X or Y and Z” = “X and Z or Y and Z”? Similarly, “and” in the context “P is X and Y” connotes intersection, while “and” in the context of “P and Q are X” connotes union. ------------Ingot chemistry vs product chemistry
  • Semantics at different levels of detail and developed in stages : “Rome was not built in a day”! : Cost-benefit trade-offs------------------------------------------------------ANALOGY: Table of content (top-down, prescribed, static) vs Index (bottom-up, gleaned to describe, dynamic)--------------------------------------------------------Controlled vocabularies <= Lightweight ontologies [ legacy vocab + community agreed semantic relationships] <= Formal ontologiesOriginal document vs its translation => traceability (provenance)---------Past Research: We have dealt with top-down UMLS ontology vs bottom-up facts from Pubmed in HPCO (Literature-based discovery -> LBD)---------Pick from existing upper-level ontology vocabulary => manual ; indexing table columns, rows, captionsSemi-automatic metadata generation/embedding => annotation: mapping text to concept; summarization: triple extraction => semantic search with bg KBTranslation and summarization - [Integration and Interoperation requires Alignment of vocabularies] Graphical representation and queryingLiterature-based discovery: navigate through the documents based on path search through their LOD renditions (extractions)-----------------------------RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs = LOD  Eventually allows combining and comparing specs==============================Biomaterials use case: Gold surface affinity of peptide sequence===================PLUS NLP-lite issues:There is confusion regarding the distribution of “and” over “or”, and over the interpretation of “and” and “or”. For instance, is “X or Y and Z” = “X and Z or Y and Z”? Similarly, “and” in the context “P is X and Y” connotes intersection, while “and” in the context of “P and Q are X” connotes union. --------------Compare, manipulate, and combine specs
  • Light-weight semantics as the first step
  • More detailed annotation for extraction …Elided/surfaced : melt quantityAnd-Or issuesIn the translation: atmosphere specified only for nonconsumable electrode.
  • Syntax tad inelegant because it is meant to be machine-processable more readily than for human consumption.In reality, it is serving both purposes, for traceability and verification purposes.Supports faceted search comparing and combining specs
  • Even if sold on deeper formal semantics, LWS is a necessary first step and we advocate it …
  • Use Case: Materials and process specificationsVariety challenge: Sources of heterogeneity syntactic (excel, XML, text) vs semantic (UOM, controlled terms)Attribute value-pairs : explicit vs implicit : conditioned on shape, dimension (making these connections explicit from text doc non-trivial)Table captions : Use text-basedmetadata to help mediate => tabular data-----------------------Unification – integration vs federation – interoperation/mediation
  • Less training ASTM, NIST, MIL-stds (Handbook 21, 5)Flat list of terms and their associated definitionsHierarchical organization of properties, alloys, performance metrics, …Cross relationships: (1) Qualitative dependencies (proportionality)(2) Quantitative dependencies (equations/formula)
  • Document text – table captions + Provenance and content keywords + Annotation tools + attibute-value pairs ; consolidation of related pieces; conditioning------------------------------Biomaterials Use-case: Gold binding peptidesRecognize Gold surfaces, peptide sequences, and then their relationships
  • http://www.bbc.co.uk/music/ => [search by artist] “Elton john”How to build a site like BBC which is constantly kept up-to-date (e.g. artist details)? Either site editors manually edit the data and the web site, or have data automatically extracted from other web sites that is kept current (say by the crowd). E.g., BBC uses external open datasets such as Wikipedia and Music Brainz. How do we build an agile infrastructure that can benefit from sharable, evolving, open data sets?
  • Open data sets alone in isolation are analogous to the web of documents without hyperlinks. We need the capability to access the data via standard technologies, and interlink data over the web… That is what we call linked open data
  • Primary principles of linked open data1. Use URIs to describe the data (machine processable – std)2. Associate Descriptions to the data2. Interlink data whenever possible – data integration is happening via interlinking at data level and instance levelRealization of this generic linked data principle is achieved using the semantic web technologies (RDF, RDFS and OWL)-------------The nature of issues to be dealt with / resolved / accomplished in creating LOD
  • Diagram in the below shows the domain coverage of LOD by 2011
  • How do we search: Content and Context-based (provenance)hierarchy of terms for manual selection – publications – who wrote, where appeared, when
  • Virtuoso/SPARQL, PROVBLOOMS, Kino, Sig.maDataverse project: Data citation identifiers – technological details
  • NSF-SBIR “Computer-Assisted Document Interpretations Tools” [Materials and Process specs relevant to aircraft and automobile industries]
  • Use case: Materials and Process specs (e.g., composition table, tensile test information (conditional constructs; cross-section info placed on a row by itself; nested tables; blank values)[cannot reproduce GE specs due to copyright issues]
  • AMS 4928Nhttp://www.youtube.com/watch?v=D8U4G5kcpcMhttp://www.ndt-ed.org/EducationResources/CommunityCollege/Materials/Mechanical/Mechanical.htmMost structural materials are anisotropic, which means that their material properties vary with orientation.In products such as sheet and plate, the rolling direction is called the longitudinal direction, the width of the product is called the (long) transverse direction, and the thickness is called the short transverse direction.
  • Use case: Materials and Process specsCompact structures for sharing informationMinimize duplication
  • In content extraction from tables, a human extractor formalizes the data using “predefined” tables, and a wizard then generates LOD from it.Extractor is responsible for gleaning the semantics (manual part)Wizard responsible for the mechanical translation (automatic part)==================The yardstick of success is the extent to which regular parts of the table can be automatically assimilated and translated, while leaving more complex parts for manual guidance.
  • Transcript

    • 1. Realizing Semantic Web: Lightweight Semantics and Beyond Krishnaprasad Thirunarayan (T. K. Prasad) Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, OH-45435 1
    • 2. Outline • Domain Goals and Challenges • Cyberinfrastructure Investments in Science • Utility and Continuum of Machine-Processable Semantics : An Architecture • • • What?: Nature of Data and Granurality of Semantics Why?: Lightweight semantics and its benefits How?: Community-ratified Ontologies + Semantic Annotations of Data and Documents + Linked Open Materials Data • Research: Processing Tabular Data 2
    • 3. Domain Goals and Challenges • Materials Science and Engineering Data and Information sharing, discovery, and application are possible only if domain scientists are able and willing to do so. • Technological challenges – Computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data • Cultural challenges – Proper protection, control, and credit for sharing data 3
    • 4. Category of Geoscience Data Characteristics Strategy for Reuse CI Strategy Short tail science data created by large organization s and projects Few, large (TB+), structured, spatially rich (e.g., remote sensing), largely homogeneous, highly visible, curated Planned integration strategies, could use formal ontologies / domain models and vocabularies, visualization tools and APIs Data centers / grids generally using relational databases and files, maintained by people with significant IT skills Long tail science data created by individual scientists and small groups Many, small (GB+), heterogeneous, invisible (except via publications), poorly curated Multi-domain and broad vocabularies (including community established ones), create semantic metadata (annotations) and optionally publish, search and download legacy data, or use an open data initiative Web-based easy to learn and use semantic tools for annotation, publication, search and download that can be used by individual scientists without significant IT skills 4
    • 5. Our Thesis Associating machine-processable semantics with materials science and engineering data and documents can help overcome challenges associated with data discovery, integration and interoperability caused by data heterogeneity. 5
    • 6. What?: Nature of Data and Documents • Structured Data (e.g., relational) • Semi-structured, Heterogeneous Documents (e.g., publications and technical specs usually include text, numerics, units of measure, images and equations) • Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries) 6
    • 7. Fragment of Materials and Process spec for Ti Alloy Bars, Wire, Forgings, and Rings. 7
    • 8. What?: Granularity of Semantics and Applications: Examples • Synonyms – Chemistry, Chemical Composition, Chemical Analysis, ... – Bend Test, Bending, ... – Delivery Condition, Process/Surface Finish, Temper, "as received by purchaser", ... • Coreference vs broadening/narrowing – Tubing vs welded tubing vs flash-welded part • Capturing characteristic-value pairs – Recognize and Normalize: “0.1 inch and under in nominal thickness” is translated to “Thickness <= 0.1 in”. – Glean elided characteristic: controlled term “solution heat treated” implies the characteristic “heat treat type”. 8
    • 9. What?: Granularity of Semantics and Associated Applications • Lightweight semantics: File and document-level annotation to enable discovery and sharing • Richer semantics: Data-level annotation and extraction for semantic search and summarization • Fine-grained semantics: Data integration, interoperability and reasoning in Linked Open Materials Science Data 9
    • 10. Computer Assisted Document Extraction Tool Typical view of the tagged Spec Tree/Structure view of the Spec 10
    • 11. Computer Assisted Document Extraction Tool Tag Editor Few More Examples: Procedure Melt Methods View of the Original Spec Tagged Spec 11
    • 12. Computer Assisted Document Extraction Tool Tag Editor Few More Examples: Procedure Melt Methods The SDL 12
    • 13. Why?: Benefits of Lightweight Semantics • Ease of use by domain experts – Faster and wider adoption, promoting evolution • Low upfront cost to support • Shallow semantics has wider applicability to a range of documents/data and appeal to a broader community of geoscientists • Bottom-line: “Learn to Walk before we Run” 13
    • 14. How?: Using Semantic Web Technologies Machine-processable semantics achieved by addressing • Syntactic Heterogeneity: Using XML syntax and RDF datamodel (labelled graph structure) • Semantic Heterogeneity: – Using “common” controlled vocabularies, taxonomies and ontologies – Using federated data sources, exchanges, querying, and services 14
    • 15. How?: Ingredients for Semantics-based Cyber Infrastructure • Use of community-ratified controlled vocabularies and lightweight ontologies (upper-level, hierarchies) • Ease registration, publishing, and discovery • Provide support for provenance and access control • Track data citation for credit for data sharing • Semi-automatic annotation of data and documents : Manual + Automatic 15
    • 16. How?: Search Continuum • Keyword-based full-text search • + Manually provided content and source metadata • • • Uses upper-level ontology + Automatically extracted metadata • • Map text to concepts/properties/values Semantic + faceted search using background knowledge + Deeper semi-automatic content annotation and extraction • • Aggregating related pieces of information; conditioning Integration and Interoperation • + Linked Open Material Science Data • + Federated and Faceted Querying and Services 16
    • 17. Linked Open Data – Why do we need data? 17
    • 18. Linked Open Data – Just data is not enough • More and more data are available, But … Isolated islands of data is not enough, akin to the web of documents without hyperlinks. data set A data set D data set B data set F data set E data set C Need to interlink data over the web to enable content-rich applications. Linked Data data set A data set D data set F data set B data set E data set C 18
    • 19. Linked Open Data – A Realization http://dbpedia../politici an http://ex./John_Kennedy http://dbpedia../Profession Owl:sameAs http://ex./AuthoredBook http://dbpedia../John_F._Kennedy http://ex./A_Nation_of_I mmigrants http://ex./publishedIn 1964 http://dbpedia../BirthDate 1917-05-29 http://ex./genre http://ex./non-fiction http://dbpedia../Capital http://dbpedia../Boston http://dbpedia../BirthPlace http://dbpedia../Massac husetts http://dbpedia../Country http://dbpedia../United _States 19
    • 20. Linked Open Data “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” 20
    • 21. Example: Lightweight Semantic Registration of Data Title of data Type of data Selected from five tier vocabulary provided Keywords maps, excel files, images, text Data format structured or unstructured Description of data brief unstructured description of content Contact information of provider(s) name of provider(s), email for verification, lineage location Spatial extent of data and reference system Temporal extent of data date range in time or age range if not recent Date and type of Related Publication(s) Host site for publication Journal, Thesis, Agency report, not published Access restrictions copyright regulations Journal, Library, Personal computer 21
    • 22. System Architecture and Components 22
    • 23. Deeper Issues: Semantic Formalization of Tabular Data Problems and A Practical Approach (“When rubber meets the road”) skip 23
    • 24. Nature of tables • Compact structures for sharing information – Minimize duplication • Types of Tables – Regular : Dense Grid with explicit schema information in terms of column and row headings => Tractable – Irregular: Sparse Grid with implicit schema and ad hoc placement of heading => Hard 24
    • 25. 25
    • 26. Challenges Associated with Typical Spreadsheet/Table • • Meant for human consumption Irregular : – Not simple rectangular grid • Heterogeneous – All rows not interpreted similarly • Complex – Meaning of each row and each column context dependent • Footnotes modify meaning of entries (esp. in materials and process specifications) 26
    • 27. Practical Semi-Automatic Content Extraction • DESIGN: Develop regular data structures that can be used to formalize tabular information. – Provide a natural expression of data – Provide semantics to data, thereby removing potential ambiguities – Enable automatic translation • USE: Manual population of regular tables and automatic translation into LOD 27
    • 28. Kno.e.sis thank you, and please visit us at http://knoesis.org/ Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA 28