My name is Johann Höchtl I am from Danube University Austria and I will present you some challenges of semantic interoperability and recent research to overcome the problems. Semantic interoperability is much about connecting concepts, thus the term semantic “bridging”. Istanbul would not be metropoly of the importance it has without the two big bridges connecting Europa and Asia. When thinking about Europe and Asia, certain associations arouse. Both have a characteristic food culture, traditional clothing and distinct medical cultures. Terms as Corn and Rice, Red Wine and Sake, Bachblüten and Reiki have something in common, a relationship which can be modeled on a higher level.
While the first three concepts fall into the food domain with Corn and Rice being an important protein source, Lederhose and Sari have in common that they are super of concept Clothing and share the property Natural Material and Bachblüten and Reiki are alternative medical treatments. To even more complicate things you can identify horizontal properties. They have in common that they all can be bought which belongs to Finance domain. What we can identify here are relationships and properties, hierarchy attributes. In terms of knowledge engineering these properties are termed superconcepts and sub-concepts or Higher Ontology vs. Lower ontology. As a knowledge worker you may find ask yourself whether you are a generalist or specialist.
After this small introductory stuff about what semantic bridging is about, some more information about my workplace. I work for Danube University Krems, the only publicly owned university for continuing education in Austria. The research focus of Center for E-Government is in E-Democracy and the impact of electronic participation on society. You will find out more about what we do when you browse to and participate on our public blog. If you are interested you may submit a paper to to E-Journal of E-Democracy and Open Government.
So why are we as a center for e-Government interested in Semantic Ontology driven data exchange? Because the current state of affairs in semantic land does not permit unguided exchange on the semantic level. As long as only technical interoperability is concerned for example when you can strictly follow an XML schema specification, things are fine. But not when it comes down to semantic systems without enriched domain knowledge. In the research we made together with the CIO section of Austrian Chancellery we found out that the recall rate of semantic bridging systems which focus on domain knowledge is higher than in systems which try to extract or reconstruct that domain knowledge by dictionary lookups, word frequency analysis or stemmer approaches. Three months ago netbase made a new service publicly available, a Content Intelligence platform for healthcare. Based on user input he gets treatment advises and possible causes and cures for diseases. While some of the results may be funny, but taken to seriously those advice can be more of harm than good. Here some funny assertions by the system. Since it’s release the system has improved as those funny assertions are not returned any longer.
Some fundamentals properties on semantics. First and foremost semantic bridging is much about the detection of similarity in a computerized manner. When semantic information is for example in OWL-DL format it first has to be converted into machine processable representation, which usably is that of a matrix. The two dimensions of the matrix contain the similarity of identified concepts and their similarity expressed between 0 and 1 with 0 meaning no similarity and 1 meaning either identical or full semantic match. As for the human eye a matrix is not the most intuitive form to visualize semantic information, for the human perception, Directed Acyclic Graphs or for special inheritance relationships trees are sensible graphical representations. The naïve approach to compute similarity is to completely enumerate all concepts and to compare pairwise. The theoretical amount of required data processing power for a complete DNA analysis or Internet Data Mining required new comparison algorithms, which reduce the computational complexity to less than NP-complete. A prominent early algorithm was the marching ants algorithm to solve the traveling salesman problem in reasonable time.
Many of those semantic similarity problems have their origins in detecting structural similarity, for example comparing the similarity between graphs. Especially in the realm of graph similarity, the influence of semantic similarit research resulted in new approaches and algorithms. While the number of edit operations to transform a tree A into a structural equivalent tree B are rather old, similarity flooding is a quite new methodology. The idea behind similarity flooding is the fundamental assumption, that two concepts are similar, if their neighbors are similar. While this algorithm iteratively traverses the graph at least two-fold and has terrible runtime complexity, additional sensible constraints help to improve the performance for example the maximum depth at which to propagate a similarity of node based on its surrounding nodes or branch prediction to stop comparing branches which are unlikely to match given a certain threshold. Besides the structural similarity of Graphs the element names and their assigned data types also contain semantic information. Dictionary bases algorithms calculate the relatedness of words or similar words may be identified by the soundex or levenshtein-algorithm. Combining multiple similarity measures into one concept, eg. Structural similarity between two nodes and their soundex similarity is another challenge. Once the similarity matrix has been established, the most likely matching pairs have to be determined. Based on similarity indices in the matrix Concepts of A can been as feature vectores and compared to the feature vectors of concept B with the euclidean distance, the well-know cosine distance or the Jaccard coefficent. The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
While the previous slide presented algorithms derived from schema matching which are applicable in ontology matching, these algorithms do not account enough for the semantics in an ontology. A frequent problem is to identify the most specific ancestor in an ontology. The EDGE and LEACOCK algorithm for example measure the relatedness of ontologies entirely on distance between edges in the ontology represented as a Directed Graph. In 1995 RESNIK proposed a similarity approach which accounts for the depth of the concepts in the Graph. A node carries less information the higher it can be found along the inheritance line. Dekang Lin refined this concept in 1998 with a very clever, universally applicable, domain and resource-neutral concept. He defines similarity by the amount of information the concepts share in relation to the smallest common sub-concept. To give you an idea on how complex this is, in 2005 a paper was presented to WWW Conference in Chiba Japan. The Department of CS of University of Indiana, US, compared a traditional tree-based approach to a graph-based analysis of similarity between all concepts available on DMOZ.org, excluding world and regional. In 2005 DMoz.org had 150.000 pages. The Calculation of graph-based similarity on hierarchical component and the two non-hierarchical components symbolic and related cross-links required a total of 5000 CPU-hours on a massively parallel CPU cluster consisting of 416 Prestonian cores. But abbreviations or association words add a level of complexity which prevents automatic inference of concepts . In this cases either a custom dictionary knowledge represented in SWRL predicate logic or simply a human based mapping can solve these mapping problems.
Therefore the work in SET TC deliberately limits to mapping document standards derived from UN/CEFACTS Core Component Technical Specification. This specification imposes some challenging properties as document artifacts are described as Basic Business Information Entities which are derived from Core Components. Those business entities in turn may compose to Aggregate Business Information Entities. As the inclusion of Business Elements may not be feasible, an Association type exists. This type of information carries the semantics which has to be preserved while deriving the ontology from the provided Excel files defined at UN/CEFACT. The schema information of OAGIS 9.1, UBL 2.0 and GS1 XML can be automatically traced back to their origin in UN/CEFACT Core Component Library and the RacerPro inference engine, enriched with rules refined in predicate logic, computes the higher ontology. This inferred semantic knowledge is used for automated creation of XPATH and XSLT expressions to map document artifacts between these document standards.
To further improve match results the automatically semantics derived from these document standards has to be enriched with explicit rules. On the one hand these rules add the information to match semantically equivalent yet structurally different document artifacts. Some rules also deal with the fact that some parts of the document standards have their equivalent in UN/CEFACTS Core Component Technical specification, but in practice are used differently. Structural difference is recovered in Association Document Component and their relation to Basic Document Component Pairs and in Basic Business Document Information Entities. In tests the success rate is higher than those of semantic mapping frameworks leveraging general inference techniques as association lists, dictionaries, stemmer algorithms or word distance algorithms. OASIS Set TC is still an ongoing effort, but you may point your browser to the address in this presentation, documentation is available in the installation package you find in the link section on the last slide as well on the OASSI SET TC homepage.
The OASIS SET TC framework is one deliverable of the iSurf project, which will create and Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains Supported by RFID Devices and is targeted towards the needs of SMEs. OASIS SET TC will not replace existing document communication standards but enrich the interoperability. It is an expandable model. Documents which derive from UN CEFACTS CCL specification are easy to incorporate, other formats require a fundamental redefinition of the higher ontology. For my research focus this is just one more cornerstone towards interoperable electronic Services of public administration. Recent efforts concentrate a lot in public procurement with the PEPPOL project or the objectives of the semic.eu platform. Norway, Iceland, Finland, Sweden, Denmark and Norway work together do define a more appropriate version of UBL for public procurement, called the Northern European Subset of UBL. Iceland already exchanged eInvoices in this format. At the point this specification will see widespread application, the need to exchange data in standard of HL7 XML will arise and the SET TC framework will be the right tool to do so.
E Challenges 2009 Workshop 10b Semantic Interoperability Methodologies
Semantic Interoperability Methodologies Johann Höchtl Danube University Krems Center for E-Government Austria
Bridges <ul><li>Bridging Concepts </li></ul><ul><li>Corn vs. Rice </li></ul><ul><li>Red Wine vs. Sake </li></ul><ul><li>Surströmming vs. thousand-year egg </li></ul><ul><li>Lederhosen vs. Sari </li></ul><ul><li>Bachblüten vs. Reiki </li></ul><ul><li>. </li></ul><ul><li>. </li></ul>Istanbul Bridge Map by Openstreetmap.org Europe Asia
Super – Sub - Concepts <ul><li>Corn vs. Rice </li></ul><ul><li>Red Wine vs. Sake </li></ul><ul><li>Surströmming vs. thousand-year egg </li></ul>Know Everything vs. Domain Specific Knowledge? Food Protein Alcohol Nat. Preservative Lederhosen vs. Sari Clothing Natural Materials Bachblüten vs. Reiki Medicine Alternative Medicine Superconcept / Higher Ontology Sub-Concept / Lower Ontology Finance Buy Logistic Store
Who we are and What we do <ul><li>Danube University Krems, Austria </li></ul><ul><ul><li>Only State-Owned Post Graduate University </li></ul></ul><ul><li>Center for E-Government </li></ul><ul><ul><li>E-Inclusion and E-Participation and their impacts on electronic society </li></ul></ul><ul><ul><li>http:// digitalgovernment.wordpress.com </li></ul></ul><ul><ul><li>Journal of E-Democrcy and Open Government http:// www.jedem.org </li></ul></ul><ul><li>About the Presenter </li></ul><ul><ul><li>E-Participative Processes and Models of Incorporation </li></ul></ul><ul><ul><li>Doctoral Thesis University of Vienna and Technical University of Vienna, Business Informatics, Vienna </li></ul></ul>
The Problem – State of Computer Automated Semantic Understanding <ul><li>“ The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.” </li></ul><ul><li>Lesson 1 : Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high </li></ul><ul><li>Lesson 2 : Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text. </li></ul><ul><li>Lesson 3 : If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms. </li></ul><ul><li>This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic the failure too.” </li></ul>Reddit: Source: http://marklogic.blogspot.com/2009/09/netbase-tragicomedy-perils-of-magic-and.html
Notions of Similarity <ul><li>How can a computer system declare two data fragments similar and to what extend? </li></ul><ul><li>Starting point: Data transformation into a computable dimension </li></ul><ul><ul><li>Canonical data structure is Matrix, X and Y Dimensions contain identified terms and their respective similarity </li></ul></ul><ul><ul><li>Visualization as tree or directed graph </li></ul></ul><ul><li>Required Computational effort is very high </li></ul><ul><ul><li>Brute-force approach: Compare every identified term of document instance A with every identified term of document instance B </li></ul></ul><ul><ul><li>Recent approaches: Genetic algorithms: “90:9:1 syndrome”: 90% of results are very good, 9% are acceptable and one percent degenerates </li></ul></ul>
Similarities <ul><li>Structural Similarity </li></ul><ul><ul><li>Number of Edit Operations to transform tree A (document artifact A) into tree B </li></ul></ul><ul><ul><li>Maximum common sub Graph, Minimum common super graph </li></ul></ul><ul><ul><li>Similarity Flooding </li></ul></ul><ul><ul><ul><li>Two graphs are similar if the neighborhoods for every Node are similar. </li></ul></ul></ul><ul><li>Element based Similarity </li></ul><ul><ul><li>Element names </li></ul></ul><ul><ul><li>Data types </li></ul></ul><ul><li>Similarity algorithms </li></ul><ul><ul><li>Strings: Levenshtein distance, lingusitic similarity (soundex) </li></ul></ul><ul><ul><li>Logical structure: Jaccard index, Dice coefficient, cosine similarity </li></ul></ul>
Ontology Similarity - Approaches <ul><li>What is the most specific common ancestor of a pair of concepts in an ontology – distance of concepts? </li></ul><ul><ul><li>Assign different similarity weights according to relationship (Synonyms, Hypernyms, Antonyms, Meronyms, …) </li></ul></ul><ul><ul><li>Measure through ontology collection (Opencyc, Wordnet, Wikipedia, DMOZ) or Knowledge Base </li></ul></ul><ul><ul><li>EDGE, LEACOCK, RESNIK,LIN, JIANG </li></ul></ul><ul><li>BUT: Abbreviations, AssocWords with delimiters (ArrivalAirportIn), Suffix/Prefix (hasName), misspellings, free invented words, … </li></ul><ul><ul><li>Human interaction (interactive mapping, enriched logic) is necessary </li></ul></ul><ul><ul><li>Solution: Knowledge Base, but high computational overhead! </li></ul></ul>
Research Approach <ul><li>Domain focus: Mapping of Document standards derived from UN/CEFACT CCTS Core Component Technical Specification </li></ul><ul><li>Support OAGIS 9.1, GS1 XML, UBL 2.0 and UN/CEFACT CCL 07B as common ancestor </li></ul><ul><li>Specialist approach: Reuse semantic knowledge instead of re-create already existing knowledge </li></ul><ul><ul><li>Automatically inferable semantic knowledge is in data types, structural similarity, element names. </li></ul></ul><ul><li>Feed an inference engine to create the upper ontology </li></ul><ul><li>The „harmonized“ upper ontology contains the relationship between document artifacts </li></ul>
Results <ul><li>Explicit rules incorporated in inference process </li></ul><ul><li>Heuristics to Discover Structurally Different </li></ul><ul><ul><li>Association Document Component and Basic Document Component Pairs </li></ul></ul><ul><ul><li>Different Basic Document Component </li></ul></ul><ul><ul><li>Association Document Components </li></ul></ul><ul><li>Recall rate for a domain specific mapper is higher as one relying on automatic inference: </li></ul><ul><ul><li>Success rate in identifying UBL ABIE to GS1 XML ABIEs 88.1% </li></ul></ul><ul><ul><li>False positive hits ~ 10% </li></ul></ul><ul><li>Repository of XSLT mappings as cache </li></ul><ul><li>GUI: Semantic Interoperability Service Utility (“ISU”) </li></ul><ul><li>Tryout at http://184.108.40.206:9090/ISU/web </li></ul><ul><li>OASIS SET TC at http:// www. oasis -open.org/committees/ set / </li></ul>
Conclusion and outlook <ul><li>Targeted towards SMEs to overcome different communication standards in different domains </li></ul><ul><ul><li>RossettaNET vs. OAGIS vs. HL7 vs. … </li></ul></ul><ul><li>Current implementation focuses on CCL 07B derivatives </li></ul><ul><ul><li>But expandable model! </li></ul></ul><ul><li>Applicability beyond SMEs and Supply Chain / Invoicing </li></ul><ul><ul><li>Northern European Subset of UBL (NES) </li></ul></ul><ul><ul><li>cooperation on e-commerce and e-procurement </li></ul></ul><ul><ul><li>purpose is to facilitate harmonization of different types of e-procurement documents in countries that are already using UBL </li></ul></ul><ul><ul><li>Consequence: Data exchange between NES and OAGIS, UBL, HL7 … a use case for SET! </li></ul></ul>
THANK YOU! – Questions? <ul><li>Links: </li></ul><ul><li>http://www.srdc.metu.edu.tr/iSURF/OASIS-SET-TC/tools/ISU-latest.zip </li></ul><ul><li>http://220.127.116.11:9090/ISU/web </li></ul><ul><li>http:// www. oasis -open.org/committees/ set / </li></ul><ul><li>http://www.oasis-open.org/committees/download.php/32369/20090504SemanticRepresentationOfDocumentArtifacts.pdf </li></ul><ul><li>http://www.oasis-open.org/committees/download.php/33577/SET-TC.odp </li></ul><ul><li>Gerti Kappel, Horst Kargl, Gerhard Kramler, Andrea Schauerhuber, Martina Seidl, Michael Strommer, and Manuel Wimmer, “Matching Metamodels with Semantic Systems - An Experience Report,” Mainz , 2007, pp. 38-52. </li></ul><ul><li>Fabien Duchateau and Zohra Bellahsène, “Designing a Benchmark for the Assessment of XML Schema Matching Tools,” Vienna, Austria: ACM, 2007. </li></ul><ul><li>Hong-Hai Do and Erhard Rahm, “Matching large schemas: Approaches and evaluation,” Science Direct , 2007, pp. 857-885. </li></ul>