Your SlideShare is downloading. ×
  • Like
Fusing semantic data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Fusing semantic data

  • 231 views
Published

 

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
231
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Fusing automatically extracted annotations for the Semantic Web Andriy Nikolov Knowledge Media Institute The Open University
  • 2. Outline
    • Motivation
    • Handling fusion subtasks
      • problem-solving method approach
    • Processing inconsistencies
      • applying uncertainty reasoning
    • Overcoming schema heterogeneity
      • Linked Data scenario
  • 3. Database scenario
    • Classical scenario (database domain)
      • Merging information from datasets containing partially overlapping information
    [email_address] [email_address] E-mail … 1983 J. Smith … 1972 H. Schmidt Address Year of birth Name … [email_address] 1980 Wen, Zhao [email_address] E-Mail Job position Year of birth Name … 1973 Schmidt, Hans
  • 4. Database scenario
    • Coreference resolution (record linkage)
      • Resolving ambiguous identities
    [email_address] [email_address] E-mail … 1983 J. Smith … 1972 H. Schmidt Address Year of birth Name … [email_address] 1980 Wen, Zhao [email_address] E-Mail Job position Year of birth Name … 1973 Schmidt, Hans
  • 5. Database scenario
    • Inconsistency resolution
      • Handling contradictory pieces of data
    [email_address] [email_address] E-mail … 1983 J. Smith … 1972 H. Schmidt Address Year of birth Name … [email_address] 1980 Wen, Zhao [email_address] E-Mail Job position Year of birth Name … 1973 Schmidt, Hans
  • 6. Semantic data scenario
    • Database domain:
      • A record belongs to a single table
      • Table structure defines relevant attributes
      • Inconsistency of values
    • Semantic data:
      • Classes are organised into hierarchies
      • One individual may belong to several classes
      • Available properties depend on the level of granularity
      • Other types of inconsistencies are possible
        • E.g., class disjointness
    foaf:Person sweto:Person foaf:name xsd:string xsd:string foaf:mbox sweto:Place sweto:lives_in sweto:Organization sweto:affiliated_with sweto:Researcher xsd:string some:has_degree
  • 7. Motivating scenario – X-Media RDF Images Other data Annotation Fusion Text Internal corporate reports (Intranet) Pre-defined public sources (WWW) Domain ontology KnoFuss Knowledge base
  • 8. Outline
    • Motivation
    • Handling fusion subtasks
      • problem-solving method approach
    • Processing inconsistencies
      • applying uncertainty reasoning
    • Overcoming schema heterogeneity
      • Linked Data scenario
  • 9. Handling fusion subtasks
    • For each subtask, several available methods exist
    • Example: coreference resolution
      • Aggregated attribute similarity
        • [Fellegi&Sunter 1969]
      • String similarity
        • Levenshtein, Jaro, Jaro-Winkler
      • Machine learning
        • Clustering
        • Classification
      • Rule-based
  • 10. Handling fusion subtasks
    • All methods have their pros and cons
      • Rule-based
        • High precision
        • Restricted to a specific domain
      • Machine learning
        • Require sufficient training data
      • String similarity
        • Lower precision
        • Still need configuration (e.g., distance metric, threshold, set of attributes to include)
    • Trade-off between the quality of results and applicability range
      • better precision requires more domain-specific knowledge
  • 11. Problem-solving method approach
    • Fusion task is decomposed into subtasks
    • Algorithms defined as methods solving a particular task
    • Each method is formally described using the fusion ontology
      • Task handled by the method
      • Applicability criteria
      • Domain knowledge required
      • Reliability of output
    • Methods are selected based on their capabilities
  • 12. KnoFuss architecture
    • Method library
      • Contains implementation of each technique for specific subtasks (problem-solving method [Motta 1999])
    • Fusion ontology
      • Describes method capabilities
      • Defines intermediate structures (mappings, conflict sets, etc.)
    Fusion KB Intermediate data Main KB KnoFuss CoreferenceResolutionMethod ConflictDetectionMethod ConflictResolutionMethod Method library New data Fusion ontology
  • 13. Task decomposition Source KB Target KB (fused) Target KB Knowledge fusion Coreference resolution Knowledge base updating Model configuration Dependency identification Dependency resolution Link discovery
  • 14. Method selection Adaptive learning matcher Application context: Publication Application context: Journal Article rdf:type owl:Thing datatypeProperty ?x reliability =0.4 rdf:type sweto:Publication rdfs:label ?x sweto:year ?y reliability =0.8 rdf:type sweto:Article rdfs:label ?x sweto:year ?y sweto:journal ?z sweto:volume ?a reliability =0.9
    • Depends on:
      • Range of applicability
      • Reliability
    • Configuration parameters
      • Generic (for individuals of unknown types)
      • Context-dependent
  • 15. Using class hierarchy
    • Configuring machine-learning methods:
      • Using training instances for a subclass to learn generic models for superclasses
    owl:Thing foaf:Person foaf:Document sweto:Publication sweto:Article sweto:Article_in_Proceedings year name volume book_title label journal_name Ind1: {label, year, book_title} Ind2: {label, year, book_title} Ind3: {label, year, book_title} Ind1: {label, year} Ind2: {label, year} Ind3: {label, year} Ind1: {label} Ind2: {label} Ind3: {label}
  • 16. Using class hierarchy
    • Configuring machine-learning methods:
      • Combining training instances for subclasses to learn a generic model for a superclass
    sweto:Publication sweto:Article sweto:Article_in_Proceedings year volume book_title label journal_name Ind1: {label, year, book_title} Ind2: {label, year, book_title} Ind3: {label, year, book_title} Ind1: {label, year} Ind2: {label, year} Ind3: {label, year} Ind4: {label, year} Ind5: {label, year} Ind6: {label, year} Ind4: {label, year, journal_name, volume} Ind5: {label, year, journal_name, volume} Ind6: {label, year, journal_name, volume}
  • 17. Outline
    • Motivation
    • Handling fusion subtasks
      • problem-solving method approach
    • Processing inconsistencies
      • applying uncertainty reasoning
    • Overcoming schema heterogeneity
      • Linked Data scenario
  • 18. Data quality problems
    • Causes of inconsistency
      • Data errors
        • Obsolete data
        • Mistakes of manual annotators
        • Errors of information extraction algorithms
      • Coreference resolution errors
        • Automatic methods not 100% reliable
    • Applying uncertainty reasoning
      • Estimated reliability of separate pieces of data
      • Domain knowledge defined in the ontology
  • 19. Refining fused data
    • Additional evidence:
      • Ontological schema restrictions
        • Disjointness
        • Cardinality
      • Neighborhood graph
        • Mappings between related entities
      • Provenance
        • Uncertainty of candidate mappings
        • Uncertainty of data statements
        • “ Cleanness” of data sources
  • 20. Dempster-Shafer theory of evidence
    • Bayesian probability theory:
    • Assigning probabilities to atomic alternatives:
      • p(true)=0.6 ! p(false)=0.4
      • Sometimes hard to assign
      • Negative bias:
        • Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence
    • Dempster-Shafer theory:
    • Assigning confidence degrees (masses) to sets of alternatives
      • m({true}) = 0.6
      • m({false}) = 0.1
      • m({true;false})=0.3
    probability support plausibility
  • 21. Dependency detection
    • Identifying and localizing conflicts
      • Using formal diagnosis [Reiter 1987] in combination with standard ontological reasoning
    Article Proceedings Paper_10 owl:disjointWith owl:FunctionalProperty rdf:type rdf:type hasYear 2007 hasYear 2006 E. Motta V.S. Uren hasAuthor hasAuthor
  • 22. Belief networks (cont)
    • Valuation networks [Shenoy and Shafer 1990]
    • Network nodes – OWL axioms
      • Variable nodes
        • ABox statements (I 2 X, R(I 1 , I 2 ))
        • One variable – the statement itself
      • Valuation nodes
        • TBox axioms (X t Y)
        • Mass distribution between several variables (I 2 X, I 2 Y, I 2 X t Y)
  • 23. Belief networks (cont)
    • Belief network construction
      • Using translation rules
      • Rule antecedents:
        • Existence of specific OWL axioms (one rule per OWL construct)
        • Existence of network nodes
      • Example rule:
        • Explicit ABox statements:
        • IF I 2 X THEN CREATE N 1 (I 2 X)
        • TBox inferencing:
        • IF Trans (R) AND EXIST N 1 (R(I 1 , I 2 )) AND EXIST N 2 (R(I 2 , I 3 )) THEN CREATE N 3 ( Trans (R)) AND CREATE N 4 (R(I 1 , I 3 ))
  • 24. Example #Paper_10 Article Proceedings owl:disjointWith rdf:type rdf:type
  • 25. Example #Paper_10 2 Article #Paper_10 2 Proceedings #Paper_10 Article Proceedings owl:disjointWith rdf:type rdf:type
  • 26. Example #Paper_10 2 Article #Paper_10 2 Proceedings Article v : Proceedings #Paper_10 Article Proceedings owl:disjointWith rdf:type rdf:type
  • 27. Example #Paper_10 2 Article #Paper_10 2 Proceedings Article v : Proceedings m(true)=0.8 m(false) = 0 m({true;false})=0.2 m(true)=0.6 m(false) = 0 m({true;false})=0.4 m( )=1.0 m( )=0.0 true true false true true false false false #Paper_10 2 Proceedings #Paper_10 2 Article
  • 28. Example #Paper_10 2 Article #Paper_10 2 Proceedings Article v : Proceedings m(true)=0.8 m(false) = 0 m({true;false})=0.2 m(true)=0.6 m(false) = 0 m({true;false})=0.4 m( )=0.15 ­ -Dempster’s rule m( m( )=0.23 )=0.62 true false false true false true true false false false #Paper_10 2 Proceedings #Paper_10 2 Article
  • 29. Example #Paper_10 2 Article #Paper_10 2 Proceedings Article v : Proceedings m(true)= 0.62 m(false) = 0.23 m({true;false})= 0.15 m(true)= 0.23 m(false) = 0.62 m({true;false})= 0.15 m( )=0.15 m( m( )=0.23 )=0.62 true false false true false true true false false false #Paper_10 2 Proceedings #Paper_10 2 Article
  • 30. Belief propagation
    • Translating subontology into a belief network
      • Using provenance and confidence values of data statements
      • Coreferencing algorithm precision for owl:sameAs mappings
    • Data refinement:
      • Detecting spurious mappings
      • Removing unreliable data statements
    Article v: in_Proc Ind1=Ind2 Functional(year) Article(Ind1) (0.99;1.0)/(0.97;0.98) in_Proc(Ind2) Ind1=Ind2 inProc(Ind1) Ind1=Ind2 year(Ind1, 2007) year(Ind2, 2007) year(Ind1, 2006) (0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85) (0.95;1.0)/(0.91;0.96)
  • 31. Neighbourhood graph
    • Non-functional relations: varying impact
    Paper_10 H. Schmidt hasAuthor Paper_11 Schmidt, Hans owl:sameAs (0.9) owl:sameAs (0.3) hasAuthor Proceedings rdf:type rdf:type Person Germany H. Schmidt citizen_of Germany Schmidt, Hans owl:sameAs (1.0) owl:sameAs (0.3) citizen_of Country rdf:type rdf:type Person
  • 32. Neighborhood graph
    • Implicit relations: set co-membership
    Person11 = Person12 Person21 = Person22 Coauthor(Person12, Person22) Person11 = Person12 Coauthor(Person11, Person22) Person21 = Person22 Coauthor(Person21, Person22) “ Bard, J.B.L.”=“Jonathan Bard” “ Webber, B.L.”=“Bonnie L. Webber” 0.84/(0.86;1.0) 0.16/(0.83;1.0) 1.0/(1.0;1.0) 1.0/(1.0;1.0)
  • 33. Provenance
    • Initial belief assignments:
      • Data statements (source AND/OR extractor confidence)
      • Candidate mappings (precision of attribute similarity algorithms)
      • Source “cleanness” – contains duplicates or not
    Arl_Va  Arl_Tx Arlington = Arl_Tx Arlington  Arl_Tx Arlington = Arl_Va Arlington = Arl_Va Arl_Va  Arl_Tx 1.0/(1.0;1.0) 0.9/(0.31;0.35) Arlington, Virginia 0.95/(0.65;0.69) Arlington, Texas
  • 34. Experiments
    • Datasets:
      • Publication 1
        • AKT
        • Rexa
        • SWETO-DBLP
      • Cora
        • database community benchmark
        • translated into RDF
        • 2 versions used
          • different structure
          • different gold standard
  • 35. Experiments
    • Publication individuals
      • Ontological restrictions mainly influence precision
    0.884 0.836 0.939 0.821 0.931 0.735 Monge-Elkan 7 Cora (I) 0.895 0.981 0.823 0.702 0.982 0.546 L2 Jaro-Winkler 6 0.905 0.983 0.838 0.558 0.984 0.389 L2 Jaro-Winkler 4 0.957 0.956 0.958 0.817 0.986 0.698 Monge-Elkan 8 Cora (II) 0.938 0.932 0.944 0.916 0.933 0.899 Jaro-Winkler 5 Rexa/DBLP 0.971 0.952 0.992 0.937 0.952 0.922 Jaro-Winkler 3 AKT/DBLP 0.939 0.956 0.923 0.916 0.956 0.879 L2 Jaro-Winkler 2 0.895 0.832 0.969 0.887 0.833 0.950 Jaro-Winkler 1 AKT/Rexa F1 Recall Prec. F1 Recall Prec. Publication Matcher No Dataset
  • 36. Experiments
    • Person individuals
      • Evidence coming from the neighborhood graph
      • Mainly influences recall
    0.936 0.895 0.981 0.928 0.879 0.983 L2 Jaro-Winkler 10 Cora (I) 0.997 0.994 0.999 0.997 0.994 0.999 L2 Jaro-Winkler 11 Cora (II) 0.920 0.876 0.968 0.846 0.755 0.965 Jaro-Winkler 9 Rexa/DBLP 0.714 0.921 0.583 0.621 0.746 0.532 L2 Jaro-Winkler 8 AKT/DBLP 0.855 0.935 0.788 0.806 0.888 0.738 L2 Jaro-Winkler 7 AKT/Rexa F1 Recall Prec. F1 Recall Prec. Person Matcher No Dataset
  • 37. Outline
    • Motivation
    • Handling fusion subtasks
      • problem-solving method approach
    • Processing inconsistencies
      • applying uncertainty reasoning
    • Overcoming schema heterogeneity
      • Linked Data scenario
  • 38. Advanced scenario
    • Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]
    • Added value: coreference links ( owl:sameAs )
  • 39. Data linking: current state
    • Automatic instance matching algorithms
      • SILK, ODDLinker, KnoFuss, …
    • Pairwise matching of datasets
      • Requires significant configuration effort
    • Transitive closure of links
      • Use of “reference” datasets
  • 40. Reference datasets
  • 41. Problems
    • Transitive closures often incomplete
      • Reference dataset is incomplete
      • Missing intermediate links
      • Direct comparison of relevant datasets is desirable
    • Schema heterogeneity
      • Which instances to compare?
      • Which properties are relevant
    A B Reference
  • 42. Schema matching
    • Interpretation mismatches
      • dbpedia:Actor = professional actor
      • movie:actor = anybody who participated in a movie
    • Class interpretation “as used” vs “as designed”
      • FOAF: foaf:Person = any person
      • DBLP: foaf:Person = computer scientist
    • Instance-based ontology matching
    + - DBPedia dbpedia:Actor - + LinkedMDB movie:Actor David Garrick Richard Nixon Repository
  • 43. KnoFuss - enhanced Source KB Target KB SPARQL query translation Knowledge fusion Ontology integration Knowledge base integration Ontology matching Instance transformation Coreference resolution Dependency resolution
  • 44. Schema matching
    • Step 1: inferring schema mappings from pre-existing instance mappings
    • Step 2: utilizing schema mappings to produce new instance mappings
    Ontology 1 Ontology 2 Dataset 1 Dataset 2 Ontology 1 Ontology 2 Dataset 1 Dataset 2
  • 45.
    • Background knowledge:
      • Data-level (intermediate repositories)
      • Schema-level (datasets with more fine-grained schemas)
    Overview
  • 46. Algorithm
    • Step 1:
      • Obtaining transitive closure of existing mappings
    LinkedMDB DBPedia movie:music_contributor/2490 MusicBrainz music:artist/a16…9fdf = = dbpedia:Ennio_Morricone
  • 47. Algorithm
    • Step 2: Inferring class and property mappings
      • ClassOverlap and PropertyOverlap mappings
      • Confidence (classes A, B) = |c(A) Å c(B)| / min(c(|A|), c(|B|)) (overlap coefficient)
      • Confidence (properties r1, r2) = |c(X)|/||c(Y)|
        • X – identity clusters with equivalent values of r1 and r2
        • Y – all identity clusters which have values for both r1 and r2
    LinkedMDB DBPedia MusicBrainz music:artist/a16…9fdf = = dbpedia:Ennio_Morricone movie:music_contributor/2490 movie:music_contributor dbpedia:Artist is_a is_a
  • 48.
    • Step 3: Inferring data patterns
    • Functionality restrictions
    • IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link
    • Note:
      • Only usable if not taken into account at the initial instance matching stage
    Algorithm
  • 49. Algorithm
    • Step 4: utilizing mappings and patterns
      • Run instance-level matching for individuals of strongly overlapping classes
      • Use patterns to filter out existing mappings
    • LinkedMDB
    • SELECT ?uri
    • WHERE {
    • ?uri rdf:type movie:music_contributor .
    • }
    • DBPedia
    • SELECT ?uri
    • WHERE {
    • ?uri rdf:type
    • dbpedia:Artist .
    • }
  • 50. Results
    • Class mappings:
      • Improvement in recall
        • Previously omitted mappings were discovered after direct comparison of instances
    • Data patterns
      • Improved precision
        • Filtered out spurious mappings
        • Identified 140 mappings between movies as “potentially spurious”
        • 132 identified correctly
    DBPedia/ DBLP DBPedia/ LinkedMDB DBPedia/ BookMashup
  • 51. Future work
    • From the pairwise scenario to the network of repositories
    • Combining schema and data integration in an efficient way
    • Evaluating data sources
      • Which data source(s) to link to?
      • Which data source(s) to select data from?
  • 52. Questions?
    • Thanks for your attention
  • 53. References
    • [Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990
    • [Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999
    • [Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009
    • [Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969
    • [Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987