Your SlideShare is downloading. ×



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration
  • 2. The Big Picture
    • Data Integration
      • Integrating data in multiple, possibly heterogeneous information sources
      • Combining databases; Web Portals that syndicate information
    • Several parts of the problem
      • Schema level : matching & mapping
      • Instance level : Reference Reconciliation, Deduplication, Data cleaning etc.
  • 3. Data Integration - The Works
    • Schema level
      • Schema Matching: The process of identifying two objects are semantically similar
      • Mappings: Transformations required to transform one instance to another
    • Example:
      • DB1 Student (Name, SSN, Level, Major, Marks)
      • DB2 Grad-Student (Name, ID, Major, Grades)
      • Output of schema matching: <Student, Grad-Student>; <SSN, ID>; <Marks, Grades>
      • Possible transformations:
        • Marks to Grades (100-90 A; 90-80 B..)
        • Student to Grad-Student (omit Level field)
        • Grad-Student to Student (include entry for Level field)
  • 4. Data Integration - The Works
    • Schema level
      • Mappings: Transformations required to transform one instance to another
      • Example:
      • DB1 Student (Name, SSN, Level, Major, Marks)
      • DB2 Grad-Student (Name, ID, Major, Grades)
      • Possible transformations:
        • Marks to Grades (100-90 A; 90-80 B..)
        • Student to Grad-Student (omit Level field)
        • Grad-Student to Student (include entry for Level field)
  • 5. Data Integration - The Works
    • Instance level
      • Reference reconciliation: Reconciling multiple references of the same entity
        • While integrating similar or heterogeneous domains
      • Deduplication: Detecting and eliminating duplicate records referring to the same entity
        • Record linkage
    • Hard because of several data level inconsistencies
  • 6. Why is Data Integration Hard?
    • Data models created by different people, for different purposes, evolved differently over time
    • Various heterogeneities
      • Model / Representation : relational vs. network vs. hierarchical models
      • Structural / schematic :
        • Domain Incompatibilities
        • Entity Definition Incompatibilities
        • Data Value Incompatibilities
        • Abstraction level Incompatibilities
    Sheth/Kashyap 1992, Kim/Seo 1993, Kashyap/Sheth 1996)
  • 7. Disambiguation : A Conflict of Interest Application
    • Should Arpinar review Verma’s paper?
    Verma Sheth Miller Aleman-M. Thomas Arpinar
  • 8. Our experiences
    • Reference reconciliation in structured information spaces
      • Disambiguating 2 real world datasets
      • Social Networking - FOAF
      • Bibliography - DBLP
    • Used in a ‘conflict-of-interest’ detection conference management application
      • B. Aleman-Meza et al. &quot;Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection&quot;, WWW 2006 ( Nominated for Best Paper Award)
  • 9. FOAF, DBLP schemas
    • To disambiguate
      • FOAF vs. FOAF
      • DBLP vs. DBLP
      • FOAF vs. DBLP
    • Heterogeneous schemas
  • 10. Reference Reconciliation
    • Nature of the dataset
      • Entities have multiple entries within and across datasets
        • <rdf:Description rdf:about=&quot;#4_0.14956114797143916&quot;>
        • <rdfs:URL xml:lang=&quot;en&quot;></rdfs:URL>
        • <rdfs:label xml:lang=&quot;en&quot;>Libby Miller</rdfs:label>
        • <sweto:foaf_homepage></sweto:foaf_homepage>
        • </rdf:Description>
        • <rdf:Description rdf:about=&quot;#4_0.7134997862220993&quot;>
        • <rdfs:URL xml:lang=&quot;en&quot;>genid:libby</rdfs:URL>
        • <rdfs:label xml:lang=&quot;en&quot;>Libby Miller</rdfs:label>
        • <sweto:foaf_workplaceHomepage></sweto:foaf_workplaceHomepage>
        • <sweto:foaf_workplaceHomepage></sweto:foaf_workplaceHomepage>
        • <sweto:friend rdf:resource=&quot;#4_0.9087751274968384&quot;/>
        • <sweto:foaf_workplaceHomepage></sweto:foaf_workplaceHomepage>
        • <sweto:foaf_mbox></sweto:foaf_mbox>
        • </rdf:Description>
      • Task at hand is to identify if the entities refer to the same real world entity
      • FOAF and DBLP are datasets from very different domains
        • Comparable attributes are few
        • Most of the FOAF dataset is created by the entity themselves; more susceptible to errors in spellings, incomplete profiles
  • 11. Dataset Statistics
    • FOAF
      • Total number of entities 476419
      • Total number of comparable entities 29357
    • DBLP
    • Total number of entities 44358
    • Total number of comparable entities 17885
    • DBLP vs. FOAF
    • Total number of entities 5354
    • Total number of comparable entities 3633
  • 12. Salient Features of the Algorithm
    • Building contexts
      • A context comprises of atomic attributes of an entity and other entities it is related to. (to a certain degree of separation)
    • Weighted Entity Relationship graphs
      • Relationships weighted by
        • importance to domain
        • Number of entities participating in the relationship
    • Rules that implicitly encode the disambiguation / similarity function
      • Given a pair of entities determine confidence in similarity
      • Three classes of similarity: SAMEAS, AMBIGUOUS, NOTSAMEAS
    • Using past reconciliation decisions (expensive but tunable)
      • Adapted from Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. ACM SIGMOD 2005
  • 13. In some detail.. DATASET Run samples of g1, g2.. through the disambiguation function 3 Evaluate results (sameas, ambiguous, notsameas) + alter disambiguation function 4 5 Repeat Steps 3 and 4 till user satisfied with disambiguation results End of this exercise, we have a disambiguation function and a whole dataset to disambiguate Domain knowledge + Schema information + Statistical information on dataset + Rules Disambiguation Function 1 INDEX Groups g1, g2 .. gx 2
  • 14. Results
    • WWW06 results
      • Number of FOAF entities 38,015
      • Number of DBLP entities 21,307
      • Total number of entities 59,322
      • Number of entity pairs to be compared 42,433
      • Number of entity pairs for which a sameAs was established 633
      • Number of entity pairs compared yet without sufficient information to be reconciled 6,347
    • False Positives and False Negatives
      • A false positive in the sameAs set indicates an incorrectly reconciled pair of entities and a false negative in the ambiguous set indicates a pair of entities that should have been reconciled, but were not.
      • With a confidence level of 95% using this algorithm on this dataset, the number of false negatives in any ambiguous set will be between 2.8% and 7.8%. The number of false positives was estimated with the same level of confidence to be between 0.3% and 0.9%.
  • 15. Observations
    • Challenges
      • Asian names
      • Using past reconciliation decision was extremely useful but also very expensive
        • Total number of comparable entities 29357; significant time complexity
    • Current investigation
      • Performance major concern
      • Accuracy – can always do better
      • Active learning + past reconciliation feedback
  • 16. 2. Our experiences – thus far and here on..
    • Schema Matching and Mapping – quick recap
      • DB and Ontology schema matching techniques overlap significantly
        • Not much advancement since DB schema integration efforts
      • Ontologies formalize the semantics of a domain, but matching is still primarily syntactic / structural.
        • The semantics of ‘named relationships’ is largely unexploited
      • The real semantics lies in the relationships connecting entities
        • Modeled as first class objects in Ontologies
        • In DB, they are not explicit and have to be inferred
  • 17. Ontologies: matching and mapping
    • Using Ontologies to provide integrated semantic access to information sources (unstructured, structured and semi-structured)
    • The kind of relationships that need to be identified between Ontology schemas are different from those identified between database schemas.
      • set membership relationships like overlap / disjointness / exclusion / equivalence / subsumption VS. arbitrary named relationships
  • 18. Advancing the State of Art in Schema Matching
    • Discovering simple to complex named relationships:
      • Past matching techniques have exhausted Schema + Instance properties
      • Since Ontology modeling de couples schema + instance base
        • Tremendous opportunity to exploit knowledge present outside the ontology knowledge base
  • 20. A Vision for Ontology Matching SIMPLE TO COMPLEX MATCHES Possible identifiable matches: equivalence / inclusion / overlap / disjointness Possible to identify more complex relationships from the corpus. Ontologies Heterogeneous data Today , the Food and Drug Administration ( FDA ) is announcing that it has asked Pfizer , Inc . to voluntarily withdraw Bextra from the market . Pfizer has agreed to suspend sales and marketing of Bextra in the , pending further discussions with the agency . Semantic metadata
  • 21. The Intuition.. 9284 documents 4733 documents Disease or Syndrome Biologically active substance causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of instance_of 5 documents UMLS MeSH PubMed Lipid affects
  • 22. Challenges, Open Ended Questions
    • Translating instance level findings to the schema level
      • GOING FROM several discovered relationships like “Deficiency in migraine causes Migraine” TO “substance X causes condition Y”
    • Generating Mappings: not always simple mathematical / string transformations
      • Examples of complex mappings
        • Associations / paths between classes ; Graph based / form fitting functions
  • 23. To summarize…
    • The distinction between schema and instances is slowly disappearing
    • Integrating new and external data sources, mining and analyzing them is gaining importance.
    • Tremendous opportunities and challenges in using more information than what is modeled in a schema and captured in an instance base.
    • Need to go beyond well-mannered schemas and knowledge representations; and relatively simpler mappings
  • 24. Digressing..
    • Semantic Document Classification – Investigative work over Summer 06 at Hewlett Packard Research Labs
    • Motivation
      • Storage labs starting to look inside containers
    • Goal
      • To investigate how the use of Ontologies (especially the named semantic relationships) as background knowledge affects the task of document classification
  • 25. Procedure
    • Using a combination of statistical and domain information to alter document term vectors by amplifying weights of discriminative terms
  • 26. Preliminary results
    • Comparing classification techniques using the base document vector and a semantically enhanced document vector