Clio: Schema Mapping Creation and              Data Exchange                            Presented by                      ...
the Clio project                                     •Wants data from S                                     •Understands T...
Outline The Motivating Example2. Schema Mapping Generation        Mapping generation algorithm2. Data Exchange         ...
A Motivating ExampleSchema S:       Companies: Set of Rcd                Schema T:           Name                 v1      ...
Correspondences   Companies                                           Using tuple generating dependency(tgd):       Name  ...
More complex mappings   Companies                                           ∀n,d,y,g,a,s,m Companies(n,d,y),       Name   ...
More complex mappings   Companies                                                ∀n,d,y,g,a,s,m Companies(n,d,y),       Na...
Outline  The Motivating Example 2. Schema Mapping Generation           Mapping generation algorithm 2. Data Exchange    ...
Mapping GenerationSource Schema                 Generate all possible associations within the Source                      ...
Mapping GenerationSource Schema                 Generate all possible associations within the Source                      ...
Mapping GenerationSource Schema                 Generate all possible associations within the Source                      ...
Mapping GenerationSource Schema                    Generate all possible associations within the Source                   ...
Mapping GenerationSource Schema                 Generate all possible associations within the Source                      ...
Clio mapping, example                                                       Generate a Clio Mapping: foreach AS exists AT ...
Dominance A2 dominates A1 (A1 ≤ A2 ) if    the from and where clauses of A1 are subsets of those of A2 (after      suita...
Coverage of a coresspondence A correspondence    v : foreach PS exists PT with eS=eT  is covered by a pair of association...
Mapping GenerationSource Schema                 Generate all possible associations within the Source                      ...
Mapping GenerationSource Schema                    Generate all possible associations within the Source                   ...
Logical associations are meaningful                              combinations of correspondences                          ...
Outline  The Motivating Example 1. Schema Mapping Generation           Mapping generation algorithm 2. Data Exchange    ...
Query generation for data exchange                             Mapping                            generation      Source  ...
Overview of Query Generation         Input: A Clio Mapping                                                                ...
1. Constructing the Query GraphAdding a node for each variable in the exists clause                             y0 (organi...
1. Constructing the Query Graph (cont.)                                                                               Orga...
1. Constructing the Query Graph (cont.)                                                                              Organ...
1. Constructing the Query Graph (cont.)Add the source nodes for all source expressions in the with clause                 ...
1. Constructing the Query Graph (cont.)Attach the source nodes to the target nodes to which they are “equal”              ...
1. Constructing the Query Graph (cont.)Use the equalities in the where clause to add edges between target nodes           ...
2. Annotating the GraphEach node is annotated with a set of source expressionsUpward propagation: Every expression that a ...
2. Annotating the Graph (cont.)Downward propagation: Every expression that a node acquires ispropagated to its children   ...
2. Annotating the Graph (cont.)Eq. propagation: Every expression that a node acquires is propagated tothe nodes related to...
2. Annotating the Graph (cont.)Apply the rules until no more rules can be applied                                         ...
3. Generation of Transformation QueriesGenerate the query fragment:The for each clause is converted to a query fragment: I...
3. Generation of Transformation Queries Perform a depth-first traversal on the Graph                                      ...
3. Generation of Transformation Queries                          x 0.name                                                 ...
Finally we have the Query:Information Systems Group    Leila Jalali, Candidacy Exam
Clio: Conclusion Providing tools that help in automating and managing the  problem of Data Conversion The key contributi...
ThanksInformation Systems Group            Candidacy Exam, Jan. 2010
Back ups Clio Requirements Complex mappings: using association Definitions:    Mapping language    Paths    Schema&T...
the Clio project- overview of the requirements                                                                            ...
Formalize correspondences   Companies                                           Using tuple generating dependency(tgd):   ...
Correspondences alone are not enough   How individual data values should be connected in the target?   Companies       Nam...
More complex mappings are needed   Companies        Name              v1       Organizations        Address               ...
Yet more complex...   Companies       Name              v1        Organizations                ∀g, r, a, s, m Grants(g,r,a...
Yet more complex...                           Companies                                                  Name            v...
The Mapping Language- Syntax         foreach x1 in g1, . . . , xn in gn         xi in gi (generator)             where B1 ...
Primary and Relative paths Primary path (given a schema root R, that is a first level   element in the schema):     x1 i...
Schema and types A schema: a sequence of labels(roots) each with associated  type, defined by this grammar:              ...
CorrespondencesInformation Systems Group   Leila Jalali, Candidacy Exam
the data exchange problemInformation Systems Group   Leila Jalali, Candidacy Exam
Query generation challenges1. Creation of New Values in the TargetOptional: Null                                          ...
Query generation challenges1. Creation of New Values in the TargetRefrential constraints  Information Systems Group       ...
Query generation challenges2. Grouping Nested elements  Information Systems Group     Leila Jalali, Candidacy Exam
Query generation challenges3. Value Creation interacts with Grouping Information Systems Group                  Leila Jala...
Recursion in XML schemaInformation Systems Group   Leila Jalali, Candidacy Exam
the Chase Given as association, repeatedly applying a chase rule to the "current"  association (initialed as the input on...
Clio: Analysis and Conclusion Termination and Complexity of the Chase:     the Chase with general dependecies may not be...
Clio mapping A Clio mapping:               for each AS exists AT with E    AS , AT : logical associations (on source and...
Structural Association Structural association:   − from P           (with P primary path)                                ...
Nested Referential Integrity (NRI) constraints The basis for discovery of associations: capture relation foreign key and ...
Logical Association Logical association: semantic relationships between schema  elements   Obtained by starting with a s...
Logical Association- the Chase                                                      start with a structural association   ...
Logical Association Relationships A2 dominates A1 (A1 ≤ A2 ) if    the from and where clauses of A1 are subsets of those...
Mapping Generation Algorithm    Inputs: S , T , Correspondences                                         AS : from c in com...
Upcoming SlideShare
Loading in...5
×

Data Integration

367

Published on

Clio Schema Mapping

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
367
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Providing tools that help in automating and managing the problem of Data Conversion use of Schema Mappings (specification to describe the relationship between data in two different schemas) To transform data between two different representations Schema Mappings to generate: A view to reformulates queries: Data Integration A code to transform data : Data Exchange
  • Contributions of the paper
  • Information about companies and grants…. Nested relational representation  one can present both relational and xml schemas Schema S is a relational schema: with 3 tables : companies, grants and contacts The grant has grantidentifier, recipient which is the name of the company that receives, and the amount The green lines: referential constraints: foreign key or dependency The target is the XML schema: the funding that an organization receives is nested with the organization record Dashed arrows : Correspondences : the relationships between the schemas, may given by the schema matcher, or we can ask the user to draw these lines V1: the company name in the first schema referred to the organization code in the second schema Why there is no lines between year: 2 diff. concepts. The year. The time the company founded vs the time it had its first initial public offer Their approach does not care about how these correspondence are created, but consider about matchings are incompelete and sometimes incorrect For simplicity these 4 correcpondences are correct
  • Correspondence can be formally expressed using tuple generating dependency(tgd) Using shared variables: for each company there must be an organization whose code is the same as companies.name All the shared variables are underlined
  • For each x i in g i (generator) x i variable g i set (either the root or a set nested within it) where B 1 conjunction of equalities over the x i variables with e 1 = e' 1 … equalities between a source expression and a target expression The mapping as a source to target constraint: "the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with-clause)"
  • For each x i in g i (generator) x i variable g i set (either the root or a set nested within it) where B 1 conjunction of equalities over the x i variables with e 1 = e' 1 … equalities between a source expression and a target expression The mapping as a source to target constraint: "the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with-clause)"
  • Contributions of the paper
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  •  n ,d,y Companies( n ,d,y) →  y',F Organizations( n ,y',F))  n ,d,y, g , a,s,m Companies( n ,d,y), Grants( g , n ,a,s,m) →  y',F ,f Organizations( n ,y’ ,F), F( g ,f )  g, r, a , s, m Grants( g,r, a ,s,m) →  f,p Finances(f, a ,p)  c, e, p Contacts( c,e, p ) →  f,b Finances(f,b, p )
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Contributions of the paper
  • The schema mapping specify how the data of two schemas relate to each other For data exchange an instance of the source schema must be transformed to an instance of the target schema Note the schema mapping migth not contain all the target values, and may not specify the grouping/ nested semantics for target data
  • When one schema is XML Clio can generate a data exchange query in Xquery or XSLT The paper describe how to generate Xquery , SQL is similar without having nested elements
  • Obvious relationships
  • Obvious relationships
  • Obvious relationships
  • finally
  • Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
  • Every expression that a node acquires is propagated to its children if they do not already have it and if they are not equal to any of the source nodes. Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
  • Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
  • Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
  • It is straightforward, Clio binds one variable to each term, and add the conditions in the where clause Noted it by Q S M1 It is not the complete query because it does not have the result yet It will be used repeatedly in the larger query
  • It will be used repeatedly in the larger query Starts at the target schema root in query graph , depth first traversal If a node is a complex type element (like y1 fundings) , the element is generated by visiting the children If the node is an atomic type, if it is linked to the source node (like y1.fid) , a simple element is created with the value equal to source, If it is an optional element, nothing generated If it is a nullable element, null value is generated else (like y1. finId) a value will be generated using a new Skolem function, with all arguments that annotate to the node (take care that all the nodes equal to this node receive the same Skolem function name) If it is a variable, For Where Return query produced, copy Q S M1 (the query fragment) rename all the variables, compare annotation with its parent variable, for each common expression correlated sub query generated
  • If it is a variable, For Where Return query produced, copy Q S M1 (the query fragment) rename all the variables, compare annotation with its parent variable, for each common expression correlated sub query generated
  • It will be used repeatedly in the larger query Starts at the target schema root in query graph , depth first traversal
  • The path in an NRI require matchings, to determine the variables in the path However it is exponential to the size of the path , which is often small . Some matching are not possible because of schema restrictions a Chase step can take exponential (in the worst case, it could be multiple ways of matching a variable in a path)
  • Providing tools that help in automating and managing the problem of Data Conversion Makes no assumption about the schemas, their relationships or how they were created The mapping language is more general than TSIMMIS, Information Manifold Able to map between relational schemas and nested schemas Mapping at different levels of granularities: fine grained mappings such as translating the salary in francs to dollars, boarder concept (documents from one schema to the other schema) Incremental mapping algorithms: sometimes the complete mapping is not the goal (we want a single concept to be mapped) or we have partial knowledge of the schemas so we want to support incomplete mappings as well
  • Correspondence can be formally expressed using tuple generating dependency(tgd) Using shared variables: for each company there must be an organization whose code is the same as companies.name All the shared variables are underlined
  • Correspondences alone do not specify how individual values should be connected in the target For e.g. fundings is nested inside organization which means there is a semantic association between them We should look for the association between organization information and funding information in the source to know about the association in the target One such association is f1, each grant is associated with a company. Thus in target we can associate with each organization a set of fundings The algorithm use logical inference to find all associations represented by referential constraints and a schema relational and nesting structure
  • F is a set identifier, set of fundings that an organizations tuple has This mapping tells us that if there is a pattern in source data what must be true in the target, if we join grant and a company there must be organization with the name of company as its source, and fundings inside it, with fid equal gid.
  • V3 does not recognize that grant amounts are associated with specific gids. Using f4 the better mapping would be this
  • To complete our example, consider v4, there are two ways to associate the grant amount(budget) to the phone, Using f2 supervisor phone or f3 manager phone
  • Consider this simple mapping An employee in the source has atomic elements A ,B, C , Employee record in the targer: A’, B’, C’, and an extra elemnt E’ A and B are mapped to A’, B’. But E’ and C’ left unmapped. Now what should be the values for C’, E’: 1. When neither used in the schema as contraints: creating null value is sufficient 2. If E’ is a key in target : not nullable, not optional  like employee id: create values using one-to-one Skolem function, E’ depends only on A and B not on C
  • E’ is the refrence page 224
  • Target schema contains two levels.
  • One reason for XSLT is that there are no efficient, robust implementation of Xquery today I give the size of the largest schemas and some idea of compilation/interpretation times
  • The path in an NRI require matchings, to determine the variables in the path However it is exponential to the size of the path , which is often small . Some matching are not possible because of schema restrictions a Chase step can take exponential (in the worst case, it could be multiple ways of matching a variable in a path)
  • Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings Given as association, repeatedly applying a chase rule to the "current" association (initialed as the input one) If there is a NRI constraint foreach X exists Y where B such that the "current" association contains X and does not contain a Y that satisfies B then add Y to the generators and B to the where clause Example. If we start with from g in grants then we have to add various components and obtain from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid
  • NRI capture relations foreign key and referential constraints as well as xml keyref constraints Referential integrity is essential in this approach as the basis for the discovery of "associations" Given the nested model, they need a rather complex definition Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings
  • Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings Given as association, repeatedly applying a chase rule to the "current" association (initialed as the input one) If there is a NRI constraint foreach X exists Y where B such that the "current" association contains X and does not contain a Y that satisfies B then add Y to the generators and B to the where clause Example. If we start with from g in grants then we have to add various components and obtain from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid
  • Logical association: An association obtained by "chasing" constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
  • Data Integration

    1. 1. Clio: Schema Mapping Creation and Data Exchange Presented by Leila JalaliInformation Systems Group Candidacy Exam, Jan. 2010
    2. 2. the Clio project •Wants data from S •Understands T •May not understand S Q Source Schema Mapping Target schema T schema S“conforms to” “conforms to” Data Exchange data to transform data Clio addresses two main problems: How to generate schema mappings and how to use them for data exchange? exchangeInformation Systems Group Leila Jalali, Candidacy Exam
    3. 3. Outline The Motivating Example2. Schema Mapping Generation  Mapping generation algorithm2. Data Exchange  Query generation algorithm Conclusions Information Systems Group Leila Jalali, Candidacy Exam
    4. 4. A Motivating ExampleSchema S: Companies: Set of Rcd Schema T: Name v1 Organizations: Set of Rcd Address Code Year Year f1 Fundings: Set of Rcd Grants : Set of Rcd v2 FId Gid FinId Recipient f4 Amount Finances: Set of Rcd v3 Supervisor FinId f2 Manager Budget f3 Phone Contacts : Set of Rcd v4 Correspondences Cid (given by a "schema matcher“ or Email a“user”) Phone Information Systems Group Leila Jalali, Candidacy Exam
    5. 5. Correspondences Companies Using tuple generating dependency(tgd): Name v1 Organizations Address Code ∀n,d,y Companies(n,d,y) → v1: ∃y,F Organizations(n,y,F)) Year Yearf1 Grants Fundings Gid v2 FId Recipient FinId Amount foreach c in companiesf2 Supervisor v3 Finances f4 f3 exists o in organizations, Manager FinId Contacts Budget with o.code = c.name Cid Phone Email Phone v4 Information Systems Group Leila Jalali, Candidacy Exam
    6. 6. More complex mappings Companies ∀n,d,y,g,a,s,m Companies(n,d,y), Name v1 Organizations Grants(g,n,a,s,m) → Address Code ∃y,F,f, p Year Yearf1 Grants Organizations(n,y,F)), Fundings v2 F(g,f), Gid FId Recipient FinId Finances(f,a,p) Amount foreach c in companies, g in grantsf2 Supervisor v3 Finances f4 f3 where c.name=g.recipient Manager FinId exists o in organizations, Contacts Budget f in o.fundings, Cid Phone i in finances Email where f.finId = i.finId v4 Phone with o.code = c.name and f.fId = g.gId and i.budget = g.amount Information Systems Group Leila Jalali, Candidacy Exam
    7. 7. More complex mappings Companies ∀n,d,y,g,a,s,m Companies(n,d,y), Name v1 Organizations Grants(g,n,a,s,m) → Address Code ∃y,F,f, p Year Yearf1 Grants Organizations(n,y,F)), Fundings v2 F(g,f), Gid FId Recipient FinId Finances(f,a,p) Amount foreach c in companies, g in grantsf2 Supervisor v3 Finances f4 f3 where c.name=g.recipient Manager FinId exists o in organizations, Contacts Budget f in o.fundings, Cid Phone i in finances Email where f.finId = i.finId v4 Phone query on the with o.code = c.name source:QS and f.fId = g.gId and i.budget = g.amount query on the Correspondences QS  QT target: QT Information Systems Group Leila Jalali, Candidacy Exam
    8. 8. Outline  The Motivating Example 2. Schema Mapping Generation  Mapping generation algorithm 2. Data Exchange  Query generation algorithm  ConclusionsInformation Systems Group Leila Jalali, Candidacy Exam
    9. 9. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Information Systems Group Leila Jalali, Candidacy Exam
    10. 10. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Companies: Name Organizations: f1 Address from p in companies Code Year Year from o in organizations Grants: from g in grants Fundings: Gid FId f4 f2 Recipient FinId f3 Finances: Amount Supervisor FinId Manager Budget Contacts: Phone Cid Email Information Systems Group Leila Jalali, Candidacy Exam
    11. 11. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Logical Associations Build larger associaitons in Source (AS) and Target (AT) Information Systems Group Leila Jalali, Candidacy Exam
    12. 12. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Logical Associations Build larger associaitons in Source (AS) and Target (AT) Companies: Name starting with a structural association and "chasing" constraintsf1 Address AS : Year Grants: Gidf2 Recipientf3 Amount Supervisor Manager Contacts: Information Systems Group Leila Jalali, Candidacy Exam
    13. 13. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Logical Associations Build larger associaitons in Source (AS) and Target (AT) Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> to generate a Clio Mapping: foreach AS exists AT with W W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences) Information Systems Group Leila Jalali, Candidacy Exam
    14. 14. Clio mapping, example Generate a Clio Mapping: foreach AS exists AT with W Companies W is the conjunction of equalities h (eS )=h’(eT ) Name v1 Organizations Address Code AS : from g in grants, c in companies, Year Year s in contacts, m in contactsf1 Grants where g.recipient = c.name Fundings Gid v2 FId and g.supervisor = s.cid Recipient and g.manager = m.cid FinId Amount AT: from o in organizations,f2 Supervisor v3 Finances f4 f in o.fundings, i in finances f3 Manager FinId where f.finId = i.finId Contacts Budget Cid Phone v1, v2, v3 are covered Email Phone v4foreach g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with c.name = o.code and g.gId = f. fId and g.amount = i.budget Information Systems Group Leila Jalali, Candidacy Exam
    15. 15. Dominance A2 dominates A1 (A1 ≤ A2 ) if  the from and where clauses of A1 are subsets of those of A2 (after suitable renaming) A2 : from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid A1 : from g in grants, c in companies where g.recipient = c.name Information Systems Group Leila Jalali, Candidacy Exam
    16. 16. Coverage of a coresspondence A correspondence v : foreach PS exists PT with eS=eT is covered by a pair of associations <AS , AT> if PS ≤ AS and PT ≤ AT with some renaming h, h’ AS : from c in companies v: foreach c in companiesExample: AT : fom o in organizations exists o in organizations with c.name = o.code Information Systems Group Leila Jalali, Candidacy Exam
    17. 17. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Logical Associations Build larger associaitons in Source (AS) and Target (AT) Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> and generate a Clio Mapping: foreach AS exists AT with W W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences) Information Systems Group Leila Jalali, Candidacy Exam
    18. 18. Mapping GenerationSource Schema Generate all possible associations within the Source Structural AssociationsTarget Schema Generate all possible associations within the Target Logical Associations Build larger associaitons in Source (AS) and Target (AT) Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> and generate a Clio Mapping: foreach AS exists AT with W W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences) Add the Clio Mapping to the Set of Mappings the Set of Mappings Information Systems Group Leila Jalali, Candidacy Exam
    19. 19. Logical associations are meaningful combinations of correspondences Finds maximal sets of correspondences that can be interpreted together Discard the “larger” mapping Generate a Clio mappingInformation Systems Group Leila Jalali, Candidacy Exam
    20. 20. Outline  The Motivating Example 1. Schema Mapping Generation  Mapping generation algorithm 2. Data Exchange  Query generation algorithm  ConclusionsInformation Systems Group Leila Jalali, Candidacy Exam
    21. 21. Query generation for data exchange Mapping generation Source Target schema schema Query generationInformation Systems Group Leila Jalali, Candidacy Exam
    22. 22. Overview of Query Generation Input: A Clio Mapping x 0.name1. Query Graph is constructed which represents y 0 (organizations)the key portions of the query in the graph x 0.name x1. amount, x1.gid, x 0.name, y 0.year2. Annotate the graph to generate Skolem terms y 1(fundings) x 0.name y 0 .code x1.gid x 0.name, x1.gid3. Traverse the graph and produce the query y 0.fid y 0.finId x1. gid Output: the data exchange Query (in SQL, XQuery, or XSLT) Information Systems Group Leila Jalali, Candidacy Exam
    23. 23. 1. Constructing the Query GraphAdding a node for each variable in the exists clause y0 (organizations) y2(finances) y1(fundings) Information Systems Group Leila Jalali, Candidacy Exam
    24. 24. 1. Constructing the Query Graph (cont.) Organizations: Code Year Fundings: FId f4Adding nodes for all the atomic type elements reachable from these FinIdnodes via record projection Finances FinId y0 (organizations) y2(finances) Budget Phone y1(fundings) y2.phone y0.code y0.year y2.finId y2.budget y1.fid y1.finId Information Systems Group Leila Jalali, Candidacy Exam
    25. 25. 1. Constructing the Query Graph (cont.) Organizations: Code Year Fundings: FIdAdd structural edges to reflect the relationships between nodes FinId Finances FinId y0 (organizations) y2(finances) Budget Phone y1(fundings) y2.phone y0.code y0.year y2.finId y2.budget y1.fid y1.finId Information Systems Group Leila Jalali, Candidacy Exam
    26. 26. 1. Constructing the Query Graph (cont.)Add the source nodes for all source expressions in the with clause y0 (organizations) y2(finances) y1(fundings) y2.phone y0.code y0.year y2.finId y2.budget y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    27. 27. 1. Constructing the Query Graph (cont.)Attach the source nodes to the target nodes to which they are “equal” y0 (organizations) y2(finances) y1(fundings) y2.phone y0.code y0.year y2.finId y2.budget y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    28. 28. 1. Constructing the Query Graph (cont.)Use the equalities in the where clause to add edges between target nodes y0 (organizations) y2(finances) y1(fundings) y2.phone y0.code y0.year y2.finId y2.budget y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    29. 29. 2. Annotating the GraphEach node is annotated with a set of source expressionsUpward propagation: Every expression that a node acquires is propagatedto its parent node, unless the (acquiring) node is a variable. y0 (organizations) y2(finances) x 2.phone x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId y2.budgetx1.gid y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    30. 30. 2. Annotating the Graph (cont.)Downward propagation: Every expression that a node acquires ispropagated to its children x 0.name x 1.amount, x 2.phone y0 (organizations) y2(finances) x 2.phone x1.gid x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId y2.budgetx1.gid x 0.name y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    31. 31. 2. Annotating the Graph (cont.)Eq. propagation: Every expression that a node acquires is propagated tothe nodes related to it through equality edges. x 0.name x 1.amount, x 2.phone y0 (organizations) y2(finances) x 2.phone x1.gid,x 0.name x 0.name x 1.amount, x 2.phone x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId y2.budget x1.gid,x 0.namex1.gid y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    32. 32. 2. Annotating the Graph (cont.)Apply the rules until no more rules can be applied x1.gid,x 0.name x 0.name x 1.amount, x 2.phone y0 (organizations) y2(finances)x 1.amount, x 2.phone x1.gid,x 0.name x 2.phone x1.gid,x 0.name x 0.name x 1.amount, x 2.phone x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId x 1.amount, x 2.phone y2.budget x1.gid,x 0.namex1.gid y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    33. 33. 3. Generation of Transformation QueriesGenerate the query fragment:The for each clause is converted to a query fragment: Information Systems Group Leila Jalali, Candidacy Exam
    34. 34. 3. Generation of Transformation Queries Perform a depth-first traversal on the Graph x1.gid,x 0.name x 0.name x 1.amount, x 2.phone y0 (organizations) y2(finances)x 1.amount, x 2.phone x1.gid,x 0.name x 2.phone x1.gid,x 0.name x 0.name x 1.amount, x 2.phone x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId x 1.amount, x 2.phone y2.budget x1.gid,x 0.namex1.gid y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    35. 35. 3. Generation of Transformation Queries x 0.name x1.gid,x 0.name y0 (organizations) x 1.amount, x 2.phone y2(finances)x 1.amount, x 2.phone x1.gid,x 0.name x 2.phone x1.gid,x 0.name x 0.name x 1.amount, x 2.phone x 0.name x 1.amount y2.phone y1(fundings) y0.code y0.year y2.finId x 1.amount, x 2.phone y2.budget x1.gid,x 0.namex1.gid y1.fid y1.finId x0.name x2.phone x1.amount x1. gid Information Systems Group Leila Jalali, Candidacy Exam
    36. 36. Finally we have the Query:Information Systems Group Leila Jalali, Candidacy Exam
    37. 37. Clio: Conclusion Providing tools that help in automating and managing the problem of Data Conversion The key contributions of Clio:  Schema mapping generation  Mapping as a query discovery problem  Capable of mapping between relational and nested schemas  Query generation for data exchange  SQL, XQuery, XSLT, generating Skolems,... Information Systems Group Leila Jalali, Candidacy Exam
    38. 38. ThanksInformation Systems Group Candidacy Exam, Jan. 2010
    39. 39. Back ups Clio Requirements Complex mappings: using association Definitions:  Mapping language  Paths  Schema&Types  Dominance Query Generation Challenges,the problem of Recursion in XML schema Nested Referential Integrity (NRI) constraints The ChaseInformation Systems Group Leila Jalali, Candidacy Exam
    40. 40. the Clio project- overview of the requirements Q Schema Mapping Target Source schema T schema S“conforms to” “conforms to” no assumptions about the schemas data A general mapping language Mapping at different levels of granularities Incremental mapping algorithms Capable of mapping between relations schemas and nested schemasInformation Systems Group Leila Jalali, Candidacy Exam
    41. 41. Formalize correspondences Companies Using tuple generating dependency(tgd): Name v1 Organizations Address Code ∀n,d,y Companies(n,d,y) → v1: ∃y,F Organizations(n,y,F)) Year Yearf1 Grants Fundings Gid v2 FId Recipient FinId v3: ∀g, r, a, s, m Grants(g,r,a,s,m) → Amount ∃f,p Finances(f,a,p)f2 Supervisor v3 Finances f4 f3 Manager ∀c, e, p Contacts(c,e,p) → FinId Contacts Budget v4: Cid Phone ∃f,b Finances(f,b,p) Email Phone v4 ∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) → v2: ∃ y,F,f Organizations(n,y’,F), F(g,f ) Information Systems Group Leila Jalali, Candidacy Exam
    42. 42. Correspondences alone are not enough How individual data values should be connected in the target? Companies Name v1 Organizations Address Code Year Yearf1 Grants Fundings Gid v2 FId Recipient FinId Amount f4 Companies Organizationsf2 Supervisor v3 Finances Name Address Year Code Year Fundings f3 Manager FinId MS SA 1976 FId FinId Contacts Budget AT&T TX 1980 f3 IBM NY 1955 MS Cid Phone Email Grants AT&T Phone v4 GId Amt Rec.t IBM 301 MS 30 301 302 MS 40 303 IBM 30 302 Information Systems Group Leila Jalali, Candidacy Exam
    43. 43. More complex mappings are needed Companies Name v1 Organizations Address Code The "association" between companies and grants in Year Year the source is suggested by f1 (a foreign key)f1 Grants Fundings Gid v2 ∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) → FId Recipient FinId ∃ y,F,f Organizations(n,y’,F), F(g,f ) Amountf2 Supervisor v3 Finances f4 f3 Manager FinId Contacts Budget Companies Organizations Name Address Year Cid Phone MS SA 1976 Code Year Fundings Email AT&T TX 1980 v4 FId FinId Phone f3 IBM NY 1955 MS 301 Grants 302 GId Rec.t Amt 301 MS 30 AT&T 302 MS 40 IBM 303 303 IBM 30 Information Systems Group Leila Jalali, Candidacy Exam
    44. 44. Yet more complex... Companies Name v1 Organizations ∀g, r, a, s, m Grants(g,r,a,s,m) → v3: Address Code ∃f,p Finances(f,a,p) Year Yearf1 Grants Fundings Gid v2 FId ∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) → Recipient FinId ∃y,F,f, p Organizations(n,y,F), F(g,f), Finances(f,a,p) Amountf2 Supervisor v3 Finances f4 f3 Manager FinId Contacts Budget • Three tuples are generated for each pair of related Cid Phone companies and grants Email • The mapping specifies that there exist an f, appearing in Phone v4 two places, without saying what its value must be Information Systems Group Leila Jalali, Candidacy Exam
    45. 45. Yet more complex... Companies Name v1 Organizationsv4 ∀c, e, p Contacts(c,e,p) → Address Code Year ∃f,b Finances(f,b,p) f1 Grants Year Fundings Gid v2 FId• How do we obtain the phone to be Recipient FinId put in finances? Amount • Is it the supervisors one or the f2 Supervisor Finances f4 v3 managers? f3 Manager FinId• FKs suggest either (or even both) Contacts Budget• Human intervention is needed to choose Phone Cid Email Phone v4Information Systems Group Leila Jalali, Candidacy Exam
    46. 46. The Mapping Language- Syntax foreach x1 in g1, . . . , xn in gn xi in gi (generator) where B1 •xi variable •gi set (either the root or a set exists y1 in g1, . . . , ym in gm nested within it) where B2 B1 conjunction of equalities over with e1 = e1 and . . . and ek = ek the xi variablesThe example: e1 = e1 … equalities between a foreach c in companies, g in grants source expression and a target where c.name=g.recipient expression exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name and f.fId = g.gId and i.budget = g.amount Information Systems Group Leila Jalali, Candidacy Exam
    47. 47. Primary and Relative paths Primary path (given a schema root R, that is a first level element in the schema):  x1 in g1, x2 in g2, …, xn in gn  where g1 is an expression on R (just R?), gi (for i ≥ 2) g1 is an expression on xi-1  Examples  c in companies  o in organizations, f in o.fundings Relative path with respect to a variable x  x1 in g1, x2 in g2, …, xn in gn  where g1 is an expression on x, gi (for i ≥ 2) g1 is an expression on xi-1  Example  f in o.fundings Information Systems Group Leila Jalali, Candidacy Exam
    48. 48. Schema and types A schema: a sequence of labels(roots) each with associated type, defined by this grammar: Complex types Atomic types A set type All and choice model-groups Repeated elements  Instances: associates each schema root a value A value for atomic types setID An unordered tuple of pairs A pairInformation Systems Group Leila Jalali, Candidacy Exam
    49. 49. CorrespondencesInformation Systems Group Leila Jalali, Candidacy Exam
    50. 50. the data exchange problemInformation Systems Group Leila Jalali, Candidacy Exam
    51. 51. Query generation challenges1. Creation of New Values in the TargetOptional: Null name salary spouse dateofbirthNot nullable: one-to-one Skolem function But if it is emp ID Information Systems Group Leila Jalali, Candidacy Exam
    52. 52. Query generation challenges1. Creation of New Values in the TargetRefrential constraints Information Systems Group Leila Jalali, Candidacy Exam
    53. 53. Query generation challenges2. Grouping Nested elements Information Systems Group Leila Jalali, Candidacy Exam
    54. 54. Query generation challenges3. Value Creation interacts with Grouping Information Systems Group Leila Jalali, Candidacy Exam
    55. 55. Recursion in XML schemaInformation Systems Group Leila Jalali, Candidacy Exam
    56. 56. the Chase Given as association, repeatedly applying a chase rule to the "current" association (initialed as the input one)  If there is a NRI constraint foreach X exists Y where B such that the "current" association contains X and does not contain a Y that satisfies B then add Y to the generators and B to the where clause Example. If we start with from g in grants then we have to add various components and obtain from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid Information Systems Group Leila Jalali, Candidacy Exam
    57. 57. Clio: Analysis and Conclusion Termination and Complexity of the Chase:  the Chase with general dependecies may not be terminate  Cyclic dependencies  NRIs: A weakly acyclic set  the number of Chase steps is polynomial Conculsion Information Systems Group Leila Jalali, Candidacy Exam
    58. 58. Clio mapping A Clio mapping: for each AS exists AT with E  AS , AT : logical associations (on source and target, resp.)  E a conjunction of equalities:  for each correspondence v in C covered by <AS , AT> , E includes the equality h(eS )=h(eT ) which is the result of the coverage, for one of the coveragesInformation Systems Group Leila Jalali, Candidacy Exam
    59. 59. Structural Association Structural association: − from P (with P primary path) Starts from the Root of the schema Companies Name Organizations Address Code Year Year Grants Fundings Gid FId Recipient FinId Amount Supervisor Finances Manager FinId Contacts Budget Information Systems Group Cid Leila Jalali, Phone Candidacy Exam
    60. 60. Nested Referential Integrity (NRI) constraints The basis for discovery of associations: capture relation foreign key and referential constraints as well as XML keyref constraint: foreach P1 exists P2 where B o in organizations, f in o.fundings  P1 is a primary path f in o.fundings Organizations:  P2 is a primary path or a relative path with respect to a Code variable in P1 Year  B is a conjunction of equalities Fundings: FId between an expression on a variable of P1 FinId f4 and an expression on a variable of P2 Finances foreach o in organizations, f in o.fundings FinId exists i in finances Budget where f.finId = i.finId Phone Information Systems Group Leila Jalali, Candidacy Exam
    61. 61. Logical Association Logical association: semantic relationships between schema elements  Obtained by starting with a structural association and "chasing" NRI constraints Information Systems Group Leila Jalali, Candidacy Exam
    62. 62. Logical Association- the Chase start with a structural association Companies Name v1 Organizations Address Codef1 Year Year Grants Fundings v2 Gid FId Recipient FinId f2 Amount Financesf2 Supervisor v3 f4 FinId f3 Manager Budget Contacts Phone Cid f3 Email v4 Phone Information Systems Group Leila Jalali, Candidacy Exam
    63. 63. Logical Association Relationships A2 dominates A1 (A1 ≤ A2 ) if  the from and where clauses of A1 are subsets of those of A2 (after suitable renaming) A2 : from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid A1 : from g in grants, c in companies where g.recipient = c.name Information Systems Group Leila Jalali, Candidacy Exam
    64. 64. Mapping Generation Algorithm Inputs: S , T , Correspondences AS : from c in companies AT : fom o in organizations Logical associations are meaningful combinations of correspondences Generate all Logical Associations : AS , AT Which correspondences can be interpreted together? For each suitable pair <AS , AT>: find the correspondences covered by the pair with some renaming <h,h‘>, Check for dominance Generate Clio Mapping: foreach AS exists AT with W W is the equality h(eS )=h(eT ) Add the Clio Mapping to the Set of Mappings M: for each c in companiesOutput: the set of Schema Mappings exists o in organizations with c.name = o.code Information Systems Group Leila Jalali, Candidacy Exam
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×