Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30...
We Left Off with TSIMMIS <ul><li>“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew </li></ul...
Queries in TSIMMIS <ul><li>Specified in OQL-style language called Lorel </li></ul><ul><ul><li>OQL was an object-oriented q...
Query Answering in TSIMMIS <ul><li>Basically, it’s  view unfolding , i.e., composing a query with a view </li></ul><ul><ul...
A Wrapper Definition  in MSL, Translated to XQuery <ul><li>Wrappers have templates and binding patterns ($X) in MSL: </li>...
How to Answer the Query <ul><li>Given our query: </li></ul><ul><ul><li>for $b in document(“mediated-schema”)/book where $b...
Query Composition with Views <ul><li>We find all views that define book with author and title, and we  compose  the query ...
Matching View Output to  Our Query’s Conditions <ul><li>Determine that the query tests for $x=“Chamberlin” by matching the...
The Final Step:  Unfolding <ul><li>The expression: </li></ul><ul><ul><li>let $x := “Chamberlin” for $b in { for $b in  sql...
What Is the Answer? <ul><li>Given schema book(author, year, title) and Datalog rules defining an instance: </li></ul><ul><...
Limitations of Global-As-View <ul><li>Some data sources may contain data that falls within certain ranges or has certain k...
Observations of Levy et al. in Information Manifold Paper <ul><li>When you integrate something, you have a conceptual mode...
The Information Manifold <ul><li>Defines the mediated schema  independently  of the sources! </li></ul><ul><ul><li>“ Local...
The Local-as-View Model <ul><li>Properties: </li></ul><ul><ul><li>“Local” sources are views over the mediated schema </li>...
Answering Queries Using Views  <ul><li>Our assumption for today:  conjunctive queries ,  set semantics </li></ul><ul><ul><...
Query Answering <ul><li>Suppose we have the same query: </li></ul><ul><li>  q(a, t, p) :- author(a, i, _), book(i, t, p), ...
Inverse Rules <ul><li>We can take every mapping and “invert” it, though sometimes we may have insufficient information: </...
But NULLs Lose Information <ul><li>Suppose we take these rules and ask for:  </li></ul><ul><ul><ul><ul><li>q(a,t) :- autho...
The Solution: “Skolem Functions” <ul><li>Skolem functions: </li></ul><ul><ul><li>“ Perfect” hash functions </li></ul></ul>...
Revisiting Our Example <ul><li>Query:  </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></...
Query Answering Using  Inverse Rules <ul><li>Invert all rules using the procedures described </li></ul><ul><li>Take the qu...
Levy et al. Alternative Approach: The Bucket Algorithm <ul><li>Given a query Q with relations and predicates </li></ul><ul...
Source Capabilities <ul><li>The simplest form is to annotate the attributes of a relation: </li></ul><ul><ul><li>Book bff ...
Contributions of the Info Manifold <ul><li>More robust way of defining mediated schemas and sources </li></ul><ul><ul><li>...
Later Integration Systems Focused on Better Performance <ul><li>Tukwila/Piazza   [Ives+99,Halevy+02]  – Washington </li></...
Upcoming SlideShare
Loading in …5
×

10/30

360 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
360
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

10/30

  1. 1. Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan
  2. 2. We Left Off with TSIMMIS <ul><li>“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew </li></ul><ul><li>An instance of a “global-as-view” mediation system </li></ul><ul><li>One of the first systems to support semi-structured data, which predated XML by several years </li></ul><ul><li>This system, like the Information Manifold, focused on querying web sources </li></ul><ul><ul><li>Real-world integration companies (IBM, BEA, Actuate, …) are focusing on the enterprise – more $$$! </li></ul></ul>
  3. 3. Queries in TSIMMIS <ul><li>Specified in OQL-style language called Lorel </li></ul><ul><ul><li>OQL was an object-oriented query language </li></ul></ul><ul><ul><li>Lorel is a predecessor to XQuery; OEM is a predecessor to XML </li></ul></ul><ul><li>Based on path expressions over OEM structures: </li></ul><ul><li>select book where book.author = “DB2 UDB” and book.title = “Chamberlin” </li></ul><ul><li>This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Restating the query above: </li></ul><ul><ul><li>for $b in document(“mediated-schema”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b </li></ul></ul>
  4. 4. Query Answering in TSIMMIS <ul><li>Basically, it’s view unfolding , i.e., composing a query with a view </li></ul><ul><ul><li>The query is the one being asked </li></ul></ul><ul><ul><li>The views are the MSL templates for the wrappers </li></ul></ul><ul><ul><li>Some of the views may actually require parameters, e.g., an author name, before they’ll return answers </li></ul></ul><ul><ul><ul><li>These are called input bindings </li></ul></ul></ul><ul><ul><ul><li>Common for web forms (see Amazon, Google, …) </li></ul></ul></ul><ul><ul><ul><li>XQuery functions (XQuery’s version of views) support parameters as well, so we’ll use these to illustrate </li></ul></ul></ul>
  5. 5. A Wrapper Definition in MSL, Translated to XQuery <ul><li>Wrappers have templates and binding patterns ($X) in MSL: </li></ul><ul><li>B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X // </li></ul><ul><ul><li>This reformats a SQL query over Book(author, year, title) </li></ul></ul><ul><li>In XQuery, this might look like: </li></ul><ul><ul><li>define function GetBook($X AS xsd:string) as book* { for $x in sql(“select * from book where author=‘” + $x +”’”) return <book>$x<author>$x</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul>
  6. 6. How to Answer the Query <ul><li>Given our query: </li></ul><ul><ul><li>for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b </li></ul></ul><ul><li>We want to find all wrapper definitions that: </li></ul><ul><ul><li>Either output enough information that we can evaluate all of our conditions over the output </li></ul></ul><ul><ul><ul><li>They return a book’s title, and author so we can test against these </li></ul></ul></ul><ul><ul><li>Or have already “enforced” the conditions for us! </li></ul></ul><ul><ul><ul><li>They already do a selection on author=“Chamberlin,” etc. </li></ul></ul></ul>
  7. 7. Query Composition with Views <ul><li>We find all views that define book with author and title, and we compose the query with each of these </li></ul><ul><li>In our example, we find one wrapper definition that matches: </li></ul><ul><ul><li>define function GetBook($x AS xsd:string) as book* { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book>$b<author>$x</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b </li></ul></ul>
  8. 8. Matching View Output to Our Query’s Conditions <ul><li>Determine that the query tests for $x=“Chamberlin” by matching the query’s XPath, $b/author/text() , on the function’s output: </li></ul><ul><ul><li>define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book> $b <author>$x</author> </book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b </li></ul></ul>
  9. 9. The Final Step: Unfolding <ul><li>The expression: </li></ul><ul><ul><li>let $x := “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book> $b <author>$x</author> </book> }/book where $b/title/text() = “DB2 UDB” return $b </li></ul></ul><ul><li>Can be unnested (“unfolded”) and simplified to: </li></ul><ul><ul><li>for $b in sql(“select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b </li></ul></ul>
  10. 10. What Is the Answer? <ul><li>Given schema book(author, year, title) and Datalog rules defining an instance: </li></ul><ul><ul><li>book(“Chamberlin”, “1992”, “DB2 UDB”) </li></ul></ul><ul><ul><li>book(“Chamberlin”, “1995”, “DB2/CS”) </li></ul></ul><ul><ul><li>book(“Bernstein”, “1997”, “Transaction Processing”) </li></ul></ul><ul><li>TSIMMIS is an instance of a global-as-view mediator with a semistructured data model </li></ul><ul><ul><li>Can also have GAV mediators using Datalog or SQL, which work on similar principles </li></ul></ul><ul><li>Queries and mappings are unfolded (macro-expanded + simplified) </li></ul>
  11. 11. Limitations of Global-As-View <ul><li>Some data sources may contain data that falls within certain ranges or has certain known properties </li></ul><ul><ul><li>“ Books by Aho”, “Students at UPenn”, … </li></ul></ul><ul><ul><li>How do we express these? (Important so we reduce the number of sources we query!) </li></ul></ul><ul><li>Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema! </li></ul><ul><ul><li>Not good for scalability or flexibility </li></ul></ul>
  12. 12. Observations of Levy et al. in Information Manifold Paper <ul><li>When you integrate something, you have a conceptual model of the integrated domain </li></ul><ul><ul><li>Define that as a basic frame of reference – not the data that’s in the sources </li></ul></ul><ul><li>May have overlapping/incomplete sources </li></ul><ul><ul><li>Define each source as the subset of a query over the mediated schema </li></ul></ul><ul><ul><li>We can use selection or join predicates to specify that a source contains a range of values : </li></ul></ul><ul><ul><ul><li>ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers” </li></ul></ul></ul>
  13. 13. The Information Manifold <ul><li>Defines the mediated schema independently of the sources! </li></ul><ul><ul><li>“ Local-as-view ” instead of “global-as-view” </li></ul></ul><ul><ul><li>Assumes that we can only see a small subset of all the possible facts – “open-world assumption” </li></ul></ul><ul><ul><li>Allows us to specify information about data sources </li></ul></ul><ul><ul><li>Focuses on relations (with OO extensions), Datalog </li></ul></ul><ul><li>Guarantees soundness of answers, completeness of “ certain answers ” – those tuples that must exist </li></ul><ul><ul><li>Maximal set of tuples in query answer that are logically implied by data at the sources, plus all mappings’ constraints </li></ul></ul>
  14. 14. The Local-as-View Model <ul><li>Properties: </li></ul><ul><ul><li>“Local” sources are views over the mediated schema </li></ul></ul><ul><ul><li>Sources have the data – mediated schema is virtual </li></ul></ul><ul><ul><li>Sources may not have all the data from the domain – “open-world assumption” </li></ul></ul><ul><li>The system must use the sources (views) to answer queries over the mediated schema </li></ul><ul><ul><li>“Answering queries using views” … </li></ul></ul>
  15. 15. Answering Queries Using Views <ul><li>Our assumption for today: conjunctive queries , set semantics </li></ul><ul><ul><li>Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher) </li></ul></ul><ul><ul><li>A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul></ul><ul><li>Recall intuitions about this class of queries: </li></ul><ul><ul><li>Adding a conjunct to a query removes answers from the result but never adds any </li></ul></ul><ul><ul><li>Any conjunctive query with at least the same constraints & conjuncts will give valid answers </li></ul></ul>
  16. 16. Query Answering <ul><li>Suppose we have the same query: </li></ul><ul><li> q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul><ul><li>and sources: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s2(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s3(a,t,p)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s4(a,i)  author(a, i, _), a = “Smith” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s6(i,p)  book(i, t, p) </li></ul></ul></ul></ul><ul><li>We want to compose the query with the source mappings – but they’re in the wrong direction! </li></ul>
  17. 17. Inverse Rules <ul><li>We can take every mapping and “invert” it, though sometimes we may have insufficient information: </li></ul><ul><ul><li>If </li></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><li>then we can also infer that: </li></ul></ul><ul><ul><ul><ul><li>author(a, i, ??? )  s5(a,i) </li></ul></ul></ul></ul><ul><ul><li>But how to handle the absence of the 3 rd attribute? </li></ul></ul><ul><ul><li>We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair </li></ul></ul><ul><ul><li>So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)… </li></ul></ul>
  18. 18. But NULLs Lose Information <ul><li>Suppose we take these rules and ask for: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>If we look at the rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Clearly q(a,t) :- s1(a,t) </li></ul><ul><li>But if apply our inversion procedure, we get: </li></ul><ul><ul><ul><ul><li>author(a, NULL , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( NULL , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>and there’s no way to figure out how to join author and book on NULL! </li></ul><ul><ul><li>We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together </li></ul></ul>
  19. 19. The Solution: “Skolem Functions” <ul><li>Skolem functions: </li></ul><ul><ul><li>“ Perfect” hash functions </li></ul></ul><ul><ul><li>Each function returns a unique, deterministic value for each combination of input values </li></ul></ul><ul><ul><li>Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values) </li></ul></ul><ul><li>Skolem functions won’t ever be part of the answer set or the computation </li></ul><ul><ul><li>They’re just a way of logically generating “special NULLs” </li></ul></ul>
  20. 20. Revisiting Our Example <ul><li>Query: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>Mapping rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Inverse rules: </li></ul><ul><ul><ul><ul><li>author(a, f(a,t) , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( f(a,t) , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>We can now expand the query: </li></ul><ul><ul><li>q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t) </li></ul></ul><ul><ul><li>q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t) </li></ul></ul>
  21. 21. Query Answering Using Inverse Rules <ul><li>Invert all rules using the procedures described </li></ul><ul><li>Take the query and the possible rule expansions and execute them in a Datalog interpreter </li></ul><ul><ul><li>In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources </li></ul></ul><ul><ul><li>Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent) </li></ul></ul>
  22. 22. Levy et al. Alternative Approach: The Bucket Algorithm <ul><li>Given a query Q with relations and predicates </li></ul><ul><ul><li>Create a bucket for each subgoal in Q </li></ul></ul><ul><ul><li>Iterate over each view (source mapping) </li></ul></ul><ul><ul><ul><li>If source includes bucket’s subgoal: </li></ul></ul></ul><ul><ul><ul><ul><li>Create mapping between q’s vars and the view’s var at the same position </li></ul></ul></ul></ul><ul><ul><ul><ul><li>If satisfiable with substitutions, add to bucket </li></ul></ul></ul></ul><ul><ul><li>Do cross-product of buckets, see if result is contained in the query (recall we saw an algorithm to do that) </li></ul></ul>
  23. 23. Source Capabilities <ul><li>The simplest form is to annotate the attributes of a relation: </li></ul><ul><ul><li>Book bff (auth,title,pub) </li></ul></ul><ul><li>But many data integration efforts had more sophisticated models </li></ul><ul><ul><li>Can a data source support joins between its relations? </li></ul></ul><ul><ul><li>Can a data source be sent a relation that it should join with? </li></ul></ul><ul><li>In the end, we need to perform parts of the query in the mediator, and other parts at the sources </li></ul>
  24. 24. Contributions of the Info Manifold <ul><li>More robust way of defining mediated schemas and sources </li></ul><ul><ul><li>Mediated schema is clearly defined, less likely to change </li></ul></ul><ul><ul><li>Sources can be more accurately described </li></ul></ul><ul><li>Relatively efficient algorithms for query reformulation, creating executable plans </li></ul><ul><li>Still requires standardization on a single schema </li></ul><ul><ul><li>Can be hard to get consensus </li></ul></ul><ul><li>Some other aspects were captured in related papers </li></ul><ul><ul><li>Overlap between sources; coverage of data at sources </li></ul></ul><ul><ul><li>Semi-automated creation of mappings </li></ul></ul><ul><ul><li>Semi-automated construction of wrappers </li></ul></ul>
  25. 25. Later Integration Systems Focused on Better Performance <ul><li>Tukwila/Piazza [Ives+99,Halevy+02] – Washington </li></ul><ul><ul><li>Descendants of the Information Manifold </li></ul></ul><ul><ul><li>Similar capabilities, but with adaptive processing of XML as it is read across streams </li></ul></ul><ul><li>Niagara [DeWitt+99] – Wisconsin </li></ul><ul><ul><li>XML querying of web sources </li></ul></ul><ul><ul><li>Giving answers a screenful at a time </li></ul></ul><ul><li>TelegraphCQ [Chandrasekaran+03] – Berkeley </li></ul><ul><ul><li>Adaptive, select-project-join queries over infinite streams </li></ul></ul>

×