Upcoming SlideShare
×

# 11/11 Slides

337
-1

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
337
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
1
0
Likes
0
Embeds 0
No embeds

No notes for slide

### 11/11 Slides

1. 1. Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004
2. 2. A Special Type of Query: Conjunctive Queries <ul><li>A single Datalog rule with no “ Ç ,” “ : ,” “ 8 ” can express select, project, and join – a conjunctive query </li></ul><ul><ul><li>Conjunctive queries are possible to reason about statically </li></ul></ul><ul><ul><li>(Note that we can write CQ’s in other languages, e.g., SQL!) </li></ul></ul><ul><li>We know how to “minimize” conjunctive queries </li></ul><ul><ul><li>An important simplification that can’t be done for general SQL </li></ul></ul><ul><li>We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance) </li></ul><ul><ul><li>Why might this be useful? </li></ul></ul>
3. 3. Example of Containment <ul><li>Suppose we have two queries: q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCIS(C), Course(C, “DB & Info Systems”) q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X) </li></ul><ul><li>Intuitively, q1 must contain the same or fewer answers vs. q2: </li></ul><ul><ul><li>It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted ) </li></ul></ul><ul><ul><li>There’s no union or any other way it can add more data </li></ul></ul><ul><li>We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course} </li></ul>
4. 4. Wrapping up Datalog… <ul><li>We’ve seen a new language, Datalog </li></ul><ul><ul><li>It’s basically a glorified DRC with a special feature, recursion </li></ul></ul><ul><ul><li>It’s much cleaner than SQL for reasoning about </li></ul></ul><ul><ul><li>… But negation (as in the DRC) poses some challenges </li></ul></ul><ul><li>We’ve seen that a particular kind of query, the conjunctive query, is written naturally in Datalog </li></ul><ul><ul><li>Conjunctive queries are possible to reason about </li></ul></ul><ul><ul><li>We can minimize them, or check containment </li></ul></ul><ul><ul><li>Conjunctive queries are very commonly used in our next problem, data integration </li></ul></ul>
5. 5. A Problem <ul><li>We’ve seen that even with normalization and the same needs, different people will arrive at different schemas </li></ul><ul><li>In fact, most people also have different needs! </li></ul><ul><li>Often people build databases in isolation, then want to share their data </li></ul><ul><ul><li>Different systems within an enterprise </li></ul></ul><ul><ul><li>Different information brokers on the Web </li></ul></ul><ul><ul><li>Scientific collaborators </li></ul></ul><ul><ul><li>Researchers who want to publish their data for others to use </li></ul></ul><ul><li>This is the goal of data integration : tie together different sources, controlled by many people, under a common schema </li></ul>
6. 6. Building a Data Integration System <ul><li>Create a middleware “mediator” or “data integration system” over the sources </li></ul><ul><ul><li>Can be warehoused (a data warehouse) or virtual </li></ul></ul><ul><ul><li>Presents a uniform query interface and schema </li></ul></ul><ul><ul><li>Abstracts away multitude of sources; consults them for relevant data </li></ul></ul><ul><ul><ul><li>Unifies different source data formats (and possibly schemas) </li></ul></ul></ul><ul><ul><ul><li>Sources are generally autonomous , not designed to be integrated </li></ul></ul></ul><ul><ul><li>Sources may be local DBs or remote web sources/services </li></ul></ul><ul><ul><li>Sources may require certain input to return output (e.g., web forms): “ binding patterns ” describe these </li></ul></ul>
7. 7. Typical Data Integration Components Data Integration System / Mediator Mediated Schema Wrapper Wrapper Wrapper Source Relations Mappings in Catalog Source Catalog Query Results
8. 8. Typical Data Integration Architecture Reformulator Query Processor Source Catalog Wrapper Wrapper Wrapper Query Query over sources Source Descrs. Queries + bindings Data in mediated format Results
9. 9. Challenges of Mapping Schemas <ul><li>In a perfect world, it would be easy to match up items from one schema with another </li></ul><ul><ul><li>Every table would have a similar table in the other schema </li></ul></ul><ul><ul><li>Every attribute would have an identical attribute in the other schema </li></ul></ul><ul><ul><li>Every value would clearly map to a value in the other schema </li></ul></ul><ul><li>Real world: as with human languages, things don’t map clearly! </li></ul><ul><ul><li>May have different numbers of tables – different decompositions </li></ul></ul><ul><ul><li>Metadata in one relation may be data in another </li></ul></ul><ul><ul><li>Values may not exactly correspond </li></ul></ul><ul><ul><li>It may be unclear whether a value is the same </li></ul></ul>
10. 10. A Few Simple Examples <ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>PieceOfArt(ID, Artist, Subject, Title, TypeOfArt) </li></ul><ul><li>MotionPicture(ID, Title, Year) Participant(ID, Name, Role) </li></ul>Ives, Z. 1234 CustName CustID Zachary Ives 46732 EmpName PennID
11. 11. How Do We Relate Schemas? <ul><li>General approach is to use a view to define relations in one schema (typically either the mediated schema or the source schema), given data in the other schema </li></ul><ul><ul><li>This allows us to “restructure” or “recompose + decompose” our data in a new way </li></ul></ul><ul><li>We can also define mappings between values in a view </li></ul><ul><ul><li>We use an intermediate table defining correspondences – a “concordance table” </li></ul></ul><ul><ul><li>It can be filled in using some type of code, and corrected by hand </li></ul></ul>
12. 12. Mapping Our Examples <ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>Movie(Title, Year, Director, Editor, Star1, Star2) </li></ul><ul><li>PieceOfArt(ID, Artist, Subject, Title, TypeOfArt) </li></ul><ul><li>MotionPicture(ID, Title, Year) Participant(ID, Name, Role) </li></ul>PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 Need a concordance table from CustIDs to PennIDs Ives, Z. 1234 CustName CustID Zachary Ives 46732 EmpName PennID
13. 13. Two Important Approaches <ul><li>TSIMMIS [Garcia-Molina+97] – Stanford </li></ul><ul><ul><li>Focus: semistructured data (OEM), OQL-based language (Lorel) </li></ul></ul><ul><ul><li>Creates a mediated schema as a view over the sources </li></ul></ul><ul><ul><li>Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems </li></ul></ul><ul><ul><li>Other important systems of this vein: Kleisli/K2 @ Penn </li></ul></ul><ul><li>Information Manifold [Levy+96] – AT&T Research </li></ul><ul><ul><li>Focus: local-as-view mappings, relational model </li></ul></ul><ul><ul><li>Sources defined as views over mediated schema </li></ul></ul><ul><ul><ul><li>Requires a special </li></ul></ul></ul><ul><ul><li>Spawned Tukwila at Washington, and eventually a company as well </li></ul></ul><ul><ul><li>Led to peer-to-peer integration approaches (Piazza, etc.) </li></ul></ul>
14. 14. TSIMMIS and Information Manifold <ul><li>Focus: Web-based queryable sources </li></ul><ul><ul><li>CGI forms, online databases, maybe a few RDBMSs </li></ul></ul><ul><ul><li>Each needs to be mapped into the system – not as easy as web search – but the benefits are significant vs. query engines </li></ul></ul><ul><li>A few parenthetical notes: </li></ul><ul><ul><li>Part of a slew of works on wrappers, source profiling, etc. </li></ul></ul><ul><ul><li>The creation of mappings can be partly automated – systems such as LSD, Cupid, Clio, … do this </li></ul></ul><ul><ul><li>Today most people look at integrating large enterprises (that’s where the \$\$\$ is!) – Nimble, BEA, IBM </li></ul></ul>
15. 15. TSIMMIS <ul><li>“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew </li></ul><ul><li>An instance of a “global-as-view” mediation system </li></ul><ul><li>One of the first systems to support semi-structured data, which predated XML by several years </li></ul>
16. 16. Semi-structured Data: OEM <ul><li>Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity </li></ul><ul><li>Proposal: Object Exchange Model, OEM </li></ul><ul><ul><ul><li>OID: <label, type, value> </li></ul></ul></ul><ul><li>… How does it relate to XML? </li></ul><ul><li>… What problems does OEM solve, and not solve, in a heterogeneous system? </li></ul>
17. 17. OEM Example Show this XML fragment in OEM: <book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title> </book> <book> <author>Chamberlin</author> <title>DB2 UDB</title> </book>
18. 18. Queries in TSIMMIS <ul><li>Specified in OQL-style language called Lorel </li></ul><ul><ul><li>OQL was an object-oriented query language </li></ul></ul><ul><ul><li>Lorel is, in many ways, a predecessor to XQuery </li></ul></ul><ul><li>Based on path expressions over OEM structures: </li></ul><ul><li>select book where book.title = “DB2 UDB” and book.author = “Chamberlin” </li></ul><ul><li>This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated = </li></ul><ul><ul><li>for \$b in AllData()/book where \$b/title/text() = “DB2 UDB” and \$b/author/text() = “Chamberlin” return \$b </li></ul></ul>
19. 19. Query Answering in TSIMMIS <ul><li>Basically, it’s view unfolding , i.e., composing a query with a view </li></ul><ul><ul><li>The query is the one being asked </li></ul></ul><ul><ul><li>The views are the MSL templates for the wrappers </li></ul></ul><ul><ul><li>Some of the views may actually require parameters, e.g., an author name, before they’ll return answers </li></ul></ul><ul><ul><ul><li>Common for web forms (see Amazon, Google, …) </li></ul></ul></ul><ul><ul><ul><li>XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action </li></ul></ul></ul>
20. 20. A Wrapper Definition in MSL <ul><li>Wrappers have templates and binding patterns (\$X) in MSL: </li></ul><ul><li>B :- B: <book {<author \$X>}> // \$\$ = “select * from book where author=“ \$X // </li></ul><ul><ul><li>This reformats a SQL query over Book(author, year, title) </li></ul></ul><ul><li>In XQuery, this might look like: </li></ul><ul><ul><li>define function GetBook(\$x AS xsd:string) as book { for \$b in sql(“Amazon.DB”, “select * from book where author=‘” + \$x +”’”) return <book>{\$b/title}<author>\$x</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul>book title author … … … The union of GetBook’s results is unioned with others to form the view AllData()
21. 21. How to Answer the Query <ul><li>Given our query: </li></ul><ul><ul><li>for \$b in AllData()/book where \$b/title/text() = “DB2 UDB” and \$b/author/text() = “Chamberlin” return \$b </li></ul></ul><ul><li>Find all wrapper definitions that: </li></ul><ul><ul><li>Contain output enough “structure” to match the conditions of the query </li></ul></ul><ul><ul><li>Or have already tested the conditions for us! </li></ul></ul>
22. 22. Query Composition with Views <ul><li>We find all views that define book with author and title, and we compose the query with each: </li></ul><ul><ul><li>define function GetBook(\$x AS xsd:string) as book { for \$b in sql(“Amazon.DB”, “select * from book where author=‘” + \$x + “’”) return <book> {\$b/ title } <author> {\$x}</author></book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>for \$b in AllData()/ book where \$b/ title /text() = “DB2 UDB” and \$b/ author /text() = “Chamberlin” return \$b </li></ul></ul>book title author … …
23. 23. Matching View Output to Our Query’s Conditions <ul><li>Determine that \$b/book/author/text()  \$x by matching the pattern on the function’s output: </li></ul><ul><ul><li>define function GetBook( \$x AS xsd:string) as book { for \$b in sql(“Amazon.DB”, “select * from book where author=‘” + \$x + “’”) return <book>{ \$b/title } <author>{\$x}</author> </book> </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>let \$x := “Chamberlin” for \$b in GetBook(\$x)/book where \$b/title/text() = “DB2 UDB” return \$b </li></ul></ul>book title author … …
24. 24. The Final Step: Unfolding <ul><ul><li>let \$x := “Chamberlin” for \$b in ( for \$b’ in sql(“Amazon.com”, “select * from book where author=‘” + \$x + “’”) return <book>{ \$b/title } <author>{\$x}</author></book> </li></ul></ul><ul><ul><li> )/book where \$b/title/text() = “DB2 UDB” return \$b </li></ul></ul><ul><li>How do we simplify further to get to here? </li></ul><ul><ul><li>for \$b in sql(“Amazon.com”, “select * from book where author=‘Chamberlin’”) where \$b/title/text() = “DB2 UDB” return \$b </li></ul></ul>
25. 25. Virtues of TSIMMIS <ul><li>Early adopter of semistructured data, greatly predating XML </li></ul><ul><ul><li>Can support data from many different kinds of sources </li></ul></ul><ul><ul><li>Obviously, doesn’t fully solve heterogeneity problem </li></ul></ul><ul><li>Presents a mediated schema that is the union of multiple views </li></ul><ul><ul><li>Query answering based on view unfolding </li></ul></ul><ul><li>Easily composed in a hierarchy of mediators </li></ul>
26. 26. Limitations of TSIMMIS’ Approach <ul><li>Some data sources may contain data with certain ranges or properties </li></ul><ul><ul><li>“ Books by Aho”, “Students at UPenn”, … </li></ul></ul><ul><ul><li>If we ask a query for students at Columbia, don’t want to bother querying students at Penn… </li></ul></ul><ul><ul><li>How do we express these? </li></ul></ul><ul><li>Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema </li></ul>
27. 27. An Alternate Approach: The Information Manifold (Levy et al.) <ul><li>When you integrate something, you have some conceptual model of the integrated domain </li></ul><ul><ul><li>Define that as a basic frame of reference, everything else as a view over it </li></ul></ul><ul><ul><li>“ Local as View” </li></ul></ul><ul><li>May have overlapping/incomplete sources </li></ul><ul><ul><li>Define each source as the subset of a query over the mediated schema </li></ul></ul><ul><ul><li>We can use selection or join predicates to specify that a source contains a range of values : </li></ul></ul><ul><ul><ul><li>ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers” </li></ul></ul></ul>
28. 28. The Local-as-View Model <ul><li>The basic model is the following: </li></ul><ul><ul><li>“Local” sources are views over the mediated schema </li></ul></ul><ul><ul><li>Sources have the data – mediated schema is virtual </li></ul></ul><ul><ul><li>Sources may not have all the data from the domain – “open-world assumption” </li></ul></ul><ul><li>The system must use the sources (views) to answer queries over the mediated schema </li></ul>
29. 29. Answering Queries Using Views <ul><li>Assumption: conjunctive queries , set semantics </li></ul><ul><ul><li>Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher) </li></ul></ul><ul><ul><li>A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB” </li></ul></ul><ul><li>Recall intuitions about this class of queries: </li></ul><ul><ul><li>Adding a conjunct to a query removes answers from the result but never adds any </li></ul></ul><ul><ul><li>Any conjunctive query with at least the same constraints & conjuncts will give valid answers </li></ul></ul>
30. 30. Query Answering <ul><li>Suppose we have the query: </li></ul><ul><li> q(a, t, p) :- author(a, i, _), book(i, t, p) </li></ul><ul><li>and sources: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>s6(i,p)  book(i, t, p) </li></ul></ul></ul></ul><ul><li>We want to compose the query with the source mappings – but they’re in the wrong direction! </li></ul>
31. 31. Inverse Rules <ul><li>We can take every mapping and “invert” it, though sometimes we may have insufficient information: </li></ul><ul><ul><li>If </li></ul></ul><ul><ul><ul><ul><li>s5(a,i)  author(a, i, _) </li></ul></ul></ul></ul><ul><ul><li>then we can also infer that: </li></ul></ul><ul><ul><ul><ul><li>author(a, i, ??? )  s5(a,i) </li></ul></ul></ul></ul><ul><li>But how to handle the absence of the 3 rd (publisher) attribute? </li></ul><ul><ul><li>We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair </li></ul></ul><ul><ul><li>So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)… </li></ul></ul>
32. 32. But NULLs Lose Information <ul><li>Suppose we take these rules and ask for: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>If we look at the rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Clearly q(a,t)  s1(a,t) </li></ul><ul><li>But if apply our inversion procedure, we get: </li></ul><ul><ul><ul><ul><li>author(a, NULL , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( NULL , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>and there’s no way to kow to join author and book on NULL! </li></ul><ul><ul><li>We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together </li></ul></ul>
33. 33. The Solution: “Skolem Functions” <ul><li>Skolem functions: </li></ul><ul><ul><li>Conceptual “perfect” hash functions </li></ul></ul><ul><ul><li>Each function returns a unique, deterministic value for each combination of input values </li></ul></ul><ul><ul><li>Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values) </li></ul></ul><ul><li>Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values </li></ul><ul><ul><li>They’re just a way of logically generating “special NULLs” </li></ul></ul>
34. 34. Revisiting Our Example <ul><li>Query: </li></ul><ul><ul><ul><ul><li>q(a,t) :- author(a, i, _), book(i, t, p) </li></ul></ul></ul></ul><ul><li>Mapping rule: </li></ul><ul><ul><ul><ul><li>s1(a,t)  author(a, i, _), book(i, t, p), t = “123” </li></ul></ul></ul></ul><ul><li>Inverse rules: </li></ul><ul><ul><ul><ul><li>author(a, f(a,t) , NULL)  s1(a,t) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>book( f(a,t) , t, p)  s1(a,t), t = “123” </li></ul></ul></ul></ul><ul><li>Expand the query as follows: </li></ul><ul><ul><li>q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t) </li></ul></ul><ul><ul><li>q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t) </li></ul></ul>
35. 35. Query Answering Using Inverse Rules <ul><li>Invert all rules using the procedures described </li></ul><ul><li>Take the query and the possible rule expansions and execute them in a Datalog interpreter </li></ul><ul><ul><li>In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources </li></ul></ul><ul><ul><li>Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent) </li></ul></ul><ul><li>More efficient, but equivalent, algorithms now exist: </li></ul><ul><ul><li>Bucket algorithm [Levy et al.] </li></ul></ul><ul><ul><li>MiniCon [Pottinger & Halevy] </li></ul></ul><ul><ul><li>Also related: “chase and backchase” [Popa, Tannen, Deutsch] </li></ul></ul>
36. 36. Summary of Data Integration <ul><li>Local-as-view integration has replaced global-as-view as the standard </li></ul><ul><ul><li>More robust way of defining mediated schemas and sources </li></ul></ul><ul><ul><li>Mediated schema is clearly defined, less likely to change </li></ul></ul><ul><ul><li>Sources can be more accurately described </li></ul></ul><ul><li>Methods exist for query reformulation, including inverse rules </li></ul><ul><li>Integration requires standardization on a single schema </li></ul><ul><ul><li>Can be hard to get consensus </li></ul></ul><ul><ul><li>Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.] </li></ul></ul><ul><li>Some other aspects of integration were addressed in related papers </li></ul><ul><ul><li>Overlap between sources; coverage of data at sources </li></ul></ul><ul><ul><li>Semi-automated creation of mappings and wrappers </li></ul></ul><ul><li>Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies </li></ul>