Your SlideShare is downloading. ×
10/30
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
223
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan
  • 2. We Left Off with TSIMMIS
    • “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew
    • An instance of a “global-as-view” mediation system
    • One of the first systems to support semi-structured data, which predated XML by several years
    • This system, like the Information Manifold, focused on querying web sources
      • Real-world integration companies (IBM, BEA, Actuate, …) are focusing on the enterprise – more $$$!
  • 3. Queries in TSIMMIS
    • Specified in OQL-style language called Lorel
      • OQL was an object-oriented query language
      • Lorel is a predecessor to XQuery; OEM is a predecessor to XML
    • Based on path expressions over OEM structures:
    • select book where book.author = “DB2 UDB” and book.title = “Chamberlin”
    • This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Restating the query above:
      • for $b in document(“mediated-schema”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
  • 4. Query Answering in TSIMMIS
    • Basically, it’s view unfolding , i.e., composing a query with a view
      • The query is the one being asked
      • The views are the MSL templates for the wrappers
      • Some of the views may actually require parameters, e.g., an author name, before they’ll return answers
        • These are called input bindings
        • Common for web forms (see Amazon, Google, …)
        • XQuery functions (XQuery’s version of views) support parameters as well, so we’ll use these to illustrate
  • 5. A Wrapper Definition in MSL, Translated to XQuery
    • Wrappers have templates and binding patterns ($X) in MSL:
    • B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //
      • This reformats a SQL query over Book(author, year, title)
    • In XQuery, this might look like:
      • define function GetBook($X AS xsd:string) as book* { for $x in sql(“select * from book where author=‘” + $x +”’”) return <book>$x<author>$x</author></book>
      • }
  • 6. How to Answer the Query
    • Given our query:
      • for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
    • We want to find all wrapper definitions that:
      • Either output enough information that we can evaluate all of our conditions over the output
        • They return a book’s title, and author so we can test against these
      • Or have already “enforced” the conditions for us!
        • They already do a selection on author=“Chamberlin,” etc.
  • 7. Query Composition with Views
    • We find all views that define book with author and title, and we compose the query with each of these
    • In our example, we find one wrapper definition that matches:
      • define function GetBook($x AS xsd:string) as book* { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book>$b<author>$x</author></book>
      • }
      • for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
  • 8. Matching View Output to Our Query’s Conditions
    • Determine that the query tests for $x=“Chamberlin” by matching the query’s XPath, $b/author/text() , on the function’s output:
      • define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book> $b <author>$x</author> </book>
      • }
      • let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b
  • 9. The Final Step: Unfolding
    • The expression:
      • let $x := “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return <book> $b <author>$x</author> </book> }/book where $b/title/text() = “DB2 UDB” return $b
    • Can be unnested (“unfolded”) and simplified to:
      • for $b in sql(“select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b
  • 10. What Is the Answer?
    • Given schema book(author, year, title) and Datalog rules defining an instance:
      • book(“Chamberlin”, “1992”, “DB2 UDB”)
      • book(“Chamberlin”, “1995”, “DB2/CS”)
      • book(“Bernstein”, “1997”, “Transaction Processing”)
    • TSIMMIS is an instance of a global-as-view mediator with a semistructured data model
      • Can also have GAV mediators using Datalog or SQL, which work on similar principles
    • Queries and mappings are unfolded (macro-expanded + simplified)
  • 11. Limitations of Global-As-View
    • Some data sources may contain data that falls within certain ranges or has certain known properties
      • “ Books by Aho”, “Students at UPenn”, …
      • How do we express these? (Important so we reduce the number of sources we query!)
    • Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema!
      • Not good for scalability or flexibility
  • 12. Observations of Levy et al. in Information Manifold Paper
    • When you integrate something, you have a conceptual model of the integrated domain
      • Define that as a basic frame of reference – not the data that’s in the sources
    • May have overlapping/incomplete sources
      • Define each source as the subset of a query over the mediated schema
      • We can use selection or join predicates to specify that a source contains a range of values :
        • ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
  • 13. The Information Manifold
    • Defines the mediated schema independently of the sources!
      • “ Local-as-view ” instead of “global-as-view”
      • Assumes that we can only see a small subset of all the possible facts – “open-world assumption”
      • Allows us to specify information about data sources
      • Focuses on relations (with OO extensions), Datalog
    • Guarantees soundness of answers, completeness of “ certain answers ” – those tuples that must exist
      • Maximal set of tuples in query answer that are logically implied by data at the sources, plus all mappings’ constraints
  • 14. The Local-as-View Model
    • Properties:
      • “Local” sources are views over the mediated schema
      • Sources have the data – mediated schema is virtual
      • Sources may not have all the data from the domain – “open-world assumption”
    • The system must use the sources (views) to answer queries over the mediated schema
      • “Answering queries using views” …
  • 15. Answering Queries Using Views
    • Our assumption for today: conjunctive queries , set semantics
      • Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)
      • A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
    • Recall intuitions about this class of queries:
      • Adding a conjunct to a query removes answers from the result but never adds any
      • Any conjunctive query with at least the same constraints & conjuncts will give valid answers
  • 16. Query Answering
    • Suppose we have the same query:
    • q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
    • and sources:
          • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
          • s2(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB”
          • s3(a,t,p)  author(a, i, _), book(i, t, p), t = “123”
          • s4(a,i)  author(a, i, _), a = “Smith”
          • s5(a,i)  author(a, i, _)
          • s6(i,p)  book(i, t, p)
    • We want to compose the query with the source mappings – but they’re in the wrong direction!
  • 17. Inverse Rules
    • We can take every mapping and “invert” it, though sometimes we may have insufficient information:
      • If
          • s5(a,i)  author(a, i, _)
      • then we can also infer that:
          • author(a, i, ??? )  s5(a,i)
      • But how to handle the absence of the 3 rd attribute?
      • We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair
      • So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…
  • 18. But NULLs Lose Information
    • Suppose we take these rules and ask for:
          • q(a,t) :- author(a, i, _), book(i, t, p)
    • If we look at the rule:
          • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
    • Clearly q(a,t) :- s1(a,t)
    • But if apply our inversion procedure, we get:
          • author(a, NULL , NULL)  s1(a,t)
          • book( NULL , t, p)  s1(a,t), t = “123”
    • and there’s no way to figure out how to join author and book on NULL!
      • We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together
  • 19. The Solution: “Skolem Functions”
    • Skolem functions:
      • “ Perfect” hash functions
      • Each function returns a unique, deterministic value for each combination of input values
      • Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values)
    • Skolem functions won’t ever be part of the answer set or the computation
      • They’re just a way of logically generating “special NULLs”
  • 20. Revisiting Our Example
    • Query:
          • q(a,t) :- author(a, i, _), book(i, t, p)
    • Mapping rule:
          • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
    • Inverse rules:
          • author(a, f(a,t) , NULL)  s1(a,t)
          • book( f(a,t) , t, p)  s1(a,t), t = “123”
    • We can now expand the query:
      • q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t)
      • q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)
  • 21. Query Answering Using Inverse Rules
    • Invert all rules using the procedures described
    • Take the query and the possible rule expansions and execute them in a Datalog interpreter
      • In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources
      • Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)
  • 22. Levy et al. Alternative Approach: The Bucket Algorithm
    • Given a query Q with relations and predicates
      • Create a bucket for each subgoal in Q
      • Iterate over each view (source mapping)
        • If source includes bucket’s subgoal:
          • Create mapping between q’s vars and the view’s var at the same position
          • If satisfiable with substitutions, add to bucket
      • Do cross-product of buckets, see if result is contained in the query (recall we saw an algorithm to do that)
  • 23. Source Capabilities
    • The simplest form is to annotate the attributes of a relation:
      • Book bff (auth,title,pub)
    • But many data integration efforts had more sophisticated models
      • Can a data source support joins between its relations?
      • Can a data source be sent a relation that it should join with?
    • In the end, we need to perform parts of the query in the mediator, and other parts at the sources
  • 24. Contributions of the Info Manifold
    • More robust way of defining mediated schemas and sources
      • Mediated schema is clearly defined, less likely to change
      • Sources can be more accurately described
    • Relatively efficient algorithms for query reformulation, creating executable plans
    • Still requires standardization on a single schema
      • Can be hard to get consensus
    • Some other aspects were captured in related papers
      • Overlap between sources; coverage of data at sources
      • Semi-automated creation of mappings
      • Semi-automated construction of wrappers
  • 25. Later Integration Systems Focused on Better Performance
    • Tukwila/Piazza [Ives+99,Halevy+02] – Washington
      • Descendants of the Information Manifold
      • Similar capabilities, but with adaptive processing of XML as it is read across streams
    • Niagara [DeWitt+99] – Wisconsin
      • XML querying of web sources
      • Giving answers a screenful at a time
    • TelegraphCQ [Chandrasekaran+03] – Berkeley
      • Adaptive, select-project-join queries over infinite streams

×