• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
11/11 Slides

11/11 Slides






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    11/11 Slides 11/11 Slides Presentation Transcript

    • Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004
    • A Special Type of Query: Conjunctive Queries
      • A single Datalog rule with no “ Ç ,” “ : ,” “ 8 ” can express select, project, and join – a conjunctive query
        • Conjunctive queries are possible to reason about statically
        • (Note that we can write CQ’s in other languages, e.g., SQL!)
      • We know how to “minimize” conjunctive queries
        • An important simplification that can’t be done for general SQL
      • We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance)
        • Why might this be useful?
    • Example of Containment
      • Suppose we have two queries: q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCIS(C), Course(C, “DB & Info Systems”) q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)
      • Intuitively, q1 must contain the same or fewer answers vs. q2:
        • It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted )
        • There’s no union or any other way it can add more data
      • We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}
    • Wrapping up Datalog…
      • We’ve seen a new language, Datalog
        • It’s basically a glorified DRC with a special feature, recursion
        • It’s much cleaner than SQL for reasoning about
        • … But negation (as in the DRC) poses some challenges
      • We’ve seen that a particular kind of query, the conjunctive query, is written naturally in Datalog
        • Conjunctive queries are possible to reason about
        • We can minimize them, or check containment
        • Conjunctive queries are very commonly used in our next problem, data integration
    • A Problem
      • We’ve seen that even with normalization and the same needs, different people will arrive at different schemas
      • In fact, most people also have different needs!
      • Often people build databases in isolation, then want to share their data
        • Different systems within an enterprise
        • Different information brokers on the Web
        • Scientific collaborators
        • Researchers who want to publish their data for others to use
      • This is the goal of data integration : tie together different sources, controlled by many people, under a common schema
    • Building a Data Integration System
      • Create a middleware “mediator” or “data integration system” over the sources
        • Can be warehoused (a data warehouse) or virtual
        • Presents a uniform query interface and schema
        • Abstracts away multitude of sources; consults them for relevant data
          • Unifies different source data formats (and possibly schemas)
          • Sources are generally autonomous , not designed to be integrated
        • Sources may be local DBs or remote web sources/services
        • Sources may require certain input to return output (e.g., web forms): “ binding patterns ” describe these
    • Typical Data Integration Components Data Integration System / Mediator Mediated Schema Wrapper Wrapper Wrapper Source Relations Mappings in Catalog Source Catalog Query Results
    • Typical Data Integration Architecture Reformulator Query Processor Source Catalog Wrapper Wrapper Wrapper Query Query over sources Source Descrs. Queries + bindings Data in mediated format Results
    • Challenges of Mapping Schemas
      • In a perfect world, it would be easy to match up items from one schema with another
        • Every table would have a similar table in the other schema
        • Every attribute would have an identical attribute in the other schema
        • Every value would clearly map to a value in the other schema
      • Real world: as with human languages, things don’t map clearly!
        • May have different numbers of tables – different decompositions
        • Metadata in one relation may be data in another
        • Values may not exactly correspond
        • It may be unclear whether a value is the same
    • A Few Simple Examples
      • Movie(Title, Year, Director, Editor, Star1, Star2)
      • Movie(Title, Year, Director, Editor, Star1, Star2)
      • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
      • MotionPicture(ID, Title, Year) Participant(ID, Name, Role)
      Ives, Z. 1234 CustName CustID Zachary Ives 46732 EmpName PennID
    • How Do We Relate Schemas?
      • General approach is to use a view to define relations in one schema (typically either the mediated schema or the source schema), given data in the other schema
        • This allows us to “restructure” or “recompose + decompose” our data in a new way
      • We can also define mappings between values in a view
        • We use an intermediate table defining correspondences – a “concordance table”
        • It can be filled in using some type of code, and corrected by hand
    • Mapping Our Examples
      • Movie(Title, Year, Director, Editor, Star1, Star2)
      • Movie(Title, Year, Director, Editor, Star1, Star2)
      • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
      • MotionPicture(ID, Title, Year) Participant(ID, Name, Role)
      PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 Need a concordance table from CustIDs to PennIDs Ives, Z. 1234 CustName CustID Zachary Ives 46732 EmpName PennID
    • Two Important Approaches
      • TSIMMIS [Garcia-Molina+97] – Stanford
        • Focus: semistructured data (OEM), OQL-based language (Lorel)
        • Creates a mediated schema as a view over the sources
        • Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems
        • Other important systems of this vein: Kleisli/K2 @ Penn
      • Information Manifold [Levy+96] – AT&T Research
        • Focus: local-as-view mappings, relational model
        • Sources defined as views over mediated schema
          • Requires a special
        • Spawned Tukwila at Washington, and eventually a company as well
        • Led to peer-to-peer integration approaches (Piazza, etc.)
    • TSIMMIS and Information Manifold
      • Focus: Web-based queryable sources
        • CGI forms, online databases, maybe a few RDBMSs
        • Each needs to be mapped into the system – not as easy as web search – but the benefits are significant vs. query engines
      • A few parenthetical notes:
        • Part of a slew of works on wrappers, source profiling, etc.
        • The creation of mappings can be partly automated – systems such as LSD, Cupid, Clio, … do this
        • Today most people look at integrating large enterprises (that’s where the $$$ is!) – Nimble, BEA, IBM
      • “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew
      • An instance of a “global-as-view” mediation system
      • One of the first systems to support semi-structured data, which predated XML by several years
    • Semi-structured Data: OEM
      • Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity
      • Proposal: Object Exchange Model, OEM
          • OID: <label, type, value>
      • … How does it relate to XML?
      • … What problems does OEM solve, and not solve, in a heterogeneous system?
    • OEM Example Show this XML fragment in OEM: <book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title> </book> <book> <author>Chamberlin</author> <title>DB2 UDB</title> </book>
    • Queries in TSIMMIS
      • Specified in OQL-style language called Lorel
        • OQL was an object-oriented query language
        • Lorel is, in many ways, a predecessor to XQuery
      • Based on path expressions over OEM structures:
      • select book where book.title = “DB2 UDB” and book.author = “Chamberlin”
      • This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated =
        • for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
    • Query Answering in TSIMMIS
      • Basically, it’s view unfolding , i.e., composing a query with a view
        • The query is the one being asked
        • The views are the MSL templates for the wrappers
        • Some of the views may actually require parameters, e.g., an author name, before they’ll return answers
          • Common for web forms (see Amazon, Google, …)
          • XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action
    • A Wrapper Definition in MSL
      • Wrappers have templates and binding patterns ($X) in MSL:
      • B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //
        • This reformats a SQL query over Book(author, year, title)
      • In XQuery, this might look like:
        • define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +”’”) return <book>{$b/title}<author>$x</author></book>
        • }
      book title author … … … The union of GetBook’s results is unioned with others to form the view AllData()
    • How to Answer the Query
      • Given our query:
        • for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b
      • Find all wrapper definitions that:
        • Contain output enough “structure” to match the conditions of the query
        • Or have already tested the conditions for us!
    • Query Composition with Views
      • We find all views that define book with author and title, and we compose the query with each:
        • define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book> {$b/ title } <author> {$x}</author></book>
        • }
        • for $b in AllData()/ book where $b/ title /text() = “DB2 UDB” and $b/ author /text() = “Chamberlin” return $b
      book title author … …
    • Matching View Output to Our Query’s Conditions
      • Determine that $b/book/author/text()  $x by matching the pattern on the function’s output:
        • define function GetBook( $x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author> </book>
        • }
        • let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b
      book title author … …
    • The Final Step: Unfolding
        • let $x := “Chamberlin” for $b in ( for $b’ in sql(“Amazon.com”, “select * from book where author=‘” + $x + “’”) return <book>{ $b/title } <author>{$x}</author></book>
        • )/book where $b/title/text() = “DB2 UDB” return $b
      • How do we simplify further to get to here?
        • for $b in sql(“Amazon.com”, “select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b
    • Virtues of TSIMMIS
      • Early adopter of semistructured data, greatly predating XML
        • Can support data from many different kinds of sources
        • Obviously, doesn’t fully solve heterogeneity problem
      • Presents a mediated schema that is the union of multiple views
        • Query answering based on view unfolding
      • Easily composed in a hierarchy of mediators
    • Limitations of TSIMMIS’ Approach
      • Some data sources may contain data with certain ranges or properties
        • “ Books by Aho”, “Students at UPenn”, …
        • If we ask a query for students at Columbia, don’t want to bother querying students at Penn…
        • How do we express these?
      • Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema
    • An Alternate Approach: The Information Manifold (Levy et al.)
      • When you integrate something, you have some conceptual model of the integrated domain
        • Define that as a basic frame of reference, everything else as a view over it
        • “ Local as View”
      • May have overlapping/incomplete sources
        • Define each source as the subset of a query over the mediated schema
        • We can use selection or join predicates to specify that a source contains a range of values :
          • ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”
    • The Local-as-View Model
      • The basic model is the following:
        • “Local” sources are views over the mediated schema
        • Sources have the data – mediated schema is virtual
        • Sources may not have all the data from the domain – “open-world assumption”
      • The system must use the sources (views) to answer queries over the mediated schema
    • Answering Queries Using Views
      • Assumption: conjunctive queries , set semantics
        • Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)
        • A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
      • Recall intuitions about this class of queries:
        • Adding a conjunct to a query removes answers from the result but never adds any
        • Any conjunctive query with at least the same constraints & conjuncts will give valid answers
    • Query Answering
      • Suppose we have the query:
      • q(a, t, p) :- author(a, i, _), book(i, t, p)
      • and sources:
            • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
            • s5(a,i)  author(a, i, _)
            • s6(i,p)  book(i, t, p)
      • We want to compose the query with the source mappings – but they’re in the wrong direction!
    • Inverse Rules
      • We can take every mapping and “invert” it, though sometimes we may have insufficient information:
        • If
            • s5(a,i)  author(a, i, _)
        • then we can also infer that:
            • author(a, i, ??? )  s5(a,i)
      • But how to handle the absence of the 3 rd (publisher) attribute?
        • We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair
        • So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…
    • But NULLs Lose Information
      • Suppose we take these rules and ask for:
            • q(a,t) :- author(a, i, _), book(i, t, p)
      • If we look at the rule:
            • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
      • Clearly q(a,t)  s1(a,t)
      • But if apply our inversion procedure, we get:
            • author(a, NULL , NULL)  s1(a,t)
            • book( NULL , t, p)  s1(a,t), t = “123”
      • and there’s no way to kow to join author and book on NULL!
        • We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together
    • The Solution: “Skolem Functions”
      • Skolem functions:
        • Conceptual “perfect” hash functions
        • Each function returns a unique, deterministic value for each combination of input values
        • Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values)
      • Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values
        • They’re just a way of logically generating “special NULLs”
    • Revisiting Our Example
      • Query:
            • q(a,t) :- author(a, i, _), book(i, t, p)
      • Mapping rule:
            • s1(a,t)  author(a, i, _), book(i, t, p), t = “123”
      • Inverse rules:
            • author(a, f(a,t) , NULL)  s1(a,t)
            • book( f(a,t) , t, p)  s1(a,t), t = “123”
      • Expand the query as follows:
        • q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t)
        • q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)
    • Query Answering Using Inverse Rules
      • Invert all rules using the procedures described
      • Take the query and the possible rule expansions and execute them in a Datalog interpreter
        • In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources
        • Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)
      • More efficient, but equivalent, algorithms now exist:
        • Bucket algorithm [Levy et al.]
        • MiniCon [Pottinger & Halevy]
        • Also related: “chase and backchase” [Popa, Tannen, Deutsch]
    • Summary of Data Integration
      • Local-as-view integration has replaced global-as-view as the standard
        • More robust way of defining mediated schemas and sources
        • Mediated schema is clearly defined, less likely to change
        • Sources can be more accurately described
      • Methods exist for query reformulation, including inverse rules
      • Integration requires standardization on a single schema
        • Can be hard to get consensus
        • Today we have peer-to-peer data integration, e.g., Piazza [Halevy et al.], Orchestra [Ives et al.], Hyperion [Miller et al.]
      • Some other aspects of integration were addressed in related papers
        • Overlap between sources; coverage of data at sources
        • Semi-automated creation of mappings and wrappers
      • Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies