2 rel-algebra


Published on

Published in: Design, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

2 rel-algebra

  1. 1. Relational Model & Algebra Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2011 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
  2. 2. Recall Our Initial Discussion… <ul><li>There are a variety of ways of representing data, each with trade-offs </li></ul><ul><ul><li>Free text [often need a human] </li></ul></ul><ul><ul><li>Shapes/points in space </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>“ Objects” with “properties” </li></ul></ul><ul><li>In general, our emphasis will be on the last item </li></ul><ul><ul><li>… though there are spatial databases, OO databases, text databases, and the like… </li></ul></ul>
  3. 3. The Relational Data Model (1970) <ul><li>Lessons from the Codd paper </li></ul><ul><ul><li>Let’s separate physical implementation from logical </li></ul></ul><ul><ul><li>Model the data independently from how it will be used (accessed, printed, etc.) </li></ul></ul><ul><ul><ul><li>Describe the data minimally and mathematically </li></ul></ul></ul><ul><ul><ul><ul><li>A relation describes an association between data items – tuples with attributes </li></ul></ul></ul></ul><ul><ul><ul><ul><li>We generally think of tables and rows, but that’s somewhat imprecise </li></ul></ul></ul></ul><ul><ul><ul><li>Use standard mathematical (logical) operations over the data – these are the relational algebra or relational calculus </li></ul></ul></ul><ul><ul><ul><li>How does this model relate to objects, properties? What are its abilities and limitations? </li></ul></ul></ul>
  4. 4. Why Did It Take So Many Years to Implement Relational Databases? <ul><li>Codd’s original work: 1969-70 </li></ul><ul><li>Earliest relational database research: ~1976 </li></ul><ul><li>Oracle “2.0”: 1979 </li></ul><ul><li>Why the gap? </li></ul><ul><ul><li>“ You could do the same thing in other ways” </li></ul></ul><ul><ul><li>“ Nobody wants to write math formulas” </li></ul></ul><ul><ul><li>“ Why would I turn my data into tables?” </li></ul></ul><ul><ul><li>“ It won’t perform well” </li></ul></ul><ul><li>What do you think? </li></ul>
  5. 5. Getting More Concrete: Building a Database and Application <ul><li>Start with a conceptual model </li></ul><ul><ul><li>“ On paper” using certain techniques we’ll discuss next week </li></ul></ul><ul><ul><li>We ignore low-level details – focus on logical representation </li></ul></ul><ul><li>Design & implement schema </li></ul><ul><ul><li>Design and codify (in SQL) the relations/tables </li></ul></ul><ul><ul><li>Do physical layout – indexes, etc. </li></ul></ul><ul><li>Import the data </li></ul><ul><li>Write applications using DBMS and other tools </li></ul><ul><ul><li>Many of the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc.) </li></ul></ul>
  6. 6. Conceptual Design for CIS Student Course Survey STUDENT COURSE Takes name sid cid name PROFESSOR Teaches semester fid name exp-grade “ Who’s taking what, and what grade do they expect?” This design is independent of the final form of the report!
  7. 7. Example Schema <ul><li>Our focus now: relational schema – set of tables </li></ul><ul><li>Can have other kinds of schemas – XML, object, … </li></ul>STUDENT Takes COURSE PROFESSOR Teaches sid name 1 Jill 2 Qun 3 Nitin fid name 1 Ives 2 Taskar 8 Martin sid exp-grade cid 1 A 550-0109 1 A 520-1009 3 C 500-0109 cid subj sem 550-0109 DB F09 520-1009 AI S09 501-0109 Arch F09 fid cid 1 550-0109 2 520-1009 8 501-0109
  8. 8. Some Terminology <ul><li>Columns of a relation are called attributes or fields </li></ul><ul><li>The number of these columns is the arity of the relation </li></ul><ul><li>The rows of a relation are called tuples </li></ul><ul><li>Each attribute has values taken from a domain , e.g., subj has domain string </li></ul><ul><li>Theoretically: a relation is a set of tuples; no tuple can occur more than once </li></ul><ul><ul><li>Real systems may allow duplicates for efficiency or other reasons – we’ll ignore this for now </li></ul></ul><ul><ul><li>Objects and XML may also have the same content with different “identity” </li></ul></ul>
  9. 9. Describing Relations <ul><li>A schema can be represented many ways </li></ul><ul><li>In relational DBs, we use relation ( attribute:domain ) </li></ul><ul><li>To the DBMS, use data definition language (DDL) – like programming language type definitions </li></ul>STUDENT (sid:int, name:string) Takes (sid:int, exp-grade:char[2], cid:string) COURSE (cid:string, subj:string, sem:char[3]) Teaches (fid:int, cid:string) PROFESSOR (fid:int, name:string)
  10. 10. More on Attribute Domains <ul><li>Relational DBMSs have very limited “built-in” domains: either tables or scalar attributes – int, string, byte sequence, date, etc. </li></ul><ul><li>But more generally: </li></ul><ul><ul><li>We can have “nested relations” </li></ul></ul><ul><ul><li>Object-oriented, object-relational systems allow complex, user-defined domains – lists, classes, etc. </li></ul></ul><ul><ul><li>XML systems allow for XML trees (or lists of trees) that follow certain structural constraints </li></ul></ul><ul><li>Database people, when they are discussing design, often assume domains are evident to the reader: STUDENT (sid, name) </li></ul>
  11. 11. Integrity Constraints <ul><li>Domains and schemas are one form of constraint on a valid data instance </li></ul><ul><li>Other important constraints include: </li></ul><ul><ul><li>Key constraints : </li></ul></ul><ul><ul><ul><li>Subset of fields that uniquely identifies a tuple, and for which no subset of the key has this property </li></ul></ul></ul><ul><ul><ul><li>May have several candidate keys; one is chosen as the primary key </li></ul></ul></ul><ul><ul><ul><li>A superkey is a subset of fields that includes a key </li></ul></ul></ul><ul><ul><li>Inclusion dependencies (referential integrity constraints): </li></ul></ul><ul><ul><ul><li>A field in one relation may refer to a tuple in another relation by including its key </li></ul></ul></ul><ul><ul><ul><li>The referenced tuple must exist in the other relation for the database instance to be valid </li></ul></ul></ul>
  12. 12. SQL: Structured Query Language <ul><li>The standard language for relational data </li></ul><ul><ul><li>Invented by folks at IBM, esp. Don Chamberlin </li></ul></ul><ul><ul><li>Actually not a particularly elegant language… </li></ul></ul><ul><ul><li>Beat a more elegant competing standard, QUEL, from Berkeley </li></ul></ul><ul><li>Separated into a DML (data manipulation language) & DDL </li></ul><ul><li>DML based on relational algebra & (mostly) calculus , which we discuss this week </li></ul><ul><li>Later we’ll see how it’s embedded in a host language </li></ul>
  13. 13. Table Definition: SQL-92 DDL and Constraints CREATE TABLE Takes (sid INTEGER, exp-grade CHAR(2), cid STRING(8), PRIMARY KEY (sid, cid), FOREIGN KEY (sid) REFERENCES STUDENT, FOREIGN KEY (cid) REFERENCES COURSE ) CREATE TABLE STUDENT (sid INTEGER, name CHAR(20), )
  14. 14. Example Data Instance STUDENT Takes COURSE PROFESSOR Teaches sid name 1 Jill 2 Qun 3 Nitin fid name 1 Ives 2 Taskar 8 Martin sid exp-grade cid 1 A 550-0109 1 A 520-1009 3 C 501-0109 cid subj sem 550-0109 DB F09 520-1009 AI S09 501-0109 Arch F09 fid cid 1 550-0109 2 700-1009 8 501-0109
  15. 15. From Tables  SQL  Web Application <ul><li><html> </li></ul><ul><li><body> </li></ul><ul><li><!-- hypotheticalEmbeddedSQL: </li></ul><ul><li>SELECT * FROM STUDENT, Takes, COURSE </li></ul><ul><li>WHERE STUDENT.sid = Takes.sID </li></ul><ul><li>AND Takes.cID = cid </li></ul><ul><li>--> </li></ul><ul><li></body> </li></ul><ul><li></html> </li></ul>C -> machine code sequence -> microprocessor Java -> bytecode sequence -> JVM SQL -> relational algebra expression -> query execution engine
  16. 16. Codd’s Relational Algebra <ul><li>A set of mathematical operators that compose, modify, and combine tuples within different relations </li></ul><ul><li>Relational algebra operations operate on relations and produce relations (“closure”) </li></ul><ul><ul><li>f: Relation  Relation f: Relation x Relation  Relation </li></ul></ul>
  17. 17. Codd’s Logical Operations: The Relational Algebra <ul><li>Six basic operations: </li></ul><ul><ul><li>Projection   (R) </li></ul></ul><ul><ul><li>Selection   (R) </li></ul></ul><ul><ul><li>Union R 1 [ R 2 </li></ul></ul><ul><ul><li>Difference R 1 – R 2 </li></ul></ul><ul><ul><li>Product R 1 £ R 2 </li></ul></ul><ul><ul><li>(Rename)     (R) </li></ul></ul><ul><li>And some other useful ones: </li></ul><ul><ul><li>Join R 1 ⋈  R 2 </li></ul></ul><ul><ul><li>Semijoin R 1 ⋉  R 2 </li></ul></ul><ul><ul><li>Intersection R 1 Å R 2 </li></ul></ul><ul><ul><li>Division R 1 ¥ R 2 </li></ul></ul>
  18. 18. Data Instance for Operator Examples STUDENT Takes COURSE PROFESSOR Teaches sid name 1 Jill 2 Qun 3 Nitin 4 Marty fid name 1 Ives 2 Taskar 8 Martin sid exp-grade cid 1 A 550-0109 1 A 520-1009 3 A 520-1009 3 C 501-0109 4 C 501-0109 cid subj sem 550-0109 DB F09 520-1009 AI S09 501-0109 Arch F09 fid cid 1 550-0109 2 520-1009 8 501-0109
  19. 19. Projection,  
  20. 20. Selection,  
  21. 21. Product X
  22. 22. Join, ⋈  : A Combination of Product and Selection
  23. 23. Union 
  24. 24. Difference –
  25. 25. Rename,     <ul><li>The rename operator can be expressed several ways: </li></ul><ul><ul><li>The book has a very odd definition that’s not algebraic </li></ul></ul><ul><ul><li>An alternate definition: </li></ul></ul><ul><ul><ul><li>    (x) Takes the relation with schema  Returns a relation with the attribute list  </li></ul></ul></ul><ul><ul><li>Rename isn’t all that useful, except if you join a relation with itself </li></ul></ul><ul><ul><ul><li>Why would it be useful here? </li></ul></ul></ul>
  26. 26. Mini-Quiz <ul><li>This completes the basic operations of the relational algebra. We shall soon find out in what sense this is an adequate set of operations. </li></ul><ul><li>Try writing queries for these: </li></ul><ul><ul><li>The names of students named “Bob” </li></ul></ul><ul><ul><li>The names of students expecting an “A” </li></ul></ul><ul><ul><li>The names of students in Milo Martin’s 501 class </li></ul></ul><ul><ul><li>The sids and names of students not enrolled </li></ul></ul>
  27. 27. Deriving Intersection <ul><li>Intersection: as with set operations, derivable from difference </li></ul>A-B B-A A B A Å B ≡ (A [ B) – (A – B) – (B – A) ≡ (A – B) – (B – A)
  28. 28. Division <ul><li>A somewhat messy operation that can be expressed in terms of the operations we have already defined </li></ul><ul><li>Used to express queries such as “The fid's of faculty who have taught all subjects” </li></ul><ul><li>Paraphrased: “The fid’s of professors for which there does not exist a subject that they haven’t taught” </li></ul>
  29. 29. Division: R 1  R 2 <ul><li>Requirement: schema(R 1 ) ¾ schema(R 2 ) </li></ul><ul><li>Result schema: schema(R 1 ) – schema(R 2 ) </li></ul><ul><li>“ Professors who have taught all courses”: </li></ul><ul><li>What about “Courses that have been taught by all faculty”? </li></ul> fid (  fid,subj ( Teaches ⋈ COURSE)   subj (COURSE))
  30. 30. Division Using Our Existing Operators <ul><li>All possible teaching assignments: Allpairs : </li></ul><ul><li>NotTaught , all (fid,subj) pairs for which professor fid has not taught subj : </li></ul><ul><li>Answer is all faculty not in NotTaught : </li></ul> fid,subj (PROFESSOR £  subj (COURSE)) Allpairs -  fid,subj (Teaches ⋈ COURSE)  fid (PROFESSOR) -  fid (NotTaught) ´  fid (PROFESSOR) -  fid (  fid,subj (PROFESSOR £  subj (COURSE)) -  fid,subj (Teaches ⋈ COURSE))
  31. 31. The Big Picture: SQL to Algebra to Query Plan to Web Page SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid Optimizer Execution Engine Storage Subsystem Web Server / UI / etc Query Plan – an operator tree STUDENT Takes COURSE Merge Hash by cid by cid
  32. 32. Hint of Future Things: Optimization Is Based on Algebraic Equivalences <ul><li>Relational algebra has laws of commutativity, associativity, etc. that imply certain expressions are equivalent in semantics </li></ul><ul><li>They may be different in cost of evaluation! </li></ul> c Ç d (R) ´  c (R) [  d (R)  c (R 1 £ R 2 ) ´ R 1 ⋈ c R 2  c Ç d (R) ´  c (  d (R)) <ul><li>Query optimization finds the most efficient representation to evaluate (or one that’s not bad) </li></ul>
  33. 33. Next Time: An Equivalent, But Very Different, Formalism <ul><li>Codd invented a relational calculus that he proved was equivalent in expressiveness </li></ul><ul><ul><li>Based on a subset of first-order logic – declarative, without an implicit order of evaluation </li></ul></ul><ul><ul><li>More convenient for describing certain things, and for certain kinds of manipulations </li></ul></ul><ul><ul><li>… And, in fact, the basis of SQL! </li></ul></ul>