Your SlideShare is downloading. ×
3 f6 8_databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

3 f6 8_databases

597
views

Published on

Published in: Technology, Real Estate

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
597
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
122
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. DatabasesElena Punskaya, op205@cam.ac.uk 1
  • 2. Big Data• Facebook stores over 100 petabytes of media (photos and videos) uploaded by its 845 million users• There are 762 billion objects stored in Amazon S3 that processes over 500,000 requests per second for these objects at peak times Bryce Durbin, Techcrunch aws.typepad.com Amazon © 2012 Elena Punskaya 2 Cambridge University Engineering Department 2
  • 3. Big Data• Storing large amounts of data requires managing complexity: - mapping real world to data - providing concurrent access to creating, reading and changing of the data - providing distributed access and storage of the data• Database Management Systems decouples business logic of applications working with data from the details of physical storage and transaction (operations on data) management• Any non-trivial system needs to store its application data: - user/password - credit cards - product information - health records ...• It is possible to store all data directly as files but a typical filesystem isn’t build for transaction management and high performance Cambridge High Performance Computing Cluster Darwin © 2012 Elena Punskaya 3 Cambridge University Engineering Department 3
  • 4. Database Systems I 1 Transaction Processing Transaction Processing• Databases Databases common component of distributeddistributed systems. are a are a common component of many many sys- They storetems. They store records for a large number of distinct entities records for a large number of distinct entities and will typically support a small setsmall operations to access and and manipulate and will typically support a of set of operations to access those entities. These operations can be assumedto be atomic i.e. manipulate those entities. These operations can be assumed to be atomic i.e. they cannot be interrupted. they cannot be interrupted. External clients execute transactions which are sequences of op-• External clients execute transactions which areto erations applied to one or more database entities designed sequences of operations achieve a single logical affect. more database entities designed to applied to one or achieve a single logical affect. Recovery Log Client A TA Transaction Database Manager Client B TB Transactions Atomic Operations• The transaction manager ensures that transactions appear atomic to clients. Client receives an acknowledgement atomic The transaction manager ensures that transactions appear of every successful transaction. clients. Client receives an acknowledgement of every successful to transaction. © 2012 Elena Punskaya 4 Cambridge University Engineering Department 4
  • 5. Example: Bank Transfer• Each account is represented by a different database object, which guarantees that each operation is atomic class Account { // link to required account records DBaseAccessInfo dbinfo; public: // Constructor - open an account account(string account_name); // Atomic operations void debit(float amount); void credit(float amount); float read_balance(); }; // A typical transaction would be void transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if (balance >= amount) { A.debit(amount); B.credit(amount); } }• A key issue is what happens if there is a failure part-way through the transaction? © 2012 Elena Punskaya 5 Cambridge University Engineering Department 5
  • 6. System Crash• What happens if the system crashes in the middle of transaction? // a typical transaction void transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if(balance >= amount) { A.debit(amount); <-----------------------------CRASH!• Account A will have had its money debited, but it will never appear in account B! – invalid state• The transaction manager (or any transaction processing system) must have a means of recovering from errors, and always leaving the system in a valid state• Need to ensure that Credit/Debit is ATOMIC, i.e. can only be preformed as a WHOLE not in parts © 2012 Elena Punskaya 6 Cambridge University Engineering Department 6
  • 7. ACID• A transaction my fail in many different ways (e.g. two clients try to access the same entity at the same time, temporary network failure, software fault, disk crash, etc). The transaction processor tries to ensure that transactions have the following properties• Atomicity - Either all or none of the transaction’s operations are performed• Consistency - Transactions transform the system from one consistent state to another• Isolation - An incomplete transaction cannot reveal its result to other transactions before it is complete• Durability - Once the transaction is committed, the system must guarantee that the results of its operations will persist, even if there are subsequent system failures © 2012 Elena Punskaya 7 Cambridge University Engineering Department 7
  • 8. ems I 5 Recovery Recovery In order to maintain the ACID properties, a transaction • processor must be able to recover from errors by restoringo maintain the ACID properties, a transaction processor the system to a consistent state.ble to recover from errors by restoring the system to a • To achieve this, transactions are modelled on the following state. state machinee this, transactions are modelled on the following state // A typical transaction with commit void transfer(account& A, account& B, float amount) { try { // Record transaction start int id = BeginTransaction(); Transaction balance = A.read_balance(); float processor might >= amount) { if (balance A.debit(amount); B.credit(amount); ← invalidate this } transaction success, so commit (finish) // (see Commit(id); last slide) } catch(..) { // transaction failed, recover/revert Abort(id); } the transfer transaction }sfer(account& A, account& B, float amount) © 2012 Elena Punskaya 8 Cambridge University Engineering Department 8
  • 9. Database Systems I 9 Concurrency Concurrency• In practice, a database transaction processor will be receiving In practice, a database transaction processor will be receiving a a stream of transaction requests, and will need to execute stream of transaction requests, and will need to execute transac- transactions in parallel in order to provide acceptable response tions in parallel in order to provide acceptable response times. times.• When two transactions reference the same account, When two transactions reference the same account, uncontrolled uncontrolled interleaving of operations can produce an interleaving of operations can produce an incorrect result. There incorrect are three classes of concurrency problem: result. There are three classes of concurrency problem • The uncommitted dependency problem Time Transaction 1 Transaction 2 t1 – A.write() t2 A.read() – t3 – abort() In this case, transaction 1 reads an updated account value, but• In this case, transaction 1 reads an updated account value, transaction 2 aborts undoing the effect of the update. Transac- but transaction 2 aborts undoing the effect of the update. tion 1 is then left holding an incorrect account value. Transaction 1 is then left holding an incorrect account value © 2012 Elena Punskaya 9 Cambridge University Engineering Department Note: A.read() indicates any operation which reads a value from ac- 9
  • 10. • The lost update problem10 Concurrency Time Transaction 1 Transaction 2 Software Engineering and Design Engineering Part IIA: 3F6 - t1 A.read() – • The lost update problem t2 – A.read() t3 A.write() – • The change made to account Time Transaction 1 Transaction 2 t4 – A.write() A at t3 by transaction 1 is lost t1 A.read() – because it is overwritten at In t2 case, the change made to account A at t3 by t4 by transaction 2 this – A.read() time transac- t A.write() – tion3 1 is lost because it is overwritten at time t4 by transac- t tion4 2. – A.write() In this case, the change made to account A at t3 by transac- The is lost because it is overwritten at time t4 by transac- • tion 1inconsistent analysis problem tion 2. • Transaction 2 updates Time Transaction 1 Transaction 2 account A after transaction 1 t1 A.read() – has read its value • The inconsistent analysis problem t2 – A.read() • Hence, transaction 1 is left t3 – A.write() Time Transaction 1 Transaction 2 t4 – commit() holding an incorrect value for t1 A.read() – account A In t2 case, transaction 2A.read() account A after transac- this – updates tion3 1 has read its value. Hence, transaction 1 is left holding University Engineering Department 10 t – A.write() Cambridge © 2012 Elena Punskaya an tincorrect value for account A. 4 – commit() 10
  • 11. Managing Concurrency• The problems discussed can be managed by applying a Pessimistic or Optimistic concurrency control• Pessimistic - When a transaction wishes to access an account it first secures a lock on that account, when it has finished it releases the lock. If a lock is already taken, the transaction must wait until it is released. - Locking could be on the whole table or a single row and could be declared at different levels of exclusivity (e.g. no one else can access data or some access is allowed) - Could cause deadlocks, e.g. Tx1 and Tx2 require two resources R1 and R2 to proceed: ‣ T1 holds R1 and is waiting for R2 ‣ T2 holds R2 and is waiting for R1 - Useful when there is a lot of data that is often updated by many users• Optimistic - Allows uncontrolled access to accounts, and then simply abort any transactions which might have suffered a conflict - Implemented by creating a new copy of the data that maybe be updated and when the update is completed checks if the master copy hasn’t changed in meantime ‣ if changed – aborted ‣ if not – complete - Useful when most operations are reading data and changes occur rarely © 2012 Elena Punskaya 11 Cambridge University Engineering Department 11
  • 12. Relational Databases• By late 1960s, the “Software Crisis” was already declared and data storage wasn’t doing much better Computer calculations cost hundreds of dollars a minute, so great human effort was spent to make programs as efficient as possible before they were run. Early databases used either a rigid hierarchical structure or a complex navigational plan of pointers to the physical locations of the data on magnetic tapes. Teams of programmers were needed to express queries to extract meaningful information. While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of Edgar Ted Codd, 1923-2003 queries required complex reprogramming, and adding new types of data image © IBM forced a total redesign of the database itself. IBM Research News, www.research.ibm.com/resources/news/20030423_edgarpassaway.shtml• In 1970, Edgar Codd, an English mathematician working for IBM, published a paper “"A Relational Model of Data for Large Shared Data Banks", it started with the following words: “Future users of large data banks must be protected from having to know how the data is organised in the machine...” © 2012 Elena Punskaya 12 Cambridge University Engineering Department 12
  • 13. Relational Databases• Codd suggested to move away from hierarchical or navigational structure of early databases to simple tables with rows and columns• Based on Relational Algebra, this approach allowed to greatly simplify database queries (ability to access and analyse data)• Many relational database management systems (RDBMS) are accessed using SQL (Structured Query Language) - SQL is defined by industry standards and has been developed over many revisions from SQL-87 to SQL 2008• There are many free and commercial databases available: - Free: PostgreSQL, MySQL, SQLite... - Commercial: Oracle, DB2, SQL Server...• SQLite is the easiest database to start using as it requires no setup, and is available on the teaching system. - Type: sqlite3 <db-name> - Then enter SQL commands followed by a ‘;’. The database will be stored in a file called <db-name> which will be created if it does not already exist. © 2012 Elena Punskaya 13 Cambridge University Engineering Department 13
  • 14. ble. A relation contains a set of tuples (rows). 2 2 relation The Relational IIA:Engineering andEngineering and Design Engineering Part IIA: 3F6 - Software 3F6 - Software Design Engineering Part Model course scheme Title The relationalLeader model Lectures The relational model• The relational model is related to set theory. A relation is a t1 RISC Processors Sanchez 8 Therelation contains a settheory. A relation is a ta- is 34ta- table. A relationalThe QAM related to is related tuples (rows). t2 model is for model set of to set theory. A relation a relational modems Sanchez ble. A t ble. A relation contains a set of tuples (rows). relation contains a set of to Mainframes Belford 3 Introduction tuples (rows). course relation 20 relation course t4 scheme refresh LCDs Fast Title Richard Leader 1 Lectures scheme Title Leader Lectures t5 t1 t5 [T itle] Processors RISC t [Leader] t5[Lectures] Sanchez 8 t1 RISC Processors Sanchez 5 8 t2 t6 QAM for modems for[T itle, Leader] Sanchez QAM t6 modems 34 t 2 Sanchez 34 t3 Introduction to Mainframes Belford 20 t3 The Introduction theMainframes Belfordby the scheme, which is a to meaning of Fastdata is LCDs 20 t4 refresh described Richard 1 t4 Fast refresh LCDs t5 t5[T itle] Richard t [Leader] t [Lectures] 1 set of column names. Column names are known as attributes. 5 5 t5 t5[Ttitle] t6[T itle, Leader] t5 [Lectures] t5[Leader] 6 t6 course scheme =6[T itle, Leader] Lectures) t (Title, Leader, The meaning of the data is described by the scheme, which is a - The meaning of theof thecolumn names. by the scheme, which is as attributes. The meaning set of is described Column names are known a set of column names. data data is described by the scheme, which is a There is no ordering or grouping of attributes. The table is a Column names are names. as attributes. are known as attributes. set of column known Column names relationcourse this scheme. A relationLectures) scheme R is written over scheme = (Title, Leader, r over a - There courseas r(R). = (Title, Leader,attributes. TheD. So: a relation over this scheme. scheme Each column Lectures) is no ordering or grouping ofhas a domain, table is A relation r over a scheme R is writtenor grouping of attributes.has a table is a D. There is no ordering as r(R). Each column The domain, ThereDTitle ordering overgrouping of attributes. r The a scheme a is written is no = strings, this scheme. A relation over table is R relation or DLectures = Z+ relation over thisr(R). Each column hasover a scheme R is written as scheme. A relation r a domain, D. So: as r(R). Each elementhas a domain, D.tiSo: D1 × D2 × · · · × Dn So each column ti[j] ∈ Dj and ∈ + DTitle = strings, DLectures = Z DTitle For example, the schemeZ+ y) with Dx = Dy = R, the domain = strings, DLectures = So each element ti[j] (x,Dj and ti ∈ D1 × D2 × · · · × Dn ∈ So each the tuples D and ti ∈ D1 × D2 dimensional vectors. 2012 Elena Punskaya ×D © of element t [j]is∈the domain of all two× · · · Cambridge University Engineering Department 14 For example,j the scheme (x, y) with Dx = Dn = R, the domain i y 14
  • 15. DTitle = strings, DLectures = Z+ So each element tThe Dj and ti ∈ D1 × D2 × · · · × Dn i [j] ∈ Relational Model For example, the scheme (x, y) with Dx = Dy = R, the domain of the tuples is the domain of all two dimensional vectors. SQL: domain↓ Constraint↓ CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0)) INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10) UPDATE course SET Lectures=8 WHERE Leader="Sanchez" DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez" DROP TABLE course SQL allows domain of tuples: Dt ⊆ D1 × D2 × · · · × Dn.• Built on principles of Relational Algebra - Projection, Selection, Union, Intersection, Subtraction, Join © 2012 Elena Punskaya 15 Cambridge University Engineering Department 15
  • 16. Relational algebra: Projection Π Π Relational algebra: ProjectionThe projection operator, Π, removes columns by listing the onesto be retained. The operator is written as:Πcolumn1,column2,. . . (relation). Leader LecturesAn example of applying projection is: Sanchez 8 ΠLeader,Lectures (course) = Sanchez 34 Belford 20 Richard 1Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Eachrow represents a 3D vector. The relation Πx,y(r) contains theprojection of the vectors onto the x, y plane.In SQL the SELECT statement performs all of the primitiverelational algebra funcionality. The selection above is rendered © 2012 Elena Punskaya Cambridge University Engineering Department 16as: 16
  • 17. Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Eachrow represents a 3D vector. The relation Πx,y(r) contains theprojectionRelational algebra: Projection Π of the vectors onto the x, y plane.In SQL the SELECT statement performs all of the primitiverelational algebra funcionality. The selection above is renderedas:SELECT Leader,Lectures FROM courseThe general form being:SELECT Col1[, Col2, [· · · ]] FROM tableNote that SQL is not entirely relational and the expression:SELECT Leader FROM coursehas duplicate rows. To remove duplicates, use:SELECT DISTINCT Leader FROM courseThe there is a shorthand for the identity projection:SELECT * FROM table © 2012 Elena Punskaya 17 Cambridge University Engineering Department 17
  • 18. 4 Engineering Part IIA: 3F6 - Software Engineering and Design Relational algebra: Selection σ σ Relational algebra: SelectionThe selection operator accepts a predicate, Θ and a relation.Rows matching the predicate are retained: Title Leader LecturesσLeader=”Sanchez”(course) = RISC Processors Sanchez 8 QAM for modems Sanchez 34The general form of the resulting relation can be written in setbuilder notationσΘ(r) = {t|t ∈ r, Θ(t)}That is, the result consists of all tuples t such that each tuple isboth in the relation r and for which the predicate applied to thetuple, i.e. Θ(t), is true.In SQL, selection is also performed with the select statement withthe predicate being specified by the WHERE clause:SELECT * FROM course WHERE Leader=”Sanchez”Predicates can contain expressions involving any or all of the © 2012 Elena Punskayarows. SQL has more or less the same set of numeric operators as Cambridge University Engineering Department 18C and also AND, OR, NOT, BETWEEN: 18
  • 19. the predicate being specified by the WHERE clause: Relational algebra: Selection σSELECT * FROM course WHERE Leader=”Sanchez”Predicates can contain expressions involving any or all of therows. SQL has more or less the same set of numeric operators asC and also AND, OR, NOT, BETWEEN:SELECT * FROM course WHERE Lectures BETWEEN 2 AND 10and IN: WHERE Leader IN ("Belford", "Richard")Projection and selection can be readily composed, so in general:ΠS (σΘ(r)) translates to SELECT S FROM r WHERE Θ © 2012 Elena Punskaya 19 Cambridge University Engineering Department 19
  • 20. In SQL, union intersection and subtraction behave much more Union, intersection, subtraction like set theory than relational algebra. For these operations it is the order of the attributes not the names of the attributes which have significance.• In SQL, union intersection and subtraction behave much more like set theory than relational algebra. For theseofoperations If there Set union, ∪ aggregates the rows two sets together. it are two relations, r(R) and s(R), then the union, r ∪ s is the order of the attributes not the names of the attributes can be computed: which have significance. SELECT * FROM r UNION SELECT * FROM s• Set union, ∪ aggregates the Likewise, intersection can be computed using: rows of two sets together. If there are two relations, SELECT * FROM r INTERSECT SELECT * FROM s r(R) and s(R), then the Set differencing is either MINUS or EXCEPT depending on the union, r ∪ s can be database. computed: r s r−s SELECT * FROM r EXCEPT SELECT * FROM s Since ordering, not naming matters, with the schema R=(a,b), S=(b,a) and the tables r(R), s(S): r s a b b a a b r-s= 1 2 3 5 3 4 3 4 1 2 © 2012 Elena Punskaya 20 Cambridge University Engineering Department 20
  • 21. information. students Join / cartesian product × labs Student Supervisor Lab Demonstrator• The cartesian product is the Cook primitive operator which Gibson Sanchez 3F27 only combines two tables3F89 different schemes. Joining two Murphy Belford with Libby relations, a Goldstein Part IIA: 3F6 - Software Engineering and Design 6 Libby × b generates a Margo relation with every row in in Engineering 4F185 new a paired with every row in b.Ray Cook Sanchez 3F34 Joining is very useful for extracting related information. × Join / cartesian product The table students × labs is on the next page. Note that the 7• Joining students and labs: Database Systems II The cartesian get augmented with the table name tocom- ambiguity. attributes product is the only primitive operator which avoid bines two tables with different schemes. Joining two relations, The table name may be omitted if it is not ambiguous. SQL: students × labs a × b generates a new relation with every row in in a paired students.Student students.Supervisor labs.Lab labs.Demonstrator with every row in b. Joining is very useful for extracting related SELECT * FROM students, labs Gibson Sanchez 3F27 Cook information. Gibson Sanchez 3F89 Libby Gibson Sanchez 4F185 Margo Find all students of “Sanchez” who are demonstrating: students labs Gibson Sanchez 3F34 Ray Murphy Belford 3F27 Cook Student Supervisor Lab Demonstrator Murphy Belford 3F89 Libby Π (σ Gibson Sanchez 3F27 Cook Student Student=Demonstrator∧Supervisor=“Sanchez” (students × labs)) Murphy Belford 4F185 Margo Murphy Belford 3F89 Libby Murphy Belford 3F34 Ray Libby Goldstein 3F27 Cook Libby Goldstein 4F185 Margo Libby Goldstein 3F89 Libby SELECT Student FROM students, labs Cook Sanchez 3F34 Ray Libby Goldstein 4F185 Margo WHERE Student=Demonstrator AND Libby Goldstein 3F34 Ray Cook Sanchez 3F27 Cook The table students × labs is on the next page. Note that the Supervisor="Sanchez" Cook Sanchez 3F89 Libby attributes get augmented with the table name to avoid ambiguity. Cook Sanchez 4F185 Margo The table name may be omitted if it is not ambiguous. SQL: Cook Sanchez 3F34 Ray SELECTresult is students, Selection The * FROM Cook . labs is often composed with joining, © 2012 Elena Punskaya so it is given the non primitive operator, the theta join: Cambridge University Engineering Department 21 Find all students of “Sanchez” who are demonstrating: 21
  • 22. students labsDatabase Systems II 7 Student Supervisor Lab Demonstrator Join / cartesian product × Gibson Sanchez 3F27 Cook Murphy Belford 3F89 Libby students × labs students.Student students.Supervisor labs.Lab labs.Demonstrator Libby Goldstein 4F185 Margo Gibson Sanchez 3F27 Cook Cook Sanchez 3F34 Ray Gibson Sanchez 3F89 Libby Gibson Sanchez 4F185 Margo Gibson Sanchez 3F34 Ray The table students × labs is on the next page. Note that the Murphy Belford 3F27 Cook attributes get augmented with the table name to avoid ambiguity. Murphy Belford 3F89 Libby Murphy Belford 4F185 Margo The table name may be omitted if it is not ambiguous. SQL: Murphy Belford 3F34 Ray Libby Goldstein 3F27 Cook SELECT * FROM students, labs Libby Goldstein 3F89 Libby Libby Goldstein 4F185 Margo Libby Goldstein 3F34 Ray Find all students of “Sanchez” who are demonstrating: Cook Sanchez 3F27 Cook Cook Sanchez 3F89 Libby ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students × labs)) Cook Sanchez 4F185 Margo Cook Sanchez 3F34 Ray SELECT Student FROM students, labs WHERE Student=Demonstrator AND Supervisor="Sanchez" The result is Cook . Selection is often composed with joining, so it is given the non primitive operator, the theta join: a Θ b ≡ σΘ(a × b). © 2012 Elena Punskaya 22 Cambridge University Engineering Department 22
  • 23. Database Systems II • Perform a join. 9 • Natural Join Perform selection so that attributes with the same name must Natural Join be equal. • Perform projection to remove duplicated attributes. • A ‘natural join’ is a join followed by some selection andA ‘natural join’ is a join followed by some selection and projec- attribute ambiguities. Note that there are notion: projection: • Perform a join. join - Perform a If attributes with the same name are semantically the same, then the natural join is usually the correct kind of join to use. In ad- - Perform selection attributes with the same name must name must be equal • Perform selection so that so that attributes with the same dition to the ‘labs’ table, we also have a table listing lab sessions: - Perform projection to remove duplicated attributes be equal. sessions •• If attributes remove duplicated attributes. Perform projection to with the same name are semantically the same, Lab TitleNote that there are no attribute ambiguities.usually the correct kind of join to use. then the natural join is 3F27 Mainframe filesystems In addition to the ‘labs’ 3F27 table, we also have a table listing lab Filesystem securityIf attributes with the same name are semantically the same, then sessions: 3F89 Large vehicle controlthe natural join is usually the correct kind of join to use. In ad- finance systems 4F185 Networks fordition to the ‘labs’ table, we also have a table 3F34 lab sessions: listing Magnetic storage forensics sessions The natural join matches up the shared attributes Lab Title Demonstrator Lab Title 3F27 Mainframe filesystems Cook 3F27 Filesystem security 3F27 Filesystem security Cook 3F27 Mainframe filesystems 3F89 Large vehicle control sessions labs = Libby 3F89 Large vehicle control 4F185 Networks for finance systems Margo 4F185 Networks for finance systems 3F34 Magnetic storage forensics Ray 3F34 Magnetic storage forensicsThe natural join matches up the shared attributes Demonstrator Lab Title © 2012 Elena Punskaya 23 Cambridge University Engineering Department Cook 3F27 Filesystem security 23
  • 24. 10 Engineering Part IIA: 3F6 - Software Engineering and Design Natural JoinMore formally:There are two relations r(R) and s(S).The set of shared attributes is A:A = {A1, · · · , An} = R ∩ Swhere n = |A|. The set of all attributes with no duplicates is:R ∪ S.The natural join is therefore:r s ≡ ΠR ∪ Sσr.A1=s.A1∧···∧r.An=s.An (r × s)In SQL, natural joins are performed with NATURAL JOIN:SELECT * FROM sessions NATURAL JOIN labsIn practice, you will usually design databases by considering thetype of data, how it is stored in tables and how to extract therelevant information. Relation algebra will not crop up much inday-to-day design, but it is essential for understanding how thevarious operations in a relational database work. © 2012 Elena Punskaya 24 Cambridge University Engineering Department 24
  • 25. C Processors, Sanchez, 10) course SET Lectures=8 WHERE Leader=Sanchez FROM courCREATE Example • Let’s consider an example of movies database for LOVEFiLM.com • It is likely to have movie Title Year Actor Pulp Fiction 1994 John Travolta Hackers 1995 Angelina Jolie The Matrix 1999 Keanu Reeves The Devil’s Advocate 1997 Keanu Reeves • SQL: CREATE TABLE movie (Title text, Year int, Actor text) INSERT INTO movie VALUES (Pulp Fiction, 1994, “John Travolta”) INSERT INTO movie VALUES (Hackers, 1995, “Angelina Jolie”) etc. © 2012 Elena Punskaya 25 Cambridge University Engineering Department 25
  • 26. Example• Projection • Distinct Actor SELECT DISTINCT Actor Actor SELECT Actor FROM movie John Travolta FROM movie John Travolta Angelina Jolie Angelina Jolie Keanu Reeves Keanu Reeves Keanu Reeves• Selection SELECT * FROM movie WHERE Actor=”Keanu movie Reeves” Title Year Actor The Matrix 1999 Keanu Reeves The Devil’s Advocate 1997 Keanu Reeves• Projection and Selection composed Title SELECT Title FROM movie WHERE The Matrix Actor=”Keanu Reeves” The Devil’s Advocate © 2012 Elena Punskaya 26 Cambridge University Engineering Department 26
  • 27. Example• Selection may use AND, OR, NOT, BETWEEN, IN and etc. SELECT * FROM movie WHERE Year BETWEEN 1995 AND 1997 (BETWEEN 1995 AND 1997 Inclusive) movie Title Year Actor Hackers 1995 Angelina Jolie The Devil’s Advocate 1997 Keanu Reeves © 2012 Elena Punskaya 27 Cambridge University Engineering Department 27
  • 28. Example• Let us take now a simplified table movie Title Actor Pulp Fiction John Travolta Hackers Angelina Jolie• Imagine we also have some info regarding the number of won Oscars people Actor Oscars John Travolta 0 Angelina Jolie 1 © 2012 Elena Punskaya 28 Cambridge University Engineering Department 28
  • 29. Example• Cartesian product SELECT Title, Actor, Actor, Oscars FROM movie, people movie x people movie.Title movie.Actor people.Actor people.Oscars Pulp Fiction John Travolta John Travolta 0 Pulp Fiction John Travolta Angelina Jolie 1 Hackers Angelina Jolie John Travolta 0 Hackers Angelina Jolie Angelina Jolie 1 - the only one that can create new record (if one doesn’t count renaming) - BUT it creates too many records!• Natural join would give information on whether there are Oscar winning actors in the movie SELECT * FROM movie, people WHERE movie.Actor = people.Actor or SELECT * FROM movie NATURAL JOIN people movie Title Actor Oscars Pulp Fiction John 0 Hackers Travolta Angelina 1 Jolie © 2012 Elena Punskaya 29 Cambridge University Engineering Department 29
  • 30. Example • Let us consider two tables with Oscar and BAFTA nominations Oscar BAFTAJohn Travolta Pulp Fiction John Travolta Pulp FictionAngelina Jolie Girl, Interrupted Angelina Jolie ChangelingAngelina Jolie Changeling Jesse Eisenberg The Social Network • Union (SELECT * FROM Oscar) UNION (SELECT * FROM BAFTA) Oscar ∪ BAFTA John Travolta Pulp Fiction Angelina Jolie Girl, Interrupted Angelina Jolie Changeling © 2012 Elena Punskaya 30 Cambridge University Engineering Department 30
  • 31. Example• Intersection (SELECT * FROM Oscar) INTERSECT (SELECT * FROM BAFTA) Oscar ∩ BAFTA John Travolta Pulp Fiction Angelina Jolie Changeling• Difference (SELECT * FROM Oscar) EXCEPT (SELECT * FROM BAFTA) Oscar – BAFTA Angelina Jolie Girl, Interrupted Jesse Eisenberg The Social Network NOTE: some operators are treated differently in different databases, some may not be present © 2012 Elena Punskaya 31 Cambridge University Engineering Department 31
  • 32. Keys and Uniqueness• Rows in a relation can be uniquely identified by a key, which can consist of one or more columns - A key must be able to uniquely identify all possible rows that relation could have in the domain of tuples, not just the rows that currently exist.• Superkey - Any collection of columns which can uniquely identify a row. There may be more than one valid superkey.• Candidate key - A minimal superkey, i.e. a superkey with the minimal number of columns. I.e. there is no subset of the columns in a candidate key which will also form a candidate key. There may be more than one candidate key.• Primary key - A superkey or candidate key which has been selected to have a special status. A table can have at most one primary key. Should be small and constant.• Foreign key -If two relations r and s share a key k, then r[k] is a foreign key if k is the primary key of s. Therefore, the foreign key k does not necessarily uniquely identify the rows of r © 2012 Elena Punskaya 32 Cambridge University Engineering Department 32
  • 33. KeysName Address DoB Gender Relation Email shipJohn Smith 34 West rd, 2 Jan 1981 Male Single john@smith. Cambridge comThomas Flat 303, 11 March Male Single neo@matrix.Anderson 1962 org...Mia 20 Sunset 10 October Female Married m.wallace@hWallace rd, Carlsbad 1994 otmail.com © 2012 Elena Punskaya 33 Cambridge University Engineering Department 33
  • 34. Keysid Name Address DoB Gender Relation Email ship1 John Smith 34 West rd, 2 Jan 1981 Male Single john@smith. Cambridge com2 Thomas Flat 303, 101 11 March Male Single neo@matrix. Anderson Red st, Zion 1962 org... ... ... ... ... ... ...10001 Mia Wallace 20 Sunset rd, 10 October Female Married m.wallace@ Carlsbad 1994 hotmail.com © 2012 Elena Punskaya 34 Cambridge University Engineering Department 34
  • 35. Normalization Normalization Normalization If a database has has duplicated information thenisitsubject it up-up- If a database duplicated information then it is subject it• Ifdate anomalies, and the information can become inconsistent. a database has duplicated information then it is subject it date anomalies, and the information can become inconsistent. update anomalies, and the information can becomelec- Imagine adding contact details to the the ‘course’ table allow Imagine adding contact detailscontact detailsto to allow lec- inconsistent. Imagine adding to ‘course’ table to the ‘course’ turers toallow contacted to be be contacted easily: table turers to belecturers easily: contacted easily: to Title Title Leader Lectures Telephone Leader Lectures Telephone RISC Processors RISC Processors Sanchez Sanchez 8 8 6596065960 QAM for modems QAM for modems Sanchez Sanchez 34 34 65960 65960 Introduction to Mainframes Belford Introduction to Mainframes Belford 20 20 65536 65536 LowLow latency LCD screens Richard latency LCD screens Richard 1 1 3276832768• If the table is updated, for for instance the the the SQL command: If the If table is updated, instance with with SQL command: the table is updated, for instance with SQL command: UPDATE course SET SET Leader=Libby WHERE Title=RISC Processors UPDATE course Leader=Libby WHERE Title=RISC Processors• Then Then contact details will become incorrect. The process of the the contact details Then the contact details will will become incorrect. The process of become incorrect. The process of normalizing a database involves splitting up large tables with normalizing related information splitting up large tables with normalizing a database involves into a large tables with only weakly a database involves splitting upnumber of smaller tables. Normalized data is theninto a numbersmaller tables. onlyonly weakly related information accessed by joining tables. weakly related information into a number of of smaller tables together and performing accessed joining tablesresults. and Normalized datadata is then selectionsjoining tables together and Normalized is then accessed by by on the together © 2012 Elena Punskaya performing selections on the the results. performing selections on results. 35 Cambridge University Engineering Department 35
  • 36. date anomalies, and the information can become inconsistent. Imagine adding contact details to the ‘course’ table to allow lec- Normalization turers to be contacted easily: Title Leader Lectures Telephone RISC Processors Sanchez 8 65960 QAM for modems Sanchez 34 65960 Introduction to Mainframes Belford 20 65536 Low latency LCD screens Richard 1 32768• The database above is not normalised the SQL command: If the table is updated, for instance with because there is duplicated data. More intuitively, the telephone number has UPDATE course SET Leader=Libby WHERE Title=RISC Processors merely been inserted as a convenience and has nothing directly to do with courses Then the contact details will become incorrect. The process of• Much like type safety and object oriented design, database normalizing a database involves splitting up large tables with normalization allows databases to be designed such that certain errors (for instance data inconsistency) are tables.likely. only weakly related information into a number of smaller less Normalized data is then accessed by joining tables together and• Normalization is the process of designing the database performing selections on the results. comply with normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF). The database above is not normalized because there is duplicated data. More intuitively, the telephone number has merely been © 2012 Elena Punskaya 36 inserted as a convenience and has nothing directly to do with Cambridge University Engineering Department 36
  • 37. First Normal Form (1NF) First Normal Form 1. Make sure that your database really obeys the relational model:• Make sure that your database really obeys the relational (a) No ordering over rows model: (b) No ordering over columns - No ordering over rows - No ordering over columns duplicates (c) No - No duplicates 2. Each row/column intersection contains exactly one datum.• Each row/column intersection contains exactly one datum Consider trying to extend the earlier design to allow for multiple• Consider storing numbers: phone multiple phone numbers for the Leader Title Lectures ID Numbers ··· 8 456 65950, 60294, 70231 ··· 8 456 65950, 60294, 70231 BAD ··· 34 20 65536 ··· 1 82 32768, 16384 Title Lectures ID Phone 1 Phone 2 Phone 3 ··· 8 456 65960 60294 70231 ··· 34 456 65960 60294 70231 BAD ··· 20 9 65536 ··· 1 82 32768 16384 Note the use of IDs to avoid duplicates as names make bad keys: © 2012 Elena Punskaya 37 Cambridge University Engineering Department employees The list of phone numbers for the 37
  • 38. ··· 1 82 456 65960 First Normal Form 60294 70231 9 65536 Note the use of IDs • Employees table is used for details of 32768 Leaders 82 course 16384 employees • Adding a Phone number to store Th Name ID Phone employee’s contact details of IDs to avoid duplicates as names make bad keys:65960 lea Sanchez 456 • To support multiple phone numbers be need to duplicate Name/ID data Belford 9 65536 ΠP The list of phone numbers Richard for 82 the 32768one Sanchez 456 60294 leader of a particular course can now60 • The list of phone numbers for the leader of a particular course can now be extracted using relational algebra: algebra: be extracted using relational36 ΠPhone(σTitle=“RISC Processors”(course employees))68 or in SQL:94 SELECT phone FROM course NATURAL JOIN employees WHERE Title=”RISC Processors” © 2012 Elena Punskaya 38 Cambridge University Engineering Department 38
  • 39. Second Normal Form (2NF) A tableSecond normal form if it satisfies: is in second Normal Form (2NF) 1. It is in first normal form (1NF).• A table is in second normal form if it satisfies: 2. All non-prime attributes depend on the whole candidate key. - It is in first normal form (1NF). - All non-prime attributes depend on the whole candidate key. From the previous example, the complete relation, employees(E),• From the previous example, the complete relation, employees (E), is: is: Lack of normalization allows employees buggy programs to create incon- sistencies: Name ID Phone Sanchez 456 65960 Inserting the record (“Belford”, Belford 9 65536 10, 131072) leads to a mismatch between the name and id. Richard 82 32768 Sanchez 456 60294 An employee name change re- quires updates across multiple Sanchez 456 70231 rows, which may be done incor- Richard 82 16384 rectly. It also requires more lock- ing.• The candidate key is Cis= (ID, Phone). The non prime attribute The candidate key C = (ID, Phone). The non prime attribute is therefore E − CE − C =(Name). The employees’ names do do not is therefore =(Name). The employees’ names not depend on theon the phone number, onlythe ID. Therefore the table table depend phone number, only the ID. Therefore the is not in 2NF. is not in 2NF. A 2NF design is: © 2012 Elena Punskaya contacts Cambridge University Engineering Department 39 ID Phone 39
  • 40. The candidate key is C = (ID, Phone). The non prime attri is therefore E − C Form (2NF) employees’ names do Second Normal =(Name). The depend on the phone number, only the ID. Therefore the• A 2NF design is: is not in 2NF. A 2NF design is: contacts ID Phone employee names 456 65960 Name ID 9 65536 Sanchez 456 82 32768 Belford 9 456 60294 Richard 82 456 70231 82 16384• ID is the Primary Key in employee_names• Phone is the Primary Key in contacts• ID is a Foreign Key in contacts connecting employee names and their phone numbers © 2012 Elena Punskaya 40 Cambridge University Engineering Department 40
  • 41. upon the key, the whole key and nothing but the key.” Third Normal Form (3NF) More formally a table over R is in 3NF iff:• “I swear by Codd that each non-prime attribute shall depend upon It is in 2NF the whole key and nothing but the key.” 1. the key, (and therefore 1NF)• More Every non-prime attribute R is in 3NF if and only if: 2. formally a table over is directly dependent on every - It is in 2NF (and key of R. 1NF) candidate therefore - Every non-prime attribute is directly dependent on every candidate key of R Practical Date Demonstrator Contact Pay rate Acoustic coupling Mon 1 Feb Dade 45102 10 Acoustic coupling Sat 7 Feb Dade 45102 15 Self-propagating code Tue 2 Mar Joey 67822 10 Self-propagating code Sun 9 Mar Kate 62341 15 The candidate key is:• The candidate key is (Practical, Date), however, the table not fully normalized because there is repetition of data (the (Practical,Date) contact numbers and the pay rates). The table is not in 3NF because: Table is not fully normalized because there is repetition of data - Pay rate depends on the key, but not the whole key. Specifically, it only depends on the date. contact numbers and the pay rates). The table is not in (the - Contact depends upon the whole key, but the dependence is transitive, not direct, that 3NF because: is: Contact → Demonstrator → (Practical, Date) • Pay rate depends on the key, but not the whole key. Specif- Elena Punskaya 41 © 2012 Cambridge University Engineering Department ically, it only depends on the date. 41
  • 42. to abort, rather than Engineering Part IIA: 3F6 -data. Engineering and Design 16 make inconsistent Software In addition to normal forms, which can be represented in rela- SQL attributes (helpful for 1NF) Constraints SQL Constraints NOT NULL prevents missing tables to be constructed with addi- tional algebra, SQL allows• In addition to normal forms, which can be represented in tional constraints which(Name the database more robust. Unlike CREATE TABLE course make string NOT NULL, ...) relational algebra, SQL allows tables to be constructed with normalization,normal forms, whichThis theimpossible to ID is In primary key constraints do not makewillrepresented more robust. Aadditional to can be specified. make be database in construct addition constraints which can it ensure that rela- errors.algebra, SQL data that break unique. cause transactions tional and invalid allows tables alsobe constructedcauses Providing unique, However, constraintsare to constraints with addi- therefore all rows do make errors transactions to abort, rather than make inconsistent data to abort, rather than make inconsistent data. robust. Unlike tional constraints which make the database more•CREATE Types TABLE people of constraints (Name string, ID int PRIMARY KEY) normalization, constraints do not make it impossible to construct•NOT NULL – ensures that the value unique: column can not be Known candidate constraints do attributes (helpfultransactions errors.NULL prevents can be marked as of this NOT However, keys missing make errors cause for 1NF) omitted CREATE TABLE than make (Name UNIQUE(a, b), CREATE TABLE r course inconsistent data. NOT NULL, ...) to abort, rather (a, b, c, d, string• UNIQUE – ensures that the value of this column is unique UNIQUE(a, c, d))•A primary key can missing– designates will column that ID is NOT NULL prevents be specified. This the for 1NF)as a key PRIMARY/FOREIGN KEY attributes (helpful ensure unique, and therefore all(Name are also unique. KEY which A particularly important constraint is FOREIGN CREATE TABLE course rows string NOT NULL, ...) ensures that an attribute is a primary key in another table: CREATE TABLEcan be specified. This will ensure that ID is KEY) A primary key people (Name string, ID int PRIMARY CREATE and therefore all rows are also unique. unique, candidate keys can be marked asPRIMARY KEY, ID int, TABLE course (Title string Known unique: Lectures int, CREATE TABLEFOREIGN KEY c, string, ID intemployees) CREATE TABLE people (Name d, UNIQUE(a, b), r (a, b, (ID) REFERENCES PRIMARY KEY) Known candidate keys can be c, d)) as unique: UNIQUE(a, marked © 2012 Elena Punskaya The ID of the course leader is now constrained to be a valid em-Department Cambridge University Engineering 42 A particularly important c, d, UNIQUE(a, b), CREATE TABLE r (a, b, constraint is FOREIGN KEY which 42
  • 43. Entity-Relationship (E/R) Modelling • As in Object Oriented approach, designing a database schema requires finding conceptual abstractions (that represent the data) and defining relationships between thementity set Employee Leads Course No. lectures relationship set Name Title attribute • Notation suggested by Peter Chen in “The Entity Relationship Model: Toward a Unified View of Data”, 1976 - UML can also be used • Relationships have cardinality - 1 to 1 - 1 to Many - Many to Many etc. © 2012 Elena Punskaya 43 Cambridge University Engineering Department 43
  • 44. Entity-Relationship (E/R) Modelling Name Number Employee ISA Does Mechanic Salesman Description Price Date Number RepairJob Date Buys Sells Value License Parts Cost Value Comission Work Repairs Car seller buyer Year Client ID Manufacturer Model Name PhonePável Calado, http://www.texample.net/tikz/examples/entity-relationship-diagram/ Address © 2012 Elena Punskaya 44 Cambridge University Engineering Department 44
  • 45. Objects and Databases• Relational Database Management Systems were mature stable products by 1980s• Object-Oriented approach reached wide adoption in 1990s• Any large software system still needs to persist data, hence store it in databases• Question: how we map Objects in a software system at runtime to Data stored in databases?• Originally, two options emerged: - Object to Relationship Mapping – a software layer that can provide database persistent to OO system (e.g. Hibernate, TopLink) – commonly used - Object Databases – a nice idea that failed to reach mainstream adoption• Most recently, further developments included non- relationship approaches (NoSQL) to working with large distributed datasets, e.g. Hadoop (hadoop.apache.org) - Map/Reduce: distributed processing of large data sets on compute clusters - Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying - Cassandra: A scalable multi-master database with no single points of failure © 2012 Elena Punskaya 45 Cambridge University Engineering Department 45
  • 46. No(t only)SQL at guardian.co.uk• The Guardian online, 1999 Web server Web server Web server Guardian journalism online: 1999 App bring I server you NEWS!!! App server App server Memcached (20Gb) Oracle CMS Data feeds © 2012 Elena PunskayaMatthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation 46 Cambridge University Engineering Department 46
  • 47. No(t only)SQL at guardian.co.uk• The Guardian online, 2010 Core Out In Web serversGuardian journalism online: 2010 App Solr Proxy App server App Solr Memcached (20Gb) App Solr App CMS Data feeds Solr Solr App M/Q Solr App rdbms CouchDB? external hosting Cloud, EC2 app engine etc © 2012 Elena PunskayaMatthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation 47 Cambridge University Engineering Department 47
  • 48. Security and SQL Injection• Consider the following example // allowing a user to search by product name string name; cout Enter product name: endl; getline(cin, name); string query = SELECT * FROM products WHERE name= + name + ; do_sql(query);• What happens if the user enters: ; DROP TABLE products; --• The query becomes // going to delete the table Products SELECT * FROM products WHERE name= ; DROP TABLE products; -- • SQL Injection could be used to steal data from a database news.bbc.co.uk/1/hi/8206305.stm © 2012 Elena Punskaya 48 Cambridge University Engineering Department 48