Like this presentation? Why not share!

# Understanding databases and querying

## by Usman Sharif, Product Manager at Smiley Network Inc. on May 21, 2013

• 226 views

### Views

Total Views
226
Views on SlideShare
226
Embed Views
0

Likes
0
5
0

No embeds

### Categories

Uploaded via SlideShare as Microsoft PowerPoint

## Understanding databases and queryingPresentation Transcript

• UnderstandingDatabases andQueryingUSMAN SHARIF
• History of databases? Need to structurally organize data. Various different models to fulfill this need. Most common technique is called Relational Modelling The databases supporting relational model are called RelationalDatabase Management Systems (RDBMS).
• Relational Model All data is represented in terms of tuples. A tuple is an extension to a pair. A pair is between two items, and atuple is between N items where N is a countable number. Tuples are grouped into relations. In mathematical terms, a relational model is based on first-orderpredicate logic.
• Example of Tuples and Relations Assume a road repair company wants to track their activities ondifferent roads. Lets restrict their activities to ‘Patching’, ‘Overlay’ and ‘Crack Sealing’. The company had overlaid I-95 on 1/12/01 and I-66 on 2/8/01. How can we represent this information in a relational model usingtuples? First, we see that there are two distinct things here Activities Work Next we define tuples for both of these items as follows: Activities = {activityName} Works = {activity, date, routeNumber}
• Example of Tuples and Relations We see a relation between Activities and Work – the activity that isto be performed. In relational model we use the concept of ‘keys’ to describe therelationship between different tuples. In our example, activityName can act as a ‘key’ to describe therelation that can be named as ‘ActivityWorks’. For optimization reasons, keys are generally of numeric type.Therefore we modify Activities and Works to add a numeric ID Activities = {activityId, activityName} Works = {activityId, date, routeNumber}
• Describing the example graphically
• Relational Databases Relational Modelling is a mathematical concept. When we translate this mathematical concept into RDBMS system we describe tuples as rows, items in tuples ascolumns and a group of ruples as tables. Relations are called relations in RDBMS terminology as well. The example of our road repair company when translated into RDBMS would have two tables as follows: Table Name: Activities. Columns: activityId (Primary Key, number type) activityName (string type) Table Name: Works. Columns: activityId (Foreign Key, number type) date (date type) routeNumber (string type) It would have the following relation: Relation Name: ActivityWorks. Participating Columns: Primary Table: Activities. Key Column: activityId Secondary Table: Works. Key Column: activityId
• More on Relations The relation in the previous example is commonly called a ‘one-to-many’ or a ‘Master-Child’ relationship. There are a total of three relationships: One-to-One: For a row in primary table there can be at most one row insecondary table. Commonly used to spread a single tuple across twotables based on logical reasoning. One-to-Many: For a row in primary table there can be multiple rows insecondary table. Commonly used to reduce redundancy or duplicationof same data. Many-to-Many: For multiple rows in primary table there can be multiplerows in secondary table. Used to describe complex relationships. Relations are always directional.
• Querying databases Databases provide an interface to define and manipulate data. It iscalled queries. There are two types of queries: Data Describing Language (DDL) queries. They are used to create andmodify database structure. DB structure is called a schema definition. Data Manipulating Language (DML) queries. They are used to query thedata base for data. There are four major DML queries: SELECT INSERT UPDATE DELETE
• SELECT Query A SELECT query is the way to fetch data from a database. At a minimum, it has two parts (called clauses): The SELECT clause The FROM clause For example:SELECT activityId, activityNameFROM Activities; This query would return all rows in Activities table. Apart from SELECT and FROM clauses, there are a number of other clauses thatare optional. These include (but not limited to): WHERE ORDER BY GROUP BY
• SELECT Query – The SELECT Clause It enables you to define the columns you want. Sometimes you want all columns, in those cases you can use thewildcard operator (*). For example, the previous query can bemodified as:SELECT *FROM Activities; A good practice is to name the columns rather than using * The primary use of SELECT clause is to define a projection – a subsetof columns, so that the result can be restricted to such columns only.
• SELECT Query – FROM Clause This is where you tell the database the name of table(s) where it shouldlook for the columns you named in the SELECT clause. When fetching data from multiple tables, list all tables and describe therelation between them. For example, let us try to fetch data for all theactivities that have been performed on various routes along with dates.SELECT Activities.activityName, Works.date, Works.routeNumberFROM Activities INNER JOINWorks ON Activities.activityId = Works.activityId Notice the keywords ‘INNER JOIN’ and the part ‘ON Activities.activityId= Works.activityId’. The ON … part tells the database what are the columns to matchresults on. It is also called the join condition. There can be more thanone joining condition depending on the underlying database schema.
• SELECT Query – FROM Clause -JOINs JOIN is a keyword that allows you to let the database know thatthere are multiple tables you intend to fetch data from. There is a table mentioned before JOIN and another after it. The one before is called the left table and the one after is called theright table. There are three types of joins: INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN
• INNER JOIN INNER JOIN is also sometimes called a ‘strict’ join. Some RDBMS systems support dropping the ‘INNER’ and implicitlyassume it. This type of join means that for each row in the left table find therows in the right table and skip if there is no match found. This type of joins helps in eliminating empty records. For example, in our road repair example, it would omit all suchActivities rows that don’t have records in Works table.
• OUTER JOINs In case we don’t want to omit empty records, we can use OUTER JOINs. A LEFT OUTER JOIN suggests that for each row in left table find all rows inright table. A RIGHT OUTER JOIN suggests that for each row in right table find allrows in left table. For example, let us find all Activities and related Works. We can do thisby:SELECT Activities.activityName, Works.date, Works.routeNumberFROM Activities LEFT OUTER JOINWorks ON Activities.activityId = Works.activityId This query would return all Activities along with their associated Works.For the Activities that don’t have corresponding Works it would put‘NULL’ under date and routeNumber columns.
• The JOIN Conditions The ON … part is called the joining condition. It is essentially an assertion condition describing column on the leftand right tables and the way they are to be evaluated. In most circumstances, there are columns (from left and right tables)that are matched with an = operator, however, in some cases thatmight not be true. Other conditional operators such as not equal, greater than, lessthan, etc. are also supported. There can be more than one JOINing conditions.
•  QUESTION: What would happen if you skip the ON … part?
• SELECT Query – WHERE clause WHERE clause allows you to describe conditions on the data you want fetched. For example, if we are interested in all Overlaying Works we’ll write a query:SELECT *FROM WorksWHERE activityId = 24 Another way to do the same without using an ID is:SELECT Works.*FROM Works INNER JOINActivities ON Works.activityId = Activities.activityIdWHERE Activities.activityName = ‘Overlay’ However, the second example would be a bit slow and non-optimal becausethere is a certain overhead of joining and matching on string columns.
• SELECT Query - ORDER BY Clause Theoretically speaking, the records in a table are unordered. However, mostRDBMS usually store them in some kind of ordering (usually in the order of Primarykey column). In any case, there might be a requirement to order the results in a particularway. ORDER BY clause allows you to describe data ordering and the direction ofordering. For example, if we want all Activities along with their associated Works orderedalphabetically and sorted by date in a descending order, we can do that by:SELECT Activities.activityName, Works.date, Works.routeNumberFROM Activities INNER JOINWorks ON Activities.activityId = Works.activityIdORDER BY Activities.activityName ASC, Works.date DESC The ASC keyword is implicit and can be skipped.
• Aggregating Results Sometimes we want to fetch aggregated results. For example, wewant to find out the number of times each Activity has been carriedout from the road repair example. The GROUP BY clause provides this functionality.SELECT Activities.activityName, COUNT(Works.routeNumber) AScountActivityFROM Activities INNER JOINWorks ON Activities.activityId = Works.activityIdGROUP BY Activities.activityName COUNT is an aggregate function. Others commonly usedaggregate functions are SUM, AVG, MIN and MAX.
• SELECT Query – GROUP BY Clause When a GROUP BY clause is defined then every column in theSELECT and ORDER BY clauses either need to be part of anaggregate function or mentioned in the GROUP BY clause. For example, the following query is invalid:SELECT Activities.activityName, Works.date,COUNT(Works.routeNumber) AS countActivityFROM Activities INNER JOINWorks ON Activities.activityId = Works.activityIdGROUP BY Activities.activityName
• Sub-queries A SELECT query works on a table or a group of tables, meaningtables are the operands for a SELECT operation. The output of a SELECT query is (a kind of) a table. Therefore, an output of a SELECT query can act as aninput/operand for another SELECT query.
• Why use sub-queries? Query optimization by breaking a large/complex query into smallerqueries that use WHERE clauses to reduce the data size. Retrieving single valued records for related tables based on valueson some other columns in another query. Such as retrieving mostrecent (or oldest) record in a table that holds data for single recordwith updates over a period of time. The above point is a reference to a common data warehousing usecase of storing data that changes over time and you want topreserve these over the time changes. Sometimes also referred to as Slowly Changing Dimension (SCD) Using a sub-query in a WHERE clause to specify a match on a rangeof values.
• Sub-queries for optimization Assume that we have aservice with one millionusers. There are only about100,000 users that havespent money on ourservice. Of the 100,000 users, onlyabout 1,000 users haveever spent 100 dollars ormore in one go. We would most likely havea database with thetables as shown in thediagram
• Sub-queries for optimization You are required to analyze transcations with amount greater than100 dollars. Write down the query that fetches users (userId, name, gender,country) and their transactions (transactionDate, amount). A sub-optimal query follows on the next slide but don’t peak ahead.Write down one yourself and compare with it later.
• Sub-queries for optimizationSELECT users.userId, users.username, users.gender, users.country,transactions.transactionDate, transactions.transactionAmountFROM users INNER OUTER JOINtransactions ON users.userId = transactions.userIdWHERE transactions.transactionAmount > 100; Problems: There were 100,000 users that had spent money. Of those there were only a 1,000 instanceswhere a the amount spent was greater than 100. Assume that on average there are 2 transactions per user. The query above would result in retrieval of 200,000 records and then the WHERE clausewould be applied to it to pick out the 1,000 such records where the amount was greater than100. This means that 99.5% of data fetched initially was of no use and wasted server resources(time and memory).
• Sub-queries for optimization First, we know that we are only interested in transactions worth more than 100 dollars.Following query gets use only these transactions:SELECT transactions.userId, transactions.transactionDate, transactions.transactionAmountFROM transactionsWHERE transactions.transactionAmount > 100 Since, the output of the above query would be a table, we’ll use this one to JOINwith users table. The resulting query would be:SELECT users.userId, users.username, users.gender, users.country,t1.transactionDate, t1.transactionAmountFROM users INNER OUTER JOIN(SELECT transactions.userId, transactions.transactionDate,transactions.transactionAmountFROM transactionsWHERE transactions.transactionAmount > 100) AS t1 ON users.userId = t1.userId
• Sub-queries for Retrieving SCD From the previous example, assume that now we’re interested inknowing when was the last time each of our users spent moneyalong with their gender and country. How can we go about doing this? The query that does that is on the next slide, but first try thinking outhow you can do that.
• Sub-queries for Retrieving SCD First, lets write a query that retrieves the latest transaction.SELECT MAX(transactions.transactionDate) AS lastTransactionDateFROM transactionsORSELECT transactions.transactionDateFROM transactionsORDER BY transactions.transactionDate DESCLIMIT 1 But we want to know the last transaction for each user. We can modify the first example as:SELECT transactions.userId, MAX(transactions.transactionDate) AS lastTransactionDateFROM transactionsGROUP BY transactions.userId The second one cannot be modified in a way that would give us the desired because??SELECT transactions.userId, transactions.transactionDateFROM transactionsORDER BY transactions.transactionDate DESCLIMIT 1
• Sub-queries for Retrieving SCD Now, we need to combine the result with user’s gender andcountry.SELECT users.userId, users.gender, users.country,MAX(transactions.transactionDate) AS lastTransactionDateFROM users LEFT OUTER JOINtransactions ON users.userId = transactions.userIdGROUP BY users.userId, users.gender, users.location The query above gives us the desired result, but it has one problem.What?
• Sub-queries for Retrieving SCD We can use the discarded query two slides back if we can parameterize it somehow so that itevaluates for each user and gives us the last date. The following query does that:SELECT users.userId, users.gender, users.country,(SELECT transactions.transactionDateFROM transactionsWHERE transactions.userId = users.userIdORDER BY transactionDate DESCLIMIT 1) AS lastTransactionDateFROM users The query above does not have a join. It does not use an aggregate function in the main query and enables us to easily add morecolumns without worrying about the GROUP BY clause. Modify the query above (or the one on previous slide) so that we now get the last transactiondates for transactions worth more than 50 dollars for each user. (Answer on next slide)
• Sub-queries for Retrieving SCDSELECT users.userId, users.gender, users.country,(SELECT transactions.transactionDateFROM transactionsWHERE transactions.userId = users.userIdAND transactions.transactionAmount > 50LIMIT 1) AS lastTransactionDateFROM users
• Handling NULL Values The query on previous slide would return rows for all one million userswith most of them having lastTransactionDate as NULL. NULLs don’t look good on a result set and are of no value for furtheranalysis. We can resolve this situation in two ways. Assume that we do need to see all one million users and would liketo put a default value for the users that don’t have a transaction(such as 1.Jan.1900). Such values are called ‘sentinels’. To replace a NULL, we can use a function ISNULL to replace theNULL with a sentinel value.
• Handling NULL ValuesSELECT users.userId, users.gender, users.country,ISNULL((SELECT transactions.transactionDateFROM transactionsWHERE transactions.userId = users.userIdAND transactions.transactionAmount > 50LIMIT 1), ‘1.Jan.1900’) AS lastTransactionDateFROM users
• Sub-queries in WHERE clause Or, we can modify the same query as:SELECT users.userId, users.gender, users.country,(SELECT transactions.transactionDateFROM transactionsWHERE transactions.userId = users.userIdAND transactions.transactionAmount > 50LIMIT 1) AS lastTransactionDateFROM usersWHERE users.userId IN (SELECT transactions.userIdFROM transactionsWHERE transactions.transactionAmount > 50) However, this is (and in general queries that user a sub-query in WHERE clauseare) sub-optimal to the point that it is quite a bad query.
• Many-to-Many Relation Example We are tasked to design a system for a college. There are students and there are courses. We need to provide a basic model that can store data for students,courses and enrollment of students in courses over years andsemesters. A student may have enrolled in multiple courses. A course may have enrollment of multiple students. A student may enroll in a course only once in a give semester of ayear. Try modelling the above scenario. The slide following this shows acommon way to go about doing this.
• Many-to-Many Relation Example
• Many-to-Many Relation Example Write a query that retrieves records of enrollment for all studentsordered chronologically. Write a query that retrieves semester-wise enrollment count for allcourses Write a query that displays students that have enrolled in the samecourse more than once along with the number of times they hadenrolled. Write a query to display last enrollment for all students.
• Questions