Objectives Developers tend to learn & enhance skill on programming logic/language/framework/tool more, and, put less effort in learning SQL. Because of lacking good knowledge in SQL, the data access operations are not written in best way in most of the cases. So, overall application performs slow. Objective of this session is to re-visit and re-explore SQL to enhance our SQL knowledge as best as possible so that: We can write better SQL We can tune data access of our applications
Assumptions Audience have basic SQL working knowledge Audience have experience in database based application development Audience have working experience in a database server (Say, SQL server 2000 or later) Audience are familiar with common database objects
Topics covered Stay clean, follow good conventions Data types. Re-visit the basics. Joining. SubQueries. Joins Versus Subqueries. Group By. Set Operations. Case Statements.
Stay clean, follow good conventions Using a good convention allows us to write consistent and manageable code. It also increase readability and helps troubleshoot problems. 3 basic conventions: Use uppercase letters for the keywords, which includes SELECT, FROM, and WHERE. Use lowercase letters for the user-supplied words Align the keywords SELECT, FROM, and WHERE on separate lines Write appropriate comments. For example : SELECT dname, age /* displays the dependent name and age */ FROM Dependent d /* from the Dependent table */ WHERE d.age > 5 /* where the age of the dependent is greater than 5 */
Data types Always use the data type that will require the least storage size. Following are the SQL server 2005 data types. CHAR and VARCHAR Both non-Unicode character data types with a maximum length of 8,000. Each character occupies 1 byte. Storage size for CHAR is n bytes while for VARCHAR is the actual length in bytes of the data entered Use CHAR data type when the data values in a column are expected to be consistently the same size (Say, ProductCode) NCHAR and NVARCHAR Both are Unicode character data types with a maximum length of 4,000 characters. Each character occupies 2 bytes.
Data types TINYINT, SMALLINT, INT and BIGINT If you simply need to store a value between 0 and 255 then you should define your column as TINYINT,not INT
Data types FLOAT and REAL Data types for use with floating point numeric data. Floating point data is approximate; not all values in the data type range can be precisely represented. Declared as float [(n)] n is the number of bits used to store the mantissa of the float number in scientific notation and thus dictates the precision and storage size. n must be a value from 1 through 53. The default value of n is 53. If value of n is 24, then, it is REAL.
Data types NUMERIC and DECIMAL Numeric data types that have fixed precision and scale. DECIMAL (p [, s ]) NUMERIC (p [, s ]). They are synonymous to each other and either one can be used. Precision is the total number of digits stored, and scale is the maximum number of digits to the right of the decimal point. The storage space of NUMERIC/DECIMAL data varies according to the precision. Float and real follow the IEEE 754 specification on approximate numeric. data types and cannot store values accurately. So, when accuracy is important (For financial applications), NUMERIC/DECIMAL should be used.
Data types DATETIME and SMALLDATETIME Values with the datetimedata type are stored internally DateTime is stored as two 4-byte integers. The first 4 bytes store number of days before or after the base date, January 1, 1900. Rest 4 bytes store time of day in milliseconds after midnight. SmallDateTimevalues are stored as two 2-byte integers. The first 2 bytes store the number of days after January 1, 1900. The other 2 bytes store the number of minutes since midnight. UseSmallDateTime when you don't need to store the time of the day. It’ll save a good amount of storage space.
Data types MONEY and SMALLMONEY Data types for representing currency values. The differences between these 2 data types are in the minimum and maximum values each can hold . MONEY uses 8 bytes of storage space while SMALLMONEY uses 4 bytes of storage space. If the value is less than 214,748.3647 then you should use SMALLMONEY. Otherwise use MONEY. BOOLEAN In SQL Server, there's no BOOLEANdata type. Nearest data type that can be used in place of boolean data is the BIT data type, which is basically an INTEGER data type that can accept a value of 1, 0 or NULL value only.
Data types TEXT,NTEXT and IMAGE TEXT :Single-byte non-unicode character strings, can be used to store more than 8,000 bytes (2 GB). NTEXT :Double-byte unicode character strings, can be used to store more than 8,000 bytes (2 GB). IMAGE :Variable-length binary data from 0 byte through 2 GB BINARY and VERBINARY BINARY [ ( n ) ] is Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000. The storage size is n bytes. VERBINARY [ ( n )] isVariable-length binary data. n can be a value from 1 through 8,000.
Data types VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) In SQL Server 2000, a row cannot exceed 8000 bytes in size. This limitation is due to the 8 KB internal page size SQL Server. To store more data in a single column, you needed to use the TEXT, NTEXT, or IMAGE data types (BLOBs) which are stored in a collection of 8 KB data pages . These are unlike the data pages that store the other data in the same table. Rather, these pages are arranged in a B-tree structure. These data cannot be used as variables in a procedure or a function and they cannot be used inside string functions such as REPLACE, CHARINDEX or SUBSTRING. In most cases, you have to use READTEXT, WRITETEXT, and UPDATETEXT To solve this problem, Use VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) in SQL Server 2005. These data types can hold the same amount of data BLOBs can hold (2 GB) and they are stored in the same type of data pages used for other data types. When data in a MAX data type exceeds 8 KB, an over-flow page is used (In the ROW_OVERFLOW allocation unit) and a pointer to the page is left in the original data page in the IN_ROW allocation unit.
Data types Other data types UNIQUEIDENTIFIER : also referred to as globally unique identifier (GUID) or universal unique identifier (UUID), is a 128-bit generated value that guarantees uniqueness worldwide, even among unconnected computers XML : A new data type that has been added to SQL Server 2005 to store/retrieve XML data. The XML data type can be typed (Must conform to an XML Schema) or, untyped. XML data type will allow you to store complete XML documents or fragments of XML documents. The data type has methods (That accepts Xquery) to store/retrieve/udpate XML data. Can store upto 2 GB of data. SQL_VARIANT : Values stored in a SQL_VARIANT column can be any data type except TEXT or IMAGE. Should be avoided for several reasons. TIMESTAMP : Data type used to store automatically generated unique incrementing number. It has nothing to do with preserving a date or a time. This is useful to determine whether any row has changed since the last read - if there's any change in the row - timestamp gets updated automatically.
Re-visit the basics. Query processing order : According to ANSI standard the query should be written in the following way: SELECT [...] FROM [T1] JOIN [T2] ON [condition] WHERE [...] GROUP BY [...] HAVING [...] ORDER BY [...] However the query will be processed by MSSQL in the following order like so: FROM [T1] ON [condition] JOIN [T2] WHERE [...] GROUP BY [...] HAVING [...] SELECT [...] ORDER BY [...] Notes : ORDER BY is the last to be processed, only TOP clause would be processed after ORDER BY. DISTINCT will be processed after SELECT and before ORDER BY
Re-visit the basics Obtaining record count count(*) is used to get the number of total records in the table. But, use count(primary_key_field) for faster result. To count the number of values (not null) in a column, use count(field_name). Limiting number of records in query :Two ways to do that SET ROWCOUNT: - takes a variable as well as a constant,- affects ALL future queries, until another SET ROWCOUNT is executed,TOP: - takes only a constant,- affects only the current query,
Re-visit the basics Aliases Aliases are used to improve the readability of a query and its output Column Aliases : --SELECT dname 'Dependent Name', age 'Dependent Age', FROM Dependent --SELECT dname AS [Dependent Name], age AS [Dependent Age], FROM Dependent --SELECT dname AS "Dependent Name", age AS "Dependent Age", FROM Dependent --SELECT [Dependent Name] = dname, [Dependent Age] = age, FROM Dependent Table Aliases: --SELECT * FROM Dependent d WHERE d.age > 5 --SELECT v.column1, v.column2, ... FROM (SELECT column1, column2, ... FROM Table WHERE ...) v
Re-visit the basics Synonyms SQL Server 2005 allows to create synonyms for tables. Synonyms are usually shorter names that can be used in place of the table name. If a change is made in the original table or its data, this change will be reflected when the synonym is used. And, if a change is made in the data of the table using a synonym, this change will be reflected in the original table Synonyms are more permanent; they are available for use until they are deleted Example CREATE SYNONYM s1 FOR Student SELECT * FROM s1 DROP SYNONYM s1
Joining SQL joins are used to query data from two or more tables, based on a relationship between certain columns in these tables. In SQL Server 2005, the join is accomplished using the ANSI JOIN SQL syntax The general form of the ANSI JOIN SQL syntax is: SELECT columns FROM table1 JOIN table2 ON table1.column1=table2.column1 Different SQL JOINs INNER JOIN: Return rows when there is match between two tables OUTER JOIN: Does not require each record in the two joined tables to have a matching record. The joined table retains each record—even if no other matching record exists SELF JOIN: A self-join is joining a table to itself
Joining INNER joins : Can be classified into followings Equi-join: Uses only equality(=) comparisons in the join-predicate Examle: SELECT Employee.lastName,, Department.DepartmentName FROM Employee INNER JOIN Department ON Employee.DepartmentID = Department.DepartmentID; Natural join:ImplecitEqui-join. Uses a common column across two tables to join. The common column appears once only in the result set (Not supported in all databases) SELECT * FROM employee NATURAL JOIN department Cross join: A cross join, cartesian join or product provides the foundation upon which all types of inner joins operate. A cross join returns the cartesian product of the sets of records from the two joined tables SELECT * FROM employee CROSS JOIN department, or, SELECT * FROM employee, department; (Implicit cross join)
Joining OUTER joins : Can be classified into followings Left outer join:Result of a left outer join (or left join) for tables A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). Example:SELECT * FROM employee LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID Right outer join: join:Result of a right outer join (or right join) for tables A and B always contains all records of the “right" table (B), even if the join-condition does not find any matching record in the “left" table (A). Example:SELECT * FROM employee RIGHT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID Full outer join: A full outer join combines the results of both left and right outer joins Example:SELECT * FROM employee FULL OUTER JOIN department ON employee.DepartmentID = department.DepartmentID Note : --In case of outer join queries, the missing matching fields are replaced as NULL values in the result set --Some Databases do not support Full outer join (Say, MySql). In these cases, Full outer joins could be obtained by using UNION of Left outer join and Right outer join
Joining Self Join: In a regular join, each row in a table is compared against other rows in another table on a condition. But, if each row in a table is to be compared in other rows in the same table, then, self join is required. In a typical self join query, following two join conditions are used: Table1.Field =Table2.Field (To select duplicate/common values) Table1.PK <> Table2.PK (To filter the original row) Example: Select Employees and their corresponding Managers select e.Name as [Employee Name], m.Name as [Manager Name] from Employee e left outer join Employee m on e.MgrID = m.ID (Left join is required to list all employees including those who don’t have managers) Example: List products that are supplied by different vendors (Duplicate entries on a condition). SELECT DISTINCT pv1.ProductID, pv1.VendorID FROM Purchasing.ProductVendor pv1 INNER JOIN Purchasing.ProductVendor pv2 ON pv1.ProductID = pv2.ProductID AND pv1.VendorID <> pv2.VendorID ORDER BY pv1.ProductID
SubQueries select au_lname, au_fname from authors where city in (select city from publishers) There are two basic types of subqueries: Expression subqueries: Used with a comparison operator, must return a single value, and can be used almost anywhere an expression is allowed in SQL. Quantified predicate subqueries: Operate on lists introduced with in or not in, or, exists or not exists or, with a comparison operator modified by any or all. These subqueries return zero or more values. Subqueries of either type are either noncorrelated or correlated (repeating). Noncorrelatedsubquery: Can be evaluated as if it were an independent query. Conceptually, the results of the subquery are substituted in the main statement, or outer query Correlated subquery: Cannot be evaluated as an independent query, but can reference columns in a table listed in the from list of the outer query.
SubQueries select au_lname, au_fname from authors where city in (select city from publishers) Expression SubQueries : Two types are there SubQuery in a select list (Correlated Subquery) Example : SELECT P.Name, P.ListPrice, (SELECT M.Name FROM Model M where M.ProductModelID = P.ProductID) Model FROM Product P Subqueries in a where or having clause connected by a comparison operator (Non-Correlated Subquery) Examples: SELECT customerid FROM orders WHERE empID= (SELECT empID from employees WHERE lastname = ‘Chang’); SELECT title FROM titles WHERE price > (SELECT min(price) FROM titles)
SubQueries select au_lname, au_fname from authors where city in (select city from publishers) Quantified predicate subqueries : 3 types are there any/all subqueries : > all means greater than every value, or greater than the maximum value. For example, > all (1, 2, 3) means greater than 3. > any means greater than at least one value, or greater than the minimum value. Therefore, > any (1, 2, 3) means greater than 1. Example : select title from titles where price > all (select price from titles where type = "mod_cook") select title from titles where price > any(select price from titles where type = "mod_cook") In/not in subqueries : Example : Select lname, fname from authors where city in (select city from publishers) Select lname, fname from authors where city not in (select city from publishers)
SubQueries select au_lname, au_fname from authors where city in (select city from publishers) Quantified predicate subqueries : Continued Exists/not Exists subqueries : The subquery Exists/NotExists evaluates to TRUE or FALSE rather than returning any data. With EXISTS/NOT EXISTS, The select list of the subquery usually consists of the asterisk (*) Example : Select lname, fname from authors where exists (select * from publishers where authors.city = publishers.city) Select lname, fname from authors where not exists (select * from publishers where authors.city = publishers.city) If the inner queries in any of these types of sub queries contains any column use of the outer query, then, it becomes a non-correlated sub query
SubQueries select au_lname, au_fname from authors where city in (select city from publishers) Limitations The subquery_select_list can consist of only one column name, except in the exists subquery, where an (*) is usually used Subqueries can be nested inside the WHERE or HAVING clause of an outer SELECT, INSERT, UPDATE, or DELETE statement, inside another subquery, or in a select list. Examples : UPDATE titles SET price = price * 2 WHERE pub_id in (SELECT pub_id FROM publishers WHERE pub_name = "New Age Books") DELETE salesdetail WHERE title_id in (SELECT title_id FROM titles WHERE type = "business") IF EXISTS (SELECT title_id FROM titles WHERE type = "business") BEGIN DELETE salesdetail WHERE title_id in (SELECT title_id FROM titles WHERE type = "business") END In Transact-SQL, a subquery can appear almost anywhere an expression can be used, if it returns a single value You cannot use subqueries in an order by, group by text and image datatypes are not allowed in subqueries
Joins Versus Subqueries Subqueriesmay often be used as alternatives to joins, but, not always SubQueries resembles to “Programming Logics” (Hence, seem more logical), where, Joining resembles to “Set operations” more. Joining queries usually perform less efficient than the SubQueries when number of involved tables grows, and, on some particular conditions. (See next slide) Both SubQueries and Join queries have their limitations. Based upon situation, one might perform better than the other.
Joins Versus Subqueries A query written in both way Suppose we have two tables “Student” and “Grade”. List the students who have obtained “A” or “B” grades. The SubQuery (Programming logic)way Logic : 1. Select student ID’s who has “A” or “B” grades 2. Take those students who has a ID in the student ID list SQL: SELECT StudentName FROM Student WHERE ID IN(SELECT StID from Grade where StdGrade=“A” OR StdGrade=“B”) The Join way: (Wrong one!) SELECT StudentName from Student INNER JOIN Grade ON Student.ID = Grade.StdID AND StdGrade=“A” OR StdGrade=“B” Though the above query looks OK, this query produces more results comparing to the subquery version. Why? The Join query version shows duplicate student names who have both “A” and “B” grades in the Grade table (Because, the results are combined by “OR”). So, the DINTINCT has to be applied after the SELECT in the above query. But, to show DINTINCT results, the result set has to be sorted internally and the DINTINCT records has to be calculated. So, the SubQuery version might be a little efficient in this case. Why this problem does not occur in the SubQuery? (Think yourself)
Joins Versus Subqueries When the Join Cannot Be Turned into a Subquery When a column from a table needs to be in the result set, that table has to be in the outer query. If two tables are being used, and if columns from both tables have to be in the result set, a join is necessary. This type of join cannot be turned into a subquery If the result set does not need the columns from more than one table, then the join can be turned into a subquery. In this case, the inner query only acts as a “List of valid values” to filter against a value in the outer query In the previous example,if the Student Name and Grade had been asked to show, then, the query had to be written using a Join.
Group By If an aggregate function (Say, count(*)) is used in a query, it is applied to all (Filtered if there is any “where condition”) rows in the table. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns. SQL GROUP BY Syntax SELECT column_name, aggregate_function(column_name) FROM table_name WHERE column_name operator value GROUP BY column_name Example: SELECT Customer,SUM(OrderPrice) FROM Orders GROUP BY Customer
Group By GROUP BY More Than One Column We can also use the GROUP BY statement on more than one column, like this: SELECT Customer,OrderDate,SUM(OrderPrice) FROM Orders GROUP BY Customer,OrderDate Note: If any aggregate function is used in any select query, except the aggregate functions, All the fields in the select list must be presented in a group by list But, the reverse is not true. That means, a field in the group by list does not necessarily have to be presented in the select list. That means, the select list can only contain aggregate function. The HAVING clause has to be used be used to compare the aggregate functions calculated value with any value. SQL HAVING example SELECT Customer,SUM(OrderPrice) FROM Orders WHERE Customer='Hansen' OR Customer='Jensen' GROUP BY Customer HAVING SUM(OrderPrice)>1500
Group By Using ORDER BY The result set of a group by query can be ordered by any column from the result set using the ORDER BY clause Note: If alias is used for the aggregate function, that alias could be used in the ORDER BY clause in the GROUP BY query Group by query processing sequence The where clause excludes rows that do not meet its search conditions. The group by clause collects the remaining rows into one group for each unique value in the group by expression. Aggregate functions specified in the select list calculate summary values for each group. The having clause excludes rows from the final results that do not meet its search conditions.
If any column value is null and if that column is used in an aggregation function, then, the corresponding row is not considered in the calcuation. Workaround: Use aggregate_func(ISNULL(nullable_column, 0)) SQL Server 2005 will not allow you to write nested aggregation function with a GROUP BY query. For example, following query is invalid SELECT MIN(COUNT(stno)) FROM Student GROUP BY class The above query has to be written using a SubQuery SELECT MIN(Scount.count) FROM ( SELECT COUNT(stno) as count FROM Student GROUP BY class) ) as Scount
Set Operations In relation database concepts, collection of rows are considered as “Sets" Set operations are used in SQL to retrieve data from multiple sets, and apply union, intersection and set difference operations to obtain the result Explicit set operations are used in SQL: UNION, INTERSECT, and MINUS (for set difference) In SQL server, the MINUS set operation is achieved by using the NOT IN operator. The format of a set statement is as follows: set OPERATOR set where OPERATOR is a UNION, INTERSECT or MINUS, and where "set" is defined by a SELECT.
Set Operations Union Compatibility When using set operations, the two sets (the results of two SELECTs) have to have the same number of similar columns and the columns have to have compatible data types For union compatibility, the three basic data types are numeric, string, and dates All numeric columns are compatible with one another, all string columns are compatible with one another, and all date columns are compatible with one another The two sets being unioned must have the same number of columns in the result sets of the SELECT clauses.
Set Operations UNION ALL UNION ALL works exactly like UNION, but does not expunge duplicates If we want to union two result sets that do not have the same number of columns, we have to use NULL (or other) values in the column-places as place holders
Case Statements CASE expressions can be used in SQL anywhere an expression can be used (Inside SELECT,INSERT,UPDATE,DELETE statements and functions). A CASE expression returns a particular value. Two types of CASE statements are there Simple CASE function: Checks one expression against multiple values. Allows only an equality check. SELECT CASE @TestVal WHEN 1 THEN 'First‘ WHEN 2 THEN 'Second' WHEN 3 THEN 'Third' ELSE 'Other' END Searched CASE function: Doesn’t have expression after the CASE statement. Allows evaluating complex expression in the WHEN clauses.
Case Statements CASE statement in Select Query SELECT 'Price Category' = CASE WHEN price IS NULL THEN 'Not yet priced' WHEN price < 10 THEN 'Very Reasonable Title' ELSE 'Expensive book!' END, Title FROM Titles CASE Statement in a functionSELECT title_id, SUM(CASE WHEN month(ord_date)between 1 and 3 THEN qty ELSE 0 END) AS Q1, SUM(CASE WHEN month(ord_date) between 4 and 6 THEN qty ELSE 0 END) AS Q2
Case Statements Searched CASE Statement example SELECT CASE WHEN @TestVal <=3 THEN 'Top 3' ELSE 'Other' END CASE statement in Select Query SELECT 'Price Category' = CASE WHEN price IS NULL THEN 'Not yet priced' WHEN price < 10 THEN 'Very Reasonable Title' ELSE 'Expensive book!' END, Title FROM Titles
Case Statements CASE Statement in a functionSELECT title_id, SUM(CASE WHEN month(ord_date)between 1 and 3 THEN qty ELSE 0 END) AS Q1, SUM(CASE WHEN month(ord_date) between 4 and 6 THEN qty ELSE 0 END) AS Q2 CASE Statement in an UPDATE statement UPDATE dbo.Customer SET stateDescription = CASE WHEN statecode = 'MA' THEN 'Massachusetts' WHEN statecode = 'VA' THEN 'Virginia' WHEN statecode = 'PA' THEN 'Pennsylvania' ELSE NULL END