2. Topics
• What is windowing?
• Window aggregate functions
• Set-Based vs. Iterative Programming
• Use of window functions
• String concatenation
• Ranking functions
• Common Table Expressions (CTE)
• Optimising ranking functions
• Creating a sequence
• Removing duplicate entries
• Pivoting
• Running totals
• What’s new with 2012?
• Benchmarks
3. What is windowing
• A window function is a function applied to a set of rows. A window
is the term standard SQL uses to describe the context for the
function to operate in. SQL uses a clause called OVER in which you
provide the window specification.
• Window functions are functions applied to sets of rows defined by
a clause called OVER. They are used mainly for analytical purposes
allowing you to calculate running totals, calculate moving averages,
identify gaps and islands in your data, and perform many other
computations. These functions are based on an amazingly profound
concept in standard SQL (which is both an ISO and ANSI standard)—
the concept of windowing. The idea behind this concept is to allow
you to apply various calculations to a set, or window, of rows and
return a single value. Window functions can help to solve a wide
variety of querying tasks by helping you express set calculations
more easily, intuitively, and efficiently than ever before.
4. Window aggregate functions
• COUNT()
• SUM()
• AVG()
• MAX()
• MIN()
SELECT SalesOrderID, SalesOrderDetailID,
SUM(LineTotal) AS sum_product,
SUM(LineTotal) OVER(PARTITION BY SalesOrderID) AS sum_all
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID, SalesOrderDetailID, LineTotal
5. Set-Based vs. Iterative Programming
To get to the bottom of this, one first needs to
understand the foundations of T-SQL, and what the set-
based approach truly is. When you do, you realize that
the set-based approach is non intuitive for most people,
whereas the iterative approach is. It’s just the way our
brains are programmed, and I will try to clarify this
shortly. The gap between iterative and set-based thinking
is quite big. The gap can be closed, though it certainly
isn’t easy to do so. And this is where window functions
can play an important role; I find them to be a great tool
that can help bridge the gap between the two approaches
and allow a more gradual transition to set-based thinking.
6. Use of window functions
• Paging
• De-duplicating data
• Returning top n rows per group
• Computing running totals
• Performing operations on intervals such as packing intervals, and
calculating the maximum number of concurrent sessions
• Identifying gaps and islands
• Computing percentiles
• Computing the mode of the distribution
• Sorting hierarchies
• Pivoting
• Computing recency (newness)
7. String concatenation
• Among many alternative solutions in SQL
Server to achieve ordered string
concatenation, one of the more efficient
techniques is based on XML manipulation
using the FOR XML option with the PATH
mode. SELECT SUBSTRING((SELECT ',' + Name as [text()]
FROM Production.ProductCategory
FOR XML PATH('')),
2, 100) AS Names
8. Ranking functions
• ROW_NUMBER()
• RANK()
• DENSE_RANK()
• NTILE()
SELECT StateProvinceID, StateProvinceCode, CountryRegionCode,
RANK() OVER(ORDER BY CountryRegionCode) AS rnk_all,
RANK() OVER(PARTITION BY CountryRegionCode ORDER BY StateProvinceCode) AS rnk_cust
FROM Person.StateProvince
9. Common Table Expression (CTE)
• Set based programming
• Can perform self referencing
• Replaces temp table technique
WITH tblA AS
(
SELECT PC.*,
COUNT(PSC.ProductSubcategoryID)
OVER(PARTITION BY PC.ProductCategoryID) AS SubCategoryCount
FROM Production.ProductCategory PC
JOIN Production.ProductSubcategory PSC
ON PC.ProductCategoryID = PSC.ProductCategoryID
)
SELECT DISTINCT * FROM tblA
10. Optimising ranking functions
• Forward scan
SELECT actid, tranid, val,
ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val) AS rownum
FROM dbo.Transactions;
11. • Create indexes against ordering columns
CREATE INDEX idx_actid_val_i_tranid
ON dbo.Transactions(actid /* P */, val /* O */)
INCLUDE(tranid /* C */);
SELECT actid, tranid, val,
ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val) AS rownum
FROM dbo.Transactions;
The execution plan performs an
ordered scan of the index to
satisfy the ordering
requirement. With larger sets,
the difference can be greater.
12. SELECT actid, tranid, val,
ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val DESC) AS rownum
FROM dbo.Transactions;
• Backward scan
13. SELECT actid, tranid, val,
ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val DESC) AS rownum
FROM dbo.Transactions
ORDER BY actid DESC;
What’s even more curious is what happens if you add a presentation ORDER BY
clause that requests to order by the partitioning column in descending order.
Suddenly, the iterators that compute the window function are willing to
consume the partitioning values in descending order and can rely on index
ordering for this. So simply adding a presentation ORDER BY clause with actid
DESC to our query removes the need for a Sort iterator.
14. • Window order clause is mandatory, and SQL
Server doesn’t allow the ordering to be based on
a constant (i.e. ORDER BY NULL), but surprisingly,
when passing an expression based on a sub query
that returns a constant, SQL Server will accept it.
At the same time, the optimizer un-nests, or
expands the expression and realizes that the
ordering is the same for all rows. Therefore, it
removes the ordering requirement from the input
data.
SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;
15. Creating a sequence
CREATE FUNCTION dbo.GetSequence (@low AS BIGINT, @high AS BIGINT)
RETURNS TABLE AS
RETURN
WITH
L0 AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum
OFFSET 0 ROWS FETCH FIRST @high - @low + 1 ROWS ONLY;
GO
SELECT n FROM dbo.GetSequence(11, 20);
16. Not only numbers
DECLARE @start AS DATE = '20120201',
@end AS DATE = '20120212';
SELECT n, DATEADD(day, n, @start) AS dt
FROM dbo.GetSequence(0, DATEDIFF(day, @start, @end)) AS Nums;
dt
---------------
2012-02-01
2012-02-02
2012-02-03
…
2012-02-10
2012-02-11
2012-02-12
17. Removing duplicate entries
WITH C AS
(
SELECT orderid,
ROW_NUMBER() OVER(PARTITION BY orderid
ORDER BY (SELECT NULL)) AS n
FROM Sales.MyOrders
)
DELETE FROM C
WHERE n > 1;
18. Pivoting
• Pivoting is a technique to aggregate and rotate
data from a state of rows to columns.
• When pivoting data, you need to identify
three elements: the element you want to see
on rows (the grouping element), the element
you want to see on columns (the spreading
element), and the element you want to see in
the data portion (the aggregation element).
19. WITH C AS
(
SELECT YEAR(ModifiedDate) AS OrderYear,
MONTH(ModifiedDate) AS OrderMonth,
LineTotal as Amount
FROM Sales.SalesOrderDetail
)
SELECT *
FROM C
PIVOT(SUM(Amount)
FOR OrderMonth IN ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS P;
• Suppose you need to query the Sales.OrderValues view and
return a row for each order year, a column for each order
month, and the sum of order values for each year and
month intersection. In this request, the on rows, or
grouping element is YEAR(orderdate); the on cols, or
spreading element is MONTH(orderdate); the distinct
spreading values are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12;
and the data, or aggregation element is SUM(val).
21. WITH C AS
(
SELECT YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, val
FROM Sales.OrderValues
)
SELECT orderyear, (CAST([1] AS VARCHAR) + ‘, ‘ + CAST([2] AS VARCHAR)) AS Totals
FROM C
PIVOT(SUM(val)
FOR ordermonth IN ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS P;
orderyear Totals
----------------------------------------
2000 11011.75, 10952.62
2001 10926.32, 10759.00
2002 10856.55, 10682.69
2003 11016.10, 10953.25
2004 10924.99, 10875.14
2005 11058.40, 10956.20
2006 10826.33, 10679.79
...
22. Running totals
• Prior to SQL Server 2012, the set-based
solutions used to calculate running totals were
extremely expensive. Therefore, people often
resorted to iterative solutions that weren’t
very fast but in certain data distribution
scenarios were faster than the set-based
solutions.
23. • Set based solution
• Cursor based solution
• CLR based solution
SELECT T1.ProductID, T1.TransactionID, T1.ActualCost,
SUM(T2.ActualCost) AS balance
FROM Production.TransactionHistory AS T1
JOIN Production.TransactionHistory AS T2
ON T2.ProductID = T1.ProductID
AND T2.TransactionID <= T1.TransactionID
GROUP BY T1.ProductID, T1.TransactionID, T1.ActualCost;
24. What’s new with 2012?
• Distribution functions are PERCENT_RANK,
CUME_DIST, PERCENTILE_CONT, and
PERCENTILE_DISC.
• Offset functions are LAG, LEAD, FIRST_VALUE,
LAST_VALUE, and NTH_VALUE. (There’s no support
for the NTH_VALUE function yet in SQL Server as of SQL
Server 2012.)