Session from BGOUG I presented in June, 2016
Even though DBAs and developers are writing SQL queries every day, it seems that advanced SQL techniques such as multi-dimension aggregation and analytic functions are still relatively remain unknown. In this session, we will explore some of the common real-world usages for analytic function, and understand how to take advantage of this great and useful tool. We will deep dive into ranking based on values and groups; understand aggregation of multiple dimensions without a group by; see how to do inter-row calculations, and much-much more…
Together we will see how we can unleash the power of analytics using Oracle 11g best practices and Oracle 12c new features.
2. Who am I?
• Zohar Elkayam, CTO at Brillix
• DBA, team leader, database trainer, public speaker, and a
senior consultant for over 18 years
• Oracle ACE Associate
• Involved with Big Data projects since 2011
• Blogger – www.realdbamagic.com and www.ilDBA.co.il
2
3. About Brillix
• Brillix is a leading company that specialized in Data
Management
• We provide consulting, training, and professional services for
various Databases, Security, NoSQL, and Big Data solutions
• Providing the Brillix Big Data Experience Center
3
4. Agenda: Advanced SQL
• “Basic” aggregation: Rollup, Cube, and Grouping Sets
• Analytic functions
• Reporting Functions
• Ranking Functions
• Inter-row Functions
• Using the Window clause
• Oracle 12c new features overview
• Top-N queries
• Pattern matching
4
6. Basics
• Group functions will return a single row for each group of rows
• We can run group functions only when we group the rest of the
columns together using GROUP BY clause
• Common group functions: SUM, MIN, MAX, AVG, etc.
• We can filter out rows after aggregation, if we use the HAVING
clause
6
7. GROUP BY With the ROLLUP and CUBE
Operators
• Use ROLLUP or CUBE with GROUP BY to produce super
aggregate rows by cross-referencing columns
• ROLLUP grouping produces a result set containing the regular
grouped rows and the subtotal and grand total values
• CUBE grouping produces a result set containing the rows from
ROLLUP and cross-tabulation rows
7
8. Using the ROLLUP Operator
• ROLLUP is an extension of the GROUP BY clause
• Use the ROLLUP operation to produce cumulative aggregates,
such as subtotals
SELECT [column,] group_function(column). . .
FROM table
[WHERE condition]
[GROUP BY [ROLLUP] group_by_expression]
[HAVING having_expression];
[ORDER BY column];
8
9. Using the ROLLUP Operator: Example
SELECT department_id, job_id, SUM(salary)
FROM hr.employees
WHERE department_id < 60
GROUP BY ROLLUP(department_id, job_id);
1
2
3
Total by DEPARTMENT_ID
and JOB_ID
Total by DEPARTMENT_ID
Grand total
9
10. Using the CUBE Operator
• CUBE is an extension of the GROUP BY clause
• You can use the CUBE operator to produce cross-tabulation
values with a single SELECT statement
SELECT [column,] group_function(column)...
FROM table
[WHERE condition]
[GROUP BY [CUBE] group_by_expression]
[HAVING having_expression]
[ORDER BY column];
10
11. 1
2
3
4
Grand total
Total by JOB_ID
Total by DEPARTMENT_ID
and JOB_ID
Total by DEPARTMENT_ID
SELECT department_id, job_id, SUM(salary)
FROM hr.employees
WHERE department_id < 60
GROUP BY CUBE (department_id, job_id);
. . .
Using the CUBE Operator: Example
11
12. GROUPING SETS
• The GROUPING SETS syntax is used to define multiple
groupings in the same query
• All groupings specified in the GROUPING SETS clause are
computed and the results of individual groupings are combined
with a UNION ALL operation
• Grouping set efficiency:
• Only one pass over the base table is required
• There is no need to write complex UNION statements
• The more elements GROUPING SETS has, the greater the performance
benefit
12
13. SELECT department_id, job_id,
manager_id, AVG(salary)
FROM hr.employees
GROUP BY GROUPING SETS
((department_id,job_id), (job_id,manager_id));
GROUPING SETS: Example
. . .
. . .
1
2
13
14. Composite Columns
• A composite column is a collection of columns that are treated
as a unit.
ROLLUP (a,(b,c), d)
• Use parentheses within the GROUP BY clause to group columns,
so that they are treated as a unit while computing ROLLUP or
CUBE operators.
• When used with ROLLUP or CUBE, composite columns require
skipping aggregation across certain levels.
14
15. SELECT department_id, job_id, manager_id,
SUM(salary)
FROM hr.employees
GROUP BY ROLLUP( department_id,(job_id, manager_id));
Composite Columns: Example
…
1
2
3
4
15
17. Overview of SQL for Analysis and Reporting
• Oracle has enhanced SQL's analytical processing capabilities
by introducing a family of analytic SQL functions
• These analytic functions enable you to calculate and perform:
• Reporting operations
• Rankings and percentiles
• Moving window calculations
• Inter-row calculations (LAG/LEAD, FIRST/LAST etc.)
• Pivoting operations (11g)
• Pattern matching (12c)
• Linear regression and predictions
17
18. Why Use Analytic Functions?
• Ability to see one row from another row in the results
• Avoid self-join queries
• Summary data in detail rows
• Slice and dice within the results
• Performance improvement, in some cases
18
19. Concepts Used in Analytic Functions
• Result set partitions: These are created and available to any
aggregate results such as sums and averages. The term
“partitions” is unrelated to the table partitions feature.
• Window: For each row in a partition, you can define a sliding
window of data, which determines the range of rows used to
perform the calculations for the current row.
• Current row: Each calculation performed with an analytic
function is based on a current row within a partition. It serves as
the reference point determining the start and end of the window.
19
20. Reporting Functions
• We can use aggregative functions as analytic functions (i.e.
SUM, AVG, MIN, MAX, COUNT etc.)
• Each row will get the aggregative value for a given partition
without the need for group by clause so we can have multiple
group by’s on the same row
• Getting the raw data along with the aggregated value
• Use Order By to get cumulative aggrigations
20
21. Report Functions
21
SELECT last_name, salary, department_id,
ROUND(AVG(salary) OVER (PARTITION BY department_id),2) A,
COUNT(*) OVER (PARTITION BY manager_id) B,
SUM(salary) OVER (PARTITION BY department_id ORDER BY salary) C,
MAX(salary) OVER () D
FROM hr.employees;
23. Using the Ranking Functions
• A ranking function computes the rank of a record compared to
other records in the data set based on the values of a set of
measures. The types of ranking function are:
• RANK and DENSE_RANK functions
• ROW_NUMBER function
• PERCENT_RANK function
• NTILE function
23
24. Working with the RANK Function
• The RANK function calculates the rank of a value in a group of values,
which is useful for top-N and bottom-N reporting.
• When using the RANK function, ascending is the default sort order,
which you can change to descending.
• Rows with equal values for the ranking criteria receive the same rank.
• Oracle Database then adds the number of tied rows to the tied rank to
calculate the next rank.
RANK ( ) OVER ( [query_partition_clause] order_by_clause )
24
25. Using the RANK Function: Example
SELECT department_id, last_name, salary,
RANK() OVER (PARTITION BY department_id
ORDER BY salary DESC) "Rank"
FROM employees
WHERE department_id = 60
ORDER BY department_id, "Rank", salary;
25
26. RANK and DENSE_RANK Functions: Example
SELECT department_id, last_name, salary,
RANK() OVER (PARTITION BY department_id
ORDER BY salary DESC) "Rank",
DENSE_RANK() over (partition by department_id
ORDER BY salary DESC) "Drank"
FROM employees
WHERE department_id = 60
ORDER BY department_id, last_name, salary DESC, "Rank"
DESC;
DENSE_RANK ( ) OVER ([query_partition_clause] order_by_clause)
26
27. Working with the ROW_NUMBER Function
• The ROW_NUMBER function calculates a sequential number of
a value in a group of values.
• When using the ROW_NUMBER function, ascending is the
default sort order, which you can change to descending.
• Rows with equal values for the ranking criteria receive a
different number.
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )
27
28. ROW_NUMBER vs. ROWNUM
• ROWNUM is a pseudo column, ROW_NUMBER is an actual
function
• It is calculated when the result returns to the client
• ROWNUM requires sorting of the entire dataset in order to
return an ordered list
• ROW_NUMBER will only sort the required rows thus giving
better performance
28
29. Using the PERCENT_RANK Function
• Uses rank values in its numerator and returns the percent rank of a
value relative to a group of values
• PERCENT_RANK of a row is calculated as follows:
• The range of values returned by PERCENT_RANK is 0 to 1,
inclusive. The first row in any set has a PERCENT_RANK of 0. The
return value is NUMBER. Its syntax is:
(rank of row in its partition - 1) /
(number of rows in the partition - 1)
PERCENT_RANK () OVER ([query_partition_clause]
order_by_clause)
29
30. Using the PERCENT_RANK Function:
Example
SELECT department_id, last_name, salary, PERCENT_RANK()
OVER (PARTITION BY department_id ORDER BY salary DESC)
AS pr
FROM hr.employees
ORDER BY department_id, pr, salary;
. . .
30
31. Working with the NTILE Function
• Not really a rank function
• Divides an ordered data set into a number of buckets indicated by
expr and assigns the appropriate bucket number to each row
• The buckets are numbered 1 through expr
NTILE ( expr ) OVER ([query_partition_clause] order_by_clause)
31
32. Summary of Ranking Functions
• Different ranking functions may return different results if the
data has ties
SELECT last_name, salary, department_id,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) A,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) B,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) C,
PERCENT_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) D,
NTILE(4) OVER (PARTITION BY department_id ORDER BY salary DESC) E
FROM hr.employees;
32
34. Using the LAG and LEAD Analytic Functions
• LAG provides access to more than one row of a table at the same
time without a self-join.
• Given a series of rows returned from a query and a position of the
cursor, LAG provides access to a row at a given physical offset
before that position.
• If you do not specify the offset, its default is 1.
• If the offset goes beyond the scope of the window, the optional
default value is returned. If you do not specify the default, its value is
NULL.
{LAG | LEAD}(value_expr [, offset ] [, default ])
OVER ([ query_partition_clause ] order_by_clause)
34
35. Using the LAG and LEAD Analytic Functions:
Example
SELECT time_id, TO_CHAR(SUM(amount_sold),'9,999,999') AS SALES,
TO_CHAR(LAG(SUM(amount_sold),1) OVER (ORDER BY
time_id),'9,999,999') AS LAG1,
TO_CHAR(LEAD(SUM(amount_sold),1) OVER (ORDER BY
time_id),'9,999,999') AS LEAD1
FROM sales
WHERE time_id >= TO_DATE('10-OCT-2000') AND
time_id <= TO_DATE('14-OCT-2000')
GROUP BY time_id;
35
36. Using FIRST_VALUE/LAST_VALUE
• Returns the first/last value in an ordered set of values
• If the first value in the set is null, then the function returns NULL
unless you specify IGNORE NULLS. This setting is useful for
data densification.
38
FIRST_VALUE (expr [ IGNORE NULLS ]) OVER (analytic_clause)
LAST_VALUE (expr [ IGNORE NULLS ]) OVER (analytic_clause)
37. Using FIRST_VALUE Analytic Function
Example
SELECT department_id, last_name, salary,
FIRST_VALUE(last_name) OVER
(ORDER BY salary ASC ROWS UNBOUNDED PRECEDING) AS lowest_sal,
LAST_VALUE(last_name) OVER (ORDER BY salary ASC ROWS BETWEEN UNBOUNDED
PRECEDING and UNBOUNDED FOLLOWING) AS highest_sal
FROM (SELECT * FROM employees WHERE department_id = 30 ORDER BY employee_id)
ORDER BY department_id, last_name, salary;
39
38. Using NTH_VALUE Analytic Function
• Returns the N-th values in an ordered set of values
• Different default window: RANGE BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW
NTH_VALUE (measure_expr, n)
[ FROM { FIRST | LAST } ][ { RESPECT | IGNORE } NULLS ]
OVER (analytic_clause)
40
39. Using NTH_VALUE Analytic Function
Example
SELECT prod_id, channel_id, MIN(amount_sold),
NTH_VALUE ( MIN(amount_sold), 2) OVER (PARTITION BY
prod_id ORDER BY channel_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) nv
FROM sh.sales
WHERE prod_id BETWEEN 13 and 16
GROUP BY prod_id, channel_id;
41
40. Using the LISTAGG Function
• For a specified measure, LISTAGG orders data within each group
specified in the ORDER BY clause and then concatenates the
values of the measure column
• WARNING: Limited to output of 4000 chars (else, error message
in runtime)
42
LISTAGG(measure_expr [, 'delimiter'])
WITHIN GROUP (order_by_clause) [OVER
query_partition_clause]
41. Using the LISTAGG Function Example
SELECT department_id "Dept", hire_date "Date",
last_name "Name",
LISTAGG(last_name, ', ') WITHIN GROUP (ORDER BY
hire_date, last_name)
OVER (PARTITION BY department_id) as "Emp_list"
FROM hr.employees
WHERE hire_date < '01-SEP-2003'
ORDER BY "Dept", "Date", "Name";
43
43. Window Functions
• The windowing_clause gives some analytic functions a
further degree of control over this window within the current
partition
• The windowing_clause can only be used if an
order_by_clause is present
• The windows are always limited to the current partition
• Generally, the default window is the entire work set unless said
otherwise
45
44. Windowing Clause Useful Usages
• Cumulative aggregation
• Sliding average over proceeding and/or following rows
• Using the RANGE parameter to filter aggregation records
46
45. Windows can be by RANGE or ROWS
47
Possible values for start_point and end_point
UNBOUNDED PRECEDING The window starts at the first row of the partition. Only
available for start points.
UNBOUNDED FOLLOWING The window ends at the last row of the partition. Only
available for end points.
CURRENT ROW The window starts or ends at the current row
value_expr PRECEDING A physical or logical offset before the current row.
When used with RANGE, can also be an interval literal
value_expr FOLLOWING As above, but an offset after the current row
RANGE BETWEEN start_point AND end_point
ROWS BETWEEN start_point AND end_point
46. Shortcuts
• Useful shortcuts for the windowing clause:
48
ROWS UNBOUNDED PRECEDING ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT
ROW
ROWS 10 PRECEDING ROWS BETWEEN 10
PRECEDING AND CURRENT
ROW
ROWS CURRENT ROW ROWS BETWEEN CURRENT
ROW AND CURRENT ROW (1
row)
47. Oracle 12c New Feature
Overview
Just a couple, we can talk for hours on all the new features..
49
48. What’s New in Oracle 12c
• Top-N Queries and pagination: returning the top-n queries
• syntactic honey – just a syntax enhancement, not performance
enhancement
• Pattern matching: New MATCH_RECOGNIZE syntax for finding
row between patterns
50
49. Top-N Examples
51
SELECT last_name, salary
FROM hr.employees
ORDER BY salary
FETCH FIRST 4 ROWS ONLY;
SELECT last_name, salary
FROM hr.employees
ORDER BY salary
FETCH FIRST 4 ROWS WITH TIES;
SELECT last_name, salary
FROM hr.employees
ORDER BY salary DESC
FETCH FIRST 10 PERCENT ROWS ONLY;
50. What is Pattern Matching?
• A new syntax that allows us to identify and group rows with
consecutive values
• Consecutive in this regards – row after row
• Uses regular expression like syntax to find patterns
• Finds complex behavior we couldn’t found before, or needed
PL/SQL for it
52
51. Example: Pages in a Book Example
• Our goal: find uninterrupted sequences in a book
• This can be useful for detecting missing records or sequential
behavior
53
(source: “Database 12c Row Pattern Matching” (OOW2014 session), by Stew Ashton).
52. SELECT *
FROM book_pages
MATCH_RECOGNIZE (
ORDER BY page
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
ONE ROW PER MATCH
MEASURES
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
AFTER MATCH SKIP PAST LAST ROW
);
1. Define input
2. Pattern Matching
3. Order input
4. Process pattern
5. Using defined conditions
6. Output: rows per match
7. Output: columns per row
8. Where to go after
match?
Pattern Matching Example
SELECT *
FROM book_pages
MATCH_RECOGNIZE (
ORDER BY page
MEASURES
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
);
55. Summary
• We talked about advanced aggregation clauses, multi- dimensional
aggregation, and how utilizing it can save us time and effort
• Analytic functions are really important both for performance and for
code clarity
• We saw how rank function work and how to use windows
• We explored some Oracle 12c enhancements – more information
about that can be found in my blog: www.realdbamgic.com
57