• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Extreme querying with_analytics
 

Extreme querying with_analytics

on

  • 302 views

Presentation given to the Sydney Oracle meetup on June 30th 2010.

Presentation given to the Sydney Oracle meetup on June 30th 2010.
Covering Oracle analytics and advanced aggregate functions

Statistics

Views

Total Views
302
Views on SlideShare
299
Embed Views
3

Actions

Likes
0
Downloads
10
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Standard SQL functionality focuses on rows. When you want to explore the relationship between rows, those rows would have to come from different sources, even if it was the same table with a different alias. And those different data sources would have to be joined. Analytic functions allow the rows in a result set to 'peek' at each other, avoiding the need for joining duplicated data sources.
  • A partition separates discrete and independent slices of data. A window dictates how far back or forward the row can peek WITHIN the partition The default windows is from the 'start' of the partition to the current row. The concept of 'start' generally requires an explicit ORDER BY
  • Without analytics, all the expressions and columns in a row of a result set had to come from within that row. Analytics allows you to 'peek' at your neighbours
  • You can have multiple analytics in a single select, all with different partitioning and/or ordering characteristics. You may want to partition by Customer or Order, and order by dates ascending and descending. Like any SQL, there will be limits on how complex you ALLOW things to get.
  • This 'New Aggregate' section is pretty much only here to emphasize this function.
  • If I want the biggest selling item, there's little point in telling me how many I've sold if you can't tell me WHAT I sold.
  • 'X' is made up, just to give a row in Cities that has the same population as Sydney
  • It looks better with your own user defined collections.
  • XML allows complex or 'wide' data sets (lots of columns) to be pulled together. They are a (mostly) reliable mechanism for grouping data so that it can be ungrouped at a later point.
  • You can group up an entire record this way, treat it as a single value as you move it around (eg in a MAX … KEEP) and then decompose it back at the end.
  • Wraps the into a higher level record.
  • Turn the results into a single row. This is the AGGREGATION function.
  • Finally, you may need to wrap the XML records into a higher level chunk. Here we have a PARENT, with multiple LINE entries each of which consists of a NAME and POP element.
  • Better than the old days when you had to rely on DECODE, which was a right royal PITA if you needed greater than/less than style comparisons, especially with strings where you couldn't use SIGN.
  • These are also available in SQL Server : Google "DAT317: T-SQL Power! The OVER Clause: Your Key to No-Sweat Problem Solving" Which was presented at "Tech Ed North America 2010"
  • The most common scenario is a Top-N query.
  • Not wanting to lose all the employees from a particular sector, Smithers brings a report ranking the employees within their department (or sector).
  • NOTE: Since I used a WAGE DESC as the order by, nulls were put to the top. I can avoid this with a NULLS LAST clause select name, wage, sector, row_number() over (partition by sector order by wage desc nulls last) rn, rank() over (partition by sector order by wage desc nulls last) rnk, dense_rank() over (partition by sector order by wage desc nulls last) drnk from emp where sector in ('7G','9i') order by sector, wage desc nulls last ROW_NUMBER gives non-duplicating, consecutive numbers. If the ORDER BY is not deterministic, the results may differ. It can give you a "Give me the two highest paid employees" and guarantee no more than two rows, but with the risk that it isn't deterministic. You might get Lenny or Carl. RANK gives the same number when the ORDER BY values match, but will skip numbers. It can be the best for "Give me the two highest paid employees" with the caveat that you may get more than two records if there are 'ties' at the end. For 7G, Homer, Lenny and Carl would be returned. DENSE_RANK gives the same number when the ORDER BY values match, and the next number is always consecutive. In this case 'Give me the three highest salaries and the people to whom they are paid" would return the four people in 7G with salaries of 200, 100 and 50. If there are no ties, the results are equivalent.
  • Cumulative amount demonstrates the ORDER BY. It is generally less confusing if, where you have an ORDER BY in an analytic, you have the same ORDER BY at the bottom of the query.
  • Because Lenny and Carl both earn 100, the SUM analytic 'groups' both together. However the ROW_NUMBER analytic orders them uniquely. Using the ROW_NUMBER derived value as a filter means that a group is broken and the results look wrong.
  • The default for the windowing clause is RANGE, which is deterministic and has rows of equivalent value at the same level. The alternative is ROWS which puts in an artificial, and arbitrary, differentiator as a tie-breaker. The UNBOUNDED PRECEDING is also the default. It just means start at the beginning (eg the beginning of the partition, but we'll get on to partitions later).
  • NTILE (and the similar PERCENT_RANK) can be useful for excluding the outliers. For example, you only select rows where the percent_rank is between 10 and 90 to exclude the top and bottom 10% of 'weird' values, or select the middle third Percent_rank is always a percentage. NTILE allows you to choose your own bucket size NTILE is also available in SQL Server. Postgres also has analytics, which are enhanced in Postgres 9.
  • create table test_mill as select round(dbms_random.normal,3) n from dual connect by level < 100000; select round(n*2) label, count(*) val from (select n, ntile(9) over (order by n) nt from test_mill) where nt [not] in (1,9) group by round(n*2) order by 2
  • IGNORE NULLS
  • Emphasize that May went from 130 to 170, an increase of 40 (or about 31%). Not much use for LEAD, but it is pretty similar. create table sales (period date, amount number); Insert into sales select add_months(trunc(sysdate,'YYYY'), rownum -1), round(dbms_random.value(100, 500),-1) from dual connect by level < 10; column perc format 9999.99 select to_char(period,'Month') mon, amount, lag(amount) over (order by period) prev_amt, 100*(amount - (lag(amount) over (order by period)))/(lag(amount) over (order by period)) perc from sales order by period /
  • Ignore nulls syntax (11g) select to_char(period,'Month') mon, amount, lag(amount) ignore nulls over (order by period) prev_amt from sales order by period /
  • Partitioning by Quarter Get the sales value for the first month of each quarter.
  • Answers questions that I never need to ask, such as the rolling total of the last three months.
  • Filter is applied AFTER the analytic SQL> explain plan for 2 select * from 3 (select order_id, line_id, 4 sum(value) over (order by line_id) cumul 5 from order_lines) 6 where order_id =10; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------- Plan hash value: 2716399136 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 6 | 234 | 4 (25)| 00:00:01 | |* 1 | VIEW | | 6 | 234 | 4 (25)| 00:00:01 | | 2 | WINDOW SORT | | 6 | 234 | 4 (25)| 00:00:01 | | 3 | TABLE ACCESS FULL| ORDER_LINES | 6 | 234 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------------- PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("ORDER_ID"=10)
  • In this case, it allows from predicate pishing SQL> explain plan for 2 select * from 3 (select order_id, line_id, 4 sum(value) over 5 (partition by order_id order by line_id) cumul 6 from order_lines) 7 where order_id =10; Explained. SQL> SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------ Plan hash value: 2716399136 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | 117 | 4 (25)| 00:00:01 | | 1 | VIEW | | 3 | 117 | 4 (25)| 00:00:01 | | 2 | WINDOW SORT | | 3 | 117 | 4 (25)| 00:00:01 | |* 3 | TABLE ACCESS FULL| ORDER_LINES | 3 | 117 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------------- PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("ORDER_ID"=10) --------------------------------------------------------------------------------------------------- PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------- SQL_ID 7zc4srwzv21gq, child number 0 ------------------------------------- SELECT * FROM ORD_LN_VW WHERE ORDER_ID = :B1 Plan hash value: 2082499838 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 4 (100)| | | 1 | VIEW | ORD_LN_VW | 1 | 39 | 4 (25)| 00:00:01 | | 2 | WINDOW SORT | | 1 | 39 | 4 (25)| 00:00:01 | |* 3 | TABLE ACCESS FULL| ORDER_LINES | 1 | 39 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("ORDER_ID"=:B1)
  • select decode(grouping(to_char(period,'Q')),1,'Total', nvl(to_char(period,' MM Month'),' Subtotal')) mnth, sum(amount) amt from sales group by rollup(to_char(period,'Q'), period); Sometimes use GROUPING in a filter predicate to avoid duplicate results. [Should be able to use HAVING, but ORA-0600 in XE] select m1,m2, amt from (select m1, m2, sum(amount) amt, grouping(m1) gm1, grouping(m2) gm2 from (select to_char(Period,'Month') m1, to_char(period,'MM') m2, amount from sales) group by rollup (m1, m2)) where gm1 = gm2 order by m2, m1, amt /
  • The example above excludes the detail rows shown below. SQL> select colour, shape, count(*) 2 from stc 3 group by cube(colour,shape) 4 / COLOUR SHAPE COUNT(*) ---------- ---------- ---------- 249 Oval 83 Round 83 Square 83 Red 50 Red Oval 16 Red Round 17 Red Square 17 Blue 49 Blue Oval 17 Blue Round 16 Blue Square 16 Green 50 Green Oval 17 Green Round 16 Green Square 17 White 50 White Oval 16 White Round 17 White Square 17 Yellow 50 Yellow Oval 17 Yellow Round 17 Yellow Square 16 24 rows selected.
  • Number, datatype and names of columns are fixed at parse time. There can't be any chance of a subsequent execution, potentially with different bind variables, returning a differently structured data set.

Extreme querying with_analytics Extreme querying with_analytics Presentation Transcript

  •  
    • blah blah NOT LIABLE blah blah blah, I NEVER SAID THAT blah blah READ THE DOCUMENTATION blah blah blah NO PROMISES blah I GET PAID BY THE WORD blah blah
    Read my blog at HTTP://BLOG.SYDORACLE.COM
  •  
  •  
  •  
    • Aggregate functions are the basis of many Analytics
    • All the standard aggregates (MIN, MAX, COUNT, SUM, etc) can be used with analytic clauses.
    • Min / Max (with added KEEP)
    • KEEP means keep the column value for the highest ranked record.
  • Which of their cities has the most potential slaves ?
  • SYDNEY and X both have a population of 2 million
  • MIN or MAX only makes a difference if there are multiple entries of the same ORDER BY rank
    • Min / Max (with added KEEP)
    • Collect
      • Create an collection of all the individual values
      • A list of large cities …
  •  
    • Min / Max (with added KEEP)
    • Collect
    • XMLAgg (in four steps)
      • Collect the column(s) into an XML document
  •  
  •  
  •  
  •  
    • Min / Max (with added KEEP)
    • Collect
    • XMLAGG
    • ListAgg
      • 11g function to create a single VARCHAR2 value from a collection of individual VARCHAR2s
  •  
    • Wrap the aggregate around a CASE statement to give more aggregation possibilities.
    • SELECT
    • SUM(case when state='VIC' then pop end) vic_pop,
    • SUM(case when state='NSW' then pop end) nsw_pop
    • FROM cities;
  • (at last)
    • Dense Rank / Rank / Row Number
  • Smithers, Bring me a list of our highest paid employees … and the poisoned donuts.
    • select name, wage, sector,
    • row_number () over
    • ( partition by sector order by wage desc) rn,
    • rank () over
    • (partition by sector order by wage desc) rnk,
    • dense_rank () over
    • (partition by sector order by wage desc) drnk
    • from emp
    • order by sector, wage desc;
  •  
  •  
    • Using ROW_NUMBER with other analytics can confuse…
    • select name, wage, cum_wage from
    • (select name, wage,
    • sum(wage) over (order by wage desc) cwage,
    • row_number() over (order by wage desc) rn
    • from emp
    • where sector = '7G')
    • where rn < 3
    • NAME WAGE CUM_WAGE
    • Homer 2OO 2OO
    • Lenny 1OO 4OO
  •  
    • Dense Rank / Rank / Row Number
    • NTILE
      • The &quot;Snobs&quot; and &quot;Yobs&quot; function
      • Ignore the outliers and extremes
      • Or ignore the 'huddled masses'
  •  
  • Exclude the most common 90% Focus on the most common 10%
    • Dense Rank / Rank / Row Number
    • NTILE
    • Lag / Lead
      • Look around for the previous or next row
    • MONTH AMOUNT PREV_AMT PERC
    • January 340
    • February 340 340 .00
    • March 150 340 -55.88
    • April 130 150 -13.33
    • May 170 130 30.77
    • June 210 170 23.53
    • July 350 210 66.67
    • August 270 350 -22.86
    • September 380 270 40.74
    • MON AMOUNT PREV_AMT
    • ---------- ---------- ----------
    • January 340
    • February 340 340
    • March 150 340
    • April 130 150
    • May 170 130
    • June 170
    • July 350 170
    • August 270 350
    • September 380 270
    • Dense Rank / Rank / Row Number
    • Percent Rank
    • Lag / Lead
    • First / Last
      • Look further ahead or behind
    • select to_char(period,'Month') mon,
    • amount,
    • first_value (amount) over
    • ( partition by trunc(period,'Q')
    • order by period) prev_amt
    • from sales
    • order by period
    • MON AMOUNT PREV_AMT
    • ---------- ---------- ----------
    • January 340 340
    • February 340 340
    • March 150 340
    • April 130 130
    • May 170 130
    • June 210 130
    • July 350 350
    • August 270 350
    • September 380 350
    • Rarely needed in practice
    • Partition By and Order By normally enough
    • If you omit the PARTITION clause, especially with in-line views , the results can be BAD
  •  
  •  
  • In the inline view, the SUM analytic applies to ALL the Orders in the table.
  •  
  • (if we have time)
    • Rollup
    • Grouping sets
    • Cube
  •  
  •  
    • Rollup
    • Cube
        • CUBE allows combinations of columns to be totaled
  •  
    • Rollup
    • Cube
    • Grouping sets
      • Perform grouping across multiple columns
      • Without the lower level totals of CUBE
  •  
    • If you think you have a problem which the MODEL clause solves then
      • Go have a coffee
      • Go have a bar of chocolate
      • Go have a beer
      • Go have a lie down
    • BUT do something else until the feeling wears off
  •