Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Row patternmatching12ctech14

792 views

Published on

My presentation on a new Oracle Database feature called "Row Pattern Matching". I presented on 9 December 2014 at UKOUG Tech 14. Be sure to download for the animations.

Published in: Technology
  • Be the first to comment

Row patternmatching12ctech14

  1. 1. “Row Pattern Matching” with Database 12c MATCH_RECOGNIZE Beating the Best Pre-12c Solutions Stew Ashton UKOUG Tech 14 Stew ASHTON UKOUG Tech 14
  2. 2. Agenda • Who am I? • Pre-12c solutions compared to row pattern matching with MATCH_RECOGNIZE – For all sizes of data – Thinking in patterns • Watch out for “catastrophic backtracking” • Other things to keep in mind (time permitting) 2
  3. 3. Who am I? • 33 years in IT – Developer, Technical Sales Engineer, Technical Architect – Aeronautics, IBM, Finance – Mainframe, client-server, Web apps • 25 years as an American in Paris • 9 years using Oracle database – Performance analysis – Replace Java with SQL • 2 years as internal “Oracle Development Expert” 3
  4. 4. 1) “Fixed Difference” • Identify and group rows with consecutive values • My presentation: print slides to keep • Math: subtract known consecutives – If A-1 = B-2 then A = B-1 – Else A <> B-1 – Consecutive becomes equality, non-consecutive becomes inequality • “Consecutive” = fixed difference of 1 PAGE 1 2 3 5 6 7 10 11 12 36 4
  5. 5. 1) Pre-12c select min(page) firstpage, max(page) lastpage, count(*) cnt FROM ( SELECT page, page – Row_Number() over(order by page) as grp_id FROM t ) GROUP BY grp_id; FIRSTPAGE PAGE [RN] GRP_LASTPAGE ID CNT 1 1 0 2 2 0 3 3 0 5 4 1 6 5 1 7 6 1 10 7 3 11 8 3 12 9 3 42 10 32 1 3 3 5 7 3 10 12 3 36 36 1 5
  6. 6. Think “match a row pattern” • PATTERN – Uninterrupted series of input rows – Described as a list of conditions (“regular expressions”) PATTERN (A B*) "A" : 1 row, "B" : 0 or more rows, as many as possible • DEFINE each row condition [A undefined = TRUE] B AS page = PREV(page)+1 • Each series that matches the pattern is a “match” – "A" and "B" identify the rows that meet their conditions 6
  7. 7. Input, Processing, Output 1. Define input 2. Order input 3. Process pattern 4. using defined conditions 5. Output: rows per match 6. Output: columns per row 7. Go where after match? SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY page MEASURES PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ONE ROW PER MATCH MEASURES A.page firstpage, LAST(page) lastpage, COUNT(*) cnt AFTER MATCH SKIP PAST LAST ROW ); A.page firstpage, LAST(page) lastpage, COUNT(*) cnt ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B*) DEFINE B AS page = PREV(page)+1 7
  8. 8. 1) Run_Stats comparison For one million rows: Stat Pre 12c Match_R Pct Latches 4090 4079 100% Elapsed Time 5.51 5.56 101% CPU used by this session 5.5 5.55 101% “Latches” are serialization devices: fewer means more scalable 8
  9. 9. 1) Execution Plans Operation Used-Mem SELECT STATEMENT HASH GROUP BY 40M (0) Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem 0 SELECT STATEMENT 1 400K 00:00:01.83 1594 1 VIEW HASH GROUP BY 1 1000K 400K 00:00:01.83 1594 41M 5035K 40M (0) 2 VIEW WINDOW SORT 1 1000K 1000K 00:00:12.69 1594 3 WINDOW SORT 1 1000K 1000K 00:00:03.46 1594 22M 20M 1749K (0) 20M (0) 4 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.53 1594 TABLE ACCESS FULL Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem 0 SELECT STATEMENT 1 400K 00:00:03.45 1594 1 VIEW 1 1000K 400K 00:00:03.45 1594 2 Operation Used-Mem MATCH RECOGNIZE SORT DETERMINISTIC FINITE SELECT AUTO STATEMENT VIEW 1 1000K 400K 00:00:01.87 1594 22M 1749K 20M (0) 3 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.09 1594 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 20M (0) TABLE ACCESS FULL 9
  10. 10. 2) “Start of Group” • Identify group boundaries, often using LAG() • 3 steps instead of 2: 1. For each row: if start of group, assign 1 Else assign 0 2. Running total of 1s and 0s produces a group identifier 3. Group by the group identifier 10
  11. 11. 2) Requirement GROUP_NAME START_TS END_TS X 2014-01-01 00:00 2014-02-01 00:00 X 2014-03-01 00:00 2014-04-01 00:00 X 2014-04-01 00:00 2014-05-01 00:00 X 2014-06-01 00:00 2014-06-01 01:00 X 2014-06-01 01:00 2014-06-01 02:00 X 2014-06-01 02:00 2014-06-01 03:00 Y 2014-06-01 03:00 2014-06-01 04:00 Y 2014-06-01 04:00 2014-06-01 05:00 Y 2014-07-03 08:00 2014-09-29 17:00 Merge contiguous date ranges in same group 11
  12. 12. 1 2 2 3 3 3 1 1 2 X X 05-X 06-06-03:Y 03:05:Y 07-03 08:09-29 17:X 01-01 00:00 02-01 00:00 1 X 03-01 00:00 04-01 00:00 1 X 04-01 00:00 05-01 00:00 0 X 06-01 00:00 06-01 01:00 1 X 06-01 01:00 06-01 02:00 0 X 06-01 02:00 06-01 03:00 0 Y 06-01 03:00 06-01 04:00 1 Y 06-01 04:00 06-01 05:00 0 Y 07-03 08:00 09-29 17:00 1 with grp_starts as ( select a.*, case when start_ts = lag(end_ts) over( partition by group_name order by start_ts ) then 0 else 1 end grp_start from t a ), grps as ( select b.*, sum(grp_start) over( partition by group_name order by start_ts ) grp_id from grp_starts b) select group_name, min(start_ts) start_ts, max(end_ts) end_ts from grps group by group_name, grp_id; 12
  13. 13. 2) Match_Recognize SELECT * FROM t MATCH_RECOGNIZE( PARTITION BY group_name ORDER BY start_ts MEASURES A.start_ts start_ts, end_ts end_ts, next(start_ts) - end_ts gap PATTERN(A B*) DEFINE B AS start_ts = prev(end_ts) ); New this time: • Added PARTITION BY • MEASURES added gap using row outside the match! • ONE ROW PER MATCH and SKIP PAST LAST ROW are the defaults One solution replaces two methods: simple! 13
  14. 14. Which row do we mean? 14 Column name by itself = « current » row • Define: row being evaluated • All rows: each row being output • One row: last row being output START_TS END_TS DEFINE MEASURES ALL ROWS ONE ROW 00:00 01:00 FIRST() FIRST() FIRST() 01:00 02:00 Current Current Current 02:00 03:00 LAST() LAST() LAST() 04:00 05:00 FINAL LAST FINAL LAST
  15. 15. Which row do we mean? Expression DEFINE MEASURES ALL ROWS… ONE ROW… FIRST(start_ts) First row of match start_ts current row last row of match LAST(end_ts) current row last row of match FINAL ORA-62509 last row of match LAST(end_ts) B.start_ts most recent B row last B row PREV(), NEXT() Physical offset from referenced row COUNT(*) from first to current row all rows in match COUNT(B.*) B rows including current row all B rows 15
  16. 16. 2) Run_Stats comparison For 500,000 rows: Stat Pre 12c Match_R Pct Latches 10165 8066 79% Elapsed Time 32,16 20,58 64% CPU used by this session 31,94 19,67 62% 16
  17. 17. 2) Execution Plans Operation Used-Mem SELECT STATEMENT HASH GROUP BY 20M (0) VIEW WINDOW BUFFER 32M (0) VIEW WINDOW SORT 27M (0) TABLE ACCESS FULL Operation Used-Mem SELECT STATEMENT VIEW MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0) TABLE ACCESS FULL 17
  18. 18. 2) Matching within a group 18 SELECT * FROM ( SELECT * from t WHERE group_name = 'X' ) MATCH_RECOGNIZE … ); Filter before MATCH_RECOGNIZE to avoid extra work
  19. 19. 2) Predicate pushing Select * from <view> where group_name = 'X' Operation Name A-Rows Buffers SELECT STATEMENT 3 4 VIEW 3 4 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 3 4 TABLE ACCESS BY INDEX ROWID BATCHED T 6 4 INDEX RANGE SCAN TI 6 3 19
  20. 20. 3) “Bin fitting”: fixed size • Requirement – Order by study_site – Put in “bins” with size = 65,000 max STUDY_SITE CNT STUDY_SITE CNT 1001 3407 1026 137 1002 4323 1028 6005 1004 1623 1029 76 1008 1991 1031 4599 1011 885 1032 1989 1012 11597 1034 3427 1014 1989 1036 879 1015 5282 1038 6485 1017 2841 1039 3 1018 5183 1040 1105 1020 6176 1041 6460 1022 2784 1042 968 1023 25865 1044 471 1024 3734 1045 3360 FIRST_SITE LAST_SITE SUM_CNT 1001 1022 48081 1023 1044 62203 1045 1045 3360 20
  21. 21. SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM ( SELECT s, e, cnt, sm FROM t MODEL DIMENSION BY (row_number() over(order by study_site) rn) MEASURES (study_site s, study_site e, cnt, cnt sm) RULES ( sm[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN s[cv()] ELSE s[cv() - 1] END ) ) GROUP BY s; • DIMENSION with row_number orders data and processing • rn can be used like a subscript • cv() means current row • cv()-1 means previous row rn [– [[[[– [rn [[[[[– 21
  22. 22. SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY study_site MEASURES FIRST(study_site) first_site, LAST(study_site) last_site, SUM(cnt) sum_cnt PATTERN (A+) DEFINE A AS SUM(cnt) <= 65000 ); New this time: • PATTERN (A+) replaces (A B*) means 1 or more rows • Why? In previous examples I used PREV(), which returns NULL on the first row. One solution replaces 3 methods: simpler! 22
  23. 23. 3) Run_Stats comparison For one million rows: Stat Pre 12c Match_R Pct Latches 357448 4622 1% Elapsed Time 32.85 2.9 9% CPU used by this session 31.31 2.88 9% 23
  24. 24. 3) Execution Plans Id Operation Used-Mem 0 SELECT STATEMENT 1 HASH GROUP BY 7534K (0) 2 VIEW 3 SQL MODEL ORDERED 105M (0) 4 WINDOW SORT 27M (0) 5 TABLE ACCESS FULL Id Operation Used-Mem 0 SELECT STATEMENT 1 VIEW 2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0) 3 TABLE ACCESS FULL 24
  25. 25. 4) “Bin fitting”: fixed number Name Val Val BIN1 BIN2 BIN3 1 1 10 10 2 2 9 10 9 3 3 8 10 9 8 4 4 7 10 9 15 5 5 6 10 15 15 6 6 5 15 15 15 7 7 4 19 15 15 8 8 3 19 18 15 9 9 2 19 18 17 10 10 1 19 18 18 • Requirement – Distribute values in 3 parts as equally as possible • “Best fit decreasing” – Sort values in decreasing order – Put each value in least full “bin” 25
  26. 26. 4) Brilliant pre 12c solution SELECT bin, Max (bin_value) bin_value FROM ( SELECT * FROM items MODEL DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn) MEASURES ( item_name, item_value, Row_Number() OVER (ORDER BY item_value DESC) bin, item_value bin_value, Row_Number() OVER (ORDER BY item_value DESC) rn_m, 0 min_bin, Count(*) OVER () - 3 - 1 n_iters ) RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) ( min_bin[1] = Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn<= 3], bin[ITERATION_NUMBER + 3 + 1] = min_bin[1], bin_value[min_bin[1]] = bin_value[CV()] + Nvl(item_value[ITERATION_NUMBER+4], 0)) ) WHERE item_name IS NOT NULL group by bin; 26
  27. 27. SELECT * from items MATCH_RECOGNIZE ( ORDER BY item_value desc MEASURES sum(bin1.item_value) bin1, sum(bin2.item_value) bin2, sum(bin3.item_value) bin3 PATTERN ((bin1|bin2|bin3)+) DEFINE bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value) ); • ()+ = 1 or more of whatever is inside • '|' = alternatives, “preferred in the order specified” • Bin1 condition: • No rows here yet, • Or this bin least full • Bin2 condition • No rows here yet, or • This bin less full than 3 27
  28. 28. 4) Run_Stats comparison For 10,000 rows: Stat Pre 12c Match_R Pct Latches 3124 47 2% Elapsed Time 28 0.02 0% CPU used by this session 26.39 0.03 0% 28
  29. 29. 4) Execution Plans Id Operation Used-Mem 0 SELECT STATEMENT 1 HASH GROUP BY 817K (0) 2 VIEW 3 SQL MODEL ORDERED 1846K (0) 4 WINDOW SORT 424K (0) 5 TABLE ACCESS FULL Id Operation Used-Mem 0 SELECT STATEMENT 1 VIEW 2 MATCH RECOGNIZE SORT 330K (0) 3 TABLE ACCESS FULL 29
  30. 30. Backtracking • What happens when there is no match??? • “Greedy” quantifiers - * + {2,} – are not that greedy – Take all the rows they can, BUT give rows back if necessary – one at a time • Regular expression engines will test all possible combinations to find a match 30
  31. 31. Repeating conditions select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* c) define b as n > prev(n) , c as n = 0 ); Runs in 0.005 secs select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* b* b* c) define b as n > prev(n) , c as n = 0 ); Runs in 5.4 secs 31
  32. 32. 32 123456789 A AB ABBB ABBBB ABBBBB ABBBBBB ABBBBBBB ABBBBBBBC ABBBBBBC ABBBBBC ABBBBC ABBBC ABBC ABC AC Backtracking in action: 1. Find A 2. Find all the Bs you can 3. At the end, look for a C 4. No C? Backtrack through the Bs 5. Still no C? No Match!
  33. 33. Imprecise Conditions SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price), STRT AS price >= nvl(PREV(PRICE),0) ); Runs in 0.02 seconds CREATE TABLE Ticker ( SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER ); insert into ticker select 'ACME', sysdate + level/24/60/60, 10000-level from dual connect by level <= 5000; price) ); Runs in 24 seconds INMEMORY: 13 seconds 33
  34. 34. Keep in Mind • Backtracking – Precise conditions – Test data with no matches • To debug: Measures classifier() cl, match_number() mn All rows per match with unmatched rows • No DISTINCT, no LISTAGG • MEASURES columns must have aliases • “Reluctant quantifier” = ? = JDBC bind variable • “Pattern variables” are range variables, not bind variables 34
  35. 35. Output Row “shape” Per Match PARTITION BY ORDER BY MEASURES Other input ONE ROW X Omitted X omitted ALL ROWS X X X X ORA-00918, anyone? 35
  36. 36. Questions? More details at: stewashton.wordpress.com 36

×