Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SQL Tuning, takes 3 to tango

Session about SQL Tuning Diagnostics, delivered at OakTable World SF 2017

  • Login to see the comments

SQL Tuning, takes 3 to tango

  1. 1. SQL Tuning Takes three to tango Mauro Pagano
  2. 2. Mauro Pagano @mautro • Worked at Oracle, been at Enkitec (AEG) a while now • Spend most of the time on performance problems • Free tools: SQLd360, TUNAs360 etc (at Oracle: SQLT, SQLHC etc) • Strong British accent • “Newbie old fart” (approved by Bryn) 2
  3. 3. but….
  4. 4. 1. Do you use AWR for SQL Tuning? 1. Do you start from it? 2. Why? 3. How do you use it? 2. Do you use ASH for SQL Tuning? 1. Do you start from it? 2. Why? 3. How do you use it? 3. Do you use SQL Monitoring for SQL tuning? 1. Do you start from it? 2. Why? 3. How do you use it? Poll time – “Historical” SQL Tuning
  5. 5. What are we doing here today? • Oracle has a ton of diagnostics (awesome!) • People tend to rely on GV$ / AWR more than ASH • Some questions harder to answer (if possible) from GV$/AWR data • Today’s goal is: • Present scenarios where multiple sources needed • Explain why & where to gather the missing info, make sense out of it • Knowing what info represent / source, better use of them • Focus is on diagnostics
  6. 6. What are we NOT doing here today? • Argument about which one is better • They complement each other, not exclude each other • Need all (often AWR+ASH enough) to have a full picture • One could be enough depending on cases, still the other adds value • Provide solution to scenarios presented • Today it’s about diagnostics, not problem X or Y • Once behavior identified correctly, solution is often easier to find (if exists) • Talk license / cost associated with the Packs used
  7. 7. Some (incorrect) terminology we’ll use today • GV$ all views on X$, except X$ASH • For Example, GV$SQL • AWR all the tables in AWR except ASH • For example, DBA_HIST_SQLSTAT and DBA_HIST_SQL_PLAN • ASH as GV$ACTIVE_SESSION_HISTORY and DBA_HIST_ACTIVE_SESS_HISTORY • SQL Mon as SQL Monitor • Both raw data (GV$) and reports (current and historical)
  8. 8. How are AWR / (historical) ASH populated? • AWR takes a picture every N minutes (or manual) • Source views store accumulated data, take a pic of that at time T • Historical ASH filters out samples from memory ASH • Filtered may show info not important enough to show up in accumulated • Source data includes info for all active sessions individually (not aggregated) • Ratio is generally 1:10 • X$KEW[A|R]* help to narrow down what to collect • Might make things a little harder to “break” in isolation 
  9. 9. Why do we need both? AWR Knows exactly how much water ASH Knows roughly who, when, how…
  10. 10. ASH samples - Why can we live with it? Elapsed Time / execution Frequencyofexecution How you move these bars depends on your app
  11. 11. Before we begin… • Every case is artificial, represents real file case without noise • Case itself is just a mean to an end, not really the focus of the scenario • Cases build on each other, start simple and get into a little more complex • OF COURSE I’m cheating! • What we’ll see can be applied to any environment • Knowing how to interpret and spot things helps in dev too • Charts used just to present large amount of info in small space • Not trying to push for any specific tool • Get into DataViz anyway, it makes life so much easier!
  12. 12. Two tables, all we need -- 8x dba_objects create table t_case1 as select * from dba_objects, (select rownum n1 from dual connect by rownum <= 8); create index t_case1_objtype on t_case1(object_type); -- 32x dba_objects create table t_case3 as select * from t_case1, (select rownum from dual connect by rownum <= 4);
  13. 13. 1 - How long does my SQL take? SQL ID: apg0k1r43s8ak SQL Text: select * from t_case1 where object_type = :b1; Plan hash value: 3696583251 --------------------------------------------------------------- | Id | Operation | Name | --------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| T_CASE1 | |* 2 | INDEX RANGE SCAN | T_CASE1_OBJTYPE | --------------------------------------------------------------- 2 - access("OBJECT_TYPE"=:B1)
  14. 14. 1 - How long does my SQL take? Total <<removed plan_hash_value&child_number just to make it fit, one child only>> select elapsed_time, buffer_gets, executions, trunc(elapsed_time/executions,2) elapsed_exec, trunc(buffer_gets/executions,2) lio_exec from gv$sql where sql_id = 'apg0k1r43s8ak'; ELAPSED_TIME BUFFER_GETS EXECUTIONS ELAPSED_EXEC LIO_EXEC -------------- ------------ ---------- ------------- ----------- 2,446,627 48,884 11 222,420.63 4,444 Elapsed/exec ~220ms Gets/exec ~4.5k
  15. 15. 1 - How long does my SQL take? Good run var b1 varchar2(20) exec :b1 := 'RULE'; select * from t_case1 where object_type = :b1; ----------------------------------------------------------------------------- | Id |Operation |Name |A-Rows|A-Time |Buffers| ----------------------------------------------------------------------------- | 0|SELECT STATEMENT | | 8|00:00.01| 13| | 1| TABLE ACCESS BY INDEX ROWID B|T_CASE1 | 8|00:00.01| 13| |* 2| INDEX RANGE SCAN |T_CASE1_OBJTYPE| 8|00:00.01| 5| ----------------------------------------------------------------------------- Elapsed: 00:00:00.07 ~10ms 13 buffer gets
  16. 16. 1 - How long does my SQL take? Bad run var b1 varchar2(20) exec :b1 := 'JAVA CLASS'; select * from t_case1 where object_type = :b1; ----------------------------------------------------------------------------- |Id|Operation |Name |A-Rows|A-Time |Buffers|Reads| ----------------------------------------------------------------------------- | 0|SELECT STATEMENT | | 305K|00:01.80| 54926| 8009| | 1| TABLE ACCESS BY INDEX R B|T_CASE1 | 305K|00:01.80| 54926| 8009| |*2| INDEX RANGE SCAN |T_CASE1_OBJTYPE| 305K|00:00.63| 22737| 1300| ----------------------------------------------------------------------------- Elapsed: 00:05:11.88 Can we look at the data from a different POV? ~1.80s 55k buffer gets
  17. 17. 1 - How long does my SQL take? ASH POV select sample_time, sql_exec_id, sql_exec_start from gv$active_session_history where sql_id = 'apg0k1r43s8ak' order by sample_time; SAMPLE_TIME SQL_EXEC_ID SQL_EXEC_START --------------------------- ----------- ------------------- 20-AUG-17 11.39.58.754 AM 16777222 2017-08-20/11:39:46 20-AUG-17 11.40.11.761 AM 20-AUG-17 11.40.31.781 AM 20-AUG-17 11.40.40.791 AM 20-AUG-17 11.40.43.792 AM 20-AUG-17 11.40.48.799 AM 20-AUG-17 11.40.51.800 AM 20-AUG-17 11.40.57.801 AM 20-AUG-17 11.41.03.809 AM 20-AUG-17 11.41.05.811 AM 20-AUG-17 11.41.23.833 AM 20-AUG-17 11.41.35.846 AM 20-AUG-17 11.41.52.863 AM 20-AUG-17 11.41.54.864 AM 20-AUG-17 11.42.03.870 AM 20-AUG-17 11.42.08.875 AM 20-AUG-17 11.42.21.882 AM 20-AUG-17 11.42.31.891 AM 20-AUG-17 11.42.34.896 AM Jumps in time – session not always busy during the missing sample User experience is 5 minutes not 2 secs Not much we can do from the DB perspective here
  18. 18. 1 - How long does my SQL take? Summary • Questions answered • GV$SQL (and similar) report time spent in DB calls, not user experience • GV$SQL (and similar) aggregates time over executions of same cursor • ASH sampled data helps understand how DB Time is spread over clock time • In this case showing how clock time was likely NOT spent inside the DB • ASH data has many dimensions, can help narrow down further • For example, all slow executions come from app server X • Question not solved • Why slow execution was slow (was easy this time, we provided the bind) • Historical binds are sampled, no direct correlation with specific execution • Ideally pick up value and run SQL to reproduce
  19. 19. 2 - How long did my SQL take? AWR SQL ID: 8gv4bwmnp8kmq select /*+ LEADING(A) USE_NL(B) */ count(*) from t_case1 a, t_case1 b where rownum <= 1e10; select snap_id, executions_delta e_d, executions_total e_t, end_of_fetch_count_delta eof_d, trunc(elapsed_time_delta/1e6) et_d_s, trunc(elapsed_time_total/1e6) et_t_s, buffer_gets_delta bg_d, buffer_gets_total bg_t from dba_hist_sqlstat where sql_id = '8gv4bwmnp8kmq' order by snap_id; SNAP_ID E_D E_T EOF_D ET_D_S ET_T_S BG_D BG_T ---------- --- --- ----- ------ ------ ----------- ------------ 3341 0 1 0 187 188 77,221,193 77,693,626 3342 0 1 0 126 314 51,883,866 129,577,492 3343 0 1 0 128 442 52,887,666 182,465,158 No info from snapshots when SQL started & ended
  20. 20. 2 - How long did my SQL take? AWR report No trivial way to determine #concurrent execs. Doable from *_TOTAL raw info
  21. 21. 2 - How long did my SQL take? Concurr Execs Time passing SNAP_ID 3341 3342 3343 Exec #1, starts second and completes second Not expensive enough to get captured Exec #2, starts last and completes first Session 1 Session 2 Session 3
  22. 22. 2 - How long did my SQL take? ASH data select sql_exec_id, sql_exec_start, min(sample_time) first_sample, max(sample_time) last_sample, max(sample_time)-sql_exec_start elapsed from dba_hist_active_sess_history where sql_id = '8gv4bwmnp8kmq' group by sql_exec_id, sql_exec_start; SQL_EXEC_ID SQL_EXEC_START FIRST_SAMPLE ----------- ------------------- --------------------------- 16777216 2017-08-20/13:04:43 20-AUG-17 01.04.52.779 PM LAST_SAMPLE ELAPSED -------------------------- -------------------------- 20-AUG-17 01.12.32.799 PM +000000000 00:07:49.79 Only one execution, took ~8 mins
  23. 23. 2 - How long did my SQL take? Summary • Questions answered • AWR only captures what mattered for the snapshot • Can miss start / stop “slice” of info if not impacting enough within snapshot • Raw info allows to determine number of concurrent executions, not AWR report • Can only say how many started / ended, not which one • ASH keeps only a subset of samples, but for each exec • With approximation, allows to determine the who, when, where of each exec • Questions not answered • What if my execution takes very little? Sample compromise / doesn’t matter
  24. 24. 3 – How is my PX doing? SQL ID: frzgf5tc9cscc select /*+ LEADING(A) PARALLEL(4) */ count(*) from t_case1 a, t_case1 b where a.owner = b.owner; << while SQL running >> select child_number, elapsed_time, buffer_gets, executions, px_servers_executions from gv$sql where sql_id = 'frzgf5tc9cscc'; CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ----------- ---------- --------------------- 0 11,677 172 1 0 1 34,682,909 6 0 0
  25. 25. 3 – How is my PX doing? Running slow! << SQL still running >> CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 13,682 172 1 0 1 59,852,041 2,734 0 0 after CTRL+C (was taking too long) CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 519,951 172 1 0 1 96,314,353 13,205 0 8 Up to this point we know 8 sessions involved and aggregated stats only
  26. 26. 3 – How is my PX doing? Checking plan << SQL was still running>> --------------------------------------------------------------------------------------- | Id|Operation |Name |E-Rows|Cost (%CPU)| TQ |IN-OUT|PQ Distrib | --------------------------------------------------------------------------------------- | 0|SELECT STATEMENT | | |19462 (100)| | | | | 1| SORT AGGREGATE | | 1 | | | | | | 2| PX COORDINATOR | | | | | | | | 3| PX SEND QC (RANDOM) |:TQ10002| 1 | |Q1,02| P->S |QC (RAND) | | 4| SORT AGGREGATE | | 1 | |Q1,02| PCWP | | |* 5| HASH JOIN | | 27G|19462 (86)|Q1,02| PCWP | | | 6| PX RECEIVE | | 940K| 1377 (1)|Q1,02| PCWP | | | 7| PX SEND HYBRID HASH |:TQ10000| 940K| 1377 (1)|Q1,00| P->P |HYBRID HASH| | 8| STATISTICS COLLECTOR| | | |Q1,00| PCWC | | | 9| PX BLOCK ITERATOR | | 940K| 1377 (1)|Q1,00| PCWC | | |*10| TABLE ACCESS FULL |T_CASE1 | 940K| 1377 (1)|Q1,00| PCWP | | | 11| PX RECEIVE | | 940K| 1377 (1)|Q1,02| PCWP | | | 12| PX SEND HYBRID HASH |:TQ10001| 940K| 1377 (1)|Q1,01| P->P |HYBRID HASH| | 13| PX BLOCK ITERATOR | | 940K| 1377 (1)|Q1,01| PCWC | | |*14| TABLE ACCESS FULL |T_CASE1 | 940K| 1377 (1)|Q1,01| PCWP | | --------------------------------------------------------------------------------------- Nothing surprising, plan you’d expect when dealing with large #rows Maybe PX Skewness? Can’t use V$PQ_TQSTAT, we CTRL+Ced exec Not downgraded, used 8 processes
  27. 27. 3 – How is my PX doing? PX Skew & ASH data select session_id, session_serial#, program, count(*) from gv$active_session_history where sql_id = 'frzgf5tc9cscc' and sql_exec_id = 16777217 group by session_id, session_serial#, program; SESSION_ID SESSION_SERIAL# PROGRAM COUNT(*) ---------- --------------- -------------------- ---------- 8 55195 oracle@oel7 (P003) 217 133 37006 oracle@oel7 (P001) 12 QC not showing nor most of the other processes, P003 top consumer Adding new ASH cols in the SQL we can drill down, e.g. plan step where time goes
  28. 28. 3 – How is my PX doing? PX Skew & SQL Mon Many PX info in SQL Mon NOT COMING from ASH 
  29. 29. 3 – How is my PX doing? PX Skew Summary • Questions answered • Presence of skewness during / after SQL execution • Regardless of V$PQ_TQSTAT view (tricky to use) • Needs SQL Monitor to have low level info (buffer gets, accurate time, etc) • Questions not answered • What causes the skewness and how to resolve it (not investigated here)
  30. 30. 4 – My PX SQL performance is unstable SQL ID: 8nkpzgz08mdc8 select /*+ PARALLEL(4) */ count(*) from t_case3 a, t_case3 b where a.object_id = b.object_id; select child_number, elapsed_time, buffer_gets, executions, px_servers_executions from gv$sql where sql_id = '8nkpzgz08mdc8'; CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 3,498,326 97,015 3 0 1 13,187,086 196,073 0 12
  31. 31. 4 – PX SQL perf unstable - Checking plan -------------------------------------------------------------------------------------- | Id|Operation |Name |E-Rows|Cost(%CPU)| TQ |IN-OUT|PQ Distrib | -------------------------------------------------------------------------------------- | 0|SELECT STATEMENT | | |7375 (100)| | | | | 1| SORT AGGREGATE | | 1 | | | | | | 2| PX COORDINATOR | | | | | | | | 3| PX SEND QC (RANDOM) |:TQ10002| 1 | |Q1,02| P->S |QC (RAND) | | 4| SORT AGGREGATE | | 1 | |Q1,02| PCWP | | |* 5| HASH JOIN | | 74M|7375 (1)|Q1,02| PCWP | | | 6| PX RECEIVE | | 2363K|3664 (1)|Q1,02| PCWP | | | 7| PX SEND HYBRID HASH |:TQ10000| 2363K|3664 (1)|Q1,00| P->P |HYBRID HASH| | 8| STATISTICS COLLECTOR| | | |Q1,00| PCWC | | | 9| PX BLOCK ITERATOR | | 2363K|3664 (1)|Q1,00| PCWC | | |*10| TABLE ACCESS FULL |T_CASE3 | 2363K|3664 (1)|Q1,00| PCWP | | | 11| PX RECEIVE | | 2363K|3664 (1)|Q1,02| PCWP | | | 12| PX SEND HYBRID HASH |:TQ10001| 2363K|3664 (1)|Q1,01| P->P |HYBRID HASH| | 13| PX BLOCK ITERATOR | | 2363K|3664 (1)|Q1,01| PCWC | | |*14| TABLE ACCESS FULL |T_CASE3 | 2363K|3664 (1)|Q1,01| PCWP | | -------------------------------------------------------------------------------------- Does 4 slaves make sense looking at this plan vs SQL?
  32. 32. 4 – PX SQL perf unstable – GV$SQL “history” After 1st exec CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 16,630 212 1 0 1 8,496,684 98,491 0 8 After 2nd exec CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 3,491,757 96,975 2 0 1 8,496,684 98,491 0 8 After 3rd exec CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 3,498,326 97,015 3 0 1 13,187,086 196,073 0 12 We got lucky here Info are accumulated thus very hard to spot downgrades
  33. 33. 4 – PX SQL perf unstable – ASH data select distinct sql_exec_id, sql_exec_start, case when px_flags is null then 'SERIAL' else 'DoP '||trunc(px_flags/ 2097152) end dop from gv$active_session_history where sql_id = '8nkpzgz08mdc8' order by 2; SQL_EXEC_ID SQL_EXEC_START DOP ----------- ------------------- --------- 16777216 2017-08-20/17:08:50 DoP 4 16777217 2017-08-20/17:09:16 SERIAL 16777218 2017-08-20/17:10:03 DoP 2 No need for luck we got ASH 
  34. 34. 5 – My PX SQL perf is unstable – more fun SQL ID: gcgmgk8m8v4vm with a as (select /*+ materialize parallel(4)*/ a.object_id, b.object_name from t_case3 a, t_case3 b where a.object_id = b.object_id and rownum <= 1e6) select count(*) from (select /*+ parallel(4) no_merge */ c.object_name from t_case3 c, a where a.object_id = c.object_id and a.object_name = c.object_name); CHILD_NUMBER ELAPSED_TIME BUFFER_GETS EXECUTIONS PX_SERVERS_EXECUTIONS ------------ -------------- ------------ ---------- --------------------- 0 4,097,033 50,679 1 0 1 14,191,651 98,499 0 8
  35. 35. 5 – My PX SQL perf is unstable - Checking plan -------------------------------------------------------------------------------------------------------------------- | Id |Operation |Name |E-Rows | Cost | TQ |IN-OUT| PQ Distrib | -------------------------------------------------------------------------------------------------------------------- | 0|SELECT STATEMENT | | | 11502| | | | | 1| TEMP TABLE TRANSFORMATION | | | | | | | | 2| LOAD AS SELECT (CURSOR DURATION MEMORY)|SYS_TEMP_0FD9D6B81_119A63B| | | | | | |* 3| COUNT STOPKEY | | | | | | | | 4| PX COORDINATOR | | | | | | | | 5| PX SEND QC (RANDOM) |:TQ10002 | 74M| 7375|Q1,02| P->S | QC (RAND) | | 6| BUFFER SORT | | 1000K| |Q1,02| PCWP | | |* 7| COUNT STOPKEY | | | |Q1,02| PCWC | | |* 8| HASH JOIN | | 74M| 7375|Q1,02| PCWP | | | 9| PX RECEIVE | | 2363K| 3664|Q1,02| PCWP | | | 10| PX SEND HYBRID HASH |:TQ10000 | 2363K| 3664|Q1,00| P->P | HYBRID HASH| | 11| STATISTICS COLLECTOR | | | |Q1,00| PCWC | | | 12| PX BLOCK ITERATOR | | 2363K| 3664|Q1,00| PCWC | | |* 13| TABLE ACCESS FULL |T_CASE3 | 2363K| 3664|Q1,00| PCWP | | | 14| PX RECEIVE | | 2363K| 3664|Q1,02| PCWP | | | 15| PX SEND HYBRID HASH |:TQ10001 | 2363K| 3664|Q1,01| P->P | HYBRID HASH| | 16| PX BLOCK ITERATOR | | 2363K| 3664|Q1,01| PCWC | | |* 17| TABLE ACCESS FULL |T_CASE3 | 2363K| 3664|Q1,01| PCWP | | | 18| SORT AGGREGATE | | 1 | | | | | | 19| PX COORDINATOR | | | | | | | | 20| PX SEND QC (RANDOM) |:TQ20002 | 1 | |Q2,02| P->S | QC (RAND) | | 21| SORT AGGREGATE | | 1 | |Q2,02| PCWP | | | 22| VIEW | | 1000K| 4127|Q2,02| PCWP | | |* 23| HASH JOIN | | 1000K| 4127|Q2,02| PCWP | | | 24| PX RECEIVE | | 1000K| 461|Q2,02| PCWP | | | 25| PX SEND HASH |:TQ20000 | 1000K| 461|Q2,00| P->P | HASH | | 26| VIEW | | 1000K| 461|Q2,00| PCWP | | | 27| PX BLOCK ITERATOR | | 1000K| 461|Q2,00| PCWC | | |* 28| TABLE ACCESS FULL |SYS_TEMP_0FD9D6B81_119A63B| 1000K| 461|Q2,00| PCWP | | | 29| PX RECEIVE | | 2363K| 3664|Q2,02| PCWP | | | 30| PX SEND HASH |:TQ20001 | 2363K| 3664|Q2,01| P->P | HASH | | 31| PX BLOCK ITERATOR | | 2363K| 3664|Q2,01| PCWC | | |* 32| TABLE ACCESS FULL |T_CASE3 | 2363K| 3664|Q2,01| PCWP | | --------------------------------------------------------------------------------------------------------------------
  36. 36. 5 – My PX SQL perf is unstable – ASH solution select sample_time, program, sql_plan_line_id, case when px_flags … dop from ash where sql_exec_id = 16777216 and sql_id = 'gcgmgk8m8v4vm' order by 1, 2 ; SAMPLE_TIME PROGRAM SQL_PLAN_LINE_ID DOP ----------------------------- -------------------- ---------------- -------- 20-AUG-17 05.50.13.290 PM oracle@oel7 (P000) 6 DoP 4 20-AUG-17 05.50.13.290 PM oracle@oel7 (P001) 6 DoP 4 20-AUG-17 05.50.13.290 PM oracle@oel7 (P002) 6 DoP 4 20-AUG-17 05.50.13.290 PM oracle@oel7 (P003) 6 DoP 4 20-AUG-17 05.50.14.290 PM oracle@oel7 (P000) 6 DoP 4 20-AUG-17 05.50.14.290 PM oracle@oel7 (P001) 6 DoP 4 20-AUG-17 05.50.14.290 PM oracle@oel7 (P002) 6 DoP 4 20-AUG-17 05.50.14.290 PM oracle@oel7 (P003) 6 DoP 4 20-AUG-17 05.50.15.289 PM oracle@oel7 (P000) 6 DoP 4 20-AUG-17 05.50.15.289 PM oracle@oel7 (P001) 6 DoP 4 20-AUG-17 05.50.15.289 PM oracle@oel7 (P002) 6 DoP 4 20-AUG-17 05.50.15.289 PM oracle@oel7 (P003) 6 DoP 4 20-AUG-17 05.50.16.294 PM sqlplus@Mauros-MBP.w 2 SERIAL 20-AUG-17 05.50.17.296 PM sqlplus@Mauros-MBP.w 23 SERIAL 20-AUG-17 05.50.18.296 PM sqlplus@Mauros-MBP.w 23 SERIAL 20-AUG-17 05.50.19.296 PM sqlplus@Mauros-MBP.w 23 SERIAL
  37. 37. 5 – My PX SQL perf is unstable – SQLMon sol
  38. 38. 5 – My PX SQL perf is unstable – SQLM poking select sid, process_name, px_maxdop, px_servers_requested, px_servers_allocated, px_server#, px_server_group, px_server_set, px_qcsid from gv$sql_monitor where sql_exec_id = 16777216 and sql_id = 'gcgmgk8m8v4vm' order by px_server_set nulls first, px_server# nulls first; SID PROCE PX_MAXDOP PX_S_REQUESTED PX_S_ALLOC PX_SERVER# PX_SERVER_GROUP PX_SERVER_SET PX_QCSID ---------- ----- ---------- -------------- ---------- ---------- --------------- ------------- ---------- 244 ora 4 16 8 373 p000 1 1 1 244 132 p001 2 1 1 244 256 p002 3 1 1 244 13 p003 4 1 1 244 133 p004 1 1 2 244 255 p005 2 1 2 244 372 p006 3 1 2 244 12 p007 4 1 2 244
  39. 39. 4 & 5 PX SQL perf unstable – Summary • Questions answered • Ability to determine DoP during / after execution • Regardless of V$PX_SESSION (and others) views • Ability to determine DoP on a per DFO-tree basis • Pretty much impossible from GV$ / AWR • Multiple dimensions can be added to drill down into slave execs (e.g waits) • SQL Monitor only way to extract low level info per slave • For example, buffer gets, accurate time, #rows, starts, etc • Questions not answered • What causes the downgrade (not investigated here)
  40. 40. Trivia – My SQL blew up TEMP SQL ID: 0qnb575hn2mkr (FAILS) & dm53symv2vmy6 (WORKS) select /*+ PARALLEL(4) LEADING(B A C) USE_SWAP(c) USE_HASH(A) USE_HASH(C) FAILS|WORKS */ count(*) from t_case3 a, t_case3 b, t_case3 c where a.object_id = b.object_id and a.object_id = c.object_id; ERROR at line 1: ORA-12801: error signaled in parallel query server P000 ORA-01652: unable to extend temp segment by 128 in tablespace TEMP <<hint, I’m messing with the env and with you>>
  41. 41. Trivia – My SQL blew up TEMP select sql_id, child_number, executions, px_servers_executions, buffer_gets, disk_reads, direct_reads, direct_writes from gv$sql where sql_id in ('dm53symv2vmy6','0qnb575hn2mkr') order by 1,2; SQL_ID CHILD EXECS PX_EXECS BUFFER_GETS DISK_READS DIRECT_READS DIRECT_WRITES ------ ------ ----- -------- ------------ ----------- ------------- ------------- 0qnb57 0 1 0 104 0 0 0 0qnb57 1 0 8 145,756 147,300 147,300 73,253 dm53sy 0 1 0 15 0 0 0 dm53sy 1 0 8 145,752 145,068 145,068 0
  42. 42. Trivia – My SQL blew up TEMP sql_id = 'dm53symv2vmy6' group by sample_time order by 1; SAMPLE_TIME SUM(PGA_ALLOCATED) SUM(TEMP_SPACE_ALLOCATED) ---------------------------- ------------------ ------------------------- 20-AUG-17 07.16.06.259 PM 84,414,464 0 20-AUG-17 07.16.07.259 PM 766,644,224 0 20-AUG-17 07.16.08.264 PM 766,644,224 0 <<…>> 20-AUG-17 07.16.24.316 PM 766,644,224 0 20-AUG-17 07.16.25.316 PM 766,644,224 0 sql_id = '0qnb575hn2mkr' group by sample_time order by 1; SAMPLE_TIME SUM(PGA_ALLOCATED) SUM(TEMP_SPACE_ALLOCATED) ---------------------------- ------------------ ------------------------- 20-AUG-17 07.17.49.463 PM 17,075,200 61,865,984 20-AUG-17 07.17.50.463 PM 40,308,736 148,897,792 <<…>> 20-AUG-17 07.17.55.467 PM 58,396,672 509,607,936 20-AUG-17 07.17.56.466 PM 58,396,672 583,008,256 Used less PGA but spilled to TEMP
  43. 43. Trivia – My SQL blew up TEMP – SQL Monitor dm53symv2vmy60qnb575hn2mkr
  44. 44. Trivia – My SQL blew up TEMP select sql_id, child_number, optimizer_env_hash_value from gv$sql where sql_id in ('dm53symv2vmy6','0qnb575hn2mkr'); SQL_ID CHILD_NUMBER OPTIMIZER_ENV_HASH_VALUE ------------- ------------ ------------------------ 0qnb575hn2mkr 0 3821565029 0qnb575hn2mkr 1 128879201 dm53symv2vmy6 0 3821565029 dm53symv2vmy6 1 128879201 Same CBO environment aka same CBO params Not all _smm_* params make it into CBO env!!!
  45. 45. 6 – SQL blew up TEMP – prevention!! SQL ID: 8d5h5p8znx8mx select /*+ PARALLEL(4) LEADING(B A C) USE_SWAP(c) USE_HASH(A) USE_HASH(C) */ count(*) from t_case1 a, t_case1 b, t_case1 c where a.object_id = b.object_id and a.object_id = c.object_id; << not using GV$/AWR because we need to differentiate per exec >>
  46. 46. 6 – SQL blew up TEMP – history 1st run SAMPLE_TIME SQL_EXEC_ID PGA TEMP ------------------------------- ----------- ---------- ---------- 21-AUG-17 10.09.12.674 AM 16777217 105.22 61 2nd run – data is growing ------------------------------- ----------- ---------- ---------- 21-AUG-17 10.15.32.182 AM 16777218 17.12 20 21-AUG-17 10.15.33.182 AM 16777218 70.12 113 3rd run – data keeps growing ------------------------------- ----------- ---------- ---------- 21-AUG-17 10.16.31.259 AM 16777219 2.26 0 21-AUG-17 10.16.32.259 AM 16777219 21.69 40 21-AUG-17 10.16.33.261 AM 16777219 70.12 110
  47. 47. 6 – SQL blew up TEMP – Aggregated history Aggregating over a few runs the trend is obvious (increasing memory usage) SQL_EXEC_ID SQL_EXEC_START APPROX_ET PGA TEMP ----------- ------------------- -------------------------- ------- ---------- 16777216 2017-08-21/10:07:19 +000000000 00:00:03.578 47.37 82 16777217 2017-08-21/10:09:11 +000000000 00:00:01.674 105.22 61 16777218 2017-08-21/10:15:30 +000000000 00:00:03.182 70.12 113 16777219 2017-08-21/10:16:31 +000000000 00:00:02.261 70.12 110 16777220 2017-08-21/10:17:21 +000000000 00:00:04.342 126.12 160 16777221 2017-08-21/10:18:14 +000000000 00:00:05.449 134.94 188
  48. 48. 6 – SQL blew up TEMP – Chart your data! ASH info are really easy to chart Faster to consume!
  49. 49. 6 – SQL blew up TEMP – Keep executing One new “break the pattern” execution showed up SQL_EXEC_ID SQL_EXEC_START APPROX_ET PGA TEMP ----------- ------------------- ------------------------- ------- ----- 16777216 2017-08-21/10:07:19 +000000000 00:00:03.578 47.37 82 16777217 2017-08-21/10:09:11 +000000000 00:00:01.674 105.22 61 16777218 2017-08-21/10:15:30 +000000000 00:00:03.182 70.12 113 16777219 2017-08-21/10:16:31 +000000000 00:00:02.261 70.12 110 16777220 2017-08-21/10:17:21 +000000000 00:00:04.342 126.12 160 16777221 2017-08-21/10:18:14 +000000000 00:00:05.449 134.94 188 16777222 2017-08-21/10:26:27 +000000000 00:00:07.482 69.69 109 Touched less PGA / TEMP but took longer
  50. 50. 6 – SQL blew up TEMP – Drill into 1 exec sql_id = '8d5h5p8znx8mx' and sql_exec_id = 16777222 SAMPLE_TIME SID PROGRAM EVENT --------------------------- --- -------------------- --------------------------------- 21-AUG-17 10.26.28.480 AM 254 sqlplus@Mauros-iMac. enq: KO - fast object checkpoint 21-AUG-17 10.26.29.480 AM 254 sqlplus@Mauros-iMac. enq: KO - fast object checkpoint 21-AUG-17 10.26.30.481 AM 254 sqlplus@Mauros-iMac. enq: KO - fast object checkpoint 21-AUG-17 10.26.31.480 AM 254 sqlplus@Mauros-iMac. enq: KO - fast object checkpoint 21-AUG-17 10.26.32.480 AM 254 sqlplus@Mauros-iMac. enq: KO - fast object checkpoint 21-AUG-17 10.26.33.480 AM 253 oracle@oel7 (P005) 21-AUG-17 10.26.33.480 AM 362 oracle@oel7 (P006) 21-AUG-17 10.26.34.482 AM 16 oracle@oel7 (P003) direct path write temp 21-AUG-17 10.26.34.482 AM 135 oracle@oel7 (P000) direct path write temp 21-AUG-17 10.26.34.482 AM 255 oracle@oel7 (P001) direct path write temp 21-AUG-17 10.26.34.482 AM 373 oracle@oel7 (P002) direct path write temp
  51. 51. 6 – SQL blew up TEMP – Keep executing One more execution showed up SQL_EXEC_ID SQL_EXEC_START APPROX_ET PGA TEMP ----------- ------------------- ------------------------ ------- ----- 16777216 2017-08-21/10:07:19 +000000000 00:00:03.578 47.37 82 16777217 2017-08-21/10:09:11 +000000000 00:00:01.674 105.22 61 16777218 2017-08-21/10:15:30 +000000000 00:00:03.182 70.12 113 16777219 2017-08-21/10:16:31 +000000000 00:00:02.261 70.12 110 16777220 2017-08-21/10:17:21 +000000000 00:00:04.342 126.12 160 16777221 2017-08-21/10:18:14 +000000000 00:00:05.449 134.94 188 16777222 2017-08-21/10:26:27 +000000000 00:00:07.482 69.69 109 16777223 2017-08-21/10:34:05 +000000000 00:00:02.126 69.12 108 Same PGA / TEMP as previous but much faster
  52. 52. 6 – SQL blew up TEMP – Mystery solved One more execution showed up, but they are from different sessions SID SQL_EXEC_ID SQL_EXEC_START APPROX_ET PGA TEMP --- ----------- ------------------- ------------------------ ------- ----- 130 16777219 2017-08-21/10:16:31 +000000000 00:00:02.261 70.12 110 130 16777220 2017-08-21/10:17:21 +000000000 00:00:04.342 126.12 160 130 16777221 2017-08-21/10:18:14 +000000000 00:00:05.449 134.94 188 254 16777222 2017-08-21/10:26:27 +000000000 00:00:07.482 69.69 109 254 16777223 2017-08-21/10:34:05 +000000000 00:00:02.126 69.12 108 130 16777224 2017-08-21/10:38:01 +000000000 00:00:05.402 139.94 201
  53. 53. 6 – SQL blew up TEMP – Why not AWR? select child_number, executions, px_servers_executions, elapsed_time, direct_writes, elapsed_time/nvl(nullif(px_servers_executions,0),executions) et_exec, direct_writes/nvl(nullif(px_servers_executions,0),executions) direct_wrtes_exec from gv$sql where sql_id = '8d5h5p8znx8mx'; CHILD_NUMBER EXECS PX_EXECS ELAP_TIME DIRECT_W ET_EXEC DIRECT_W_EXEC ------------ ----- -------- ---------- -------- ------------ ------------- 0 9 0 7,803,972 0 867,108 0 1 0 71 91,106,462 154,559 1,283,189.61 2,176.88 You might be able to figure it out from GV$ but need a lot of imagination and luck 
  54. 54. 6 – SQL blew up TEMP – Summary • Questions answered • Ability to monitor spill at per-execution and per-session basis • AWR would only show aggregated into • Similar info available for IOPS and IO bytes (and memory scan in V$ASH) • Charting info allows easy monitoring • Large amount of info consumed quickly • SQL Monitor relies on same ASH info • Even without SQL Mon, tons of info can be extract from ASH
  55. 55. 7 – Making sense of “strange” executions SQL ID: 06pbgg9w0bmgp select /*+ mauro */ a.* from t_case1 a, t_case1 b where a.owner = l_owner and a.object_id = b.object_id and burn_cpu(a.object_id/b.object_id) = 1 select child_number, executions, end_of_fetch_count, elapsed_time, fetches, rows_processed from gv$sql where sql_id = '06pbgg9w0bmgp'; CHILD EXECS EOF_COUNT ELAPSED_TIME FETCHES ROWS_PROCESSED ----- ----- --------- ------------ -------- -------------- 0 3 0 30,419,821 6 30 This is a single session executing the SQL Why none reached EOF?
  56. 56. 7 – Making sense of “strange” executions sql_id = '06pbgg9w0bmgp' and session_id = 377 order by sample_time; SAMPLE_TIME SQLEXECID SEXECSTA ------------------ --------- -------- 06.04.37.450 PM 16777222 18:04:36 06.04.38.450 PM 16777222 18:04:36 06.04.39.450 PM 16777222 18:04:36 06.04.40.450 PM 16777222 18:04:36 06.04.41.450 PM 16777223 18:04:41 06.04.42.450 PM 16777223 18:04:41 06.04.43.450 PM 16777223 18:04:41 06.04.44.450 PM 16777223 18:04:41 06.04.45.450 PM 16777223 18:04:41 06.04.46.450 PM 16777224 18:04:46 06.04.47.450 PM 16777224 18:04:46 06.04.48.450 PM 16777224 18:04:46 06.04.49.450 PM 16777224 18:04:46 06.04.50.450 PM 16777224 18:04:46 06.04.51.450 PM 16777224 18:04:46 06.04.52.450 PM 16777223 18:04:41 06.04.53.450 PM 16777223 18:04:41 06.04.54.450 PM 16777223 18:04:41 06.04.55.450 PM 16777223 18:04:41 06.04.56.450 PM 16777223 18:04:41 06.04.57.450 PM 16777224 18:04:46 06.04.58.450 PM 16777224 18:04:46 06.04.59.450 PM 16777224 18:04:46 06.05.00.450 PM 16777224 18:04:46 06.05.01.450 PM 16777224 18:04:46 06.05.02.450 PM 16777222 18:04:36 06.05.03.450 PM 16777222 18:04:36 06.05.04.450 PM 16777222 18:04:36 06.05.05.450 PM 16777222 18:04:36 06.05.06.450 PM 16777222 18:04:36
  57. 57. 7 – Making sense of “strange” executions - Summary • Questions answered • ASH data can be used to “slice” GV$ data and make more sense out of it • In this specific case maybe not a cursor leak • Since the cursor is used multiple times • Same approach could be used to potentially spot a cursor leak • Would require the SQL to take “long” enough to spot it • Question not answered • Why would somebody do anything like this 
  58. 58. Something worth knowing • ASH data uses default values until the value is not “ready to consume” • Adaptive Plans could take a while to resolve and until then PHV is 0 select /*+ LEADING(a) */ count(a.object_id) from (select /*+ no_merge leading (a) */ 1 object_id, 'a' owner from (select rownum from dual connect by rownum <= 1000) a, (select rownum from dual connect by rownum <= 1000) b) a, (select a.object_id from t1 a, t2 b where a.object_id = b.n1 and a.data_object_id = 1 and a.owner = 'SYS') b where a.object_id = b.object_id
  59. 59. Something worth knowing ---------------------------------------------------------------------------------------- | Id |Operation |Name|E-Rows|Cost (%CPU)| Pstart| Pstop | ---------------------------------------------------------------------------------------- | 0|SELECT STATEMENT | | | 679 (100)| | | | 1| SORT AGGREGATE | | 1| | | | |- * 2| HASH JOIN | | 1| 679 (1)| | | | 3| NESTED LOOPS | | 1| 679 (1)| | | |- 4| STATISTICS COLLECTOR | | | | | | | * 5| HASH JOIN | | 1| 405 (1)| | | | 6| VIEW | | 1| 4 (0)| | | | 7| MERGE JOIN CARTESIAN | | 1| 4 (0)| | | | 8| VIEW | | 1| 2 (0)| | | | 9| COUNT | | | | | | | 10| CONNECT BY WITHOUT FILTERING | | | | | | | 11| FAST DUAL | | 1| 2 (0)| | | | 12| BUFFER SORT | | 1| 4 (0)| | | | 13| VIEW | | 1| 2 (0)| | | | 14| COUNT | | | | | | | 15| CONNECT BY WITHOUT FILTERING| | | | | | | 16| FAST DUAL | | 1| 2 (0)| | | | * 17| TABLE ACCESS FULL |T1 | 1| 401 (1)| | | | 18| PARTITION RANGE ITERATOR | | 49999| 274 (0)| KEY | KEY | | * 19| TABLE ACCESS FULL |T2 | 49999| 274 (0)| KEY | KEY | |- 20| PARTITION RANGE JOIN-FILTER | | 49999| 274 (0)|:BF0000|:BF0000| |- 21| TABLE ACCESS FULL |T2 | 49999| 274 (0)|:BF0000|:BF0000| ----------------------------------------------------------------------------------------
  60. 60. Something worth knowing select sample_time, sql_plan_hash_value, sql_plan_line_id from gv$active_session_history where sql_id = '8x52hyvsh1j45' order by sample_time SAMPLE_TIME SQL_PLAN_HASH_VALUE SQL_PLAN_LINE_ID --------------------------- ------------------- ---------------- 28-AUG-17 03.42.43.251 PM 0 7 28-AUG-17 03.42.44.251 PM 0 6 28-AUG-17 03.42.45.251 PM 0 6 28-AUG-17 03.42.46.252 PM 0 7 28-AUG-17 03.42.47.252 PM 0 7 28-AUG-17 03.42.48.252 PM 0 5
  61. 61. Things we just can’t do (as of now) • Current diagnostic very comprehensive • Allow to answer many questions around SQL execution • Still some questions unanswered, some examples • SQL Plan Baseline / SQL Patch used or not in the past (AWR limitation) • High Version Count in the past (AWR “limitation”) • Details of “old” CBO environment (encoded, no public API) • Historical binds for slow execution (unless captured, requires luck) • Changes in NLS environment in the past (current, V$SQL_SHARED_CURSOR) • Probably not a big problem, unless you hit it 
  62. 62. Summary • Oracle diagnostics rocks when used properly • No single source of info, needs combining to get full picture • ASH provides different point of view into SQL execution • Needed more than expected • Regardless of the source, visualizing things make it easier • But this is Enkitec so you are stuck with me & SQL*Plus  • SQL Monitoring fills some of the gaps • Still many info come from ASH, available even historically (more than SQLMon) • Statspack + free ASH can provide useful info • Unfortunately not as comprehensive as the “real” ones
  63. 63. 69
  64. 64. Contact Information • Blog: http://mauro-pagano.com • Free tools • SQLd360 • TUNAs360 • Pathfinder • An “interesting” post every N posts • Email: mauro.pagano@gmail.com • Twitter: @mautro 71

×