Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

8,421 views

Published on

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

Published in: Technology

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

  1. 1. Troubleshoo4ng  the  Most  Complex  Performance   Issue  I’ve  ever  seen   Tanel  Poder   hAp://blog.tanelpoder.com   hAp://tech.e2sn.com       www.enkitec.com   1      
  2. 2. Intro:  About  me   •  Tanel  Põder   Oracle  Database  Performance  geek   Exadata  Performance  geek   Hadoop  Performance  geek   •  Enkitec   •  Consultant   •  Researcher   •  Technology  Evangelist   •  Just  moved  to  Dallas   •  ANer  Tallinn  -­‐>  Stockholm  -­‐>  London  -­‐>   Cancun  -­‐>  Singapore   www.enkitec.com   Expert  Oracle  Exadata   book   (with  Kerry  Osborne  and   Randy  Johnson  of  Enkitec)   2      
  3. 3. Two  issues  -­‐  actually   •  For  warm-­‐up:   •  cursor  pin:  S  wait  events  and  sporadic  CPU  spikes   •  Read  more  from  my  blog  entry:   •  hAp://blog.tanelpoder.com/2010/04/21/cursor-­‐pin-­‐s-­‐waits-­‐sporadic-­‐cpu-­‐ spikes-­‐and-­‐systema4c-­‐troubleshoo4ng/   •  Or  just  google  for  “cursor  pin  s”       www.enkitec.com   3      
  4. 4. Environment   •  •  •  •  High-­‐concurrency,  high-­‐visibility  OLTP  database   Oracle  11.1.0.7  single-­‐instance,  dedicated  server  processes   HP-­‐UX  on  Itanium   32  CPUs,  128  GB  RAM   •  Thousands  of  end  users   •  Mul4ple  WebLogic  applica4on  servers  talking  to  database  via   connec4on  pools   www.enkitec.com   4      
  5. 5. The  problem   •  •  •  •  Sporadic  extreme  slowness  of  Oracle  DB  and  the  server   Slowness  lasts  for  1  ..  20  minutes  at  a  4me…   Queries  don’t  answer  or  extremely  slow   Can’t  even  log  on  to  OS  during  that  4me   •  New  SSH  connec4ons  succeded  once  the  spike  was  over   •  It  takes  minutes  to  run  simple  OS  commands  during  the  problem  4me   •  This  is  a  global  server-­‐wide  problem  –  everyone  complains!   So,  the  scope  of  this  problem   is  global,  server-­‐wide.   Therefore  we  can  use  global,   server-­‐wide  metrics  to   diagnose  the  problem.   www.enkitec.com   5      
  6. 6. Let’s  pick  and  diagnose  one  occurrence  of  this  problem   •  The  database  response  4mes  extremely  bad  again  around   18:10  and  this  lasted  for  about  5  minutes…   •  If  it’s  the  users  who  report  the  problem  (as  opposed  to  applica4on   side  measurements),  then  there  may  be  some  discrepancies  in  the   user  reported  4mes  vs  actual  problem  4me     www.enkitec.com   6      
  7. 7. Ini4al  AWR  Report   Number  of  sessions   has  grown  by  ~500!   Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9 Using  a  20  minute   End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7 report  for  diagnosing   Elapsed: 20.17 (mins) a  5  minute  problem?!   DB Time: 559.31 (mins) Avg % DB Event Waits Time(s) (ms) time Wait Class --------------------------- --------- ----------- ------ ------ ---------db file sequential read 2,135,668 21,468 10 64.0 User I/O DB CPU 5,860 17.5 log file sync 92,720 1,498 16 4.5 Commit read by other session 91,676 1,307 14 3.9 User I/O SQL*Net message from dblink 525 1,132 2155 3.4 Network This  66%  idle  is  an   average  over  20   minutes!   Host CPU (CPUs: ~~~~~~~~ 32 Cores: 32 Sockets: 32) Load Average Begin End %User %System %WIO %Idle --------- --------- --------- --------- --------- --------0.37 0.31 16.3 17.6 12.1 66.1 www.enkitec.com   7      
  8. 8. ASH  data  (shown  in  OEM)   •  Average  ac4ve  sessions  showed  something  different   •  Note  that  this  data  is  from  another  period  of  4me  when  a  similar  spike  happened   •  In  worst  4mes  there  were  up  to  220  ac4ve  sessions  trying  to  be  on  CPU!   •  Thanks  to  beAer  granularity  we  see  the  spikes  instead  of  some  20-­‐minute  or  hourly   averages…   •  The  problem  with  ASH  samples  is  that  it  looks  into  session  state  from  inside  Oracle   •  Perhaps  the  starva5on  is  due  to  some  other  applica5on  /  instance  in  the  server?   www.enkitec.com   8      
  9. 9. How  many  logons  were  done?   Number  of  sessions   has  grown  by  ~500!   Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9 End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7 Elapsed: 20.17 (mins) DB Time: 559.31 (mins) Statistic Total per Second -------------------------------- ------------------ -------------index fetch by key 24,174,148 19,971.0 index scans kdiixs1 24,565,055 20,293.9 leaf node 90-10 splits 5,865 4.9 leaf node splits 14,529 12.0 lob reads 34,480 28.5 lob writes 1,623,273 1,341.0 lob writes unaligned 1,623,266 1,341.0 logons cumulative 2,550 2.1 messages received 133,740 110.5 messages sent 133,740 110.5 min active SCN optimization appl 538,358 444.8 no buffer to keep pinned count 6,331 5.2 no work - consistent read gets 146,703,542 121,196.1 opened cursors cumulative 4,168,700 3,443.9 www.enkitec.com   2.1  logons  per   second,  but  we   don’t  know  how   these  logons  are     per Trans ------------distributed  over   43.6 the  20  minute   44.3 period!   0.0 0.0 0.1 2.9 2.9 0.0 0.2 0.2 1.0 0.0 264.7 7.5 9      
  10. 10. OS  level  metrics  don’t  lie  (well,  they  do,  but  less  ;-­‐)   | | | Date | Time |CPU % | 10/30/2010|17:45:00| 25.06| 10/30/2010|17:50:00| 24.77| 10/30/2010|17:55:00| 24.60| 10/30/2010|18:00:00| 25.95| 10/30/2010|18:05:00| 22.88| 10/30/2010|18:10:00| 66.89| 10/30/2010|18:15:00| 24.51| 10/30/2010|18:20:00| 25.47| 10/30/2010|18:25:00| 28.38| Phys | IO Rt | 7149.5| 5334.8| 7176.4| 7556.2| 5379.5| 4544.6| 7544.9| 5144.0| 10139.5| Phys |Memory|Pg Out | VM Pg | KB Rt | % | Rate | Scan Rt | 145817.6| 74.07| 0.0| 0.0| 60928.0| 73.98| 0.0| 0.0| 186368.0| 73.98| 0.0| 0.0| 192307.2| 74.11| 0.0| 0.0| 67584.0| 74.15| 0.0| 0.0| 58060.8| 76.97| 0.0| 0.0| 159334.4| 76.40| 0.0| 0.0| 59187.2| 75.04| 0.0| 0.0| 151552.0| 74.37| 0.0| 0.0| 1)  What  does  the  Time  18:10:00  mean,  beginning  of  the   monitoring  interval  or  end?   2)  66.89%  busy  during  5  minutes  may  actually  mean  100%  busy   during  ~3  minutes  out  of  5,  but  we  don’t  know  that  for  sure   without  measuring  in  more  detail  (beAer  granularity)…   www.enkitec.com   10      
  11. 11. Measuring  CPU  u4liza4on  in  more  detail   The  spike  lasted  from   18:11  to  18:14  (3  min)   Around  90%  in   Kernel  mode!!!   www.enkitec.com   11      
  12. 12. Checkpoint  –  measured  evidence  so  far   •  Fact:  We  have  a  100%  CPU  u4liza4on  spike,  las4ng  3  minutes   •  Fact:  90%  of  it  is  spent  in  KERNEL  mode   •  Fact:  We  have  over  2500  logons  done  during  20  minute   period   •  2.1  logons  /  second  on  average  (which  doesn’t  sound  bad)   •  Kernel  mode  CPU  usage  is  usually  caused  by  system  calls   •  …or  some  internal  kernel  thread  ac4vity   100%  CPU  usage  doesn’t  always  automa4cally  mean  you  have  a  serious  CPU   starva4on  problem.  The  CPU  runqueue  length  would  indicate  you  how  much   starva4on  (wai4ng  for  CPU  service)  there  is.  However  seeing  90%  of  CPU  used  in   KERNEL  mode  is  definitely  not  normal  for  an  Oracle  database  server.   www.enkitec.com   12      
  13. 13. Diagnosing  90%  kernel-­‐mode  CPU  usage  spikes…   1.  Systema4c   •  •  •  •  Break  down  this  90%  of  Kernel  mode  CPU  usage   Profiling!   Oh,  this  is  a  produc4on  system  and  the  problem  is  acute  &  ongoing   On  Solaris,  I’d  have  used  Dtrace  stack()  probe  to  record  OS  kernel   stack  traces  most  common  on  CPU  (google  for  dstackprof)   •  •  Or  lockstat  as  it  reports  spins  on  spinlocks  (which  consume  kernel  CPU)   But  this  was  HP-­‐UX  and  I  didn’t  know  the  tools  needed   •  •  But  I  knew  what  numbers  I  wanted  to  see!   We  sent  a  request  to  HP-­‐UX  support:     “How  do  we  measure  &  break  down  where  is  kernel  mode  CPU  used?”   2.  Check  for  usual  suspects   •  Fast,  cheap  checks  to  rule  out  or  find  known  troublemakers   www.enkitec.com   13      
  14. 14. Kernel  mode  CPU  usage  spikes  –  the  usual  suspects   •  Before  star4ng  the  systema4c  troubleshoo4ng  &  drilldown,   do  quick  checks  for  usual  suspects   •  Remember,  the  client  has  a  business  problem,  4me  is  of  essence…   1.  Logon  (or  logoff)  storms   •  Spawning,  ini4alizing  new  processes,  opening  files  and  aAaching  to   SGA  means  system  calls,  kernel  CPU  usage   2.  Oracle  code  gevng  into  some  crazy  loop  (due  to  a  bug)   •  Semop(),  yield(),  read  /proc/…,  getrusage(),  etc  loop   3.  OS  kernel  spinlock  conten4on   •  •  Not  so  usual  suspects  really…   Variety  of  reasons…   ONen  due  to  bugs  in  OS  or  some  kernel  module   www.enkitec.com   14      
  15. 15. Measuring  logon  storms   •  Use  the  AUD$  records  or  “logons  cumula@ve”  number  from     V$SYSSTAT  or  AWR,  right?   •  Wrong!   •  logons  cumula5ve  number  is  incremented  by  the  session  itself   –  aNer  it  has  logged  on,  the  same  applies  to  audit  records!   1.  2.  3.  4.  5.  ANer  the  listener  connec4on  has  been  established…   The  process  has  been  started…   It  has  aAached  to  SGA  SHM  segments…   Audit  file  has  been  wriAen  (if  needed)  …   Process,  session  SGA  structures  have  been  created   •  Memory  from  OS  and  shared  pool  allocated  (shared  pool  latches!)   6.  Session  has  been  authen4cated   7.  Then  the  logons  cumula4ve  is  incremented!     www.enkitec.com   The  logon  storm  may   have  started  way   before  these  logons   finally  succeeded!     15      
  16. 16. Measuring  logon  storms   •  Logon  storms  should  be  measured  at  the  listener  level   •  Process  listener.log  using  a  script:   $ tail listener.log 30-OCT-2010 23:22:03 30-OCT-2010 23:22:08 30-OCT-2010 23:22:08 30-OCT-2010 23:22:09 * * * * (CONNECT_DATA=…) (CONNECT_DATA=…) (CONNECT_DATA=…) service_update * * establish * E2SNDB * 0 * establish * E2SNDB * 0 * establish * E2SNDB * 0 E2SNDB * 0 $ fgrep "30-OCT-2010" listener.log | fgrep "establish" | awk '{ print $1 " " $2 }' | awk -F: '{ print $1 }' | sort | uniq –c … 5564 30-OCT-2010 20 5571 30-OCT-2010 21 5703 30-OCT-2010 22 5703  logons  during   4828 30-OCT-2010 23 22:00  –  22:59   … www.enkitec.com   16      
  17. 17. Measuring  logon  storms   •  Logon  storms  should  be  measured  at  the  listener  level   •  Process  listener.log  using  a  script:   $ fgrep "30-OCT-2010 22:" listener.log | fgrep "establish" | awk '{ print $1 " " $2 }' | awk -F: '{ print $1 ":" $2 }' | sort | uniq –c … 88 30-OCT-2010 22:00 120 30-OCT-2010 22:01 94 30-OCT-2010 22:02 How  many  connec4ons   94 30-OCT-2010 22:03 listener  established  every   95 30-OCT-2010 22:04 minute  (this  data  from  non-­‐ 120 30-OCT-2010 22:05 problem  4me)   79 30-OCT-2010 22:06 101 30-OCT-2010 22:07 85 30-OCT-2010 22:08 100 30-OCT-2010 22:09 85 30-OCT-2010 22:10 89 30-OCT-2010 22:11 83 30-OCT-2010 22:12 93 30-OCT-2010 22:13 www.enkitec.com   17      
  18. 18. Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis   Let’s  trace  system   calls  by  one  CPU   heavy  Oracle  process   # tusc -cp 8021 ( Attached to process 8021 ("oracleXYZ (LOCAL=NO)") [64-bit] ) ( Detaching from process 8021 ("oracleXYZ (LOCAL=NO)") ) Syscall open ----Total Interes4ngly  3  open()  syscalls   take  over  a  second  in  total.   Could  this  be  caused  by  the   general  CPU  starva4on  in  the   server?   Seconds 1.05 ----1.05 Calls 3 ----3 The  next  step  should  have   been  to  check  which  file  did   we  try  to  open  (but  the  spike   ended  before  we  could  do   that)   www.enkitec.com   Errors 3 ----3 Seeing  errors  isn’t  a   problem  itself,  as  this   is  how  Oracle  and  libs   some4mes  check  for   existence  of  a  file…   18      
  19. 19. Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis   •  HP-­‐UX  support  got  back  to  us  and  recommended  the  use  of   runki  tool  for  measuring  kernel  CPU  u4liza4on  in  detail   •  It  had  to  be  installed  as  root   •  It  measured  a  lot  of  things  happening  in  kernel,  wri4ng  a  big  output   file   •  on  Solaris  there’s  a  tool  called  TNF  trace,  AIX  has  trace  command  for  such   full  kernel  tracing   •  Now  we  had  to  wait  for  the  problem  to  happen  again   •  Didn’t  have  to  wait  for  too  long…   •  We  sent  the  raw  trace  dump  to  HP  Support,  so  they  could  run   something  like  “tkprof”  on  that  tracefile   •  It  basically  just  summed  up  the  kernel  spinlock  wait,  spin  events  by  lock,   object  locked  etc   www.enkitec.com   19      
  20. 20. kiprof  –  profiled  runki  output   Total Hardclock traces: 30239 ================================ State Count Percent USER 7130 23.58 SYS 22965 75.94 IDLE 24 0.08 SSYS 120 0.40   That’s  basically   spinning  for  locks   (adap4ve  decisions  to   spin  or  wait)   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kernel Functions executed during profile Count Pct State Function Self-­‐explanatory.  We   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 15615 51.64% SYS timed_preArbitration are  spinning  for  a  lock   7130 23.58% USER OTHER 1275 4.22% SYS spinlock 1008 3.33% SYS wait_for_lock_spinner VxFS  directory  block   488 1.61% SYS vx_dirbread read?   464 1.53% SYS vx_bc_getblk 417 1.38% SYS preArbitration 338 1.12% SYS vx_dirscan VxFS  directory   291 0.96% SYS vx_bc_brelse contents  scan  !!!   www.enkitec.com   20      
  21. 21. Drilling  down  to  kernel  spinlock  spinning   The  main  spinlock  experiencing  spinning  was  related  to  VxFS:   spn%kern 7.24 0.11 0.04 0.04 0.01 cumpct 7.24 7.36 7.40 7.44 7.45 seconds 71.96 1.14 0.44 0.42 0.12 spn%cpu lock name 3.77 FS:vxfs:bc_freelist_lock spin 0.06 FS:vxfs:inode spin for sleep lock 0.02 FS:vxfs:i_spinspin lock 0.02 Sleep Queue lock 0.01 v_count_lock Oracle  processes  were  the  main  ones  spinning:   spn%spn 99.17 0.60 0.08 0.05 0.03 0.03 cumpct 99.17 99.77 99.85 99.91 99.93 99.96 spnsec 73.82 0.45 0.06 0.04 0.02 0.02 usrsec 880.32 0.00 1.42 0.00 0.00 0.00 kernsec spn%kern process name 543.70 13.58 oracle 7.37 6.11 vxfsd 11.52 0.52 tnslsnr 402.60 0.01 [IDLE] 0.50 4.00 xyz 0.43 4.65 sadc www.enkitec.com   21      
  22. 22. Which  system  calls?   $ /opt/ktools/bin/kiinfo -kitrace pid=28679,syscalls -ts 1103_1307 | grep open | sort -r -n -k 13,13 | more 2.550292 cpu=6 seqcnt=43808107 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.295138 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_21.aud' 8.789893 cpu=6 seqcnt=43825746 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.172851 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_23.aud' 5.617009 cpu=6 seqcnt=43816702 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.066686 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_22.aud' 18.143066 cpu=6 seqcnt=43848837 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 2.576237 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_26.aud' www.enkitec.com   22      
  23. 23. Audit  file  des4ona4on   •  New  audit  file  name  format  in  11g…   •  A  new  file  is  created  for  each  audit  file  where  the  SPID  collides  with  a   previous  file…   •  Every  @me  when  crea@ng  a  new  audit  file,  Oracle  has  to  check  whether   such  file  already  exists  with  suffix  _1,  then  _2,  _3,  etc..   $ cd /u01/app/oracle/admin/E2SNDB/adump $ ls -l | head total 4788 -rw-r----- 1 oracle dba 735 Feb 28 16:06 -rw-r----- 1 oracle dba 710 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Feb 27 17:53 -rw-r----- 1 oracle dba 736 Oct 16 17:58 -rw-r----- 1 oracle dba 740 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Feb 28 16:07 -rw-r----- 1 oracle dba 735 Feb 24 17:44 -rw-r----- 1 oracle dba 735 Dec 22 21:28 www.enkitec.com   e2sndb_ora_10028_1.aud e2sndb_ora_10082_1.aud e2sndb_ora_10082_2.aud e2sndb_ora_10095_1.aud e2sndb_ora_10120_1.aud e2sndb_ora_10125_1.aud e2sndb_ora_10158_1.aud e2sndb_ora_10206_1.aud e2sndb_ora_10482_1.aud 23      
  24. 24. Shouldn’t  the  audit  files  be  created  only  for  SYSDBA  and   SYSOPER  access?   •  In  theory,  yes.   •  In  prac4ce  in  our  case,  no.   •  Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING   GENERATED  IN  11.1   •  Oracle  generated  a  new  .aud  file  for  every  new  database  connec4on!   •  Not  just  SYSOPER/SYSDBA  like  normally   •  This  is  all  despite  having  audit_trail  =  DB     •  Normally  the  .aud  files  in  audit_dump_dest  are  not  related  to  regular   audit  trail,  but  for  SYSOPER/SYSDBA  logon,  startup/shutdown  audi4ng   •  …and  if  AUDIT_SYS_OPERATIONS  =  TRUE  then  all  commands  issued  as  SYS   www.enkitec.com   24      
  25. 25. Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING   GENERATED  IN  11.1   PROBLEM: -------After upgrade to 11.1 the system creates 10 - 16 *aud files per minute in audit_file_dest. Out of 9528 *.aud files that the customer uploaded, 9124 files recorded ACTION:[3] "102". DIAGNOSTIC ANALYSIS: -------------------The change of behavior (move audit action 102 from aud$ to audit_file_dest when audit_trail=DB.) is due to a fix for an unpublished bug 5476184 in 11.1. It is not an intended feature for 11G. WORKAROUND: ----------Manual delete of audit files www.enkitec.com   25      
  26. 26. Diagnosis   1.  Thanks  to  bug  9744092  and  a  behavior  change  in  Oracle  11.1   a  new  audit  file  was  created  for  each  new  connec4on  to  DB   •  If  a  file  already  existed,  Oracle  checked  if  a  similar  file  name  with   larger  suffix  value  (_2,  _3,  _4  etc)  existed   2.  The  audit_file_dest  eventually  had  over  1.5  M  files  in  it!   •  For  each  logon,  mul4ple  file  existence  checks  (open()  syscalls)  had  to   be  done   3.  Checking  whether  a  file  existed  (open  syscall  -­‐>  directory   entry  scan)  became  very  slow  –  and  it’s  done  in  kernel  mode   •  •  A  spinlock  was  held  during  the  directory  entry  scan   Other  new  Oracle  processes  also  wanted  to  do  the  directory  scan,   resul4ng  in  spinlock  conten4on  and  further  Kernel  mode  CPU  usage   4.  When  the  DB  got  slow  –  app  servers  fired  up  hundreds  of   new  connec4ons  to  “make  things  faster”   •  This  all  fed  back  to  the  problem  –  even  more  conten4on  &  spinning   www.enkitec.com   26      
  27. 27. Limi4ng  logon  storms   Use  Oracle  Listener  connec4on  rate  limiter  (11gR1+)   listener.ora: LISTENER= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)(RATE_LIMIT=5)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1522)(RATE_LIMIT=10)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1523)) )   Oracle  Documenta5on:  Oracle  Net  Listener  Parameters  (listener.ora)   hQp://download.oracle.com/docs/cd/B28359_01/network.111/b28317/listener.htm     Also,  it  is  possible  to  limit  logoff  storm  rate    _logout_storm_rate  parameter  (instance-­‐wide)   www.enkitec.com   27      
  28. 28. Troubleshoo4ng  sporadic  system  performance  issues     Right  Data  !!!     •  Right  scope  –  if  your  problem  lasts  for  seconds,  this  should  be   the  granularity  of  your  data  too   •  OS  level  data,  in  addi4on  to  the  database  metrics   •  Ideally  OS  level  metrics  sampled  mul4ple  4mes  per  minute   www.enkitec.com   28      
  29. 29. Conclusions   •  Logon  storms  are  evil!   •  They  will  amplify  any  performance  hiccups  as  they  cause  extra  load   just  when  the  resources  are  scarcest   •  Connec4on  pools  firing  up  hundreds  of  new  connec4ons  are   evil!   •  Know  your  limits  (both  max  connec4ons  and  max  connect  rate  /  sec)   •  Here’s  a  thought:   •  If  you  have  planned  the  servers’  capacity  to  support  N-­‐thousand   connec4ons  anyway  (by  allowing  connec4on  pools  grow  that  high),  why   not  create  this  amount  of  connec4ons  right  away?   •  This  would  avoid  logon  storms  during  worst  4mes  as  all  connec4ons  have   already  been  created!   www.enkitec.com   29      

×