SlideShare a Scribd company logo
Troubleshoo4ng	
  the	
  Most	
  Complex	
  Performance	
  
Issue	
  I’ve	
  ever	
  seen	
  
Tanel	
  Poder	
  
hAp://blog.tanelpoder.com	
  
hAp://tech.e2sn.com	
  
	
  
	
  
www.enkitec.com	
  

1	
  	
  	
  
Intro:	
  About	
  me	
  
•  Tanel	
  Põder	
  
Oracle	
  Database	
  Performance	
  geek	
  
Exadata	
  Performance	
  geek	
  
Hadoop	
  Performance	
  geek	
  

•  Enkitec	
  
•  Consultant	
  
•  Researcher	
  
•  Technology	
  Evangelist	
  

•  Just	
  moved	
  to	
  Dallas	
  
•  ANer	
  Tallinn	
  -­‐>	
  Stockholm	
  -­‐>	
  London	
  -­‐>	
  
Cancun	
  -­‐>	
  Singapore	
  
www.enkitec.com	
  

Expert	
  Oracle	
  Exadata	
  
book	
  

(with	
  Kerry	
  Osborne	
  and	
  
Randy	
  Johnson	
  of	
  Enkitec)	
  

2	
  	
  	
  
Two	
  issues	
  -­‐	
  actually	
  
•  For	
  warm-­‐up:	
  
•  cursor	
  pin:	
  S	
  wait	
  events	
  and	
  sporadic	
  CPU	
  spikes	
  
•  Read	
  more	
  from	
  my	
  blog	
  entry:	
  
•  hAp://blog.tanelpoder.com/2010/04/21/cursor-­‐pin-­‐s-­‐waits-­‐sporadic-­‐cpu-­‐
spikes-­‐and-­‐systema4c-­‐troubleshoo4ng/	
  
•  Or	
  just	
  google	
  for	
  “cursor	
  pin	
  s”	
  
	
  

	
  

www.enkitec.com	
  

3	
  	
  	
  
Environment	
  
• 
• 
• 
• 

High-­‐concurrency,	
  high-­‐visibility	
  OLTP	
  database	
  
Oracle	
  11.1.0.7	
  single-­‐instance,	
  dedicated	
  server	
  processes	
  
HP-­‐UX	
  on	
  Itanium	
  
32	
  CPUs,	
  128	
  GB	
  RAM	
  

•  Thousands	
  of	
  end	
  users	
  
•  Mul4ple	
  WebLogic	
  applica4on	
  servers	
  talking	
  to	
  database	
  via	
  
connec4on	
  pools	
  

www.enkitec.com	
  

4	
  	
  	
  
The	
  problem	
  
• 
• 
• 
• 

Sporadic	
  extreme	
  slowness	
  of	
  Oracle	
  DB	
  and	
  the	
  server	
  
Slowness	
  lasts	
  for	
  1	
  ..	
  20	
  minutes	
  at	
  a	
  4me…	
  
Queries	
  don’t	
  answer	
  or	
  extremely	
  slow	
  
Can’t	
  even	
  log	
  on	
  to	
  OS	
  during	
  that	
  4me	
  
•  New	
  SSH	
  connec4ons	
  succeded	
  once	
  the	
  spike	
  was	
  over	
  
•  It	
  takes	
  minutes	
  to	
  run	
  simple	
  OS	
  commands	
  during	
  the	
  problem	
  4me	
  

•  This	
  is	
  a	
  global	
  server-­‐wide	
  problem	
  –	
  everyone	
  complains!	
  
So,	
  the	
  scope	
  of	
  this	
  problem	
  
is	
  global,	
  server-­‐wide.	
  
Therefore	
  we	
  can	
  use	
  global,	
  
server-­‐wide	
  metrics	
  to	
  
diagnose	
  the	
  problem.	
  
www.enkitec.com	
  

5	
  	
  	
  
Let’s	
  pick	
  and	
  diagnose	
  one	
  occurrence	
  of	
  this	
  problem	
  
•  The	
  database	
  response	
  4mes	
  extremely	
  bad	
  again	
  around	
  
18:10	
  and	
  this	
  lasted	
  for	
  about	
  5	
  minutes…	
  
•  If	
  it’s	
  the	
  users	
  who	
  report	
  the	
  problem	
  (as	
  opposed	
  to	
  applica4on	
  
side	
  measurements),	
  then	
  there	
  may	
  be	
  some	
  discrepancies	
  in	
  the	
  
user	
  reported	
  4mes	
  vs	
  actual	
  problem	
  4me	
  	
  

www.enkitec.com	
  

6	
  	
  	
  
Ini4al	
  AWR	
  Report	
  

Number	
  of	
  sessions	
  
has	
  grown	
  by	
  ~500!	
  

Snap Id
Snap Time
Sessions Curs/Sess
--------- ------------------- -------- --------Begin Snap:
61921 30-Oct-10 18:00:10
2,383
28.9
Using	
  a	
  20	
  minute	
  
End Snap:
61922 30-Oct-10 18:20:20
2,863
24.7
report	
  for	
  diagnosing	
  
Elapsed:
20.17 (mins)
a	
  5	
  minute	
  problem?!	
  
DB Time:
559.31 (mins)
Avg
% DB
Event
Waits
Time(s)
(ms)
time Wait Class
--------------------------- --------- ----------- ------ ------ ---------db file sequential read
2,135,668
21,468
10
64.0 User I/O
DB CPU
5,860
17.5
log file sync
92,720
1,498
16
4.5 Commit
read by other session
91,676
1,307
14
3.9 User I/O
SQL*Net message from dblink
525
1,132
2155
3.4 Network

This	
  66%	
  idle	
  is	
  an	
  
average	
  over	
  20	
  
minutes!	
  

Host CPU (CPUs:
~~~~~~~~

32 Cores:
32 Sockets:
32)
Load Average
Begin
End
%User
%System
%WIO
%Idle
--------- --------- --------- --------- --------- --------0.37
0.31
16.3
17.6
12.1
66.1
www.enkitec.com	
  

7	
  	
  	
  
ASH	
  data	
  (shown	
  in	
  OEM)	
  
•  Average	
  ac4ve	
  sessions	
  showed	
  something	
  different	
  
•  Note	
  that	
  this	
  data	
  is	
  from	
  another	
  period	
  of	
  4me	
  when	
  a	
  similar	
  spike	
  happened	
  

•  In	
  worst	
  4mes	
  there	
  were	
  up	
  to	
  220	
  ac4ve	
  sessions	
  trying	
  to	
  be	
  on	
  CPU!	
  
•  Thanks	
  to	
  beAer	
  granularity	
  we	
  see	
  the	
  spikes	
  instead	
  of	
  some	
  20-­‐minute	
  or	
  hourly	
  
averages…	
  
•  The	
  problem	
  with	
  ASH	
  samples	
  is	
  that	
  it	
  looks	
  into	
  session	
  state	
  from	
  inside	
  Oracle	
  
•  Perhaps	
  the	
  starva5on	
  is	
  due	
  to	
  some	
  other	
  applica5on	
  /	
  instance	
  in	
  the	
  server?	
  

www.enkitec.com	
  

8	
  	
  	
  
How	
  many	
  logons	
  were	
  done?	
  

Number	
  of	
  sessions	
  
has	
  grown	
  by	
  ~500!	
  

Snap Id
Snap Time
Sessions Curs/Sess
--------- ------------------- -------- --------Begin Snap:
61921 30-Oct-10 18:00:10
2,383
28.9
End Snap:
61922 30-Oct-10 18:20:20
2,863
24.7
Elapsed:
20.17 (mins)
DB Time:
559.31 (mins)
Statistic
Total
per Second
-------------------------------- ------------------ -------------index fetch by key
24,174,148
19,971.0
index scans kdiixs1
24,565,055
20,293.9
leaf node 90-10 splits
5,865
4.9
leaf node splits
14,529
12.0
lob reads
34,480
28.5
lob writes
1,623,273
1,341.0
lob writes unaligned
1,623,266
1,341.0
logons cumulative
2,550
2.1
messages received
133,740
110.5
messages sent
133,740
110.5
min active SCN optimization appl
538,358
444.8
no buffer to keep pinned count
6,331
5.2
no work - consistent read gets
146,703,542
121,196.1
opened cursors cumulative
4,168,700
3,443.9

www.enkitec.com	
  

2.1	
  logons	
  per	
  
second,	
  but	
  we	
  
don’t	
  know	
  how	
  
these	
  logons	
  are	
  	
  
per Trans
------------distributed	
  over	
  
43.6
the	
  20	
  minute	
  
44.3
period!	
  
0.0
0.0
0.1
2.9
2.9
0.0
0.2
0.2
1.0
0.0
264.7
7.5

9	
  	
  	
  
OS	
  level	
  metrics	
  don’t	
  lie	
  (well,	
  they	
  do,	
  but	
  less	
  ;-­‐)	
  
|
|
|
Date
| Time |CPU % |
10/30/2010|17:45:00| 25.06|
10/30/2010|17:50:00| 24.77|
10/30/2010|17:55:00| 24.60|
10/30/2010|18:00:00| 25.95|
10/30/2010|18:05:00| 22.88|
10/30/2010|18:10:00| 66.89|
10/30/2010|18:15:00| 24.51|
10/30/2010|18:20:00| 25.47|
10/30/2010|18:25:00| 28.38|

Phys
|
IO Rt
|
7149.5|
5334.8|
7176.4|
7556.2|
5379.5|
4544.6|
7544.9|
5144.0|
10139.5|

Phys
|Memory|Pg Out | VM Pg
|
KB Rt | %
| Rate | Scan Rt |
145817.6| 74.07|
0.0|
0.0|
60928.0| 73.98|
0.0|
0.0|
186368.0| 73.98|
0.0|
0.0|
192307.2| 74.11|
0.0|
0.0|
67584.0| 74.15|
0.0|
0.0|
58060.8| 76.97|
0.0|
0.0|
159334.4| 76.40|
0.0|
0.0|
59187.2| 75.04|
0.0|
0.0|
151552.0| 74.37|
0.0|
0.0|

1)  What	
  does	
  the	
  Time	
  18:10:00	
  mean,	
  beginning	
  of	
  the	
  
monitoring	
  interval	
  or	
  end?	
  
2)  66.89%	
  busy	
  during	
  5	
  minutes	
  may	
  actually	
  mean	
  100%	
  busy	
  
during	
  ~3	
  minutes	
  out	
  of	
  5,	
  but	
  we	
  don’t	
  know	
  that	
  for	
  sure	
  
without	
  measuring	
  in	
  more	
  detail	
  (beAer	
  granularity)…	
  
www.enkitec.com	
  

10	
  	
  	
  
Measuring	
  CPU	
  u4liza4on	
  in	
  more	
  detail	
  
The	
  spike	
  lasted	
  from	
  
18:11	
  to	
  18:14	
  (3	
  min)	
  

Around	
  90%	
  in	
  
Kernel	
  mode!!!	
  

www.enkitec.com	
  

11	
  	
  	
  
Checkpoint	
  –	
  measured	
  evidence	
  so	
  far	
  
•  Fact:	
  We	
  have	
  a	
  100%	
  CPU	
  u4liza4on	
  spike,	
  las4ng	
  3	
  minutes	
  
•  Fact:	
  90%	
  of	
  it	
  is	
  spent	
  in	
  KERNEL	
  mode	
  
•  Fact:	
  We	
  have	
  over	
  2500	
  logons	
  done	
  during	
  20	
  minute	
  
period	
  
•  2.1	
  logons	
  /	
  second	
  on	
  average	
  (which	
  doesn’t	
  sound	
  bad)	
  

•  Kernel	
  mode	
  CPU	
  usage	
  is	
  usually	
  caused	
  by	
  system	
  calls	
  
•  …or	
  some	
  internal	
  kernel	
  thread	
  ac4vity	
  

100%	
  CPU	
  usage	
  doesn’t	
  always	
  automa4cally	
  mean	
  you	
  have	
  a	
  serious	
  CPU	
  
starva4on	
  problem.	
  The	
  CPU	
  runqueue	
  length	
  would	
  indicate	
  you	
  how	
  much	
  
starva4on	
  (wai4ng	
  for	
  CPU	
  service)	
  there	
  is.	
  However	
  seeing	
  90%	
  of	
  CPU	
  used	
  in	
  
KERNEL	
  mode	
  is	
  definitely	
  not	
  normal	
  for	
  an	
  Oracle	
  database	
  server.	
  
www.enkitec.com	
  

12	
  	
  	
  
Diagnosing	
  90%	
  kernel-­‐mode	
  CPU	
  usage	
  spikes…	
  
1.  Systema4c	
  
• 
• 
• 
• 

Break	
  down	
  this	
  90%	
  of	
  Kernel	
  mode	
  CPU	
  usage	
  
Profiling!	
  
Oh,	
  this	
  is	
  a	
  produc4on	
  system	
  and	
  the	
  problem	
  is	
  acute	
  &	
  ongoing	
  
On	
  Solaris,	
  I’d	
  have	
  used	
  Dtrace	
  stack()	
  probe	
  to	
  record	
  OS	
  kernel	
  
stack	
  traces	
  most	
  common	
  on	
  CPU	
  (google	
  for	
  dstackprof)	
  
• 

• 

Or	
  lockstat	
  as	
  it	
  reports	
  spins	
  on	
  spinlocks	
  (which	
  consume	
  kernel	
  CPU)	
  

But	
  this	
  was	
  HP-­‐UX	
  and	
  I	
  didn’t	
  know	
  the	
  tools	
  needed	
  
• 
• 

But	
  I	
  knew	
  what	
  numbers	
  I	
  wanted	
  to	
  see!	
  
We	
  sent	
  a	
  request	
  to	
  HP-­‐UX	
  support:	
  	
  
“How	
  do	
  we	
  measure	
  &	
  break	
  down	
  where	
  is	
  kernel	
  mode	
  CPU	
  used?”	
  

2.  Check	
  for	
  usual	
  suspects	
  
• 

Fast,	
  cheap	
  checks	
  to	
  rule	
  out	
  or	
  find	
  known	
  troublemakers	
  
www.enkitec.com	
  

13	
  	
  	
  
Kernel	
  mode	
  CPU	
  usage	
  spikes	
  –	
  the	
  usual	
  suspects	
  
•  Before	
  star4ng	
  the	
  systema4c	
  troubleshoo4ng	
  &	
  drilldown,	
  
do	
  quick	
  checks	
  for	
  usual	
  suspects	
  
•  Remember,	
  the	
  client	
  has	
  a	
  business	
  problem,	
  4me	
  is	
  of	
  essence…	
  

1.  Logon	
  (or	
  logoff)	
  storms	
  
• 

Spawning,	
  ini4alizing	
  new	
  processes,	
  opening	
  files	
  and	
  aAaching	
  to	
  
SGA	
  means	
  system	
  calls,	
  kernel	
  CPU	
  usage	
  

2.  Oracle	
  code	
  gevng	
  into	
  some	
  crazy	
  loop	
  (due	
  to	
  a	
  bug)	
  
• 

Semop(),	
  yield(),	
  read	
  /proc/…,	
  getrusage(),	
  etc	
  loop	
  

3.  OS	
  kernel	
  spinlock	
  conten4on	
  
• 
• 

Not	
  so	
  usual	
  suspects	
  really…	
  
Variety	
  of	
  reasons…	
  
ONen	
  due	
  to	
  bugs	
  in	
  OS	
  or	
  some	
  kernel	
  module	
  
www.enkitec.com	
  

14	
  	
  	
  
Measuring	
  logon	
  storms	
  
•  Use	
  the	
  AUD$	
  records	
  or	
  “logons	
  cumula@ve”	
  number	
  from	
  	
  
V$SYSSTAT	
  or	
  AWR,	
  right?	
  
•  Wrong!	
  

•  logons	
  cumula5ve	
  number	
  is	
  incremented	
  by	
  the	
  session	
  itself	
  
–	
  aNer	
  it	
  has	
  logged	
  on,	
  the	
  same	
  applies	
  to	
  audit	
  records!	
  
1. 
2. 
3. 
4. 
5. 

ANer	
  the	
  listener	
  connec4on	
  has	
  been	
  established…	
  
The	
  process	
  has	
  been	
  started…	
  
It	
  has	
  aAached	
  to	
  SGA	
  SHM	
  segments…	
  
Audit	
  file	
  has	
  been	
  wriAen	
  (if	
  needed)	
  …	
  
Process,	
  session	
  SGA	
  structures	
  have	
  been	
  created	
  
• 

Memory	
  from	
  OS	
  and	
  shared	
  pool	
  allocated	
  (shared	
  pool	
  latches!)	
  

6.  Session	
  has	
  been	
  authen4cated	
  
7.  Then	
  the	
  logons	
  cumula4ve	
  is	
  incremented!	
  
	
  
www.enkitec.com	
  

The	
  logon	
  storm	
  may	
  
have	
  started	
  way	
  
before	
  these	
  logons	
  
finally	
  succeeded!	
  	
  
15	
  	
  	
  
Measuring	
  logon	
  storms	
  
•  Logon	
  storms	
  should	
  be	
  measured	
  at	
  the	
  listener	
  level	
  
•  Process	
  listener.log	
  using	
  a	
  script:	
  
$ tail listener.log
30-OCT-2010 23:22:03
30-OCT-2010 23:22:08
30-OCT-2010 23:22:08
30-OCT-2010 23:22:09

*
*
*
*

(CONNECT_DATA=…)
(CONNECT_DATA=…)
(CONNECT_DATA=…)
service_update *

* establish * E2SNDB * 0
* establish * E2SNDB * 0
* establish * E2SNDB * 0
E2SNDB * 0

$ fgrep "30-OCT-2010" listener.log | fgrep "establish" | 
awk '{ print $1 " " $2 }' | awk -F: '{ print $1 }'
| 
sort | uniq –c
…
5564 30-OCT-2010 20
5571 30-OCT-2010 21
5703 30-OCT-2010 22
5703	
  logons	
  during	
  
4828 30-OCT-2010 23
22:00	
  –	
  22:59	
  
…
www.enkitec.com	
  

16	
  	
  	
  
Measuring	
  logon	
  storms	
  
•  Logon	
  storms	
  should	
  be	
  measured	
  at	
  the	
  listener	
  level	
  
•  Process	
  listener.log	
  using	
  a	
  script:	
  
$ fgrep "30-OCT-2010 22:" listener.log | fgrep "establish"
| 
awk '{ print $1 " " $2 }' | awk -F: '{ print $1 ":" $2 }' | 
sort | uniq –c
…
88 30-OCT-2010 22:00
120 30-OCT-2010 22:01
94 30-OCT-2010 22:02
How	
  many	
  connec4ons	
  
94 30-OCT-2010 22:03
listener	
  established	
  every	
  
95 30-OCT-2010 22:04
minute	
  (this	
  data	
  from	
  non-­‐
120 30-OCT-2010 22:05
problem	
  4me)	
  
79 30-OCT-2010 22:06
101 30-OCT-2010 22:07
85 30-OCT-2010 22:08
100 30-OCT-2010 22:09
85 30-OCT-2010 22:10
89 30-OCT-2010 22:11
83 30-OCT-2010 22:12
93 30-OCT-2010 22:13
www.enkitec.com	
  

17	
  	
  	
  
Con4nuing	
  the	
  OS	
  kernel	
  mode	
  CPU	
  usage	
  diagnosis	
  
Let’s	
  trace	
  system	
  
calls	
  by	
  one	
  CPU	
  
heavy	
  Oracle	
  process	
  

# tusc -cp 8021
( Attached to process 8021 ("oracleXYZ (LOCAL=NO)") [64-bit] )
( Detaching from process 8021 ("oracleXYZ (LOCAL=NO)") )
Syscall
open
----Total

Interes4ngly	
  3	
  open()	
  syscalls	
  
take	
  over	
  a	
  second	
  in	
  total.	
  
Could	
  this	
  be	
  caused	
  by	
  the	
  
general	
  CPU	
  starva4on	
  in	
  the	
  
server?	
  

Seconds
1.05
----1.05

Calls
3
----3

The	
  next	
  step	
  should	
  have	
  
been	
  to	
  check	
  which	
  file	
  did	
  
we	
  try	
  to	
  open	
  (but	
  the	
  spike	
  
ended	
  before	
  we	
  could	
  do	
  
that)	
  
www.enkitec.com	
  

Errors
3
----3

Seeing	
  errors	
  isn’t	
  a	
  
problem	
  itself,	
  as	
  this	
  
is	
  how	
  Oracle	
  and	
  libs	
  
some4mes	
  check	
  for	
  
existence	
  of	
  a	
  file…	
  
18	
  	
  	
  
Con4nuing	
  the	
  OS	
  kernel	
  mode	
  CPU	
  usage	
  diagnosis	
  
•  HP-­‐UX	
  support	
  got	
  back	
  to	
  us	
  and	
  recommended	
  the	
  use	
  of	
  
runki	
  tool	
  for	
  measuring	
  kernel	
  CPU	
  u4liza4on	
  in	
  detail	
  
•  It	
  had	
  to	
  be	
  installed	
  as	
  root	
  
•  It	
  measured	
  a	
  lot	
  of	
  things	
  happening	
  in	
  kernel,	
  wri4ng	
  a	
  big	
  output	
  
file	
  
•  on	
  Solaris	
  there’s	
  a	
  tool	
  called	
  TNF	
  trace,	
  AIX	
  has	
  trace	
  command	
  for	
  such	
  
full	
  kernel	
  tracing	
  

•  Now	
  we	
  had	
  to	
  wait	
  for	
  the	
  problem	
  to	
  happen	
  again	
  
•  Didn’t	
  have	
  to	
  wait	
  for	
  too	
  long…	
  
•  We	
  sent	
  the	
  raw	
  trace	
  dump	
  to	
  HP	
  Support,	
  so	
  they	
  could	
  run	
  
something	
  like	
  “tkprof”	
  on	
  that	
  tracefile	
  
•  It	
  basically	
  just	
  summed	
  up	
  the	
  kernel	
  spinlock	
  wait,	
  spin	
  events	
  by	
  lock,	
  
object	
  locked	
  etc	
  
www.enkitec.com	
  

19	
  	
  	
  
kiprof	
  –	
  profiled	
  runki	
  output	
  
Total Hardclock traces: 30239
================================
State
Count
Percent
USER
7130
23.58
SYS
22965
75.94
IDLE
24
0.08
SSYS
120
0.40

	
  

That’s	
  basically	
  
spinning	
  for	
  locks	
  
(adap4ve	
  decisions	
  to	
  
spin	
  or	
  wait)	
  

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Kernel Functions executed during profile
Count
Pct State Function
Self-­‐explanatory.	
  We	
  
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
15615 51.64% SYS
timed_preArbitration
are	
  spinning	
  for	
  a	
  lock	
  
7130 23.58% USER
OTHER
1275
4.22% SYS
spinlock
1008
3.33% SYS
wait_for_lock_spinner
VxFS	
  directory	
  block	
  
488
1.61% SYS
vx_dirbread
read?	
  
464
1.53% SYS
vx_bc_getblk
417
1.38% SYS
preArbitration
338
1.12% SYS
vx_dirscan
VxFS	
  directory	
  
291
0.96% SYS
vx_bc_brelse

contents	
  scan	
  !!!	
  

www.enkitec.com	
  

20	
  	
  	
  
Drilling	
  down	
  to	
  kernel	
  spinlock	
  spinning	
  
The	
  main	
  spinlock	
  experiencing	
  spinning	
  was	
  related	
  to	
  VxFS:	
  
spn%kern
7.24
0.11
0.04
0.04
0.01

cumpct
7.24
7.36
7.40
7.44
7.45

seconds
71.96
1.14
0.44
0.42
0.12

spn%cpu lock name
3.77 FS:vxfs:bc_freelist_lock spin
0.06 FS:vxfs:inode spin for sleep lock
0.02 FS:vxfs:i_spinspin lock
0.02 Sleep Queue lock
0.01 v_count_lock

Oracle	
  processes	
  were	
  the	
  main	
  ones	
  spinning:	
  
spn%spn
99.17
0.60
0.08
0.05
0.03
0.03

cumpct
99.17
99.77
99.85
99.91
99.93
99.96

spnsec
73.82
0.45
0.06
0.04
0.02
0.02

usrsec
880.32
0.00
1.42
0.00
0.00
0.00

kernsec spn%kern process name
543.70
13.58 oracle
7.37
6.11 vxfsd
11.52
0.52 tnslsnr
402.60
0.01 [IDLE]
0.50
4.00 xyz
0.43
4.65 sadc

www.enkitec.com	
  

21	
  	
  	
  
Which	
  system	
  calls?	
  
$ /opt/ktools/bin/kiinfo -kitrace pid=28679,syscalls -ts
1103_1307 | grep open | sort -r -n -k 13,13 | more
2.550292 cpu=6 seqcnt=43808107 pid=28679 ktid=526366 utid=1
open [5] err=17 ret1=12 ret2=1 syscallbeg=
3.295138
ktcsys=
0.000000 A0=0x9fffffff A1=
0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/
admin/xyzdb/adump/xyzdb_ora_28679_21.aud'
8.789893 cpu=6 seqcnt=43825746 pid=28679 ktid=526366 utid=1
open [5] err=17 ret1=12 ret2=1 syscallbeg=
3.172851
ktcsys=
0.000000 A0=0x9fffffff A1=
0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/
admin/xyzdb/adump/xyzdb_ora_28679_23.aud'
5.617009 cpu=6 seqcnt=43816702 pid=28679 ktid=526366 utid=1
open [5] err=17 ret1=12 ret2=1 syscallbeg=
3.066686
ktcsys=
0.000000 A0=0x9fffffff A1=
0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/
admin/xyzdb/adump/xyzdb_ora_28679_22.aud'
18.143066 cpu=6 seqcnt=43848837 pid=28679 ktid=526366 utid=1
open [5] err=17 ret1=12 ret2=1 syscallbeg=
2.576237
ktcsys=
0.000000 A0=0x9fffffff A1=
0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/
admin/xyzdb/adump/xyzdb_ora_28679_26.aud'
www.enkitec.com	
  

22	
  	
  	
  
Audit	
  file	
  des4ona4on	
  
•  New	
  audit	
  file	
  name	
  format	
  in	
  11g…	
  
•  A	
  new	
  file	
  is	
  created	
  for	
  each	
  audit	
  file	
  where	
  the	
  SPID	
  collides	
  with	
  a	
  
previous	
  file…	
  
•  Every	
  @me	
  when	
  crea@ng	
  a	
  new	
  audit	
  file,	
  Oracle	
  has	
  to	
  check	
  whether	
  
such	
  file	
  already	
  exists	
  with	
  suffix	
  _1,	
  then	
  _2,	
  _3,	
  etc..	
  
$ cd /u01/app/oracle/admin/E2SNDB/adump
$ ls -l | head
total 4788
-rw-r----- 1 oracle dba 735 Feb 28 16:06
-rw-r----- 1 oracle dba 710 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Feb 27 17:53
-rw-r----- 1 oracle dba 736 Oct 16 17:58
-rw-r----- 1 oracle dba 740 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Feb 28 16:07
-rw-r----- 1 oracle dba 735 Feb 24 17:44
-rw-r----- 1 oracle dba 735 Dec 22 21:28
www.enkitec.com	
  

e2sndb_ora_10028_1.aud
e2sndb_ora_10082_1.aud
e2sndb_ora_10082_2.aud
e2sndb_ora_10095_1.aud
e2sndb_ora_10120_1.aud
e2sndb_ora_10125_1.aud
e2sndb_ora_10158_1.aud
e2sndb_ora_10206_1.aud
e2sndb_ora_10482_1.aud

23	
  	
  	
  
Shouldn’t	
  the	
  audit	
  files	
  be	
  created	
  only	
  for	
  SYSDBA	
  and	
  
SYSOPER	
  access?	
  
•  In	
  theory,	
  yes.	
  
•  In	
  prac4ce	
  in	
  our	
  case,	
  no.	
  
•  Bug	
  9744092:	
  EXCESSIVE	
  AMOUNT	
  OF	
  AUD	
  FILES	
  BEING	
  
GENERATED	
  IN	
  11.1	
  
•  Oracle	
  generated	
  a	
  new	
  .aud	
  file	
  for	
  every	
  new	
  database	
  connec4on!	
  
•  Not	
  just	
  SYSOPER/SYSDBA	
  like	
  normally	
  
•  This	
  is	
  all	
  despite	
  having	
  audit_trail	
  =	
  DB	
  	
  
•  Normally	
  the	
  .aud	
  files	
  in	
  audit_dump_dest	
  are	
  not	
  related	
  to	
  regular	
  
audit	
  trail,	
  but	
  for	
  SYSOPER/SYSDBA	
  logon,	
  startup/shutdown	
  audi4ng	
  
•  …and	
  if	
  AUDIT_SYS_OPERATIONS	
  =	
  TRUE	
  then	
  all	
  commands	
  issued	
  as	
  SYS	
  

www.enkitec.com	
  

24	
  	
  	
  
Bug	
  9744092:	
  EXCESSIVE	
  AMOUNT	
  OF	
  AUD	
  FILES	
  BEING	
  
GENERATED	
  IN	
  11.1	
  
PROBLEM:
-------After upgrade to 11.1 the system creates 10 - 16
*aud files per minute in audit_file_dest.
Out of 9528 *.aud files that the customer uploaded,
9124 files recorded ACTION:[3] "102".
DIAGNOSTIC ANALYSIS:
-------------------The change of behavior (move audit action 102 from
aud$ to audit_file_dest when audit_trail=DB.) is due
to a fix for an unpublished bug 5476184 in 11.1.
It is not an intended feature for 11G.
WORKAROUND:
----------Manual delete of audit files
www.enkitec.com	
  

25	
  	
  	
  
Diagnosis	
  
1.  Thanks	
  to	
  bug	
  9744092	
  and	
  a	
  behavior	
  change	
  in	
  Oracle	
  11.1	
  
a	
  new	
  audit	
  file	
  was	
  created	
  for	
  each	
  new	
  connec4on	
  to	
  DB	
  
• 

If	
  a	
  file	
  already	
  existed,	
  Oracle	
  checked	
  if	
  a	
  similar	
  file	
  name	
  with	
  
larger	
  suffix	
  value	
  (_2,	
  _3,	
  _4	
  etc)	
  existed	
  

2.  The	
  audit_file_dest	
  eventually	
  had	
  over	
  1.5	
  M	
  files	
  in	
  it!	
  
• 

For	
  each	
  logon,	
  mul4ple	
  file	
  existence	
  checks	
  (open()	
  syscalls)	
  had	
  to	
  
be	
  done	
  

3.  Checking	
  whether	
  a	
  file	
  existed	
  (open	
  syscall	
  -­‐>	
  directory	
  
entry	
  scan)	
  became	
  very	
  slow	
  –	
  and	
  it’s	
  done	
  in	
  kernel	
  mode	
  
• 
• 

A	
  spinlock	
  was	
  held	
  during	
  the	
  directory	
  entry	
  scan	
  
Other	
  new	
  Oracle	
  processes	
  also	
  wanted	
  to	
  do	
  the	
  directory	
  scan,	
  
resul4ng	
  in	
  spinlock	
  conten4on	
  and	
  further	
  Kernel	
  mode	
  CPU	
  usage	
  

4.  When	
  the	
  DB	
  got	
  slow	
  –	
  app	
  servers	
  fired	
  up	
  hundreds	
  of	
  
new	
  connec4ons	
  to	
  “make	
  things	
  faster”	
  
• 

This	
  all	
  fed	
  back	
  to	
  the	
  problem	
  –	
  even	
  more	
  conten4on	
  &	
  spinning	
  
www.enkitec.com	
  

26	
  	
  	
  
Limi4ng	
  logon	
  storms	
  
Use	
  Oracle	
  Listener	
  connec4on	
  rate	
  limiter	
  (11gR1+)	
  
listener.ora:
LISTENER=
(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)(RATE_LIMIT=5))
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1522)(RATE_LIMIT=10))
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1523))
)

	
  
Oracle	
  Documenta5on:	
  Oracle	
  Net	
  Listener	
  Parameters	
  (listener.ora)	
  
hQp://download.oracle.com/docs/cd/B28359_01/network.111/b28317/listener.htm	
  

	
  
Also,	
  it	
  is	
  possible	
  to	
  limit	
  logoff	
  storm	
  rate	
  
	
  _logout_storm_rate	
  parameter	
  (instance-­‐wide)	
  
www.enkitec.com	
  

27	
  	
  	
  
Troubleshoo4ng	
  sporadic	
  system	
  performance	
  issues	
  

	
  
Right	
  Data	
  !!!	
  
	
  

•  Right	
  scope	
  –	
  if	
  your	
  problem	
  lasts	
  for	
  seconds,	
  this	
  should	
  be	
  
the	
  granularity	
  of	
  your	
  data	
  too	
  
•  OS	
  level	
  data,	
  in	
  addi4on	
  to	
  the	
  database	
  metrics	
  
•  Ideally	
  OS	
  level	
  metrics	
  sampled	
  mul4ple	
  4mes	
  per	
  minute	
  

www.enkitec.com	
  

28	
  	
  	
  
Conclusions	
  
•  Logon	
  storms	
  are	
  evil!	
  
•  They	
  will	
  amplify	
  any	
  performance	
  hiccups	
  as	
  they	
  cause	
  extra	
  load	
  
just	
  when	
  the	
  resources	
  are	
  scarcest	
  

•  Connec4on	
  pools	
  firing	
  up	
  hundreds	
  of	
  new	
  connec4ons	
  are	
  
evil!	
  
•  Know	
  your	
  limits	
  (both	
  max	
  connec4ons	
  and	
  max	
  connect	
  rate	
  /	
  sec)	
  
•  Here’s	
  a	
  thought:	
  
•  If	
  you	
  have	
  planned	
  the	
  servers’	
  capacity	
  to	
  support	
  N-­‐thousand	
  
connec4ons	
  anyway	
  (by	
  allowing	
  connec4on	
  pools	
  grow	
  that	
  high),	
  why	
  
not	
  create	
  this	
  amount	
  of	
  connec4ons	
  right	
  away?	
  
•  This	
  would	
  avoid	
  logon	
  storms	
  during	
  worst	
  4mes	
  as	
  all	
  connec4ons	
  have	
  
already	
  been	
  created!	
  
www.enkitec.com	
  

29	
  	
  	
  

More Related Content

What's hot

Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
Mauro Pagano
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
Kyle Hailey
 
Same plan different performance
Same plan different performanceSame plan different performance
Same plan different performance
Mauro Pagano
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
Carlos Sierra
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
Anil Nair
 
Oracle AWR Data mining
Oracle AWR Data miningOracle AWR Data mining
Oracle AWR Data mining
Yury Velikanov
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Aaron Shilo
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
John Kanagaraj
 
Understanding oracle rac internals part 2 - slides
Understanding oracle rac internals   part 2 - slidesUnderstanding oracle rac internals   part 2 - slides
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
SolarWinds
 
Ash and awr deep dive hotsos
Ash and awr deep dive hotsosAsh and awr deep dive hotsos
Ash and awr deep dive hotsos
Kellyn Pot'Vin-Gorman
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
Ludovico Caldara
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And WhatPerformance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And Whatudaymoogala
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
Mohamed Farouk
 
Chasing the optimizer
Chasing the optimizerChasing the optimizer
Chasing the optimizer
Mauro Pagano
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
Riyaj Shamsudeen
 

What's hot (20)

Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
Same plan different performance
Same plan different performanceSame plan different performance
Same plan different performance
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Oracle AWR Data mining
Oracle AWR Data miningOracle AWR Data mining
Oracle AWR Data mining
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
 
Understanding oracle rac internals part 2 - slides
Understanding oracle rac internals   part 2 - slidesUnderstanding oracle rac internals   part 2 - slides
Understanding oracle rac internals part 2 - slides
 
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
 
Ash and awr deep dive hotsos
Ash and awr deep dive hotsosAsh and awr deep dive hotsos
Ash and awr deep dive hotsos
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And WhatPerformance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
Chasing the optimizer
Chasing the optimizerChasing the optimizer
Chasing the optimizer
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
 

Similar to Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Linux Perf Tools
Linux Perf ToolsLinux Perf Tools
Linux Perf Tools
Raj Pandey
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
Brendan Gregg
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
Brendan Gregg
 
KoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganKoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganTobias Koprowski
 
Hotsos 2012
Hotsos 2012Hotsos 2012
Hotsos 2012
Connor McDonald
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goKristofferson A
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
Brendan Gregg
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
Iben Rodriguez
 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
Brendan Gregg
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
Rodrigo Campos
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
ScyllaDB
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Brendan Gregg
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
Carlos Sierra
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
Amazon Web Services
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
Tanel Poder
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 

Similar to Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1 (20)

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Linux Perf Tools
Linux Perf ToolsLinux Perf Tools
Linux Perf Tools
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
KoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just beganKoprowskiT - SQLBITS X - 2am a disaster just began
KoprowskiT - SQLBITS X - 2am a disaster just began
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
Hotsos 2012
Hotsos 2012Hotsos 2012
Hotsos 2012
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 

More from Tanel Poder

Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application Troubleshooting
Tanel Poder
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
Tanel Poder
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
Tanel Poder
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
Tanel Poder
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
Tanel Poder
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
Tanel Poder
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance Tuning
Tanel Poder
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 

More from Tanel Poder (11)

Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application Troubleshooting
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance Tuning
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

  • 1. Troubleshoo4ng  the  Most  Complex  Performance   Issue  I’ve  ever  seen   Tanel  Poder   hAp://blog.tanelpoder.com   hAp://tech.e2sn.com       www.enkitec.com   1      
  • 2. Intro:  About  me   •  Tanel  Põder   Oracle  Database  Performance  geek   Exadata  Performance  geek   Hadoop  Performance  geek   •  Enkitec   •  Consultant   •  Researcher   •  Technology  Evangelist   •  Just  moved  to  Dallas   •  ANer  Tallinn  -­‐>  Stockholm  -­‐>  London  -­‐>   Cancun  -­‐>  Singapore   www.enkitec.com   Expert  Oracle  Exadata   book   (with  Kerry  Osborne  and   Randy  Johnson  of  Enkitec)   2      
  • 3. Two  issues  -­‐  actually   •  For  warm-­‐up:   •  cursor  pin:  S  wait  events  and  sporadic  CPU  spikes   •  Read  more  from  my  blog  entry:   •  hAp://blog.tanelpoder.com/2010/04/21/cursor-­‐pin-­‐s-­‐waits-­‐sporadic-­‐cpu-­‐ spikes-­‐and-­‐systema4c-­‐troubleshoo4ng/   •  Or  just  google  for  “cursor  pin  s”       www.enkitec.com   3      
  • 4. Environment   •  •  •  •  High-­‐concurrency,  high-­‐visibility  OLTP  database   Oracle  11.1.0.7  single-­‐instance,  dedicated  server  processes   HP-­‐UX  on  Itanium   32  CPUs,  128  GB  RAM   •  Thousands  of  end  users   •  Mul4ple  WebLogic  applica4on  servers  talking  to  database  via   connec4on  pools   www.enkitec.com   4      
  • 5. The  problem   •  •  •  •  Sporadic  extreme  slowness  of  Oracle  DB  and  the  server   Slowness  lasts  for  1  ..  20  minutes  at  a  4me…   Queries  don’t  answer  or  extremely  slow   Can’t  even  log  on  to  OS  during  that  4me   •  New  SSH  connec4ons  succeded  once  the  spike  was  over   •  It  takes  minutes  to  run  simple  OS  commands  during  the  problem  4me   •  This  is  a  global  server-­‐wide  problem  –  everyone  complains!   So,  the  scope  of  this  problem   is  global,  server-­‐wide.   Therefore  we  can  use  global,   server-­‐wide  metrics  to   diagnose  the  problem.   www.enkitec.com   5      
  • 6. Let’s  pick  and  diagnose  one  occurrence  of  this  problem   •  The  database  response  4mes  extremely  bad  again  around   18:10  and  this  lasted  for  about  5  minutes…   •  If  it’s  the  users  who  report  the  problem  (as  opposed  to  applica4on   side  measurements),  then  there  may  be  some  discrepancies  in  the   user  reported  4mes  vs  actual  problem  4me     www.enkitec.com   6      
  • 7. Ini4al  AWR  Report   Number  of  sessions   has  grown  by  ~500!   Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9 Using  a  20  minute   End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7 report  for  diagnosing   Elapsed: 20.17 (mins) a  5  minute  problem?!   DB Time: 559.31 (mins) Avg % DB Event Waits Time(s) (ms) time Wait Class --------------------------- --------- ----------- ------ ------ ---------db file sequential read 2,135,668 21,468 10 64.0 User I/O DB CPU 5,860 17.5 log file sync 92,720 1,498 16 4.5 Commit read by other session 91,676 1,307 14 3.9 User I/O SQL*Net message from dblink 525 1,132 2155 3.4 Network This  66%  idle  is  an   average  over  20   minutes!   Host CPU (CPUs: ~~~~~~~~ 32 Cores: 32 Sockets: 32) Load Average Begin End %User %System %WIO %Idle --------- --------- --------- --------- --------- --------0.37 0.31 16.3 17.6 12.1 66.1 www.enkitec.com   7      
  • 8. ASH  data  (shown  in  OEM)   •  Average  ac4ve  sessions  showed  something  different   •  Note  that  this  data  is  from  another  period  of  4me  when  a  similar  spike  happened   •  In  worst  4mes  there  were  up  to  220  ac4ve  sessions  trying  to  be  on  CPU!   •  Thanks  to  beAer  granularity  we  see  the  spikes  instead  of  some  20-­‐minute  or  hourly   averages…   •  The  problem  with  ASH  samples  is  that  it  looks  into  session  state  from  inside  Oracle   •  Perhaps  the  starva5on  is  due  to  some  other  applica5on  /  instance  in  the  server?   www.enkitec.com   8      
  • 9. How  many  logons  were  done?   Number  of  sessions   has  grown  by  ~500!   Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9 End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7 Elapsed: 20.17 (mins) DB Time: 559.31 (mins) Statistic Total per Second -------------------------------- ------------------ -------------index fetch by key 24,174,148 19,971.0 index scans kdiixs1 24,565,055 20,293.9 leaf node 90-10 splits 5,865 4.9 leaf node splits 14,529 12.0 lob reads 34,480 28.5 lob writes 1,623,273 1,341.0 lob writes unaligned 1,623,266 1,341.0 logons cumulative 2,550 2.1 messages received 133,740 110.5 messages sent 133,740 110.5 min active SCN optimization appl 538,358 444.8 no buffer to keep pinned count 6,331 5.2 no work - consistent read gets 146,703,542 121,196.1 opened cursors cumulative 4,168,700 3,443.9 www.enkitec.com   2.1  logons  per   second,  but  we   don’t  know  how   these  logons  are     per Trans ------------distributed  over   43.6 the  20  minute   44.3 period!   0.0 0.0 0.1 2.9 2.9 0.0 0.2 0.2 1.0 0.0 264.7 7.5 9      
  • 10. OS  level  metrics  don’t  lie  (well,  they  do,  but  less  ;-­‐)   | | | Date | Time |CPU % | 10/30/2010|17:45:00| 25.06| 10/30/2010|17:50:00| 24.77| 10/30/2010|17:55:00| 24.60| 10/30/2010|18:00:00| 25.95| 10/30/2010|18:05:00| 22.88| 10/30/2010|18:10:00| 66.89| 10/30/2010|18:15:00| 24.51| 10/30/2010|18:20:00| 25.47| 10/30/2010|18:25:00| 28.38| Phys | IO Rt | 7149.5| 5334.8| 7176.4| 7556.2| 5379.5| 4544.6| 7544.9| 5144.0| 10139.5| Phys |Memory|Pg Out | VM Pg | KB Rt | % | Rate | Scan Rt | 145817.6| 74.07| 0.0| 0.0| 60928.0| 73.98| 0.0| 0.0| 186368.0| 73.98| 0.0| 0.0| 192307.2| 74.11| 0.0| 0.0| 67584.0| 74.15| 0.0| 0.0| 58060.8| 76.97| 0.0| 0.0| 159334.4| 76.40| 0.0| 0.0| 59187.2| 75.04| 0.0| 0.0| 151552.0| 74.37| 0.0| 0.0| 1)  What  does  the  Time  18:10:00  mean,  beginning  of  the   monitoring  interval  or  end?   2)  66.89%  busy  during  5  minutes  may  actually  mean  100%  busy   during  ~3  minutes  out  of  5,  but  we  don’t  know  that  for  sure   without  measuring  in  more  detail  (beAer  granularity)…   www.enkitec.com   10      
  • 11. Measuring  CPU  u4liza4on  in  more  detail   The  spike  lasted  from   18:11  to  18:14  (3  min)   Around  90%  in   Kernel  mode!!!   www.enkitec.com   11      
  • 12. Checkpoint  –  measured  evidence  so  far   •  Fact:  We  have  a  100%  CPU  u4liza4on  spike,  las4ng  3  minutes   •  Fact:  90%  of  it  is  spent  in  KERNEL  mode   •  Fact:  We  have  over  2500  logons  done  during  20  minute   period   •  2.1  logons  /  second  on  average  (which  doesn’t  sound  bad)   •  Kernel  mode  CPU  usage  is  usually  caused  by  system  calls   •  …or  some  internal  kernel  thread  ac4vity   100%  CPU  usage  doesn’t  always  automa4cally  mean  you  have  a  serious  CPU   starva4on  problem.  The  CPU  runqueue  length  would  indicate  you  how  much   starva4on  (wai4ng  for  CPU  service)  there  is.  However  seeing  90%  of  CPU  used  in   KERNEL  mode  is  definitely  not  normal  for  an  Oracle  database  server.   www.enkitec.com   12      
  • 13. Diagnosing  90%  kernel-­‐mode  CPU  usage  spikes…   1.  Systema4c   •  •  •  •  Break  down  this  90%  of  Kernel  mode  CPU  usage   Profiling!   Oh,  this  is  a  produc4on  system  and  the  problem  is  acute  &  ongoing   On  Solaris,  I’d  have  used  Dtrace  stack()  probe  to  record  OS  kernel   stack  traces  most  common  on  CPU  (google  for  dstackprof)   •  •  Or  lockstat  as  it  reports  spins  on  spinlocks  (which  consume  kernel  CPU)   But  this  was  HP-­‐UX  and  I  didn’t  know  the  tools  needed   •  •  But  I  knew  what  numbers  I  wanted  to  see!   We  sent  a  request  to  HP-­‐UX  support:     “How  do  we  measure  &  break  down  where  is  kernel  mode  CPU  used?”   2.  Check  for  usual  suspects   •  Fast,  cheap  checks  to  rule  out  or  find  known  troublemakers   www.enkitec.com   13      
  • 14. Kernel  mode  CPU  usage  spikes  –  the  usual  suspects   •  Before  star4ng  the  systema4c  troubleshoo4ng  &  drilldown,   do  quick  checks  for  usual  suspects   •  Remember,  the  client  has  a  business  problem,  4me  is  of  essence…   1.  Logon  (or  logoff)  storms   •  Spawning,  ini4alizing  new  processes,  opening  files  and  aAaching  to   SGA  means  system  calls,  kernel  CPU  usage   2.  Oracle  code  gevng  into  some  crazy  loop  (due  to  a  bug)   •  Semop(),  yield(),  read  /proc/…,  getrusage(),  etc  loop   3.  OS  kernel  spinlock  conten4on   •  •  Not  so  usual  suspects  really…   Variety  of  reasons…   ONen  due  to  bugs  in  OS  or  some  kernel  module   www.enkitec.com   14      
  • 15. Measuring  logon  storms   •  Use  the  AUD$  records  or  “logons  cumula@ve”  number  from     V$SYSSTAT  or  AWR,  right?   •  Wrong!   •  logons  cumula5ve  number  is  incremented  by  the  session  itself   –  aNer  it  has  logged  on,  the  same  applies  to  audit  records!   1.  2.  3.  4.  5.  ANer  the  listener  connec4on  has  been  established…   The  process  has  been  started…   It  has  aAached  to  SGA  SHM  segments…   Audit  file  has  been  wriAen  (if  needed)  …   Process,  session  SGA  structures  have  been  created   •  Memory  from  OS  and  shared  pool  allocated  (shared  pool  latches!)   6.  Session  has  been  authen4cated   7.  Then  the  logons  cumula4ve  is  incremented!     www.enkitec.com   The  logon  storm  may   have  started  way   before  these  logons   finally  succeeded!     15      
  • 16. Measuring  logon  storms   •  Logon  storms  should  be  measured  at  the  listener  level   •  Process  listener.log  using  a  script:   $ tail listener.log 30-OCT-2010 23:22:03 30-OCT-2010 23:22:08 30-OCT-2010 23:22:08 30-OCT-2010 23:22:09 * * * * (CONNECT_DATA=…) (CONNECT_DATA=…) (CONNECT_DATA=…) service_update * * establish * E2SNDB * 0 * establish * E2SNDB * 0 * establish * E2SNDB * 0 E2SNDB * 0 $ fgrep "30-OCT-2010" listener.log | fgrep "establish" | awk '{ print $1 " " $2 }' | awk -F: '{ print $1 }' | sort | uniq –c … 5564 30-OCT-2010 20 5571 30-OCT-2010 21 5703 30-OCT-2010 22 5703  logons  during   4828 30-OCT-2010 23 22:00  –  22:59   … www.enkitec.com   16      
  • 17. Measuring  logon  storms   •  Logon  storms  should  be  measured  at  the  listener  level   •  Process  listener.log  using  a  script:   $ fgrep "30-OCT-2010 22:" listener.log | fgrep "establish" | awk '{ print $1 " " $2 }' | awk -F: '{ print $1 ":" $2 }' | sort | uniq –c … 88 30-OCT-2010 22:00 120 30-OCT-2010 22:01 94 30-OCT-2010 22:02 How  many  connec4ons   94 30-OCT-2010 22:03 listener  established  every   95 30-OCT-2010 22:04 minute  (this  data  from  non-­‐ 120 30-OCT-2010 22:05 problem  4me)   79 30-OCT-2010 22:06 101 30-OCT-2010 22:07 85 30-OCT-2010 22:08 100 30-OCT-2010 22:09 85 30-OCT-2010 22:10 89 30-OCT-2010 22:11 83 30-OCT-2010 22:12 93 30-OCT-2010 22:13 www.enkitec.com   17      
  • 18. Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis   Let’s  trace  system   calls  by  one  CPU   heavy  Oracle  process   # tusc -cp 8021 ( Attached to process 8021 ("oracleXYZ (LOCAL=NO)") [64-bit] ) ( Detaching from process 8021 ("oracleXYZ (LOCAL=NO)") ) Syscall open ----Total Interes4ngly  3  open()  syscalls   take  over  a  second  in  total.   Could  this  be  caused  by  the   general  CPU  starva4on  in  the   server?   Seconds 1.05 ----1.05 Calls 3 ----3 The  next  step  should  have   been  to  check  which  file  did   we  try  to  open  (but  the  spike   ended  before  we  could  do   that)   www.enkitec.com   Errors 3 ----3 Seeing  errors  isn’t  a   problem  itself,  as  this   is  how  Oracle  and  libs   some4mes  check  for   existence  of  a  file…   18      
  • 19. Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis   •  HP-­‐UX  support  got  back  to  us  and  recommended  the  use  of   runki  tool  for  measuring  kernel  CPU  u4liza4on  in  detail   •  It  had  to  be  installed  as  root   •  It  measured  a  lot  of  things  happening  in  kernel,  wri4ng  a  big  output   file   •  on  Solaris  there’s  a  tool  called  TNF  trace,  AIX  has  trace  command  for  such   full  kernel  tracing   •  Now  we  had  to  wait  for  the  problem  to  happen  again   •  Didn’t  have  to  wait  for  too  long…   •  We  sent  the  raw  trace  dump  to  HP  Support,  so  they  could  run   something  like  “tkprof”  on  that  tracefile   •  It  basically  just  summed  up  the  kernel  spinlock  wait,  spin  events  by  lock,   object  locked  etc   www.enkitec.com   19      
  • 20. kiprof  –  profiled  runki  output   Total Hardclock traces: 30239 ================================ State Count Percent USER 7130 23.58 SYS 22965 75.94 IDLE 24 0.08 SSYS 120 0.40   That’s  basically   spinning  for  locks   (adap4ve  decisions  to   spin  or  wait)   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kernel Functions executed during profile Count Pct State Function Self-­‐explanatory.  We   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 15615 51.64% SYS timed_preArbitration are  spinning  for  a  lock   7130 23.58% USER OTHER 1275 4.22% SYS spinlock 1008 3.33% SYS wait_for_lock_spinner VxFS  directory  block   488 1.61% SYS vx_dirbread read?   464 1.53% SYS vx_bc_getblk 417 1.38% SYS preArbitration 338 1.12% SYS vx_dirscan VxFS  directory   291 0.96% SYS vx_bc_brelse contents  scan  !!!   www.enkitec.com   20      
  • 21. Drilling  down  to  kernel  spinlock  spinning   The  main  spinlock  experiencing  spinning  was  related  to  VxFS:   spn%kern 7.24 0.11 0.04 0.04 0.01 cumpct 7.24 7.36 7.40 7.44 7.45 seconds 71.96 1.14 0.44 0.42 0.12 spn%cpu lock name 3.77 FS:vxfs:bc_freelist_lock spin 0.06 FS:vxfs:inode spin for sleep lock 0.02 FS:vxfs:i_spinspin lock 0.02 Sleep Queue lock 0.01 v_count_lock Oracle  processes  were  the  main  ones  spinning:   spn%spn 99.17 0.60 0.08 0.05 0.03 0.03 cumpct 99.17 99.77 99.85 99.91 99.93 99.96 spnsec 73.82 0.45 0.06 0.04 0.02 0.02 usrsec 880.32 0.00 1.42 0.00 0.00 0.00 kernsec spn%kern process name 543.70 13.58 oracle 7.37 6.11 vxfsd 11.52 0.52 tnslsnr 402.60 0.01 [IDLE] 0.50 4.00 xyz 0.43 4.65 sadc www.enkitec.com   21      
  • 22. Which  system  calls?   $ /opt/ktools/bin/kiinfo -kitrace pid=28679,syscalls -ts 1103_1307 | grep open | sort -r -n -k 13,13 | more 2.550292 cpu=6 seqcnt=43808107 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.295138 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_21.aud' 8.789893 cpu=6 seqcnt=43825746 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.172851 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_23.aud' 5.617009 cpu=6 seqcnt=43816702 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.066686 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_22.aud' 18.143066 cpu=6 seqcnt=43848837 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 2.576237 ktcsys= 0.000000 A0=0x9fffffff A1= 0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/ admin/xyzdb/adump/xyzdb_ora_28679_26.aud' www.enkitec.com   22      
  • 23. Audit  file  des4ona4on   •  New  audit  file  name  format  in  11g…   •  A  new  file  is  created  for  each  audit  file  where  the  SPID  collides  with  a   previous  file…   •  Every  @me  when  crea@ng  a  new  audit  file,  Oracle  has  to  check  whether   such  file  already  exists  with  suffix  _1,  then  _2,  _3,  etc..   $ cd /u01/app/oracle/admin/E2SNDB/adump $ ls -l | head total 4788 -rw-r----- 1 oracle dba 735 Feb 28 16:06 -rw-r----- 1 oracle dba 710 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Feb 27 17:53 -rw-r----- 1 oracle dba 736 Oct 16 17:58 -rw-r----- 1 oracle dba 740 Oct 16 17:58 -rw-r----- 1 oracle dba 735 Feb 28 16:07 -rw-r----- 1 oracle dba 735 Feb 24 17:44 -rw-r----- 1 oracle dba 735 Dec 22 21:28 www.enkitec.com   e2sndb_ora_10028_1.aud e2sndb_ora_10082_1.aud e2sndb_ora_10082_2.aud e2sndb_ora_10095_1.aud e2sndb_ora_10120_1.aud e2sndb_ora_10125_1.aud e2sndb_ora_10158_1.aud e2sndb_ora_10206_1.aud e2sndb_ora_10482_1.aud 23      
  • 24. Shouldn’t  the  audit  files  be  created  only  for  SYSDBA  and   SYSOPER  access?   •  In  theory,  yes.   •  In  prac4ce  in  our  case,  no.   •  Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING   GENERATED  IN  11.1   •  Oracle  generated  a  new  .aud  file  for  every  new  database  connec4on!   •  Not  just  SYSOPER/SYSDBA  like  normally   •  This  is  all  despite  having  audit_trail  =  DB     •  Normally  the  .aud  files  in  audit_dump_dest  are  not  related  to  regular   audit  trail,  but  for  SYSOPER/SYSDBA  logon,  startup/shutdown  audi4ng   •  …and  if  AUDIT_SYS_OPERATIONS  =  TRUE  then  all  commands  issued  as  SYS   www.enkitec.com   24      
  • 25. Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING   GENERATED  IN  11.1   PROBLEM: -------After upgrade to 11.1 the system creates 10 - 16 *aud files per minute in audit_file_dest. Out of 9528 *.aud files that the customer uploaded, 9124 files recorded ACTION:[3] "102". DIAGNOSTIC ANALYSIS: -------------------The change of behavior (move audit action 102 from aud$ to audit_file_dest when audit_trail=DB.) is due to a fix for an unpublished bug 5476184 in 11.1. It is not an intended feature for 11G. WORKAROUND: ----------Manual delete of audit files www.enkitec.com   25      
  • 26. Diagnosis   1.  Thanks  to  bug  9744092  and  a  behavior  change  in  Oracle  11.1   a  new  audit  file  was  created  for  each  new  connec4on  to  DB   •  If  a  file  already  existed,  Oracle  checked  if  a  similar  file  name  with   larger  suffix  value  (_2,  _3,  _4  etc)  existed   2.  The  audit_file_dest  eventually  had  over  1.5  M  files  in  it!   •  For  each  logon,  mul4ple  file  existence  checks  (open()  syscalls)  had  to   be  done   3.  Checking  whether  a  file  existed  (open  syscall  -­‐>  directory   entry  scan)  became  very  slow  –  and  it’s  done  in  kernel  mode   •  •  A  spinlock  was  held  during  the  directory  entry  scan   Other  new  Oracle  processes  also  wanted  to  do  the  directory  scan,   resul4ng  in  spinlock  conten4on  and  further  Kernel  mode  CPU  usage   4.  When  the  DB  got  slow  –  app  servers  fired  up  hundreds  of   new  connec4ons  to  “make  things  faster”   •  This  all  fed  back  to  the  problem  –  even  more  conten4on  &  spinning   www.enkitec.com   26      
  • 27. Limi4ng  logon  storms   Use  Oracle  Listener  connec4on  rate  limiter  (11gR1+)   listener.ora: LISTENER= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)(RATE_LIMIT=5)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1522)(RATE_LIMIT=10)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1523)) )   Oracle  Documenta5on:  Oracle  Net  Listener  Parameters  (listener.ora)   hQp://download.oracle.com/docs/cd/B28359_01/network.111/b28317/listener.htm     Also,  it  is  possible  to  limit  logoff  storm  rate    _logout_storm_rate  parameter  (instance-­‐wide)   www.enkitec.com   27      
  • 28. Troubleshoo4ng  sporadic  system  performance  issues     Right  Data  !!!     •  Right  scope  –  if  your  problem  lasts  for  seconds,  this  should  be   the  granularity  of  your  data  too   •  OS  level  data,  in  addi4on  to  the  database  metrics   •  Ideally  OS  level  metrics  sampled  mul4ple  4mes  per  minute   www.enkitec.com   28      
  • 29. Conclusions   •  Logon  storms  are  evil!   •  They  will  amplify  any  performance  hiccups  as  they  cause  extra  load   just  when  the  resources  are  scarcest   •  Connec4on  pools  firing  up  hundreds  of  new  connec4ons  are   evil!   •  Know  your  limits  (both  max  connec4ons  and  max  connect  rate  /  sec)   •  Here’s  a  thought:   •  If  you  have  planned  the  servers’  capacity  to  support  N-­‐thousand   connec4ons  anyway  (by  allowing  connec4on  pools  grow  that  high),  why   not  create  this  amount  of  connec4ons  right  away?   •  This  would  avoid  logon  storms  during  worst  4mes  as  all  connec4ons  have   already  been  created!   www.enkitec.com   29