This is a presentation I created for RMOUG 2014 which I was sadly unable to attend. However, I wanted to share it with the Oracle community so that you can learn a bit about metrics that are frequently cited, frequently demonized, and frequently misused. In this deck we will go through the steps to diagnose issues and what NOT to blame as you go through the process.
The topics and concepts discussed here were originally formed in a blog post on the OracleAlchemist.com site: http://www.oraclealchemist.com/news/these-arent-the-metrics-youre-looking-for/
2. Steve Karam
Technical Manager at Delphix
Oracle Certified Master, ACE, and other
acronyms
Just a little social
Blog: http://www.oraclealchemist.com
Twitter: @OracleAlchemist
Google Plus: +SteveKaram
Facebook: OracleAlchemist
3. Hunting for Metrics
Oracle has more metrics than you can
shake a stick at
Automatic Workload Repository (AWR)
Active Session History (ASH)
STATSPACK
V$ and X$ views
4. These Aren‘t the Metrics You‘re
Looking For
The problem is not a
lack of data, it‘s
knowing how—and
when—to use it.
5. The Database is…
Broken
Slow
Down
Not working
Giving me errors
Step 1: What is the actual problem?
6. I‘m going to…
Gather stats
Add an index to something
Bounce the database
Blame the SysAdmins
Blame the code
Kill the backup
Step 2: Don‘t be hasty. Suppress kneejerk
reactions. They have no place in problem
analysis.
7. I think I know what to do!
That‘s great! Thinking is good. But if you only
think you found a solution, chances are good
that there‘s more to it.
Step 3: Don‘t immediately think in terms of
fixes. Think in terms of findings and
recommendations.
8. Gathering Stats
Just like the optimizer needs to gather stats for proper query
analysis, you need to gather stats for problem analysis.
Think of it like a popular TV medical drama:
Your database is the patient. It‘s their job to be sick.
Your end users are the concerned family and friends.
It‘s their job to be panicky.
You are the doctor and team. It‘s your job to be brilliant.
Step 4. Be brilliant.
9. Problem Analysis
Okay, so ―Be Brilliant‖ isn‘t a good step. At
this point, what you really need to do is
choose a path to solving the issue at hand.
There are a few methods for doing this:
Top down: Review events and waits at a
global level and drill down from there.
Scientific Method: Do background
research, form a hypothesis, test your
hypothesis, analyze the outcome.
Differential Diagnosis: shrink the
probability of various issues using the
process of elimination.
10. The Top Down Approach
Top down tuning is a viable
method, and is almost always
preferable to bottom up tuning.
This is very useful when you
know the issue is global and
you need to drill down into a
root cause. It‘s good when
things suddenly go wrong;
however, it can be difficult when
there are multiple root causes.
11. Scientific Method
This method is highly
effective at ensuring
factual resolutions to
problems. While it may
not always be suitable
for quickly resolving a
critical issue, it‘s always
suitable for case studies
and post-fix root cause
analysis.
12. Differential Diagnosis (DDX)
This method is great for global issues where the
root cause is unknown and no significant change
has occurred.
Gather information
List symptoms
List possible conditions
based on the symptoms
Test Test Test
Eliminate conditions
Don‘t kill the patient
13. Speaking of House
In the show ―House‖, the main character
has a saying: Everybody lies.
DBA: So everyone, what
changed?
Developer: Nothing.
SysAdmin: Nothing.
Network Admin: Nothing.
Project Manager: Nothing.
14. You never told us the real Step
4
Step 4 is simple.
Solve the problem.
15. Well sure, but how?
The methods we‘ve discussed are all well
and good for looking into problems and
figuring out how the cause and a solution.
For the most part, it will be up to you to:
Gather the right metrics
Synthesize your data
Create findings and recommendations
Test for success
16. What are the right metrics?
There are tons of papers and articles out
there on wait events, metrics, and other
metadata you should look for. We‘re not
here for that.
There are guides on how to use the
metrics you find. We‘re not here for that
either.
No, we‘re here to discuss…
18. #5: db file scattered read
What it is:
An indication of a multiblock I/O
What it is not:
A full table scan
A reason to panic
The culprit (not always, anyways)
19. #5: db file scattered read
The ‗db file scattered read‘ event happens
when Oracle performs a multiblock I/O;
for instance, when a full table scan occurs.
Index full scans and fast full scans also
result in multiblock I/O. But those don‘t
sound so horrible, now do they?
Why is that?
20. #5: db file scattered read
Over the years, DBAs and developers have
cultivated a mortal terror of full table scans. Of
course, they can be a problem, but are they
always the problem? Of course not.
Some facts about db file scattered reads:
They are an incredibly optimal way to utilize
disk to gather large amounts of unordered
data
They aren‘t the only indication of full scans or
multiblock I/O. direct path read and db file
parallel read events also are.
21. #5: db file scattered read
Before you go off on a witch hunt
because of a ‗db file scattered read‘
event, consider the following:
Are there any indications that full
scans are actually the problem?
Are you sure that an index read
would be more efficient in this
case?
Do your other symptoms match
up with the conclusion that a
query performing a full table scan
is your culprit?
Full table scans
are the devil!
22. #4: Parse to Execute Ratio
What it is:
An indication of how often you‘re parsing
vs. executing queries
What it is not:
An indication of how often you‘re hard
parsing vs. executing queries
23. #4: Parse to Execute Ratio
Based on this formula:
round(100*(1-:parse/:execute),2)
If you hard parse a query and then
execute it, your Execute to Parse % is 0.
If you soft parse a query and then execute
it, your Execute to Parse % is 0.
24. #4: Parse to Execute Ratio
What about all those articles and forum
posts that say adding bind variables will
improve your Execute to Parse %?
They‘re not wrong, but incomplete. Adding
bind variables will improve your Execute
to Parse %... IF you have some form of
statement caching enabled.
25. #4: Parse to Execute Ratio
Hard Parses can take up valuable CPU cycles
Soft Parses can still cripple your Oracle
instance
The best way to reduce library cache
contention is to not touch it at all!
26. #4: Parse to Execute Ratio
Tom Kyte said it best:
there are three types of parses (well, maybe four) in Oracle...
there is the dreaded hard parse - they are VERY VERY
VERY bad.
there is the hurtful soft parse - they are VERY VERY very
bad.
there is the hated softer soft parse you might be able to
achieve with session cached cursors - they are VERY very
very bad.
then there is the absence of a parse, no parse, silence. This
is golden, this is perfect, this is the goal.
“
27. #4: Parse to Execute Ratio
To see hard parses vs. soft parses, check
out the Parse Count (total) and Parse
Count (hard) in an AWR report or V$ views
To reduce parsing as a whole (the actual
goal), make sure the code does not
explicitly parse per execution OR that the
client software has statement caching
enabled.
For example, in JBoss, you can set the
prepared-statement-cache-size parameter
SESSION_CACHED_CURSORS is not the
same thing!
28. #3: Buffer Hit Ratio
What it is:
Another ratio
A proportional view of LIOs to PIOs
What it is not:
A silver bullet
A magic ratio
A valuable performance indicator
29. #3: Buffer Hit Ratio
Wait, buffer hit ratio isn‘t valuable?
Okay, maybe that was a little heavy
handed. It can be valuable as an ―at-a-
glance‖ metric to see if something is
absolutely abysmal.
30. #3: Buffer Hit Ratio
It is important to remember
that a high buffer hit ratio
doesn‘t necessarily mean
the data you needed was
available in cache when it
was needed. It also doesn‘t
mean the queries you‘re
running are optimal…they
just happen to be getting
their data from cache.
100% of crap in RAM is still
crap. It‘s just logical crap.
31. #3: Buffer Hit Ratio
So what is it good for?
If you know your queries are perfect
(lolright) then it can indicate that you
don‘t have enough RAM allocated to
your buffer cache.
That‘s it, I just have a second bullet here
to keep the other one company.
32. #2: CPU %
What it is:
CPU Usage per CPU
What it is not:
Equivalent to your laptop‘s CPU %
A viable measure of CPU usage (alone)
A way to diagnose performance
33. #2: CPU %
This isn‘t your Windows
laptop.
When your PC shows 99%
or 100% CPU usage, you
panic. That‘s because you
only have one CPU
(usually), and 99% means
you can barely drag a
window from one side of
the screen to the other.
34. #2: CPU %
In the multi-processor world, it‘s not as big of
a problem. In fact, it can be a huge benefit.
You have multiple CPUs on your servers.
99% usage of one or more is probably not
a big deal.
CPU is the processor, and the part of the
system that performs work (as opposed to
wait). You want this to be heavily utilized.
35. #2: CPU %
What do you pay licensing
based on?
Number of CPUs.
So what do you actually
want to be as fully utilized
as possible?
36. #2: CPU %
Instead, we should be looking at:
Runqueue length – Provided by
vmstat, uptime, top, and other tools.
Shows the number of processes
actively waiting or working on CPU
at any given time.
Oracle Average Active Sessions –
This metric is usually more pertinent
from the DBA side, as it shows the
number of sessions actively waiting
or working at any given time.
37. #2: CPU %
The focus should be on concurrency
Using a single CPU heavily is only a
problem if the other CPUs are fairly
dormant…but that‘s another issue
entirely.
Even run queue is not a perfect metric—
some things, like uninterruptable I/O
wait, can skew the results.
I/O wait should be part of the bigger picture
along with run queue length.
38. #1: Cost
What it is:
A numerical estimation proportional to
the expected resources necessary to
execute a statement with a given plan.
What it is not:
Anything else.
39. #1: Cost
This one comes up
all. the. time.
Here‘s a simple thing
to keep in mind:
Oracle‘s optimizer
is cost based
Your tuning
practices are not
40. #1: Cost
Cost is good to understand, so you can
understand why Oracle chose the plan it
did.
However, you shouldn‘t try to tune
specifically to reduce cost.
Cost is not proportional to time. A high or
low cost doesn‘t necessarily mean a
query will be slower or faster.
41. #1: Cost
Why is cost misused?
―Gather stats‖ is like the
―restart Windows‖ of the
Oracle world. Gathering
stats changes plans.
Plans have costs. I
should tune costs.
The cost based
optimizer changed my
plan. It‘s cost based. I‘m
cost based.
42. #1: Cost
Cost is not a bottleneck, nor is it
indicative of actual work. It‘s indicative of
relative work based on parameters that
exist purely in the calculations of your
particular Oracle instance.
Instead of tuning to reduce cost, tune to
reduce bottlenecks. Those are real
things that cause real wait.
43. #1: Cost
Real things to tune
Reduce block touches (both physical and
logical) by improving your query selectivity,
join order, index usage, etc.
Reduce parses, both hard and soft.
Investigate execution plans and use
statistics, hints, or other methods to improve
Oracle‘s costing—just don‘t try to ‗tune down
cost‘ directly.
44. Step 4…
Step 4, if you remember, was ―solve the
problem.‖
That advice still stands.
But make sure you use
the right metrics to do it.
And good luck!