The document describes troubleshooting a complex performance issue in an Oracle database. Key details:
- The problem was sporadic extreme slowness of the Oracle database and server lasting 1-20 minutes.
- Initial AWR reports and OS metrics showed a spike at 18:10 with CPU usage at 66.89%, confirming a problem occurred then.
- Further investigation using additional metrics was needed to fully understand the root cause, as initial diagnostics did not provide enough context about this brief problem period.
Troubleshooting Complex Oracle Performance Problems with Tanel PoderTanel Poder
Troubleshooting Complex Oracle Performance Problems hacking session & presentation by Tanel Poder.
This presentation is about a complex performance issue where the initial symptoms pointed somewhere else than the root cause. Only when systematically following through the troubleshooting drilldown method, we get to the root cause of the problem. This session aims to help you understand (and reason about) the Oracle’s multi-process & multi-layer system behavior, preparing you for independent troubleshooting of such complex performance issues in the future.
Video recordings of this presentation are in my YouTube channel:
1) Hacking Session: https://www.youtube.com/watch?v=INQewGJMdCI
2) Presentation: https://www.youtube.com/watch?v=aaHZ8A8Ygdg
Tanel's blog and training information: https://blog.tanelpoder.com/seminar
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
Oracle Week 2017 slides.
Agenda:
Basics: How and What To Tune?
Using the Automatic Workload Repository (AWR)
Using AWR-Based Tools: ASH, ADDM
Real-Time Database Operation Monitoring (12c)
Identifying Problem SQL Statements
Using SQL Performance Analyzer
Tuning Memory (SGA and PGA)
Parallel Execution and Compression
Oracle Database 12c Performance New Features
Any DBA from beginner to advanced level, who wants to fill in some gaps in his/her knowledge about Performance Tuning on an Oracle Database, will benefit from this workshop.
This presentation talks about the different ways of getting SQL Monitoring reports, reading them correctly, common issues with SQL Monitoring reports - and plenty of Oracle 12c-specific improvements!
This is a recording of my Advanced Oracle Troubleshooting seminar preparation session - where I showed how I set up my command line environment and some of the main performance scripts I use!
Troubleshooting Complex Oracle Performance Problems with Tanel PoderTanel Poder
Troubleshooting Complex Oracle Performance Problems hacking session & presentation by Tanel Poder.
This presentation is about a complex performance issue where the initial symptoms pointed somewhere else than the root cause. Only when systematically following through the troubleshooting drilldown method, we get to the root cause of the problem. This session aims to help you understand (and reason about) the Oracle’s multi-process & multi-layer system behavior, preparing you for independent troubleshooting of such complex performance issues in the future.
Video recordings of this presentation are in my YouTube channel:
1) Hacking Session: https://www.youtube.com/watch?v=INQewGJMdCI
2) Presentation: https://www.youtube.com/watch?v=aaHZ8A8Ygdg
Tanel's blog and training information: https://blog.tanelpoder.com/seminar
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
Oracle Week 2017 slides.
Agenda:
Basics: How and What To Tune?
Using the Automatic Workload Repository (AWR)
Using AWR-Based Tools: ASH, ADDM
Real-Time Database Operation Monitoring (12c)
Identifying Problem SQL Statements
Using SQL Performance Analyzer
Tuning Memory (SGA and PGA)
Parallel Execution and Compression
Oracle Database 12c Performance New Features
Any DBA from beginner to advanced level, who wants to fill in some gaps in his/her knowledge about Performance Tuning on an Oracle Database, will benefit from this workshop.
This presentation talks about the different ways of getting SQL Monitoring reports, reading them correctly, common issues with SQL Monitoring reports - and plenty of Oracle 12c-specific improvements!
This is a recording of my Advanced Oracle Troubleshooting seminar preparation session - where I showed how I set up my command line environment and some of the main performance scripts I use!
Session aims at introducing less familiar audience to the Oracle database statistics concept, why statistics are necessary and how the Oracle Cost-Based Optimizer uses them
Understanding my database through SQL*Plus using the free tool eDB360Carlos Sierra
This session introduces eDB360 - a free tool that is executed from SQL*Plus and generates a set of reports providing a 360-degree view of an Oracle database; all without installing anything on the database.
If using Oracle Enterprise Manager (OEM) is off-limits for you or your team, and you can only access the database thorough a SQL*Plus connection with no direct access to the database server, then this tool is a perfect fit to provide you with a broad overview of the database configuration, performance, top SQL and much more. You only need a SQL*Plus account with read access to the data dictionary, and common Oracle licenses like the Diagnostics or the Tuning Pack.
Typical uses of this eDB360 tool include: databases health-checks, performance assessments, pre or post upgrade verifications, snapshots of the environment for later use, compare between two similar environments, documenting the state of a database when taking ownership of it, etc.
Once you learn how to use eDB360 and get to appreciate its value, you may want to execute this tool on all your databases on a regular basis, so you can keep track of things for long periods of time. This tool is becoming part of a large collection of goodies many DBAs use today.
During this session you will learn the basics about the free eDB360 tool, plus some cool tricks. The target audience is: DBAs, developers and consultants (some managers could also benefit).
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
In this session, I will cover under-the-hood features that power Oracle Real Application Clusters (Oracle RAC) 19c specifically around Cache Fusion and Service management. Improvements in Oracle RAC helps in integration with features such as Multitenant and Data Guard. In fact, these features benefit immensely when used with Oracle RAC. Finally we will talk about changes to the broader Oracle RAC Family of Products stack and the algorithmic changes that helps quickly detect sick/dead nodes/instances and the reconfiguration improvements to ensure that the Oracle RAC Databases continue to function without any disruption
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsJohn Kanagaraj
Oracle Database 10g brought in a slew of tuning and performance related tools and indeed a new way of dealing with performance issues. Even though 10g has been around for a while, many DBAs haven’t really used many of the new features, mostly because they are not well known or understood. In this Expert session, we will look past the slick demos of the new tuning and performance related tools and go “under the hood”. Using this knowledge, we will bypass the GUI and look at the views and counters that matter and quickly understand what they are saying. Tools covered include AWR, ADDM, ASH, Metrics, Tuning Advisors and their related views. Much of information about Oracle Database 10g presented in this paper has been adapted from my book and I acknowledge that with gratitude to my publisher - SAMS (Pearson).
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1SolarWinds
In this 2 part webinar series, Janis Griffin, Database Performance Evangelist, SolarWinds, first discusses how to quickly identify the performance disruptors in the database, find which queries to focus on, and show how to examine the execution plan for costly steps.
Are your Oracle databases highly available? You have deployed Real Application Clusters (RAC), Data Guard, or Failover Clusters and are well protected against server failures? Great – the prerequisites for a highly available environment are given. However, to assure that backend infrastructure failures also remain transparent to the client, an appropriate configuration is a prerequisite.
This lecture will discuss the Oracle technologies that can be used to achieve automatic client failover functionality. What are the advantages, but also the limitations of these technologies?
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
Oracle Data Guard and Oracle Active Data Guard have long been the answer for the real-time protection, availability, and usability of Oracle data. This presentation provides an in-depth look at several key new features that will make your life easier and protect your data in new and more flexible ways. Learn how Oracle Active Data Guard 19c has been integrated with Oracle Database In-Memory and offers a faster application response after a role transition. See how DML can now be redirected from an Oracle Active Data Guard standby to its primary for more flexible data protection in today’s data centers or your data clouds. This technical deep dive on Active Data Guard is designed to give you a glimpse into upcoming new features brought to you by Oracle Development.
Session aims at introducing less familiar audience to the Oracle database statistics concept, why statistics are necessary and how the Oracle Cost-Based Optimizer uses them
Understanding my database through SQL*Plus using the free tool eDB360Carlos Sierra
This session introduces eDB360 - a free tool that is executed from SQL*Plus and generates a set of reports providing a 360-degree view of an Oracle database; all without installing anything on the database.
If using Oracle Enterprise Manager (OEM) is off-limits for you or your team, and you can only access the database thorough a SQL*Plus connection with no direct access to the database server, then this tool is a perfect fit to provide you with a broad overview of the database configuration, performance, top SQL and much more. You only need a SQL*Plus account with read access to the data dictionary, and common Oracle licenses like the Diagnostics or the Tuning Pack.
Typical uses of this eDB360 tool include: databases health-checks, performance assessments, pre or post upgrade verifications, snapshots of the environment for later use, compare between two similar environments, documenting the state of a database when taking ownership of it, etc.
Once you learn how to use eDB360 and get to appreciate its value, you may want to execute this tool on all your databases on a regular basis, so you can keep track of things for long periods of time. This tool is becoming part of a large collection of goodies many DBAs use today.
During this session you will learn the basics about the free eDB360 tool, plus some cool tricks. The target audience is: DBAs, developers and consultants (some managers could also benefit).
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
In this session, I will cover under-the-hood features that power Oracle Real Application Clusters (Oracle RAC) 19c specifically around Cache Fusion and Service management. Improvements in Oracle RAC helps in integration with features such as Multitenant and Data Guard. In fact, these features benefit immensely when used with Oracle RAC. Finally we will talk about changes to the broader Oracle RAC Family of Products stack and the algorithmic changes that helps quickly detect sick/dead nodes/instances and the reconfiguration improvements to ensure that the Oracle RAC Databases continue to function without any disruption
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsJohn Kanagaraj
Oracle Database 10g brought in a slew of tuning and performance related tools and indeed a new way of dealing with performance issues. Even though 10g has been around for a while, many DBAs haven’t really used many of the new features, mostly because they are not well known or understood. In this Expert session, we will look past the slick demos of the new tuning and performance related tools and go “under the hood”. Using this knowledge, we will bypass the GUI and look at the views and counters that matter and quickly understand what they are saying. Tools covered include AWR, ADDM, ASH, Metrics, Tuning Advisors and their related views. Much of information about Oracle Database 10g presented in this paper has been adapted from my book and I acknowledge that with gratitude to my publisher - SAMS (Pearson).
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1SolarWinds
In this 2 part webinar series, Janis Griffin, Database Performance Evangelist, SolarWinds, first discusses how to quickly identify the performance disruptors in the database, find which queries to focus on, and show how to examine the execution plan for costly steps.
Are your Oracle databases highly available? You have deployed Real Application Clusters (RAC), Data Guard, or Failover Clusters and are well protected against server failures? Great – the prerequisites for a highly available environment are given. However, to assure that backend infrastructure failures also remain transparent to the client, an appropriate configuration is a prerequisite.
This lecture will discuss the Oracle technologies that can be used to achieve automatic client failover functionality. What are the advantages, but also the limitations of these technologies?
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
Oracle Data Guard and Oracle Active Data Guard have long been the answer for the real-time protection, availability, and usability of Oracle data. This presentation provides an in-depth look at several key new features that will make your life easier and protect your data in new and more flexible ways. Learn how Oracle Active Data Guard 19c has been integrated with Oracle Database In-Memory and offers a faster application response after a role transition. See how DML can now be redirected from an Oracle Active Data Guard standby to its primary for more flexible data protection in today’s data centers or your data clouds. This technical deep dive on Active Data Guard is designed to give you a glimpse into upcoming new features brought to you by Oracle Development.
Talk for QConSF 2015: "Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of system performance tools, touring common problems with system metrics, monitoring, statistics, visualizations, measurement overhead, and benchmarks. This will likely involve some unlearning, as you discover tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many popular talks on operating system performance tools. This is an anti-version of these talks, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice and methodologies for verifying new performance tools, understanding how they work, and using them successfully."
MeetBSDCA 2014 Performance Analysis for BSD, by Brendan Gregg. A tour of five relevant topics: observability tools, methodologies, benchmarking, profiling, and tracing. Tools summarized include pmcstat and DTrace.
One of the great challenges of of monitoring any large cluster is how much data to collect and how often to collect it. Those responsible for managing the cloud infrastructure want to see everything collected centrally which places limits on how much and how often. Developers on the other hand want to see as much detail as they can at as high a frequency as reasonable without impacting the overall cloud performance.
To address what seems to be conflicting requirements, we've chosen a hybrid model at HP. Like many others, we have a centralized monitoring system that records a set of key system metrics for all servers at the granularity of 1 minute, but at the same time we do fine-grained local monitoring on each server of hundreds of metrics every second so when there are problems that need more details than are available centrally, one can go to the servers in question to see exactly what was going on at any specific time.
The tool of choice for this fine-grained monitoring is the open source tool collectl, which additionally has an extensible api. It is through this api that we've developed a swift monitoring capability to not only capture the number of gets, put, etc every second, but using collectl's colmux utility, we can also display these in a top-like formact to see exactly what all the object and/or proxy servers are doing in real-time.
We've also developer a second cability that allows one to see what the Virtual Machines are doing on each compute node in terms of CPU, disk and network traffic. This data can also be displayed in real-time with colmux.
This talk will briefly introduce the audience to collectl's capabilities but more importantly show how it's used to augment any existing centralized monitoring infrastructure.
Speakers
Mark Seger
LinuxCon Europe, 2014. Video: https://www.youtube.com/watch?v=SN7Z0eCn0VY . There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This talk summarizes the three types of performance tools: observability, benchmarking, and tuning, providing a tour of what exists and why they exist. Advanced tools including those based on tracepoints, kprobes, and uprobes are also included: perf_events, ktap, SystemTap, LTTng, and sysdig. You'll gain a good understanding of the performance tools landscape, knowing what to reach for to get the most out of your systems.
OSNoise Tracer: Who Is Stealing My CPU Time?ScyllaDB
In the context of high-performance computing (HPC), the Operating System Noise (osnoise) refers to the interference experienced by an application due to activities inside the operating system. In the context of Linux, NMIs, IRQs, softirqs, and any other system thread can cause noise to the application. Moreover, hardware-related jobs can also cause noise, for example, via SMIs.
HPC users and developers that care about every microsecond stolen by the OS need not only a precise way to measure the osnoise but mainly to figure out who is stealing cpu time so that they can pursue the perfect tune of the system. These users and developers are the inspiration of Linux's osnoise tracer.
The osnoise tracer runs an in-kernel loop measuring how much time is available. It does it with preemption, softirq and IRQs enabled, thus allowing all the sources of osnoise during its execution. The osnoise tracer takes note of the entry and exit point of any source of interferences. When the noise happens without any interference from the operating system level, the tracer can safely point to a hardware-related noise. In this way, osnoise can account for any source of interference. The osnoise tracer also adds new kernel tracepoints that auxiliaries the user to point to the culprits of the noise in a precise and intuitive way.
At the end of a period, the osnoise tracer prints the sum of all noise, the max single noise, the percentage of CPU available for the thread, and the counters for the noise sources, serving as a benchmark tool.
SREcon 2016 Performance Checklists for SREsBrendan Gregg
Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
This talk is from ApacheCon North America 2017 - Cassandra serving netflix @ scale - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix
https://www.youtube.com/watch?v=2l0_onmQsPI&index=3&t=284s&list=PL7uQt4PWyRW0XoVhEnNcSdCw5ufLEn9HA
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Modern Linux Performance Tools for Application TroubleshootingTanel Poder
Modern Linux Performance Tools for Application Troubleshooting.
Mostly demos and focused on application/process troubleshooting, not systemwide summaries.
This is a high level presentation I delivered at BIWA Summit. It's just some high level thoughts related to today's NoSQL and Hadoop SQL engines (not deeply technical).
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
From Tanel Poder's Troubleshooting Complex Performance Issues series - an example of Oracle SEG$ internal segment contention due to some direct path insert activity.
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder
Tanel Poder's Oracle Performance and Troubleshooting Scripts & Tools presentation initially presented at Hotsos Symposium Training Day back in year 2010
Oracle Latch and Mutex Contention TroubleshootingTanel Poder
This is an intro to latch & mutex contention troubleshooting which I've delivered at Hotsos Symposium, UKOUG Conference etc... It's also the starting point of my Latch & Mutex contention sections in my Advanced Oracle Troubleshooting online seminar - but we go much deeper there :-)
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder
Tanel Poder has been involved in a number of Exadata migration projects since its introduction, mostly in the area of performance ensurance, troubleshooting and capacity planning.
These slides, originally presented at UKOUG in 2010, cover some of the most interesting challenges, surprises and lessons learnt from planning and executing large Oracle database migrations to Exadata v2 platform.
This material is not just repeating the marketing material or Oracle's official whitepapers.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
How world-class product teams are winning in the AI era by CEO and Founder, P...
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
1. Troubleshoo4ng
the
Most
Complex
Performance
Issue
I’ve
ever
seen
Tanel
Poder
hAp://blog.tanelpoder.com
hAp://tech.e2sn.com
www.enkitec.com
1
2. Intro:
About
me
• Tanel
Põder
Oracle
Database
Performance
geek
Exadata
Performance
geek
Hadoop
Performance
geek
• Enkitec
• Consultant
• Researcher
• Technology
Evangelist
• Just
moved
to
Dallas
• ANer
Tallinn
-‐>
Stockholm
-‐>
London
-‐>
Cancun
-‐>
Singapore
www.enkitec.com
Expert
Oracle
Exadata
book
(with
Kerry
Osborne
and
Randy
Johnson
of
Enkitec)
2
3. Two
issues
-‐
actually
• For
warm-‐up:
• cursor
pin:
S
wait
events
and
sporadic
CPU
spikes
• Read
more
from
my
blog
entry:
• hAp://blog.tanelpoder.com/2010/04/21/cursor-‐pin-‐s-‐waits-‐sporadic-‐cpu-‐
spikes-‐and-‐systema4c-‐troubleshoo4ng/
• Or
just
google
for
“cursor
pin
s”
www.enkitec.com
3
4. Environment
•
•
•
•
High-‐concurrency,
high-‐visibility
OLTP
database
Oracle
11.1.0.7
single-‐instance,
dedicated
server
processes
HP-‐UX
on
Itanium
32
CPUs,
128
GB
RAM
• Thousands
of
end
users
• Mul4ple
WebLogic
applica4on
servers
talking
to
database
via
connec4on
pools
www.enkitec.com
4
5. The
problem
•
•
•
•
Sporadic
extreme
slowness
of
Oracle
DB
and
the
server
Slowness
lasts
for
1
..
20
minutes
at
a
4me…
Queries
don’t
answer
or
extremely
slow
Can’t
even
log
on
to
OS
during
that
4me
• New
SSH
connec4ons
succeded
once
the
spike
was
over
• It
takes
minutes
to
run
simple
OS
commands
during
the
problem
4me
• This
is
a
global
server-‐wide
problem
–
everyone
complains!
So,
the
scope
of
this
problem
is
global,
server-‐wide.
Therefore
we
can
use
global,
server-‐wide
metrics
to
diagnose
the
problem.
www.enkitec.com
5
6. Let’s
pick
and
diagnose
one
occurrence
of
this
problem
• The
database
response
4mes
extremely
bad
again
around
18:10
and
this
lasted
for
about
5
minutes…
• If
it’s
the
users
who
report
the
problem
(as
opposed
to
applica4on
side
measurements),
then
there
may
be
some
discrepancies
in
the
user
reported
4mes
vs
actual
problem
4me
www.enkitec.com
6
7. Ini4al
AWR
Report
Number
of
sessions
has
grown
by
~500!
Snap Id
Snap Time
Sessions Curs/Sess
--------- ------------------- -------- --------Begin Snap:
61921 30-Oct-10 18:00:10
2,383
28.9
Using
a
20
minute
End Snap:
61922 30-Oct-10 18:20:20
2,863
24.7
report
for
diagnosing
Elapsed:
20.17 (mins)
a
5
minute
problem?!
DB Time:
559.31 (mins)
Avg
% DB
Event
Waits
Time(s)
(ms)
time Wait Class
--------------------------- --------- ----------- ------ ------ ---------db file sequential read
2,135,668
21,468
10
64.0 User I/O
DB CPU
5,860
17.5
log file sync
92,720
1,498
16
4.5 Commit
read by other session
91,676
1,307
14
3.9 User I/O
SQL*Net message from dblink
525
1,132
2155
3.4 Network
This
66%
idle
is
an
average
over
20
minutes!
Host CPU (CPUs:
~~~~~~~~
32 Cores:
32 Sockets:
32)
Load Average
Begin
End
%User
%System
%WIO
%Idle
--------- --------- --------- --------- --------- --------0.37
0.31
16.3
17.6
12.1
66.1
www.enkitec.com
7
8. ASH
data
(shown
in
OEM)
• Average
ac4ve
sessions
showed
something
different
• Note
that
this
data
is
from
another
period
of
4me
when
a
similar
spike
happened
• In
worst
4mes
there
were
up
to
220
ac4ve
sessions
trying
to
be
on
CPU!
• Thanks
to
beAer
granularity
we
see
the
spikes
instead
of
some
20-‐minute
or
hourly
averages…
• The
problem
with
ASH
samples
is
that
it
looks
into
session
state
from
inside
Oracle
• Perhaps
the
starva5on
is
due
to
some
other
applica5on
/
instance
in
the
server?
www.enkitec.com
8
9. How
many
logons
were
done?
Number
of
sessions
has
grown
by
~500!
Snap Id
Snap Time
Sessions Curs/Sess
--------- ------------------- -------- --------Begin Snap:
61921 30-Oct-10 18:00:10
2,383
28.9
End Snap:
61922 30-Oct-10 18:20:20
2,863
24.7
Elapsed:
20.17 (mins)
DB Time:
559.31 (mins)
Statistic
Total
per Second
-------------------------------- ------------------ -------------index fetch by key
24,174,148
19,971.0
index scans kdiixs1
24,565,055
20,293.9
leaf node 90-10 splits
5,865
4.9
leaf node splits
14,529
12.0
lob reads
34,480
28.5
lob writes
1,623,273
1,341.0
lob writes unaligned
1,623,266
1,341.0
logons cumulative
2,550
2.1
messages received
133,740
110.5
messages sent
133,740
110.5
min active SCN optimization appl
538,358
444.8
no buffer to keep pinned count
6,331
5.2
no work - consistent read gets
146,703,542
121,196.1
opened cursors cumulative
4,168,700
3,443.9
www.enkitec.com
2.1
logons
per
second,
but
we
don’t
know
how
these
logons
are
per Trans
------------distributed
over
43.6
the
20
minute
44.3
period!
0.0
0.0
0.1
2.9
2.9
0.0
0.2
0.2
1.0
0.0
264.7
7.5
9
10. OS
level
metrics
don’t
lie
(well,
they
do,
but
less
;-‐)
|
|
|
Date
| Time |CPU % |
10/30/2010|17:45:00| 25.06|
10/30/2010|17:50:00| 24.77|
10/30/2010|17:55:00| 24.60|
10/30/2010|18:00:00| 25.95|
10/30/2010|18:05:00| 22.88|
10/30/2010|18:10:00| 66.89|
10/30/2010|18:15:00| 24.51|
10/30/2010|18:20:00| 25.47|
10/30/2010|18:25:00| 28.38|
Phys
|
IO Rt
|
7149.5|
5334.8|
7176.4|
7556.2|
5379.5|
4544.6|
7544.9|
5144.0|
10139.5|
Phys
|Memory|Pg Out | VM Pg
|
KB Rt | %
| Rate | Scan Rt |
145817.6| 74.07|
0.0|
0.0|
60928.0| 73.98|
0.0|
0.0|
186368.0| 73.98|
0.0|
0.0|
192307.2| 74.11|
0.0|
0.0|
67584.0| 74.15|
0.0|
0.0|
58060.8| 76.97|
0.0|
0.0|
159334.4| 76.40|
0.0|
0.0|
59187.2| 75.04|
0.0|
0.0|
151552.0| 74.37|
0.0|
0.0|
1) What
does
the
Time
18:10:00
mean,
beginning
of
the
monitoring
interval
or
end?
2) 66.89%
busy
during
5
minutes
may
actually
mean
100%
busy
during
~3
minutes
out
of
5,
but
we
don’t
know
that
for
sure
without
measuring
in
more
detail
(beAer
granularity)…
www.enkitec.com
10
11. Measuring
CPU
u4liza4on
in
more
detail
The
spike
lasted
from
18:11
to
18:14
(3
min)
Around
90%
in
Kernel
mode!!!
www.enkitec.com
11
12. Checkpoint
–
measured
evidence
so
far
• Fact:
We
have
a
100%
CPU
u4liza4on
spike,
las4ng
3
minutes
• Fact:
90%
of
it
is
spent
in
KERNEL
mode
• Fact:
We
have
over
2500
logons
done
during
20
minute
period
• 2.1
logons
/
second
on
average
(which
doesn’t
sound
bad)
• Kernel
mode
CPU
usage
is
usually
caused
by
system
calls
• …or
some
internal
kernel
thread
ac4vity
100%
CPU
usage
doesn’t
always
automa4cally
mean
you
have
a
serious
CPU
starva4on
problem.
The
CPU
runqueue
length
would
indicate
you
how
much
starva4on
(wai4ng
for
CPU
service)
there
is.
However
seeing
90%
of
CPU
used
in
KERNEL
mode
is
definitely
not
normal
for
an
Oracle
database
server.
www.enkitec.com
12
13. Diagnosing
90%
kernel-‐mode
CPU
usage
spikes…
1. Systema4c
•
•
•
•
Break
down
this
90%
of
Kernel
mode
CPU
usage
Profiling!
Oh,
this
is
a
produc4on
system
and
the
problem
is
acute
&
ongoing
On
Solaris,
I’d
have
used
Dtrace
stack()
probe
to
record
OS
kernel
stack
traces
most
common
on
CPU
(google
for
dstackprof)
•
•
Or
lockstat
as
it
reports
spins
on
spinlocks
(which
consume
kernel
CPU)
But
this
was
HP-‐UX
and
I
didn’t
know
the
tools
needed
•
•
But
I
knew
what
numbers
I
wanted
to
see!
We
sent
a
request
to
HP-‐UX
support:
“How
do
we
measure
&
break
down
where
is
kernel
mode
CPU
used?”
2. Check
for
usual
suspects
•
Fast,
cheap
checks
to
rule
out
or
find
known
troublemakers
www.enkitec.com
13
14. Kernel
mode
CPU
usage
spikes
–
the
usual
suspects
• Before
star4ng
the
systema4c
troubleshoo4ng
&
drilldown,
do
quick
checks
for
usual
suspects
• Remember,
the
client
has
a
business
problem,
4me
is
of
essence…
1. Logon
(or
logoff)
storms
•
Spawning,
ini4alizing
new
processes,
opening
files
and
aAaching
to
SGA
means
system
calls,
kernel
CPU
usage
2. Oracle
code
gevng
into
some
crazy
loop
(due
to
a
bug)
•
Semop(),
yield(),
read
/proc/…,
getrusage(),
etc
loop
3. OS
kernel
spinlock
conten4on
•
•
Not
so
usual
suspects
really…
Variety
of
reasons…
ONen
due
to
bugs
in
OS
or
some
kernel
module
www.enkitec.com
14
15. Measuring
logon
storms
• Use
the
AUD$
records
or
“logons
cumula@ve”
number
from
V$SYSSTAT
or
AWR,
right?
• Wrong!
• logons
cumula5ve
number
is
incremented
by
the
session
itself
–
aNer
it
has
logged
on,
the
same
applies
to
audit
records!
1.
2.
3.
4.
5.
ANer
the
listener
connec4on
has
been
established…
The
process
has
been
started…
It
has
aAached
to
SGA
SHM
segments…
Audit
file
has
been
wriAen
(if
needed)
…
Process,
session
SGA
structures
have
been
created
•
Memory
from
OS
and
shared
pool
allocated
(shared
pool
latches!)
6. Session
has
been
authen4cated
7. Then
the
logons
cumula4ve
is
incremented!
www.enkitec.com
The
logon
storm
may
have
started
way
before
these
logons
finally
succeeded!
15
17. Measuring
logon
storms
• Logon
storms
should
be
measured
at
the
listener
level
• Process
listener.log
using
a
script:
$ fgrep "30-OCT-2010 22:" listener.log | fgrep "establish"
|
awk '{ print $1 " " $2 }' | awk -F: '{ print $1 ":" $2 }' |
sort | uniq –c
…
88 30-OCT-2010 22:00
120 30-OCT-2010 22:01
94 30-OCT-2010 22:02
How
many
connec4ons
94 30-OCT-2010 22:03
listener
established
every
95 30-OCT-2010 22:04
minute
(this
data
from
non-‐
120 30-OCT-2010 22:05
problem
4me)
79 30-OCT-2010 22:06
101 30-OCT-2010 22:07
85 30-OCT-2010 22:08
100 30-OCT-2010 22:09
85 30-OCT-2010 22:10
89 30-OCT-2010 22:11
83 30-OCT-2010 22:12
93 30-OCT-2010 22:13
www.enkitec.com
17
18. Con4nuing
the
OS
kernel
mode
CPU
usage
diagnosis
Let’s
trace
system
calls
by
one
CPU
heavy
Oracle
process
# tusc -cp 8021
( Attached to process 8021 ("oracleXYZ (LOCAL=NO)") [64-bit] )
( Detaching from process 8021 ("oracleXYZ (LOCAL=NO)") )
Syscall
open
----Total
Interes4ngly
3
open()
syscalls
take
over
a
second
in
total.
Could
this
be
caused
by
the
general
CPU
starva4on
in
the
server?
Seconds
1.05
----1.05
Calls
3
----3
The
next
step
should
have
been
to
check
which
file
did
we
try
to
open
(but
the
spike
ended
before
we
could
do
that)
www.enkitec.com
Errors
3
----3
Seeing
errors
isn’t
a
problem
itself,
as
this
is
how
Oracle
and
libs
some4mes
check
for
existence
of
a
file…
18
19. Con4nuing
the
OS
kernel
mode
CPU
usage
diagnosis
• HP-‐UX
support
got
back
to
us
and
recommended
the
use
of
runki
tool
for
measuring
kernel
CPU
u4liza4on
in
detail
• It
had
to
be
installed
as
root
• It
measured
a
lot
of
things
happening
in
kernel,
wri4ng
a
big
output
file
• on
Solaris
there’s
a
tool
called
TNF
trace,
AIX
has
trace
command
for
such
full
kernel
tracing
• Now
we
had
to
wait
for
the
problem
to
happen
again
• Didn’t
have
to
wait
for
too
long…
• We
sent
the
raw
trace
dump
to
HP
Support,
so
they
could
run
something
like
“tkprof”
on
that
tracefile
• It
basically
just
summed
up
the
kernel
spinlock
wait,
spin
events
by
lock,
object
locked
etc
www.enkitec.com
19
20. kiprof
–
profiled
runki
output
Total Hardclock traces: 30239
================================
State
Count
Percent
USER
7130
23.58
SYS
22965
75.94
IDLE
24
0.08
SSYS
120
0.40
That’s
basically
spinning
for
locks
(adap4ve
decisions
to
spin
or
wait)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Kernel Functions executed during profile
Count
Pct State Function
Self-‐explanatory.
We
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
15615 51.64% SYS
timed_preArbitration
are
spinning
for
a
lock
7130 23.58% USER
OTHER
1275
4.22% SYS
spinlock
1008
3.33% SYS
wait_for_lock_spinner
VxFS
directory
block
488
1.61% SYS
vx_dirbread
read?
464
1.53% SYS
vx_bc_getblk
417
1.38% SYS
preArbitration
338
1.12% SYS
vx_dirscan
VxFS
directory
291
0.96% SYS
vx_bc_brelse
contents
scan
!!!
www.enkitec.com
20
21. Drilling
down
to
kernel
spinlock
spinning
The
main
spinlock
experiencing
spinning
was
related
to
VxFS:
spn%kern
7.24
0.11
0.04
0.04
0.01
cumpct
7.24
7.36
7.40
7.44
7.45
seconds
71.96
1.14
0.44
0.42
0.12
spn%cpu lock name
3.77 FS:vxfs:bc_freelist_lock spin
0.06 FS:vxfs:inode spin for sleep lock
0.02 FS:vxfs:i_spinspin lock
0.02 Sleep Queue lock
0.01 v_count_lock
Oracle
processes
were
the
main
ones
spinning:
spn%spn
99.17
0.60
0.08
0.05
0.03
0.03
cumpct
99.17
99.77
99.85
99.91
99.93
99.96
spnsec
73.82
0.45
0.06
0.04
0.02
0.02
usrsec
880.32
0.00
1.42
0.00
0.00
0.00
kernsec spn%kern process name
543.70
13.58 oracle
7.37
6.11 vxfsd
11.52
0.52 tnslsnr
402.60
0.01 [IDLE]
0.50
4.00 xyz
0.43
4.65 sadc
www.enkitec.com
21
23. Audit
file
des4ona4on
• New
audit
file
name
format
in
11g…
• A
new
file
is
created
for
each
audit
file
where
the
SPID
collides
with
a
previous
file…
• Every
@me
when
crea@ng
a
new
audit
file,
Oracle
has
to
check
whether
such
file
already
exists
with
suffix
_1,
then
_2,
_3,
etc..
$ cd /u01/app/oracle/admin/E2SNDB/adump
$ ls -l | head
total 4788
-rw-r----- 1 oracle dba 735 Feb 28 16:06
-rw-r----- 1 oracle dba 710 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Feb 27 17:53
-rw-r----- 1 oracle dba 736 Oct 16 17:58
-rw-r----- 1 oracle dba 740 Oct 16 17:58
-rw-r----- 1 oracle dba 735 Feb 28 16:07
-rw-r----- 1 oracle dba 735 Feb 24 17:44
-rw-r----- 1 oracle dba 735 Dec 22 21:28
www.enkitec.com
e2sndb_ora_10028_1.aud
e2sndb_ora_10082_1.aud
e2sndb_ora_10082_2.aud
e2sndb_ora_10095_1.aud
e2sndb_ora_10120_1.aud
e2sndb_ora_10125_1.aud
e2sndb_ora_10158_1.aud
e2sndb_ora_10206_1.aud
e2sndb_ora_10482_1.aud
23
24. Shouldn’t
the
audit
files
be
created
only
for
SYSDBA
and
SYSOPER
access?
• In
theory,
yes.
• In
prac4ce
in
our
case,
no.
• Bug
9744092:
EXCESSIVE
AMOUNT
OF
AUD
FILES
BEING
GENERATED
IN
11.1
• Oracle
generated
a
new
.aud
file
for
every
new
database
connec4on!
• Not
just
SYSOPER/SYSDBA
like
normally
• This
is
all
despite
having
audit_trail
=
DB
• Normally
the
.aud
files
in
audit_dump_dest
are
not
related
to
regular
audit
trail,
but
for
SYSOPER/SYSDBA
logon,
startup/shutdown
audi4ng
• …and
if
AUDIT_SYS_OPERATIONS
=
TRUE
then
all
commands
issued
as
SYS
www.enkitec.com
24
25. Bug
9744092:
EXCESSIVE
AMOUNT
OF
AUD
FILES
BEING
GENERATED
IN
11.1
PROBLEM:
-------After upgrade to 11.1 the system creates 10 - 16
*aud files per minute in audit_file_dest.
Out of 9528 *.aud files that the customer uploaded,
9124 files recorded ACTION:[3] "102".
DIAGNOSTIC ANALYSIS:
-------------------The change of behavior (move audit action 102 from
aud$ to audit_file_dest when audit_trail=DB.) is due
to a fix for an unpublished bug 5476184 in 11.1.
It is not an intended feature for 11G.
WORKAROUND:
----------Manual delete of audit files
www.enkitec.com
25
26. Diagnosis
1. Thanks
to
bug
9744092
and
a
behavior
change
in
Oracle
11.1
a
new
audit
file
was
created
for
each
new
connec4on
to
DB
•
If
a
file
already
existed,
Oracle
checked
if
a
similar
file
name
with
larger
suffix
value
(_2,
_3,
_4
etc)
existed
2. The
audit_file_dest
eventually
had
over
1.5
M
files
in
it!
•
For
each
logon,
mul4ple
file
existence
checks
(open()
syscalls)
had
to
be
done
3. Checking
whether
a
file
existed
(open
syscall
-‐>
directory
entry
scan)
became
very
slow
–
and
it’s
done
in
kernel
mode
•
•
A
spinlock
was
held
during
the
directory
entry
scan
Other
new
Oracle
processes
also
wanted
to
do
the
directory
scan,
resul4ng
in
spinlock
conten4on
and
further
Kernel
mode
CPU
usage
4. When
the
DB
got
slow
–
app
servers
fired
up
hundreds
of
new
connec4ons
to
“make
things
faster”
•
This
all
fed
back
to
the
problem
–
even
more
conten4on
&
spinning
www.enkitec.com
26
27. Limi4ng
logon
storms
Use
Oracle
Listener
connec4on
rate
limiter
(11gR1+)
listener.ora:
LISTENER=
(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)(RATE_LIMIT=5))
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1522)(RATE_LIMIT=10))
(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1523))
)
Oracle
Documenta5on:
Oracle
Net
Listener
Parameters
(listener.ora)
hQp://download.oracle.com/docs/cd/B28359_01/network.111/b28317/listener.htm
Also,
it
is
possible
to
limit
logoff
storm
rate
_logout_storm_rate
parameter
(instance-‐wide)
www.enkitec.com
27
28. Troubleshoo4ng
sporadic
system
performance
issues
Right
Data
!!!
• Right
scope
–
if
your
problem
lasts
for
seconds,
this
should
be
the
granularity
of
your
data
too
• OS
level
data,
in
addi4on
to
the
database
metrics
• Ideally
OS
level
metrics
sampled
mul4ple
4mes
per
minute
www.enkitec.com
28
29. Conclusions
• Logon
storms
are
evil!
• They
will
amplify
any
performance
hiccups
as
they
cause
extra
load
just
when
the
resources
are
scarcest
• Connec4on
pools
firing
up
hundreds
of
new
connec4ons
are
evil!
• Know
your
limits
(both
max
connec4ons
and
max
connect
rate
/
sec)
• Here’s
a
thought:
• If
you
have
planned
the
servers’
capacity
to
support
N-‐thousand
connec4ons
anyway
(by
allowing
connec4on
pools
grow
that
high),
why
not
create
this
amount
of
connec4ons
right
away?
• This
would
avoid
logon
storms
during
worst
4mes
as
all
connec4ons
have
already
been
created!
www.enkitec.com
29