SlideShare a Scribd company logo
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Using Machine Learning to Debug Oracle
RAC Issues
Anil Nair
Sr. Principal Product Manager,
Oracle Real Application Clusters (RAC)
Dec 4th, 2018
@RACMasterPM, @OracleRACpm
http://www.linkedin.com/in/anil-nair-01960b6
http://www.slideshare.net/AnilNair27/
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduction
How do we diagnose issues?
What’s new?
Walk through some common scenarios
FAQ
1
2
3
4
5
3
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduction1
2
3
4
5
4
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 5
Scalability without Application code change(s)
0
500 0
100 00
150 00
200 00
250 00
300 00
350 00
400 00
4 8 32 48 64 80
2035
4010
15520
22416
30016
37040
# of Cores across RAC Nodes
Users
2 Nodes
3 Nodes
4 Nodes
5 Nodes
SAP certified SD Benchmark results
Active Active Instances scales Writes, Reads and Hybrid workloads
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 6
Oracle Real Application Cluster Family of Solutions
• Integrated set of tools that work
cohesively to provide High Availability
and Scalability
• The functionality provided by Oracle RAC
Family of Solutions can be used by
licensed Oracle RAC, Oracle RAC One
Node and Single Instance customers
without any additional charge
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
How do we Diagnose issues?
1
2
3
4
5
7
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
1. Detect
– Monitoring script/User feedback
2. React
– Login to system
– Go through the stack
– Pin point the issue
– Possible solution
3. Fix
– Implement
– Go back to Step 1
8
Anatomy of issue diagnosis
Detect
ReactFix
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 9
Lets walk through a
sample problem
resolution
There is a quiz in the end so pay attention!!!
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Node Eviction – Node 1 ocssd.trc [ Part 1 ]
• 2010-08-13 17:00:22.818: [ CSSD][4106599328]
clssnmPollingThread: node anair2 (2) at 50% heartbeat fatal,
removal in 14.520 seconds
• 2010-08-13 17:00:29.833: [ CSSD][4106599328]
clssnmPollingThread: node anair2 (2) at 75% heartbeat fatal,
removal in 7.500 seconds
• 2010-08-13 17:00:37.337: [ CSSD][4106599328]
clssnmPollingThread: Removal started for node anair2 (2), flags
• 2010-08-13 17:00:37.340: [ CSSD][4085619616]clssnmCheckSplit:
Node 2, anair2, is alive, DHB (1281744040, 1396854) more than disk
timeout of 27000 after the last NHB (1281744011, 1367154)
10
Network heartbeat missing
from node 2 for 15
consecutive seconds
Network heartbeat is still
missing
Finally eviction starts
Node 2 is still updating the
Voting disks
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Node Eviction – Node 1 ocssd.trc [ Part 2 ]
• 2010-08-13 17:00:37.340: [
CSSD][4085619616](:CSSNM00007:) clssnmrEvict: Evicting
node 2, anair2, from the cluster in incarnation 169934272,
node birth incarnation 169934271, death incarnation
169934272, stateflags 0x24000
• 2010-08-13 17:01:07.705: [
CSSD][4043389856]clssgmCMReconfig: reconfiguration
successful, incarnation 169934272 with 1 nodes, local node
number 1, master node number 1
11
Eventually Node 2 eviction
process starts
Concluding with a
reconfiguration
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Node Eviction – Node 2 ocssd.trc [ Part 1 ]
• 2010-08-13 17:00:26.213: [ CSSD][4073040800]
clssnmPollingThread: node anair1 (1) at 50% heartbeat fatal,
removal in 14.540 seconds
• 2010-08-13 17:00:40.702: [ CSSD][4073040800]
clssnmPollingThread: Removal started for node anair1 (1), flags
• 2010-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckSplit:
Node 1, anair1, is alive, DHB (1281744036, 1243744) more than disk
timeout of 27000 after the last NHB (1281744007, 1214144)
12
Huh!
So who is right?
It also detects that
Node 1 is still updating
the Voting disks
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Node Eviction – Node 2 ocssd.trc [ Part 2 ]
• 2010-08-13 17:00:40.707: [
CSSD][4052061088](:CSSNM00008:)clssnmCheckDskInfo:
Aborting local node to avoid splitbrain. Cohort of 1 nodes
with leader 2, anair2, is smaller than cohort of 1 nodes led
by node 1, anair1, based on map type 2
• 2010-08-13 17:00:40.707: [ CSSD]
[4052061088]###################################
2010-08-13 17:00:40.707: [ CSSD][4052061088]clssscExit:
CSSD aborting from thread clssnmRcfgMgrThread
2010-08-13 17:00:40.707: [ CSSD]
[4052061088]###################################
13
it correctly detects and
aborts local node to
prevent split brain
And does the right thing
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
So what is the conclusion ?
14
Hmmm….I think it is the network.
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
What does OS Watcher say?
• netstat does not show any issues
# grep "zzz|udpInOverflows|ipReasmFails"
• OSW data itself is missing possibly due to scheduling issues
• Just prior to the issue Top reports
top - 13:23:52 up 25 days, 21:08, 1 user, load average: 3.43, 3.01, 3.02
Cpu(s): 16.8%us, 23.2%sy, 0.0%ni, 56.5%id, 3.1%wa, 0.1%hi, 0.3%si,
Mem: 74027752k total, 73689744k used, 338008k free, 1516k
Swap: 16771852k total, 9069988k used, 7701864k free, 25836528k
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1049 root 11 -5 0 0 0 R 55.0 0.0 0:57.42 [kswapd0]
15
No issues seen in netstat &
traceroute
Oh no! It is swapping
Why is the OSW data
missing?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Network heart beats could be
delayed due to lack of resources
– Incorrect sizing
– Configuration issues
• OSW may not capture all the
required data
– May not be scheduled
– No co-relation between data points
16
Summary
Delayed and inconclusive diagnosis
I think it was
either a network
or CPU issue
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
What’s new?
1
2
3
4
5
17
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Scale across deployments
18
Collect detailed diagnostics
Data
Correlate and Smartly
analyze Data
DBA’s need a Solution that can
OS logs
DB logs
GI logs
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Cloud Challenge – Scale
19
Server
ServerServerServer
ServerServerServerServer
Server Server Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
1
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 20
Collect diagnostics across the entire stack
Server
ServerServerServer
ServerServerServerServer
Server Server Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Collect detailed diagnostics for the entire stack
2
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 21
Collect, Correlate & Analyze Data
Operating
System logs
DB logs
Grid
Infrastructure
logs
3
GC blocks
lost
High CPU
Swapping
Root
Cause
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Introducing Oracle Autonomous Health Framework (AHF)
• Integrated next generation tools
running as components - 24/7
• Discovers Potential Issues and takes
Corrective Actions
• Speeds up Issue Diagnosis and
Resolution
• Maintains Database Performance
and Availability
22
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
AHF: Handling Scale utilizing Machine Learning
23
Server
ServerServerServer
ServerServerServerServer
Server Server Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server• Identify troubled servers
• Attempt to resolve the issue
• Collect diagnostic information for RCA
1
2
3
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 24
Machine Learning for Automatic Diagnosis
Faults
Alarms
Incidents
Root Causes
Corrective
Actions
101011111101011010101010101
010101110110001010101010101
010101010101010101010101101
Diagnostic Data collected from OS and the database is analyzed using a Bayesian Belief Network for Cause
and Effect Analysis to automatically discover potential issues or take corrective actions
Diagnostic Data
AHF Real time Analysis
Machine
Learning
Pattern
Recognition
BN
Engines
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Utilize Machine Learning for
efficient diagnosis in house
• Reduce the amount of trace
files to be read manually
• Create reusable models for
different problem scenarios
25
Create Reusable optimized Models for runtime deployment
Subject Matter
ExpertLog
ASH
Metrics
ML
Knowledge
Extraction
Model
Generation
Human
Supervision
Application
Optimized
Models
Feedback
• Compare problems with the
models in customer
environments
Oracle Environment
Customer Environment
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Components of Autonomous Health Framework (AHF)
26
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Cluster Health Monitor (CHM)
27
GIMR
ologgerd
(master)
osysmond
osysmond
osysmond
osysmond
Grid Infrastructure
Management Repository
• One Master daemon ologgerd
responsible for writes
• Every node has slave osysmond which
sends detailed OS statistics to the
Master osysmond
• Master Slave automatically maintained
by Grid Infrastructure during node
start/stop
• Listens to CSS, GIPC events and
Clusterware Agent notifications
• Reads Oracle Database Wait
information directly from SGA
OS Data
OS Data
OS Data
OS Data
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Components of Autonomous Health Framework (AHF)
28
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Cluster Health Advisor (CHA)
29
OS Data
ochad
• CHA is the core component that is
running on every node
• Detects node and database
performance problems by applying ML
on the collected diagnostics data
• Provides early-warning alerts and
corrective action
• Supports on-site calibration to improve
sensitivity
• Standalone Interactive GUI and
command line tools provided for
analysis
DB Data
CHM
Node
Health
Engine
Database
Health
Engine
GIMR
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Components of Autonomous Health Framework (AHF)
30
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 31
Hang Manager - Workings
• Resolves Cross Layer Hangs
• Resolves ASM Hangs
• Including Flex ASM
• Resolves Dead locks
Considers cross-layer
hangs between ASM and
database instances
Hang
Resolution
Analyze
Evaluate
Detect
Session
Hung?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Database processes can be waiting
on
– Allocating space in SGA due to change
in workload
– Higher load on LMS*
• Even though there is adequate
resources
– CPU and Memory
• Remaster Blocks
• Reduce Brownouts
32
The Database point of view
What is happening in the Database?
Allocate space in SGA
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Oracle RAC Optimization Design Goals
Monitor
Reduce
Brownouts
Optimize CPU
usage
Dynamically
Adjust
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Monitors for problems before
service disruption
– E.g. HB for critical processes
• Detects the cause of problem
• Use collected data across all nodes
to identify root cause
– E.g. Waits on GRD
• Resolves the problem with minimal
disruption
– E.g Resize internal Structures
Introducing Database Reliability Framework
• Normal
• Busy
• Extremely Busy
Resource Utilization
• Type 1
• Type 2
• Type 3
Resource Types
Resources
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Monitor
Detect
Review
Resolve
• Increase in number of resources in
the Global Resource Directory
(GRD)
• Resulting in higher wait times for
GRD
• Several solutions possible
– Is wait time due to high CPU load?
– Increase in number of LMS help?
– Increasing CR slaves help
– Increasing internal thresholds help?
Database Reliability Framework in Action
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Database Reliability Framework Details
• Runs in LMHB Process
– Re-startable
– Non Fatal
• Relies on Metrics and Actions
Action Related metrics
report high cpu oracle instance
processes
cpu load, cpu threshold, bg
heartbeat, cpu load (global)
report high memory oracle instance
processes
memory load, memory threshold
kill instance memory hog memory load, memory threshold
enable rm plan cpu load, bg slowing (2)
switch process to elevated priority cpu load, bg slowing (3)
switch process from elevated priority cpu load, bg heartbeat
shrink or grow resource cache library cache pin waits
cap total processes at elevated priority cpu number
enable drm cpu load, drm checks
disable drm cpu load, drm checks
increase default number of lms cr blocks congested, current
blocks congested
Metric Scope
o/s memory Node Global
o/s load Node Global
bg heartbeat Global
DRM health check Global
library cache pin waits Local
CFIO waits Global
gc block lost Local
gc block busy Local
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Busy FG process(es) using CPU
• Potential upcoming memory
starvation
• LGWR constrained by CPU
• Too many RT processes
• Insufficient CR slaves
• DLM resource cache incorrectly
sized
• Control file IO (CFIO) stall
• v$ views
• v$gcr_metrics - details on all defined
metrics
• v$gcr_actions - details on all defined
actions
• v$gcr_log – metric/action history
summary log
• v$gcr_status – details on latest
metric/action status
37
Examples and DRF Views
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Increase the maximum number of
LMSs
– Based on System utilization (DRF)
• Each LMS will spawn a dedicated
CR slave
– Threshold of Rollback Changes
– Threaded CR slave in 18c
• Optimized for Multi core/thread architecture
• Remastering Slaves (RMV0..)
– Offloads heavy remastering work to
slaves
Cache Fusion Optimizations
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Walk through some common scenarios
1
2
3
4
5
39
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Scenario 1 – Remember our Node eviction issue
40
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• CHAG (Cluster Health Advisor
Graphical) Doc ID 2340062.1
• CHAG is the GUI to utilize the
benefits of AHF
• Can be run on the Cluster node
– Not Recommended*
– Set ORACLE_HOME= GI HOME
– Connects to the GIMR using wallets
– but Install locally on a separate Linux
host
$export ORACLE_HOME=/u01/app/12.2.0/grid
$ ./chag
CHAG logging to log/chagout_20873.log
Initializing DB reader
Connect via
'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(ADDR
ESS=…..
CHAG MDB feed open. Timings: Load JDBC driver:
147.50 ms, Connect to MDB: 1398.85 ms
1st Query with 60 minutes of data (15:00:56..16:00:56):
16546 ms
1st CLOB : (59290 lines, 1.484 MB) parsing time: 644
ms, 10.49 mics/line
41
You could have reached same conclusion using
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 42
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Use CHAG on a Remote system
• Remote Mode requires the GIMR Data to be exported
– Execute the following on the cluster node
– $chactl export repository -format mdb -start ‘’ -end '‘
–
• Copy the MDB to the local node and execute
– chag –f <mdb_file>
• MDB file includes all the data for post mortem analysis from all the nodes
43
$chactl export repository -format mdb -start '2017-12-15 00:00:00' -end '2017-12-20 00:00:00'
successfully dumped the CHA statistics to location
"/u01/app/gridbase/crsdata/anair/trace/chad/cha_dump_20171215_000000_20171215_010000.mdb“
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Cluster wide view
44
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Individual node details
45
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Use Expert mode to
46
Details
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Co-Related Data is Highlighted
47
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
48
Problem 2 – Slow I/O
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Start the Database on all instances to
simulate physical reads
• Simulate I/O on shared storage
– Loading data (sqlldr,expdp) OR
– Swingbench
• **behavior may depend on your test setup, HCA, HBA
etc
49
Simulate I/O performance issue
swingbench swingbench
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 50
Three ways to get Data in Oracle RAC
Ball park numbers
Locally (local cache): è nanoseconds
Remote (global cache) è microseconds
From disk
Flash cache è microseconds
Disk controller cache è microseconds
Spinning disk è milliseconds
1 2 3 4
Shadow Process LGWR
gc current block busy,
gc buffer busy acquire,
gc buffer busy release
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Use AWR to Identify Performance Issues
Global cache wait events: 40%
significantly higher than expected
1
Local sessions
waiting for transfer Transfer delayed by log
flush on other node(s)
4
Variance and Outliers indicate that
IO to the log file disk group affects
performance In the cluster
26+44=73
Block pinged out; sessions
waiting for its return
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
What does OS Watcher say?
• iostat confirms I/O performance issue
archive/oswiostat/xxxxxxxx_iostat_17.03.31.1000.dat
• Increase in Reads/writes
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 4.00 0.00 32.00 8.00 0.00 0.25 0.25 0.10
xvdb 0.00 0.00 6.50 2.00 146.00 21.00 19.65 0.00 0.24 0.24 0.20
• ---
• ---
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 144.00 434.00 53.00 12.00 0.00 0.45 1.13 0.10
xvdb 0.00 0.00 6.50 219.00 414.00 67.00 13.65 0.00 0.33 9.44 0.20
52
There is more overall I/O as
seen by the increase in the
number of writes, reads
* values may change
depending on test env.
**Output has been formatted
for presentation
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 53
You could have reached same
conclusion using CHAG
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
chactl query diagnosis-db
54
$ chactl query diagnosis -db sales -start "2017-03-31 10:00:50" -end "2017-03-31 10:25:50“
2017-03-31 10:01:10.0 Database sales DB Control File IO Performance (sales_1) [detected]
2017-03-31 10:01:10.0 Database sales DB Control File IO Performance (sales_2) [detected]
2017-03-31 10:01:13.0 Database sales DB CPU Utilization (sales_2) [detected]
2017-03-31 10:01:33.0 Database salesDB Log File Switch (sales_1) [detected]
Consolidates and displays
information from all instances
Note that we used the command line option to utilize the
AHF collected data to find out root cause very rapidly
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
55
Problem 3 – High CPU usage
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Ensure Grid Infrastructure is
running for at least an hour
• Run some normal load
• Simulate excessive CPU using a CPU
stressing program on 3 nodes
– stress -- C program
56
Simulate CPU load
stress stress stress
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
What does OS Watcher say?
• OSWatcher confirms chactl diagnosis
• mpstat (platform dependent)
zzz ***Fri Mar 31 10:10:29 PST 2017
10:10:29 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:10:29 all 74.01 0.00 7.95 12.20 0.00 0.13 0.13 0.00 5.10
10:10:29 0 44.89 0.00 8.79 12.09 0.00 0.00 0.00 0.00 31.23
10:10:29 1 56.00 0.00 4.00 18.00 0.00 0.00 0.00 0.00 23.00
• Top also reports the CPU stress program
Tasks: 454 total, 4 running, 450 sleeping, 0 stopped, 0 zombie
Cpu(s): 94.9%us, 4.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20752 racusr 20 0 2498m 45m 39m R 85.9 0.3 752:32.40 stress
57
System is indeed CPU
starved
**Output has been formatted for presentation
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
chactl query diagnosis -cluster
58
$chactl query diagnosis –start "2017-03-31 10:00:50" -end "2017-03-31 10:25:50“
2017-03-31 10:01:25.0 Host rwsxxxxx9 Host Memory Consumption [detected]
2017-03-31 10:01:29.0 Host rwsxxxxx0 Host Memory Consumption [detected]
2017-03-31 10:01:23.0 Host rwsxxxxx1 Host Memory Consumption [detected]
Problem: Host Memory Consumption
Description: CHA detected that more memory than expected is consumed on this server. The
memory is not allocated by sessions of this database.
Cause: The Cluster Health Advisor (CHA) detected an increase in memory consumption by other
databases or by applications not connected to a database on this node.
Action: Identify the top memory consumers by using the Cluster Health Monitor (CHM).
Note: This is a 4 node cluster but the memory
consumption issue is reported on 3 of the 4 nodes
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
chatcl actions & resolutions in text or html format
59
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
60
Problem 4 – Why did my Database
Instance move to different node?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
PDB1 PDB2 PDB3 PDB4 PDB5 PDB6 PDB7 PDB8 PDB9 PDB10 PDB11 PDB12
Oracle RAC
Inst1
Inst2
Inst3
Inst4
Common Consolidation Scenario
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• Provides common view of cluster-
wide activities in a co-ordinated
fashion
• Customer readable summary of all
actions in a cluster
• Clusterwide information provided
from any single node
• Further details are in the trace files
62
Clusterware Activity Log
On Friday, my instances
were running on Nodes
1 & 2, but today it is
only running on Node 1
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 63
Now you get the idea
Login to each node Check GI Alert log
Check Database
Alert log
Check Listener logs
Check Root Agent
logs for Network
changes
Check Agent logs for
Instance/Service
changes
Instead of
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 64
Use crsctl query calog from Single node
$ crsctl query calog -aftertime "2017-03-08 15:09:46.522-07:00"
2017-04-12 20:05:04.668000 : Attempting to start 'ora.anair1.vip' on ‘anair1' :
14920191617156230/1194/11 :
…..
2017-04-12 20:05:06.559000 : Attempting to start 'ora.LISTENER.lsnr' on ‘anair1' :
14920191617156230/1194/16
…
2017-04-12 20:05:32.038000 : Start of 'ora.FRA.dg' on ‘anair1' succeeded :
14920191617156230/1194/27 :
2017-04-12 20:05:32.040000 : Attempting to start 'ora.sales.db' on ‘anair1' :
14920191617156230/1194/28 :
2017-04-12 20:05:59.415000 : Start of 'ora.sales.db' on ‘anair1' succeeded :
14920191617156230/1194/30 :
-- Format of output records is:
DATE & TIME (YYYY-MM-DD HH24:MI:SS[.FF][[+-]HH:MM]): Event text: ACTID
ACTID
Possibly network issues
caused VIP relocation
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 65
Use filter clause for focussed diagnosis
$ crsctl query calog -filter "actid == 14920191617156230/2449732/1"
2017-08-03 16:30:24.678000 : Attempting to start 'ora.sscdb.db' on ‘anair1' : 14920191617156230/2449732/1 :
2017-08-03 16:30:24.698000 : Start of 'ora.sscdb.db' on ‘anair1' succeeded : 14920191617156230/2449732/1 :
$ crsctl query calog -filter "actid ~= “14920191617156230”
2017-08-03 16:25:20.658000 : Stop of 'ora.sscdb.test.svc' on anair1' succeeded : 14920191617156230/2449007/2 :
-filter: Use ~= or == on actid to find related actions
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
66
Problem 5 – Why was my Session
killed?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 67
Hang Manager interventions reported via ORA-32701
Dump file …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
Oracle Database 12c Enterprise Edition Release 12.2.0.0.0 - 64bit Beta
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics
and Real Application Testing options
Build label: RDBMS_MAIN_LINUX.X64_151013
ORACLE_HOME: …/3775268204/oracle
System name: Linux
Node name: slc05kyr
Release: 2.6.39-400.211.1.el6uek.x86_64
Version: #1 SMP Fri Nov 15 13:39:16 PST 2013
Machine: x86_64
VM name: Xen Version: 3.4 (PVM)
Instance name: hm62
Redo thread mounted by this instance: 2
Oracle process number: 19
Unix process pid: 12656, image: oracle@slc05kyr (DIA0)
*** 2015-10-13T16:47:59.541509+17:00
*** SESSION ID:(96.41299) 2015-10-13T16:47:59.541519+17:00
*** CLIENT ID:() 2015-10-13T16:47:59.541529+17:00
*** SERVICE NAME:(SYS$BACKGROUND) 2015-10-13T16:47:59.541538+17:00
*** MODULE NAME:() 2015-10-13T16:47:59.541547+17:00
*** ACTION NAME:() 2015-10-13T16:47:59.541556+17:00
*** CLIENT DRIVER:() 2015-10-13T16:47:59.541565+17:00
2015-10-13T16:47:59.435039+17:00
Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc
2015-10-13T16:47:59.506775+17:00
DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=1.
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1.
2015-10-13T16:47:59.538673+17:00
Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
2015-10-13T16:48:04.222661+17:00
DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid:40 with serial # 43179 (ospid:13031)
ORA-32701: Possible hangs up to hang ID=1
detected
Incident details in:
…/../hm62_dia0_12656_i5753.trc
DIA0 terminating blocker
(ospid: 13031 sid: 40 ser#: 43179)
requested by master DIA0 process on
instance 1
Hang Resolution Reason: Automatic hang
resolution was performed to free a
significant umber of affected sessions.
Alert log
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
68
Problem 6 – How long did the
reconfiguration take?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Reconfiguration Diagnosability
**************** BEGIN DLM RCFG HA STATS ****************
Total dlm rcfg time (inc 6): 3.586 secs (394926177, 394929763)
Begin step .........: 0.005 secs (394926177, 394926182)
Freeze step ........: 0.019 secs (394926182, 394926201)
Sync 1 step ........: 0.002 secs (394926264, 394926266)
Sync 2 step ........: 0.024 secs (394926266, 394926290)
Enqueue cleanup step: 0.002 secs (394926290, 394926292)
Sync pcm1 step .....: 0.004 secs (394926293, 394926297)
……
….
Enqueue dubious step: 0.004 secs (394926432, 394926436)
Sync 5 step ........: 0.000 secs (394926436, 394926436)
Enqueue grant step .: 0.001 secs (394926436, 394926437)
Sync 6 step ........: 0.012 secs (394926437, 394926449)
Fixwrt replay step .: 0.885 secs (394928837, 394929722)
Sync 8 step ........: 0.040 secs (394929722, 394929762)
End step ...........: 0.001 secs (394929762, 394929763)
Number of replayed enqueues sent / received .......: 2246 / 893
Number of replayed fusion locks sent / received ...: 124027 / 0
Number of enqueues mastered before / after rcfg ...: 2058 / 1384
**************** END DLM RCFG HA STATS *****************
Detailed timing
breakdown available
in LMON trace file
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Sample Problems and Resolution
70
Problem 7 – Is Dynamic Resource
Management (DRM) helping my workload ?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
DRM Diagnosability
Dynamic Remastering Statistics DB/Inst: SALES/sales1 Snaps: 393-452
-> Affinity objects - Affinity objects mastered at the begin/end snapshot
-> Read-mostly objects - Read-mostly objects mastered at the begin/end snapshot
per Begin End
Name Total Remaster Op Snap Snap
-------------------------------- ------------ ------------- -------- --------
remaster ops 24 1.00
remastered objects 24 1.00
remaster time (s) 7.4 0.31
freeze time (s) 1.5 0.06
cleanup time (s) 2.4 0.10
replay time (s) 0.3 0.01
fixwrite time (s) 2.4 0.10
sync time (s) 0.1 0.01
affinity objects N/A 3 27
read-mostly objects N/A 0 0
read-mostly objects (persistent) N/A 0 0
Detailed timing
breakdown available
in AWR Report
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
FAQ
1
2
3
4
5
72
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 73
Frequently asked Question # 1
Why does MGMT
DB need so much
space?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Cluster Type Redundancy MGMT DG
(GB)
Domain Services Cluster
(2 Node DSC with 4 Member
Clusters of 2 Nodes each)
External 188
Normal 376
High 564
Flex 376
Standalone Cluster
(4 Node Cluster)
External 38
Normal 76
High 114
Flex 76
• Oracle GI 12c Release 2 feature AHF
(Autonomous Health Framework)
collects, Co-relates & stores diagnostics
data from OS and DB in MGMT
• In DSC, one PDB per member cluster is
provisioned to store member cluster
diagnostics data
• The data is used by AHF components like
cluster health advisor to both prevent
and help diagnose issues
74
GIMR space requirements
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 75
Frequently asked Question # 2
Should I continue
to use
Oswatcher?
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Should I continue to use OSWatcher?
• Hopefully by now, the value benefits of AHF is clear
• AHF continues to be enhanced to improve
– Diagnosing large number of deployments
– Correlating data to speed diagnosis
– Preventing issues in the first place
– Utilizing latest technologies like Machine learning
• Customers can choose to use both OSW and AHF
76
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
• opatch automatically patches
MGMT database if required
• clients of MGMT connect using
encrypted credentials
• MGMT listener automatically
maintained by the clusterware
agent
77
Frequently asked Question # 3
But it is still one
more database
for me to manage
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 78

More Related Content

What's hot

Oracle RAC on Engineered Systems
Oracle RAC on Engineered SystemsOracle RAC on Engineered Systems
Oracle RAC on Engineered Systems
Markus Michalewicz
 
Exadata master series_asm_2020
Exadata master series_asm_2020Exadata master series_asm_2020
Exadata master series_asm_2020
Anil Nair
 
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Understanding SQL Trace, TKPROF and Execution Plan for beginnersUnderstanding SQL Trace, TKPROF and Execution Plan for beginners
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Carlos Sierra
 
Understanding oracle rac internals part 2 - slides
Understanding oracle rac internals   part 2 - slidesUnderstanding oracle rac internals   part 2 - slides
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
suresh gandhi
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
Mohamed Farouk
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - Presentation
Markus Michalewicz
 
Oracle Performance Tools of the Trade
Oracle Performance Tools of the TradeOracle Performance Tools of the Trade
Oracle Performance Tools of the Trade
Carlos Sierra
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
Bobby Curtis
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Markus Michalewicz
 
Make Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For ItMake Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For It
Markus Michalewicz
 
Scaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ssScaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ss
Anil Nair
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
オラクルエンジニア通信
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
Enkitec
 
Migration to Oracle Multitenant
Migration to Oracle MultitenantMigration to Oracle Multitenant
Migration to Oracle Multitenant
Jitendra Singh
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
Gustavo Rene Antunez
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 

What's hot (20)

Oracle RAC on Engineered Systems
Oracle RAC on Engineered SystemsOracle RAC on Engineered Systems
Oracle RAC on Engineered Systems
 
Exadata master series_asm_2020
Exadata master series_asm_2020Exadata master series_asm_2020
Exadata master series_asm_2020
 
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Understanding SQL Trace, TKPROF and Execution Plan for beginnersUnderstanding SQL Trace, TKPROF and Execution Plan for beginners
Understanding SQL Trace, TKPROF and Execution Plan for beginners
 
Understanding oracle rac internals part 2 - slides
Understanding oracle rac internals   part 2 - slidesUnderstanding oracle rac internals   part 2 - slides
Understanding oracle rac internals part 2 - slides
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - Presentation
 
Oracle Performance Tools of the Trade
Oracle Performance Tools of the TradeOracle Performance Tools of the Trade
Oracle Performance Tools of the Trade
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
 
Make Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For ItMake Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For It
 
Scaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ssScaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ss
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
 
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
GoldenGateテクニカルセミナー4「テクニカルコンサルタントが語るOracle GoldenGate現場で使える極意」(2016/5/11)
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Migration to Oracle Multitenant
Migration to Oracle MultitenantMigration to Oracle Multitenant
Migration to Oracle Multitenant
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
 

Similar to Using Machine Learning to Debug Oracle RAC Issues

Using Machine Learning to Debug complex Oracle RAC Issues
Using Machine Learning  to Debug complex Oracle RAC IssuesUsing Machine Learning  to Debug complex Oracle RAC Issues
Using Machine Learning to Debug complex Oracle RAC Issues
Anil Nair
 
What's new in oracle trace file analyzer 18.2.0
What's new in oracle trace file analyzer 18.2.0What's new in oracle trace file analyzer 18.2.0
What's new in oracle trace file analyzer 18.2.0
Sandesh Rao
 
What's new in Oracle Trace File Analyzer 12.2.1.3.0
What's new in Oracle Trace File Analyzer 12.2.1.3.0What's new in Oracle Trace File Analyzer 12.2.1.3.0
What's new in Oracle Trace File Analyzer 12.2.1.3.0
Gareth Chapman
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Fwdays
 
Hyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous DatabaseHyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous Database
Connor McDonald
 
Oracle Management Cloud
Oracle Management Cloud Oracle Management Cloud
Oracle Management Cloud
Dheeraj Hiremath
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management Cloud
Dheeraj Hiremath
 
Ebs performance tuning session feb 13 2013---Presented by Oracle
Ebs performance tuning session  feb 13 2013---Presented by OracleEbs performance tuning session  feb 13 2013---Presented by Oracle
Ebs performance tuning session feb 13 2013---Presented by OracleAkash Pramanik
 
Trace File Analyzer - Usage and Features
Trace File Analyzer - Usage and Features Trace File Analyzer - Usage and Features
Trace File Analyzer - Usage and Features
Sandesh Rao
 
Oracle Trace File Analyzer Overview
Oracle Trace File Analyzer OverviewOracle Trace File Analyzer Overview
Oracle Trace File Analyzer Overview
Gareth Chapman
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Sandesh Rao
 
Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0
Sandesh Rao
 
Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0
Gareth Chapman
 
Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?
Francisco Alvarez
 
Whats new in oracle trace file analyzer 18.4.1
Whats new in oracle trace file analyzer 18.4.1Whats new in oracle trace file analyzer 18.4.1
Whats new in oracle trace file analyzer 18.4.1
Gareth Chapman
 
The Machine Learning behind the Autonomous Database ILOUG Feb 2020
The Machine Learning behind the Autonomous Database   ILOUG Feb 2020 The Machine Learning behind the Autonomous Database   ILOUG Feb 2020
The Machine Learning behind the Autonomous Database ILOUG Feb 2020
Sandesh Rao
 
MySQL Day Paris 2018 - Introduction & The State of the Dolphin
MySQL Day Paris 2018 - Introduction & The State of the DolphinMySQL Day Paris 2018 - Introduction & The State of the Dolphin
MySQL Day Paris 2018 - Introduction & The State of the Dolphin
Olivier DASINI
 
Oracle ORAchk & EXAchk overview
Oracle ORAchk & EXAchk overviewOracle ORAchk & EXAchk overview
Oracle ORAchk & EXAchk overview
Gareth Chapman
 
Exachk Customer Presentation
Exachk Customer PresentationExachk Customer Presentation
Exachk Customer Presentation
Sandesh Rao
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 

Similar to Using Machine Learning to Debug Oracle RAC Issues (20)

Using Machine Learning to Debug complex Oracle RAC Issues
Using Machine Learning  to Debug complex Oracle RAC IssuesUsing Machine Learning  to Debug complex Oracle RAC Issues
Using Machine Learning to Debug complex Oracle RAC Issues
 
What's new in oracle trace file analyzer 18.2.0
What's new in oracle trace file analyzer 18.2.0What's new in oracle trace file analyzer 18.2.0
What's new in oracle trace file analyzer 18.2.0
 
What's new in Oracle Trace File Analyzer 12.2.1.3.0
What's new in Oracle Trace File Analyzer 12.2.1.3.0What's new in Oracle Trace File Analyzer 12.2.1.3.0
What's new in Oracle Trace File Analyzer 12.2.1.3.0
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
 
Hyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous DatabaseHyderabad Mar 2019 - Autonomous Database
Hyderabad Mar 2019 - Autonomous Database
 
Oracle Management Cloud
Oracle Management Cloud Oracle Management Cloud
Oracle Management Cloud
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management Cloud
 
Ebs performance tuning session feb 13 2013---Presented by Oracle
Ebs performance tuning session  feb 13 2013---Presented by OracleEbs performance tuning session  feb 13 2013---Presented by Oracle
Ebs performance tuning session feb 13 2013---Presented by Oracle
 
Trace File Analyzer - Usage and Features
Trace File Analyzer - Usage and Features Trace File Analyzer - Usage and Features
Trace File Analyzer - Usage and Features
 
Oracle Trace File Analyzer Overview
Oracle Trace File Analyzer OverviewOracle Trace File Analyzer Overview
Oracle Trace File Analyzer Overview
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
 
Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0Whats new in Oracle Trace File analyzer 18.3.0
Whats new in Oracle Trace File analyzer 18.3.0
 
Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0Whats new in oracle trace file analyzer 18.3.0
Whats new in oracle trace file analyzer 18.3.0
 
Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?
 
Whats new in oracle trace file analyzer 18.4.1
Whats new in oracle trace file analyzer 18.4.1Whats new in oracle trace file analyzer 18.4.1
Whats new in oracle trace file analyzer 18.4.1
 
The Machine Learning behind the Autonomous Database ILOUG Feb 2020
The Machine Learning behind the Autonomous Database   ILOUG Feb 2020 The Machine Learning behind the Autonomous Database   ILOUG Feb 2020
The Machine Learning behind the Autonomous Database ILOUG Feb 2020
 
MySQL Day Paris 2018 - Introduction & The State of the Dolphin
MySQL Day Paris 2018 - Introduction & The State of the DolphinMySQL Day Paris 2018 - Introduction & The State of the Dolphin
MySQL Day Paris 2018 - Introduction & The State of the Dolphin
 
Oracle ORAchk & EXAchk overview
Oracle ORAchk & EXAchk overviewOracle ORAchk & EXAchk overview
Oracle ORAchk & EXAchk overview
 
Exachk Customer Presentation
Exachk Customer PresentationExachk Customer Presentation
Exachk Customer Presentation
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 

More from Anil Nair

New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
Anil Nair
 
Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19
Anil Nair
 
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdfRac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Anil Nair
 
Rac 12c rel2_operational_best_practices_sangam_2017
Rac 12c rel2_operational_best_practices_sangam_2017Rac 12c rel2_operational_best_practices_sangam_2017
Rac 12c rel2_operational_best_practices_sangam_2017
Anil Nair
 
New availability features in oracle rac 12c release 2 anair ss
New availability features in oracle rac 12c release 2 anair   ssNew availability features in oracle rac 12c release 2 anair   ss
New availability features in oracle rac 12c release 2 anair ss
Anil Nair
 
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
Collaborate 17 Oracle RAC 12cRel 2 Best PracticesCollaborate 17 Oracle RAC 12cRel 2 Best Practices
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
Anil Nair
 
Step by Step instructions to install Cluster Domain deployment model
Step by Step instructions to install Cluster Domain deployment modelStep by Step instructions to install Cluster Domain deployment model
Step by Step instructions to install Cluster Domain deployment model
Anil Nair
 
Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016
Anil Nair
 
Con8780 nair rac_best_practices_final_without_12_2content
Con8780 nair rac_best_practices_final_without_12_2contentCon8780 nair rac_best_practices_final_without_12_2content
Con8780 nair rac_best_practices_final_without_12_2content
Anil Nair
 

More from Anil Nair (9)

New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19
 
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdfRac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
 
Rac 12c rel2_operational_best_practices_sangam_2017
Rac 12c rel2_operational_best_practices_sangam_2017Rac 12c rel2_operational_best_practices_sangam_2017
Rac 12c rel2_operational_best_practices_sangam_2017
 
New availability features in oracle rac 12c release 2 anair ss
New availability features in oracle rac 12c release 2 anair   ssNew availability features in oracle rac 12c release 2 anair   ss
New availability features in oracle rac 12c release 2 anair ss
 
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
Collaborate 17 Oracle RAC 12cRel 2 Best PracticesCollaborate 17 Oracle RAC 12cRel 2 Best Practices
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
 
Step by Step instructions to install Cluster Domain deployment model
Step by Step instructions to install Cluster Domain deployment modelStep by Step instructions to install Cluster Domain deployment model
Step by Step instructions to install Cluster Domain deployment model
 
Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016
 
Con8780 nair rac_best_practices_final_without_12_2content
Con8780 nair rac_best_practices_final_without_12_2contentCon8780 nair rac_best_practices_final_without_12_2content
Con8780 nair rac_best_practices_final_without_12_2content
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Using Machine Learning to Debug Oracle RAC Issues

  • 1. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Using Machine Learning to Debug Oracle RAC Issues Anil Nair Sr. Principal Product Manager, Oracle Real Application Clusters (RAC) Dec 4th, 2018 @RACMasterPM, @OracleRACpm http://www.linkedin.com/in/anil-nair-01960b6 http://www.slideshare.net/AnilNair27/
  • 2. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 2
  • 3. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Introduction How do we diagnose issues? What’s new? Walk through some common scenarios FAQ 1 2 3 4 5 3
  • 4. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Introduction1 2 3 4 5 4
  • 5. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 5 Scalability without Application code change(s) 0 500 0 100 00 150 00 200 00 250 00 300 00 350 00 400 00 4 8 32 48 64 80 2035 4010 15520 22416 30016 37040 # of Cores across RAC Nodes Users 2 Nodes 3 Nodes 4 Nodes 5 Nodes SAP certified SD Benchmark results Active Active Instances scales Writes, Reads and Hybrid workloads
  • 6. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 6 Oracle Real Application Cluster Family of Solutions • Integrated set of tools that work cohesively to provide High Availability and Scalability • The functionality provided by Oracle RAC Family of Solutions can be used by licensed Oracle RAC, Oracle RAC One Node and Single Instance customers without any additional charge
  • 7. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda How do we Diagnose issues? 1 2 3 4 5 7
  • 8. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 1. Detect – Monitoring script/User feedback 2. React – Login to system – Go through the stack – Pin point the issue – Possible solution 3. Fix – Implement – Go back to Step 1 8 Anatomy of issue diagnosis Detect ReactFix
  • 9. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 9 Lets walk through a sample problem resolution There is a quiz in the end so pay attention!!!
  • 10. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Node Eviction – Node 1 ocssd.trc [ Part 1 ] • 2010-08-13 17:00:22.818: [ CSSD][4106599328] clssnmPollingThread: node anair2 (2) at 50% heartbeat fatal, removal in 14.520 seconds • 2010-08-13 17:00:29.833: [ CSSD][4106599328] clssnmPollingThread: node anair2 (2) at 75% heartbeat fatal, removal in 7.500 seconds • 2010-08-13 17:00:37.337: [ CSSD][4106599328] clssnmPollingThread: Removal started for node anair2 (2), flags • 2010-08-13 17:00:37.340: [ CSSD][4085619616]clssnmCheckSplit: Node 2, anair2, is alive, DHB (1281744040, 1396854) more than disk timeout of 27000 after the last NHB (1281744011, 1367154) 10 Network heartbeat missing from node 2 for 15 consecutive seconds Network heartbeat is still missing Finally eviction starts Node 2 is still updating the Voting disks
  • 11. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Node Eviction – Node 1 ocssd.trc [ Part 2 ] • 2010-08-13 17:00:37.340: [ CSSD][4085619616](:CSSNM00007:) clssnmrEvict: Evicting node 2, anair2, from the cluster in incarnation 169934272, node birth incarnation 169934271, death incarnation 169934272, stateflags 0x24000 • 2010-08-13 17:01:07.705: [ CSSD][4043389856]clssgmCMReconfig: reconfiguration successful, incarnation 169934272 with 1 nodes, local node number 1, master node number 1 11 Eventually Node 2 eviction process starts Concluding with a reconfiguration
  • 12. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Node Eviction – Node 2 ocssd.trc [ Part 1 ] • 2010-08-13 17:00:26.213: [ CSSD][4073040800] clssnmPollingThread: node anair1 (1) at 50% heartbeat fatal, removal in 14.540 seconds • 2010-08-13 17:00:40.702: [ CSSD][4073040800] clssnmPollingThread: Removal started for node anair1 (1), flags • 2010-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckSplit: Node 1, anair1, is alive, DHB (1281744036, 1243744) more than disk timeout of 27000 after the last NHB (1281744007, 1214144) 12 Huh! So who is right? It also detects that Node 1 is still updating the Voting disks
  • 13. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Node Eviction – Node 2 ocssd.trc [ Part 2 ] • 2010-08-13 17:00:40.707: [ CSSD][4052061088](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, anair2, is smaller than cohort of 1 nodes led by node 1, anair1, based on map type 2 • 2010-08-13 17:00:40.707: [ CSSD] [4052061088]################################### 2010-08-13 17:00:40.707: [ CSSD][4052061088]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread 2010-08-13 17:00:40.707: [ CSSD] [4052061088]################################### 13 it correctly detects and aborts local node to prevent split brain And does the right thing
  • 14. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | So what is the conclusion ? 14 Hmmm….I think it is the network.
  • 15. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | What does OS Watcher say? • netstat does not show any issues # grep "zzz|udpInOverflows|ipReasmFails" • OSW data itself is missing possibly due to scheduling issues • Just prior to the issue Top reports top - 13:23:52 up 25 days, 21:08, 1 user, load average: 3.43, 3.01, 3.02 Cpu(s): 16.8%us, 23.2%sy, 0.0%ni, 56.5%id, 3.1%wa, 0.1%hi, 0.3%si, Mem: 74027752k total, 73689744k used, 338008k free, 1516k Swap: 16771852k total, 9069988k used, 7701864k free, 25836528k PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1049 root 11 -5 0 0 0 R 55.0 0.0 0:57.42 [kswapd0] 15 No issues seen in netstat & traceroute Oh no! It is swapping Why is the OSW data missing?
  • 16. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Network heart beats could be delayed due to lack of resources – Incorrect sizing – Configuration issues • OSW may not capture all the required data – May not be scheduled – No co-relation between data points 16 Summary Delayed and inconclusive diagnosis I think it was either a network or CPU issue
  • 17. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda What’s new? 1 2 3 4 5 17
  • 18. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Scale across deployments 18 Collect detailed diagnostics Data Correlate and Smartly analyze Data DBA’s need a Solution that can OS logs DB logs GI logs
  • 19. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Cloud Challenge – Scale 19 Server ServerServerServer ServerServerServerServer Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server 1
  • 20. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 20 Collect diagnostics across the entire stack Server ServerServerServer ServerServerServerServer Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Collect detailed diagnostics for the entire stack 2
  • 21. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 21 Collect, Correlate & Analyze Data Operating System logs DB logs Grid Infrastructure logs 3 GC blocks lost High CPU Swapping Root Cause
  • 22. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Introducing Oracle Autonomous Health Framework (AHF) • Integrated next generation tools running as components - 24/7 • Discovers Potential Issues and takes Corrective Actions • Speeds up Issue Diagnosis and Resolution • Maintains Database Performance and Availability 22
  • 23. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | AHF: Handling Scale utilizing Machine Learning 23 Server ServerServerServer ServerServerServerServer Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server• Identify troubled servers • Attempt to resolve the issue • Collect diagnostic information for RCA 1 2 3
  • 24. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 24 Machine Learning for Automatic Diagnosis Faults Alarms Incidents Root Causes Corrective Actions 101011111101011010101010101 010101110110001010101010101 010101010101010101010101101 Diagnostic Data collected from OS and the database is analyzed using a Bayesian Belief Network for Cause and Effect Analysis to automatically discover potential issues or take corrective actions Diagnostic Data AHF Real time Analysis Machine Learning Pattern Recognition BN Engines
  • 25. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Utilize Machine Learning for efficient diagnosis in house • Reduce the amount of trace files to be read manually • Create reusable models for different problem scenarios 25 Create Reusable optimized Models for runtime deployment Subject Matter ExpertLog ASH Metrics ML Knowledge Extraction Model Generation Human Supervision Application Optimized Models Feedback • Compare problems with the models in customer environments Oracle Environment Customer Environment
  • 26. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Components of Autonomous Health Framework (AHF) 26
  • 27. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Cluster Health Monitor (CHM) 27 GIMR ologgerd (master) osysmond osysmond osysmond osysmond Grid Infrastructure Management Repository • One Master daemon ologgerd responsible for writes • Every node has slave osysmond which sends detailed OS statistics to the Master osysmond • Master Slave automatically maintained by Grid Infrastructure during node start/stop • Listens to CSS, GIPC events and Clusterware Agent notifications • Reads Oracle Database Wait information directly from SGA OS Data OS Data OS Data OS Data
  • 28. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Components of Autonomous Health Framework (AHF) 28
  • 29. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Cluster Health Advisor (CHA) 29 OS Data ochad • CHA is the core component that is running on every node • Detects node and database performance problems by applying ML on the collected diagnostics data • Provides early-warning alerts and corrective action • Supports on-site calibration to improve sensitivity • Standalone Interactive GUI and command line tools provided for analysis DB Data CHM Node Health Engine Database Health Engine GIMR
  • 30. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Components of Autonomous Health Framework (AHF) 30
  • 31. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 31 Hang Manager - Workings • Resolves Cross Layer Hangs • Resolves ASM Hangs • Including Flex ASM • Resolves Dead locks Considers cross-layer hangs between ASM and database instances Hang Resolution Analyze Evaluate Detect Session Hung?
  • 32. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Database processes can be waiting on – Allocating space in SGA due to change in workload – Higher load on LMS* • Even though there is adequate resources – CPU and Memory • Remaster Blocks • Reduce Brownouts 32 The Database point of view What is happening in the Database? Allocate space in SGA
  • 33. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle RAC Optimization Design Goals Monitor Reduce Brownouts Optimize CPU usage Dynamically Adjust
  • 34. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Monitors for problems before service disruption – E.g. HB for critical processes • Detects the cause of problem • Use collected data across all nodes to identify root cause – E.g. Waits on GRD • Resolves the problem with minimal disruption – E.g Resize internal Structures Introducing Database Reliability Framework • Normal • Busy • Extremely Busy Resource Utilization • Type 1 • Type 2 • Type 3 Resource Types Resources
  • 35. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Monitor Detect Review Resolve • Increase in number of resources in the Global Resource Directory (GRD) • Resulting in higher wait times for GRD • Several solutions possible – Is wait time due to high CPU load? – Increase in number of LMS help? – Increasing CR slaves help – Increasing internal thresholds help? Database Reliability Framework in Action
  • 36. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Database Reliability Framework Details • Runs in LMHB Process – Re-startable – Non Fatal • Relies on Metrics and Actions Action Related metrics report high cpu oracle instance processes cpu load, cpu threshold, bg heartbeat, cpu load (global) report high memory oracle instance processes memory load, memory threshold kill instance memory hog memory load, memory threshold enable rm plan cpu load, bg slowing (2) switch process to elevated priority cpu load, bg slowing (3) switch process from elevated priority cpu load, bg heartbeat shrink or grow resource cache library cache pin waits cap total processes at elevated priority cpu number enable drm cpu load, drm checks disable drm cpu load, drm checks increase default number of lms cr blocks congested, current blocks congested Metric Scope o/s memory Node Global o/s load Node Global bg heartbeat Global DRM health check Global library cache pin waits Local CFIO waits Global gc block lost Local gc block busy Local
  • 37. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Busy FG process(es) using CPU • Potential upcoming memory starvation • LGWR constrained by CPU • Too many RT processes • Insufficient CR slaves • DLM resource cache incorrectly sized • Control file IO (CFIO) stall • v$ views • v$gcr_metrics - details on all defined metrics • v$gcr_actions - details on all defined actions • v$gcr_log – metric/action history summary log • v$gcr_status – details on latest metric/action status 37 Examples and DRF Views
  • 38. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Increase the maximum number of LMSs – Based on System utilization (DRF) • Each LMS will spawn a dedicated CR slave – Threshold of Rollback Changes – Threaded CR slave in 18c • Optimized for Multi core/thread architecture • Remastering Slaves (RMV0..) – Offloads heavy remastering work to slaves Cache Fusion Optimizations
  • 39. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Walk through some common scenarios 1 2 3 4 5 39
  • 40. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Scenario 1 – Remember our Node eviction issue 40
  • 41. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • CHAG (Cluster Health Advisor Graphical) Doc ID 2340062.1 • CHAG is the GUI to utilize the benefits of AHF • Can be run on the Cluster node – Not Recommended* – Set ORACLE_HOME= GI HOME – Connects to the GIMR using wallets – but Install locally on a separate Linux host $export ORACLE_HOME=/u01/app/12.2.0/grid $ ./chag CHAG logging to log/chagout_20873.log Initializing DB reader Connect via 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(ADDR ESS=….. CHAG MDB feed open. Timings: Load JDBC driver: 147.50 ms, Connect to MDB: 1398.85 ms 1st Query with 60 minutes of data (15:00:56..16:00:56): 16546 ms 1st CLOB : (59290 lines, 1.484 MB) parsing time: 644 ms, 10.49 mics/line 41 You could have reached same conclusion using
  • 42. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 42
  • 43. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Use CHAG on a Remote system • Remote Mode requires the GIMR Data to be exported – Execute the following on the cluster node – $chactl export repository -format mdb -start ‘’ -end '‘ – • Copy the MDB to the local node and execute – chag –f <mdb_file> • MDB file includes all the data for post mortem analysis from all the nodes 43 $chactl export repository -format mdb -start '2017-12-15 00:00:00' -end '2017-12-20 00:00:00' successfully dumped the CHA statistics to location "/u01/app/gridbase/crsdata/anair/trace/chad/cha_dump_20171215_000000_20171215_010000.mdb“
  • 44. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Cluster wide view 44
  • 45. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Individual node details 45
  • 46. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Use Expert mode to 46 Details
  • 47. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Co-Related Data is Highlighted 47
  • 48. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 48 Problem 2 – Slow I/O
  • 49. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Start the Database on all instances to simulate physical reads • Simulate I/O on shared storage – Loading data (sqlldr,expdp) OR – Swingbench • **behavior may depend on your test setup, HCA, HBA etc 49 Simulate I/O performance issue swingbench swingbench
  • 50. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 50 Three ways to get Data in Oracle RAC Ball park numbers Locally (local cache): è nanoseconds Remote (global cache) è microseconds From disk Flash cache è microseconds Disk controller cache è microseconds Spinning disk è milliseconds 1 2 3 4 Shadow Process LGWR gc current block busy, gc buffer busy acquire, gc buffer busy release
  • 51. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Use AWR to Identify Performance Issues Global cache wait events: 40% significantly higher than expected 1 Local sessions waiting for transfer Transfer delayed by log flush on other node(s) 4 Variance and Outliers indicate that IO to the log file disk group affects performance In the cluster 26+44=73 Block pinged out; sessions waiting for its return
  • 52. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | What does OS Watcher say? • iostat confirms I/O performance issue archive/oswiostat/xxxxxxxx_iostat_17.03.31.1000.dat • Increase in Reads/writes Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvda 0.00 0.00 0.00 4.00 0.00 32.00 8.00 0.00 0.25 0.25 0.10 xvdb 0.00 0.00 6.50 2.00 146.00 21.00 19.65 0.00 0.24 0.24 0.20 • --- • --- Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvda 0.00 0.00 0.00 144.00 434.00 53.00 12.00 0.00 0.45 1.13 0.10 xvdb 0.00 0.00 6.50 219.00 414.00 67.00 13.65 0.00 0.33 9.44 0.20 52 There is more overall I/O as seen by the increase in the number of writes, reads * values may change depending on test env. **Output has been formatted for presentation
  • 53. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 53 You could have reached same conclusion using CHAG
  • 54. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | chactl query diagnosis-db 54 $ chactl query diagnosis -db sales -start "2017-03-31 10:00:50" -end "2017-03-31 10:25:50“ 2017-03-31 10:01:10.0 Database sales DB Control File IO Performance (sales_1) [detected] 2017-03-31 10:01:10.0 Database sales DB Control File IO Performance (sales_2) [detected] 2017-03-31 10:01:13.0 Database sales DB CPU Utilization (sales_2) [detected] 2017-03-31 10:01:33.0 Database salesDB Log File Switch (sales_1) [detected] Consolidates and displays information from all instances Note that we used the command line option to utilize the AHF collected data to find out root cause very rapidly
  • 55. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 55 Problem 3 – High CPU usage
  • 56. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Ensure Grid Infrastructure is running for at least an hour • Run some normal load • Simulate excessive CPU using a CPU stressing program on 3 nodes – stress -- C program 56 Simulate CPU load stress stress stress
  • 57. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | What does OS Watcher say? • OSWatcher confirms chactl diagnosis • mpstat (platform dependent) zzz ***Fri Mar 31 10:10:29 PST 2017 10:10:29 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 10:10:29 all 74.01 0.00 7.95 12.20 0.00 0.13 0.13 0.00 5.10 10:10:29 0 44.89 0.00 8.79 12.09 0.00 0.00 0.00 0.00 31.23 10:10:29 1 56.00 0.00 4.00 18.00 0.00 0.00 0.00 0.00 23.00 • Top also reports the CPU stress program Tasks: 454 total, 4 running, 450 sleeping, 0 stopped, 0 zombie Cpu(s): 94.9%us, 4.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20752 racusr 20 0 2498m 45m 39m R 85.9 0.3 752:32.40 stress 57 System is indeed CPU starved **Output has been formatted for presentation
  • 58. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | chactl query diagnosis -cluster 58 $chactl query diagnosis –start "2017-03-31 10:00:50" -end "2017-03-31 10:25:50“ 2017-03-31 10:01:25.0 Host rwsxxxxx9 Host Memory Consumption [detected] 2017-03-31 10:01:29.0 Host rwsxxxxx0 Host Memory Consumption [detected] 2017-03-31 10:01:23.0 Host rwsxxxxx1 Host Memory Consumption [detected] Problem: Host Memory Consumption Description: CHA detected that more memory than expected is consumed on this server. The memory is not allocated by sessions of this database. Cause: The Cluster Health Advisor (CHA) detected an increase in memory consumption by other databases or by applications not connected to a database on this node. Action: Identify the top memory consumers by using the Cluster Health Monitor (CHM). Note: This is a 4 node cluster but the memory consumption issue is reported on 3 of the 4 nodes
  • 59. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | chatcl actions & resolutions in text or html format 59
  • 60. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 60 Problem 4 – Why did my Database Instance move to different node?
  • 61. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | PDB1 PDB2 PDB3 PDB4 PDB5 PDB6 PDB7 PDB8 PDB9 PDB10 PDB11 PDB12 Oracle RAC Inst1 Inst2 Inst3 Inst4 Common Consolidation Scenario
  • 62. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • Provides common view of cluster- wide activities in a co-ordinated fashion • Customer readable summary of all actions in a cluster • Clusterwide information provided from any single node • Further details are in the trace files 62 Clusterware Activity Log On Friday, my instances were running on Nodes 1 & 2, but today it is only running on Node 1
  • 63. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 63 Now you get the idea Login to each node Check GI Alert log Check Database Alert log Check Listener logs Check Root Agent logs for Network changes Check Agent logs for Instance/Service changes Instead of
  • 64. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 64 Use crsctl query calog from Single node $ crsctl query calog -aftertime "2017-03-08 15:09:46.522-07:00" 2017-04-12 20:05:04.668000 : Attempting to start 'ora.anair1.vip' on ‘anair1' : 14920191617156230/1194/11 : ….. 2017-04-12 20:05:06.559000 : Attempting to start 'ora.LISTENER.lsnr' on ‘anair1' : 14920191617156230/1194/16 … 2017-04-12 20:05:32.038000 : Start of 'ora.FRA.dg' on ‘anair1' succeeded : 14920191617156230/1194/27 : 2017-04-12 20:05:32.040000 : Attempting to start 'ora.sales.db' on ‘anair1' : 14920191617156230/1194/28 : 2017-04-12 20:05:59.415000 : Start of 'ora.sales.db' on ‘anair1' succeeded : 14920191617156230/1194/30 : -- Format of output records is: DATE & TIME (YYYY-MM-DD HH24:MI:SS[.FF][[+-]HH:MM]): Event text: ACTID ACTID Possibly network issues caused VIP relocation
  • 65. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 65 Use filter clause for focussed diagnosis $ crsctl query calog -filter "actid == 14920191617156230/2449732/1" 2017-08-03 16:30:24.678000 : Attempting to start 'ora.sscdb.db' on ‘anair1' : 14920191617156230/2449732/1 : 2017-08-03 16:30:24.698000 : Start of 'ora.sscdb.db' on ‘anair1' succeeded : 14920191617156230/2449732/1 : $ crsctl query calog -filter "actid ~= “14920191617156230” 2017-08-03 16:25:20.658000 : Stop of 'ora.sscdb.test.svc' on anair1' succeeded : 14920191617156230/2449007/2 : -filter: Use ~= or == on actid to find related actions
  • 66. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 66 Problem 5 – Why was my Session killed?
  • 67. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 67 Hang Manager interventions reported via ORA-32701 Dump file …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc Oracle Database 12c Enterprise Edition Release 12.2.0.0.0 - 64bit Beta With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics and Real Application Testing options Build label: RDBMS_MAIN_LINUX.X64_151013 ORACLE_HOME: …/3775268204/oracle System name: Linux Node name: slc05kyr Release: 2.6.39-400.211.1.el6uek.x86_64 Version: #1 SMP Fri Nov 15 13:39:16 PST 2013 Machine: x86_64 VM name: Xen Version: 3.4 (PVM) Instance name: hm62 Redo thread mounted by this instance: 2 Oracle process number: 19 Unix process pid: 12656, image: oracle@slc05kyr (DIA0) *** 2015-10-13T16:47:59.541509+17:00 *** SESSION ID:(96.41299) 2015-10-13T16:47:59.541519+17:00 *** CLIENT ID:() 2015-10-13T16:47:59.541529+17:00 *** SERVICE NAME:(SYS$BACKGROUND) 2015-10-13T16:47:59.541538+17:00 *** MODULE NAME:() 2015-10-13T16:47:59.541547+17:00 *** ACTION NAME:() 2015-10-13T16:47:59.541556+17:00 *** CLIENT DRIVER:() 2015-10-13T16:47:59.541565+17:00 2015-10-13T16:47:59.435039+17:00 Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353): ORA-32701: Possible hangs up to hang ID=1 detected Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc 2015-10-13T16:47:59.506775+17:00 DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2 due to a GLOBAL, HIGH confidence hang with ID=1. Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1. 2015-10-13T16:47:59.538673+17:00 Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753): ORA-32701: Possible hangs up to hang ID=1 detected Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc 2015-10-13T16:48:04.222661+17:00 DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1 requested by master DIA0 process on instance 1 Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. by terminating session sid:40 with serial # 43179 (ospid:13031) ORA-32701: Possible hangs up to hang ID=1 detected Incident details in: …/../hm62_dia0_12656_i5753.trc DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) requested by master DIA0 process on instance 1 Hang Resolution Reason: Automatic hang resolution was performed to free a significant umber of affected sessions. Alert log
  • 68. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 68 Problem 6 – How long did the reconfiguration take?
  • 69. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Reconfiguration Diagnosability **************** BEGIN DLM RCFG HA STATS **************** Total dlm rcfg time (inc 6): 3.586 secs (394926177, 394929763) Begin step .........: 0.005 secs (394926177, 394926182) Freeze step ........: 0.019 secs (394926182, 394926201) Sync 1 step ........: 0.002 secs (394926264, 394926266) Sync 2 step ........: 0.024 secs (394926266, 394926290) Enqueue cleanup step: 0.002 secs (394926290, 394926292) Sync pcm1 step .....: 0.004 secs (394926293, 394926297) …… …. Enqueue dubious step: 0.004 secs (394926432, 394926436) Sync 5 step ........: 0.000 secs (394926436, 394926436) Enqueue grant step .: 0.001 secs (394926436, 394926437) Sync 6 step ........: 0.012 secs (394926437, 394926449) Fixwrt replay step .: 0.885 secs (394928837, 394929722) Sync 8 step ........: 0.040 secs (394929722, 394929762) End step ...........: 0.001 secs (394929762, 394929763) Number of replayed enqueues sent / received .......: 2246 / 893 Number of replayed fusion locks sent / received ...: 124027 / 0 Number of enqueues mastered before / after rcfg ...: 2058 / 1384 **************** END DLM RCFG HA STATS ***************** Detailed timing breakdown available in LMON trace file
  • 70. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Sample Problems and Resolution 70 Problem 7 – Is Dynamic Resource Management (DRM) helping my workload ?
  • 71. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | DRM Diagnosability Dynamic Remastering Statistics DB/Inst: SALES/sales1 Snaps: 393-452 -> Affinity objects - Affinity objects mastered at the begin/end snapshot -> Read-mostly objects - Read-mostly objects mastered at the begin/end snapshot per Begin End Name Total Remaster Op Snap Snap -------------------------------- ------------ ------------- -------- -------- remaster ops 24 1.00 remastered objects 24 1.00 remaster time (s) 7.4 0.31 freeze time (s) 1.5 0.06 cleanup time (s) 2.4 0.10 replay time (s) 0.3 0.01 fixwrite time (s) 2.4 0.10 sync time (s) 0.1 0.01 affinity objects N/A 3 27 read-mostly objects N/A 0 0 read-mostly objects (persistent) N/A 0 0 Detailed timing breakdown available in AWR Report
  • 72. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda FAQ 1 2 3 4 5 72
  • 73. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 73 Frequently asked Question # 1 Why does MGMT DB need so much space?
  • 74. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Cluster Type Redundancy MGMT DG (GB) Domain Services Cluster (2 Node DSC with 4 Member Clusters of 2 Nodes each) External 188 Normal 376 High 564 Flex 376 Standalone Cluster (4 Node Cluster) External 38 Normal 76 High 114 Flex 76 • Oracle GI 12c Release 2 feature AHF (Autonomous Health Framework) collects, Co-relates & stores diagnostics data from OS and DB in MGMT • In DSC, one PDB per member cluster is provisioned to store member cluster diagnostics data • The data is used by AHF components like cluster health advisor to both prevent and help diagnose issues 74 GIMR space requirements
  • 75. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 75 Frequently asked Question # 2 Should I continue to use Oswatcher?
  • 76. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Should I continue to use OSWatcher? • Hopefully by now, the value benefits of AHF is clear • AHF continues to be enhanced to improve – Diagnosing large number of deployments – Correlating data to speed diagnosis – Preventing issues in the first place – Utilizing latest technologies like Machine learning • Customers can choose to use both OSW and AHF 76
  • 77. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | • opatch automatically patches MGMT database if required • clients of MGMT connect using encrypted credentials • MGMT listener automatically maintained by the clusterware agent 77 Frequently asked Question # 3 But it is still one more database for me to manage
  • 78. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 78