Your SlideShare is downloading. ×

Exadata Cell metrics

885
views

Published on

Testing - Exadata Cell metrics

Testing - Exadata Cell metrics

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
885
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Exadata Cell metrics Exadata CELLSRV periodically records important runtime properties, called metrics, for cell components such as CPUs, cell disks, grid disks, flash cache, and IORM statistics. These metrics are recorded in memory. Based on its own metric collection schedule, the Management Server (MS) gets the set of metric data accumulated by CELLSRV. Management Server (MS) provides Exadata cell management and configuration functions. MS is responsible for sending alerts and collects some statistics in addition to those collected by CELLSRV. Each cell is individually managed with Exadata cell command-line interface (CellCLI). Locate the MS process -------------------------------$ ps -ef | grep ms.err 1000 3940 3723 0 01:42 pts/0 00:00:00 grep ms.err root 24541 24540 0 Sep28 ? 00:01:32 /usr/java/jdk1.5.0_15/bin/java -Xms256m -Xmx512m Djava.library.path=/opt/oracle/ Check the Alert History -----------------------MS triggers an alert when it discovers a: Cell hardware issue Cell software or configuration issue CELLSRV internal error Metric that has exceeded a threshold defined in the cell CellCLI> list alerthistory 1 2013-09-26T22:51:15-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 2_1 2013-09-26T22:52:07-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110" 3 2013-09-26T22:54:08-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 4 2013-09-28T13:05:21-04:00 critical "RS-7445 [Serv RS_BACKUP is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []" 5 2013-09-28T22:05:38-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []" Create and check for disk I/O errors ---------------------------ellCLI> create threshold CD_IO_ERRS_MIN comparison='>', warning=0, > occurrences=1, observation=1
  • 2. Threshold CD_IO_ERRS_MIN successfully created CellCLI> list threshold CD_IO_ERRS_MIN detail name: CD_IO_ERRS_MIN comparison: > observation: 1 occurrences: 1 warning: 0.0 ellCLI> list alerthistory where severity='warning'; 2_1 2013-09-26T23:02:12-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110" CellCLI> list alerthistory where severity='critical'; 1 2013-09-26T23:01:18-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], IP addresses in cellinit.ora file], [], [], [], [], [], [], []" 3 2013-09-26T23:04:11-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], IP addresses in cellinit.ora file], [], [], [], [], [], [], []" 4 2013-10-01T06:42:39-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] []" [Invalid [], [], [Invalid [], [], [] [] [] CellCLI> list alerthistory where severity='clear'; CellCLI> list alerthistory where severity='info'; MetricType: - cumulative: Cumulative statistics since the metric was created - instantaneous: Value at the time that the metric is collected - rate: Rates computed by averaging statistics over observation periods - transition: Collected at the time when the value of the metrics has changed, and typically captures important transitions in hardware status CellCLI> list metriccurrent attributes name,metrictype,metricobjectname,metricvalue,collectionTime where metrictype='Rate' Monitoring Exadata with Active Requests ---------------------------------------CellCLI> LIST ACTIVEREQUEST WHERE IoType = 'predicate pushing' DETAIL ioType identifies the type of active request file initialization Possible values are read, write, predicate pushing, filtered backup read, predicate push read Check retention period for metric and alert history ------------------------------------------------------CellCLI> list cell attributes metricHistoryDays 7 CellCLI> alter cell metrichistorydays=5
  • 3. Cell qr03cel02 successfully altered CellCLI> list cell attributes metrichistorydays 5 CellCLI> list cell attributes name,interconnectCount qr03cel02 2 configure the cell to automatically send an email and/or SNMP message to a designated set of Exadata administrator. -----------------------------------------------------------------------------------------------------------------alter cell smtpServer='my_mail.example.com', smtpFromAddr='monowar.mukul@example.com', smtpFrom='monowar mukul', smtpToAddr='jane.smith@example.com', notificationPolicy='critical,warning,clear', notificationMethod='mail' Watching for Undelivered Alerts --------------------------------It is important to periodically check the storage servers just to make sure that raised alerts have actually been delivered (via email and/or to Grid or Cloud Control). CellCLI>LIST examinedBy='' ALERTHISTORY where notificationState dcli -g cell_group cellcli -e "LIST notificationState != 1 and examinedBy='' " != ALERTHISTORY 1 2013-09-26T23:01:18-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], IP addresses in cellinit.ora file], [], [], [], [], [], [], []" 2_1 2013-09-26T23:02:12-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110" 3 2013-09-26T23:04:11-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], IP addresses in cellinit.ora file], [], [], [], [], [], [], []" 4 2013-10-01T06:42:39-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] []" Drop Alert History --------------------CellCLI> drop alerthistory all Alert 1 successfully dropped Alert 2_1 successfully dropped Alert 3 successfully dropped Checking Threshold ------------------CellCLI> list threshold cl_fsut./ 1 and where [Invalid [], [], [Invalid [], [], [] [] []
  • 4. cl_fsut./u01 CellCLI> create threshold cl_tst."/u01" comparison='>', warning=80 Threshold cl_fsut."/u01" successfully created CellCLI> list threshold detail name: comparison: warning: name: comparison: warning: cl_fsut./ > 70.0 cl_fsut./u01 > 80.0 CellCLI> alter threshold cl_fsut."/" comparison='>', warning=50 Threshold cl_fsut."/" successfully altered CellCLI> list threshold detail name: comparison: warning: name: comparison: warning: cl_fsut./ > 50.0 cl_fsut./u01 > 80.0 Execute the following command inside the cell operating system. It creates a 512-MB file on the root file system which will increase the utilization metric. After the metric crosses the threshold , an alert will be generated. $ dd if=/dev/zero of=/tmp/file.out bs=1024 count=500000 [celladmin@qr03cel02 ~]$ dd if=/dev/zero of=/tmp/file.out bs=1024 count=500000 500000+0 records in 500000+0 records out 512000000 bytes (512 MB) copied, 4.25551 seconds, 120 MB/s [celladmin@qr03cel02 ~]$ cellcli CellCLI: Release 11.2.3.1.0 - Production on Mon Sep 30 01:36:45 EDT 2013 Copyright (c) 2007, 2011, Oracle. Cell Efficiency Ratio: 26M All rights reserved. CellCLI> list alerthistory 1_1 2013-09-30T01:32:46-04:00 warning "The warning threshold for the following metric has been crossed. Metric Name : CL_FSUT Metric Description : Percentage of total space on this file system that is currently used Object Name : / Current Value : 56.0 % Threshold Value : 50.0 % " CellCLI> alter alerthistory 1_1 examinedby='investigator' Alert 1_1 successfully altered CellCLI> list alerthistory detail name: 1_1
  • 5. alertMessage: "The warning threshold for the following metric has been crossed. Metric Name : CL_FSUT Metric Description : Percentage of total space on this file system that is currently used Object Name : / Current Value : 56.0 % Threshold Value : 50.0 % " alertSequenceID: 1 alertShortName: CL_FSUT alertType: Stateful beginTime: 2013-09-30T01:32:46-04:00 endTime: examinedBy: investigator metricObjectName: "/" metricValue: 56.0 notificationState: 0 sequenceBeginTime: 2013-09-30T01:32:46-04:00 severity: warning alertAction: "Examine the metric value that is violating the specified threshold, and take appropriate actions if needed." The value of the name attribute is a composite of abbreviations. • CL_ (cell) • CD_ (cell disk) • GD_ (grid disk) • FC_ (flash cache) • DB_ (database) • CG_ (consumer group) • CT_ (category) • N_ (interconnect network) -- Monitoring IORM with cellcli command. I/O-related metric: • IO_RQ (number of requests) • IO_BY (number of MB) • IO_TM (I/O latency) • IO_WT (I/O wait time) _R for read _W for write. _SM small I/O _LG large I/O _SEC signify per second _RQ to signify per request • CD_IO_WT_R_SM is the cell disk (CD_) I/O wait time (IO_WT) to read (_R) small blocks (_SM). • GD_IO_RQ_W_LG_SEC is the grid disk (GD_) number of requests (IO_RQ) to write (_W) of large block (_LG) I/O per second (_SEC) on a grid disk.