Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Deep Dive into ASM Redundancy in Exadata

5,377 views

Published on

Exadata Database Machine provides a solid storage redundancy infrastructure using ASM. Physical disks on multiple storage cell servers are logically partitioned, grouped and managed centrally by ASM. The way Exadata uses ASM has its own rules. The new term "Grid disk", ASM background processes, failgroups, redundancy options differ from non-Exadata systems. This storage configuration may sometimes seem to be complicated to Exadata Database Machine administrators. It's important to be able to answer the following questions, which are the topics of this presentation:
To what degree, disk and cell failures are tolerated;
How to understand if ASM is able to re-build redundancy after disk or cell failures;
What happens when multiple disks are failed at the same time and does it matter which disks failed;
What we need to pay attention to in terms of redundancy, when we do administrative task such as rolling restart of cell servers, resizing diskgroups, etc.

Published in: Technology

A Deep Dive into ASM Redundancy in Exadata

  1. 1. 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" Output Emre Baransel – Advanced Support Engineer, Employee ACE- Oracle A Deep Dive into ASM Redundancy in Exadata
  2. 2. A Deep Dive into ASM Redundancy in Exadata 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" Output Storage Server 1 Storage Server 2 Storage Server 3 We’ll consider 3 storage servers in examples Storage Servers Notation
  3. 3. A Deep Dive into ASM Redundancy in Exadata 12 1 2 3 4 5 6 7 8 9 10 11 Storage Server 1 Storage Server 2 Storage Server 3 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputDisks on Storage Servers
  4. 4. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 PHYSICAL DISC 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputPhysical Disks
  5. 5. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 SYSTEM PARTITIONS DBFS DG RECO DG DATA DG 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputLogical Partitions/Diskgroups
  6. 6. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 RECO DG DATA DG GRID/ASM DISCS 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputGrid Disks (Partitions) SYSTEM PARTITIONS DBFS DG
  7. 7. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 RECO DG DATA DG 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputDisks Usage Notation SYSTEM PARTITIONS DBFS DG
  8. 8. A Deep Dive into ASM Redundancy in Exadata FAILGROUP 1 FAILGROUP 2 FAILGROUP 3 NORMAL REDUNDANCY 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputNormal Redundancy Diskgroups
  9. 9. A Deep Dive into ASM Redundancy in Exadata HIGH REDUNDANCY FAILGROUP 1 FAILGROUP 2 FAILGROUP 3 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputHigh Redundancy Diskgroups
  10. 10. A Deep Dive into ASM Redundancy in Exadata - Disk Failure - transient disk failure - physical disk failure - Storage Server Failure 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputTypes of Failures This presentation examines failures in groups, in order to provide clarity. There may be exceptional cases.
  11. 11. A Deep Dive into ASM Redundancy in Exadata TRANSIENT FAILURE (OFFLINE) Storage Server 1 Storage Server 2 Storage Server 3 RECO DG DATA DG 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputTransient Disk Failures SYSTEM PARTITIONS DBFS DG
  12. 12. A Deep Dive into ASM Redundancy in Exadata FAILURE CORRECTED or NEW DISK Storage Server 1 Storage Server 2 Storage Server 3 FAILURE CORRECTED or DISK REPLACED BEFORE DISK_REPAIR_TIME EXCEEDS 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputTransient Disk Failures
  13. 13. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 DISK IS RESYNCED WITH ASM FAST MIRROR RESYNC 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputTransient Disk Failures
  14. 14. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 IF DISK_REPAIR_TIME EXCEEDS THEN ASM DROPS THE DISKS AND REBALANCE DATA IF THERE IS ENOUGH SPACE 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputTransient Disk Failures
  15. 15. A Deep Dive into ASM Redundancy in Exadata • DISK_REPAIR_TIME is a diskgroup attribute. • Default is 3.6 hours. • alter diskgroup data set attribute 'disk_repair_time' = '4.5h‘ • Altering the DISK_REPAIR_TIME attribute has no effect on offline disks 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputDISK_REPAIR_TIME Attribute
  16. 16. A Deep Dive into ASM Redundancy in Exadata PHYSICAL DISC FAILURE Storage Server 1 Storage Server 2 Storage Server 3 RECO DG DATA DG 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputPhysical Disk Failures SYSTEM PARTITIONS DBFS DG
  17. 17. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 ASM DOESN’T WAIT FOR DISK_REPAIR_TIME, DROPS THE DISK AND REBALANCE DATA IF THERE IS ENOUGH SPACE (Pro-Active Disk Quarantine - 11.2.1.3.1) 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputPhysical Disk Failures
  18. 18. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 WHEN DISK IS REPLACED GRID DISCS ARE CREATED & 2. REBALANCE STARTS AUTOMATICALLY 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputPhysical Disk Failures
  19. 19. A Deep Dive into ASM Redundancy in Exadata AUTO DISK MANAGEMENT feature in EXADATA Exadata Automation Manager (XDMG) initiates automation tasks. Monitors all configured storage cells for state changes. Exadata Automation Worker (XDWK) performs automation tasks requested by XDMG. _AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature set this parameter to FALSE. Range of values: TRUE [default] or FALSE. _AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic operation. Range of values: 1-10. Default value is 2. _AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk. Range of values: 1-10. Default value is 3. NOTE:1484274.1 - Auto disk management feature in Exadata 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputAuto Disk Management
  20. 20. A Deep Dive into ASM Redundancy in Exadata F A I L E D Storage Server 1 Storage Server 2 Storage Server 3 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputStorage Server Failures
  21. 21. A Deep Dive into ASM Redundancy in Exadata • WHEN A STORAGE SERVER FAILS IT MEANS THE FAILURE OF THE WHOLE FAILGROUP IN ASM • ASM DOES NOT DROP DISKS BEFORE DISK_REPAIR_TIME EXCEEDS • SAME WHEN REBOOTING THE STORAGE SERVER 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputStorage Server Failures
  22. 22. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 IF SERVER IS ALIVE BEFORE DISK_REPAIR_TIME EXCEEDS, DISKS WILL BE SYNCED – NO REBALANCE 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputStorage Server Failures
  23. 23. A Deep Dive into ASM Redundancy in Exadata F A I L E D Storage Server 1 Storage Server 2 Storage Server 3 IF DISK_REPAIR_TIME EXCEEDS, ASM WILL REBALANCE DATA IF THERE IS ENOUGH SPACE 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputStorage Server Failures
  24. 24. A Deep Dive into ASM Redundancy in Exadata Storage Server 1 Storage Server 2 Storage Server 3 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputStorage Server Failures WHEN STORAGE SERVER COMES BACK THERE WILL BE A SECOND REBALANCE
  25. 25. A Deep Dive into ASM Redundancy in Exadata In Normal Redundancy; What happens at second failure, is first related with when it occurs. - If after rebalance/sync is completed, then procedure is same with the first failure. - If before rebalance/sync is completed, then what happens is related with which disk is failed. - If first & second failed disks are not partner disks, a new rebalance is in question, if there’s enough space - If first & second failed disks are partner disks data loss occurs. 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputSecond Failure / Bad Chance • This is a small possibility but needs consideration. • Partner disks are on different storage servers (failgroups). • First incident doesn’t have to be a failure, storage server reboot causes the same. Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk (Doc ID 1431697.1)
  26. 26. A Deep Dive into ASM Redundancy in Exadata In High Redundancy; There are three copies of each extent So second failure never cause a data loss in High Redundancy 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputSecond Failure / Bad Chance
  27. 27. A Deep Dive into ASM Redundancy in Exadata ”MOUNT RESTRICTED FORCE FOR RECOVERY” feature >= 11.2.0.4 BP16 >= 12.1.0.2 BP4 Applicable to NORMAL redundancy diskgroups only. Potential Use Cases that this procedure will be applicable to : 1. Exadata cell rolling upgrade/patching and a partner disk failure at the same time 2. Transient disk failure in a cell followed by a permanent partner disk failure before the first failed disk comes back online. NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputA New Feature
  28. 28. A Deep Dive into ASM Redundancy in Exadata ”MOUNT RESTRICTED FORCE FOR RECOVERY” example: o Cell 1  CellCLI> Alter cell shutdown services all; o Cell 2  alter physicaldisk <disk> simulate failureType=failed;  database crashes o SQL> alter diskgroup datac1 mount restricted force for recovery; o CellCLI> Alter cell start services all; o SQL> alter diskgroup datac1 online disks in failgroup CELLFG1; o Wait until MODE_STATUS column in v$asm_disk for the disks being onlined changes to ONLINE from SYNCING. o Do NOT execute the subsequent steps if the mode_status column shows SYNCING. It will lead to data corruption. o In resync, due to the second disk failure, Arb0 will not be able to read some of the required extents (which are in the failed second disk) and hence marks those missing extents with BADFDA7A. (arb0 trace file => WARNING: group 1, file 258, extent 100: filling extent with BADFDA7A during recovery) o SQL> alter diskgroup datac1 dismount; SQL> alter diskgroup datac1 mount; o Start database & Perform RMAN block media recovery 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputExample Procedure
  29. 29. A Deep Dive into ASM Redundancy in Exadata In an Exadata ASM Diskgroup, we can mention following disk spaces: Total Raw Size (TRS) Used Raw Size (URS) Free Raw Size (FRS) Total Allocatable Size (TAS)  TRS / Redundancy Factor Used Allocatable Size (UAS)  URS / Redundancy Factor Free Allocatable Size (FAS)  FRS / Redundancy Factor Size Needed for Disk Failure Coverage (SNDFC)  Largest Disk (or 2 Disks for High R.) Size Needed for Cell Failure Coverage (SNCFC)  Largest Cell (or 2 Cells for High R.) Total Disk Failure Safe Allocatable Size  (TRS - SNDFC) / Redundancy Factor Total Cell Failure Safe Allocatable Size  (TRS - SNCFC) / Redundancy Factor Free Disk Failure Safe Allocatable Size  (FRS - SNDFC) / Redundancy Factor Free Cell Failure Safe Allocatable Size  (FRS - SNCFC) / Redundancy Factor 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputWhat kind of Usable Space?
  30. 30. A Deep Dive into ASM Redundancy in Exadata Total Raw Size (TRS) 360 Used Raw Size (URS) 120 Free Raw Size (FRS) 240 Total Allocatable Size (TAS) TRS / 2 = 180 Used Allocatable Size (UAS) URS / 2 = 60 Free Allocatable Size (FAS) FRS / 2 = 120 Size Needed for Disk Failure Coverage (SNDFC) 10 Size Needed for Cell Failure Coverage (SNCFC) 120 Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175 Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120 Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115 Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60 Normal Redundancy 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputCalculations for Normal Redundancy
  31. 31. A Deep Dive into ASM Redundancy in Exadata Total Raw Size (TRS) 360 360 Used Raw Size (URS) 120 120 Free Raw Size (FRS) 240 240 Total Allocatable Size (TAS) TRS / 2 = 180 TRS / 3 = 120 Used Allocatable Size (UAS) URS / 2 = 60 URS / 3 = 40 Free Allocatable Size (FAS) FRS / 2 = 120 FRS / 3 = 80 Size Needed for Disk Failure Coverage (SNDFC) 10 20 Size Needed for Cell Failure Coverage (SNCFC) 120 240 Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175 (TRS - SNDFC) / 3 = 113.3 Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120 N/A for Quarter Rack Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115 (FRS - SNDFC) / 3 = 73.3 Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60 N/A for Quarter Rack Normal Redundancy High Redundancy 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputCalculations for High Redundancy
  32. 32. A Deep Dive into ASM Redundancy in Exadata ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED NORMAL N 512 4096 4194304 27942912 16708892 9314304 3697294 0 N DATAC1/ MOUNTED NORMAL N 512 4096 4194304 1038240 1036984 346080 345452 0 Y DBFS_DG/ MOUNTED NORMAL N 512 4096 4194304 11973312 7966060 3991104 1987478 0 N RECOC1/ 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" OutputWhat we have in ASMCMD Total_MB  Total Raw Size (TRS) Free_MB  Free Raw Size (FRS) Req_mir_free_MB  ≥11.2.0.4.9 & ≥ 12.1.0.2  Size Needed for Disk Failure Coverage (SNDFC) <11.2.0.4.9 & <12.1.0.2  Size Needed for Cell Failure Coverage (SNCFC) Usable_file_MB  ≥11.2.0.4.9 & ≥ 12.1.0.2  Free Disk Failure Safe Allocatable Size ≥11.2.0.4.9 & ≥ 12.1.0.2  Free Cell Failure Safe Allocatable Size
  33. 33. A Deep Dive into ASM Redundancy in Exadata References 1 – Overview 2 – Failure 3 – Second Failure 4 – Usable Space 5 – ASMCMD "lsdg" Output Oracle Exadata Database Machine Maintenance Guide Automatic Storage Management Administrator's Guide NOTE:1484274.1 - Auto disk management feature in Exadata NOTE: 443835.1 - ASM Fast Mirror Resync - Example To Simulate Transient Disk Failure And Restore Disk NOTE:1431697.1 - Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example NOTE:1386147.1 - How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure) NOTE:1339373.1 - Operational Steps for Recovery after Losing a Disk Group in an Exadata Environment NOTE:1551288.1 - Understanding ASM Capacity and Reservation of Free Space in Exadata NOTE:1319567.1 - ASM Usable Space Calculations in Exadata Environment along with cell failure considerations

×