B35 Inside rac by Julian Dyke
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
690
On Slideshare
690
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
44
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 DB Tech Showcase - Osaka May 2013 juliandyke.com© 2013 Julian Dyke Julian Dyke Independent Consultant Inside RAC
  • 2. © 2013 Julian Dyke juliandyke.com2 Agenda  OPS versus RAC  Buffer Cache  Global Cache Services
  • 3. © 2013 Julian Dyke juliandyke.com3 RAC Overview Public Network Shared Storage Node 1 Instance 1 Node 2 Instance 2 Node 3 Instance 3 Node 4 Instance 4 Private Network (Interconnect) Storage Network
  • 4. © 2013 Julian Dyke juliandyke.com4 OPS versus RAC Oracle 8.0.6 and below Instance 2 Node 2 OPS - Oracle 8.0.6 and below Instance 1 Node 1 Interconnect Shared Storage Current Writes Consistent Reads Current Reads All I/O uses shared storage Enqueues only use interconnect
  • 5. © 2013 Julian Dyke juliandyke.com5 Instance 2 Node 2 OPS - Oracle 8.1.5 to Oracle 8.1.7 - Cache Fusion Phase 1 Instance 1 Node 1 Interconnect Shared Storage Current Writes Consistent Reads Current Reads Current I/O always uses shared storage Consistent reads can use interconnect OPS versus RAC Oracle 8.1.5 to Oracle 8.1.7
  • 6. © 2013 Julian Dyke juliandyke.com6 Instance 2 Node 2 RAC - Oracle 9.0.1 and above - Cache Fusion Phase 2 Instance 1 Node 1 Interconnect Shared Storage Current Writes Consistent Reads Current Reads Current I/O and consistent reads can use interconnect OPS versus RAC Oracle 9.0.1 and above
  • 7. © 2006 Julian Dyke juliandyke.com7 Head of Cold End Head of Hot End 92 0 34 3 72 4 52 1 71 2 66 0 49 0 42 1 45 2 52 1 71 2 66 0 42 1 11 1 52 1 71 2 11 1 42 1 42 2 71 0 92 0 34 3 72 4 45 2 11 1 52 1 42 2 33 1 45 2 11 1 42 2 33 1 34 4 92 0 34 4 72 4 45 2 11 1 42 0 33 1 71 0 87 1 87 1 72 4 33 1 45 2 Read Block 42 Get first available buffer from cold end Update buffer contentsInsert buffer at head of cold end Read Block 11 Get first available buffer from cold end Update buffer contentsInsert buffer at head of cold end Read Block 42 Update touch count for block 42 Read Block 33 Move block 71 to head of hot end Set touch count on block 71 to zero Get first available buffer from cold end Update buffer contentsInsert buffer at head of cold end Read Block 34 Update touch count for block 34 Read Block 87 Move block 42 to head of hot end Set touch count on block 42 to zero Get first available buffer from cold end Update buffer contentsInsert buffer at head of cold end STOP Block Number Touch Count Buffer Cache Single Block Reads
  • 8. © 2006 Julian Dyke juliandyke.com8 Head of Cold End Head of Hot End Read Block 1 Get first four available buffers from cold end Read next four blocks into buffers 1 2 3 4 Insert buffers at head of cold end 12 13 2 14 3 2 1 Move block 1 to cold end 121 Read Block 2 Move block 2 to cold end 21 321 3 4 Read Block 3 Move block 3 to cold end Read Block 4 Move block 4 to cold end Read Block 5 Get next four available buffers from cold end Read next four blocks into buffers Insert buffers at head of cold end Move block 5 to cold end 4 3 2 15 5 56 76 7 6 5 8 78 5 56 5 65 6 75 6 7 8 Read Block 6 Move block 6 to cold end Read Block 7 Move block 7 to cold end Read Block 8 Move block 8 to cold end STOP DB_FILE_MULTIBLOCK_READ_COUNT = 4 Buffer Cache Multi Block Reads
  • 9. © 2013 Julian Dyke juliandyke.com9 Global Services Overview  Resource  Object to which access must be controlled at instance level  Enqueue  Memory structure that serializes access to a resource  Global Resources  Object to which access must be controlled at cluster level  Global Enqueue  Locks and enqueues which need to be consistent between all instances
  • 10. © 2013 Julian Dyke juliandyke.com10 Global Services Overview  Global Resource Directory (GRD)  Records current state and owner of each resource  Contains convert and write queues  Distributed across all instances in cluster  Maintained by GCS and GES  Global Cache Services (GCS)  Implements cache coherency for database  Coordinates access to database blocks for instances  Global Enqueue Services (GES)  Controls access to other resources (locks) including library cache and dictionary cache  Performs deadlock detection
  • 11. © 2013 Julian Dyke juliandyke.com11 Global Cache Services Introduction  Global Cache Services exist to implement Cache Fusion  Cache Fusion allows blocks to be updated by multiple instances  Only one instance can have the updatable (current) version of a block  GCS must ensure that only one instance can update a block at any time  Many instances can have read-only (consistent read) versions of a block  Instances can have multiple copies of same block at different SCNs
  • 12. © 2013 Julian Dyke juliandyke.com12 Global Cache Services 2 way Current Read Instance 1 Instance 2 Instance 4 1318 Request shared resource Instance 3 Resource Master Instance 2 requests current read on block Request granted SN Read request Block returned 1318 1 2 3 4 STOP
  • 13. © 2013 Julian Dyke juliandyke.com13 Global Cache Services 3-way Current Read Instance 1 Instance 2 Instance 4 1318 Request exclusive resource Instance 3 Resource Master Instance 1 requests exclusive read on block Transfer block to Instance 1 for exclusive access SN Block and resource status Resource status 1318 1 2 3 4 N N X 1320 STOP
  • 14. © 2013 Julian Dyke juliandyke.com14 Global Cache Services 3-way Current Read (Dirty Block) Instance 1 Instance 2 Instance 4 1318 Request block in exclusive mode Instance 3 Resource Master Instance 4 requests exclusive read on block Transfer block to Instance 4 in exclusive mode SN Block and resource status Resource status 1318 12 3 4 N NX 1320 N N X 1320 1323 STOP Note that Instance 1 will create a past image (PI) of the dirty block
  • 15. © 2013 Julian Dyke juliandyke.com15 Global Cache Services 3-way Current (Without Downgrade) Instance 1 Instance 2 Instance 4 1318 Request block in shared mode Instance 3 Resource Master Instance 2 requests current read on block Block and resource status Resource status 1 3 4 N NX 1320 N N X 1320 1323 Transfer block to Instance 2 in shared mode 2 STOP In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions
  • 16. © 2013 Julian Dyke juliandyke.com16 Global Cache Services 3-way Current (With Downgrade) Instance 1 Instance 2 Instance 4 1318 Request block in shared mode Instance 3 Resource Master Instance 2 requests current read on block Block and resource status Resource status 1 3 4 N NX 1320 N X 1320 1323 Transfer block to Instance 2 in shared mode 2 S S STOP In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions
  • 17. © 2013 Julian Dyke juliandyke.com17 Global Cache Services Past Images  When an instance passes a dirty block to another instance it  Flushes redo buffer to redo log  Retains past image (PI) of block in buffer cache  PI is retained until another instance writes block to disk  Used to reduce recovery times  Recorded in V$BH.STATUS as PI  Based on X$BH.STATE (value 8 in Oracle 10.2)
  • 18. © 2013 Julian Dyke juliandyke.com18 Global Cache Services Past Images 71287129 UPDATE t1 SET c1 = 7124; COMMIT; UPDATE t1 SET c1 = 7129; COMMIT; 7123 Instance 1 71237124712571267127 Buffer Cache 71247123 71257124 71267125 71277126 7128 71287127 Redo Log 1 Instance 2 Buffer Cache 71297128 UPDATE t1 SET c1 = 7125; COMMIT; UPDATE t1 SET c1 = 7126; COMMIT; UPDATE t1 SET c1 = 7127; COMMIT; UPDATE t1 SET c1 = 7128; COMMIT; 7128 7123 Redo Log 2 7123 712871297129 7129 7129 Assume table t1 contains a single row in block 42 Instance 1 updates column to 7124 Block 42 is read from disk Undo/Redo written to Redo Log 1 Block 42 is updated in buffer cache Instance 1 updates column to 7125 Undo/Redo written to Redo Log 1 Block 42 is updated in buffer cache Instance 1 updates column to 7126 Undo/Redo written to Redo Log 1 Block 42 is updated in buffer cache Instance 1 updates column to 7127 Undo/Redo written to Redo Log 1 Block 42 is updated in buffer cache Instance 1 updates column to 7128 Undo/Redo written to Redo Log 1 Block 42 is updated in buffer cache Instance 2 updates column to 7129 GCS transfers block from Instance 1 to Instance 2 Instance 1 makes block 42 a Past Image block Undo/redo written to Redo Log 2 Block 42 is updated in buffer cache Instance 2 Crashes Contents of buffer cache are lost DBWR has not written changes to block 42 back to disk yet Instance 1 must perform recovery for Instance 2 Block 42 needs recovery Instance 1 uses Past Image Undo/redo is applied from Redo Log 2 Block 42 is subsequently written back to disk by DBWR STOP
  • 19. © 2013 Julian Dyke juliandyke.com19 Global Cache Services Wait Events  Wait events show reads where messages have been exchanged with other instances  Can include:  gc cr grant 2-way  gc cr block 2-way  gc cr block 3-way  gc cr multi block request  gc current grant 2-way  gc current block 2-way  gc current block 3-way  gc current multi block request
  • 20. © 2013 Julian Dyke juliandyke.com20 Global Cache Services gc cr block 3-way wait event Source Destination Description Bytes RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456 RAC2 - LMS1 RAC4 - Server OK 212 RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480 RAC3 - LMS1 RAC2 - LMS1 OK 212 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868
  • 21. © 2013 Julian Dyke juliandyke.com21 Global Cache Services gc cr block 3-way wait event RAC1 RAC2 RAC4 1318 RAC3 Resource Master 1,40 2,44 1,42 2,44 UPDATE t1 SET c2 = 50 WHERE c1 = 2; 1 2 3 4 5 10 6 7 8 9 1,42 2,44 1,42 2,44
  • 22. © 2013 Julian Dyke juliandyke.com22 Global Cache Services gc cr block 2-way wait event  2-way Consistent Read Source Destination Description Bytes RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400 RAC3 - LMS1 RAC4 - Server OK 212 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 1 1500 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 2 1500 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 3 1500 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 4 1500 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 5 1500 RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 6 868
  • 23. © 2013 Julian Dyke juliandyke.com23 Global Cache Services gc cr block 2-way wait event RAC1 RAC2 RAC4 1318 RAC3 Resource Master 1,40 2,44 1,40 2,44 UPDATE t1 SET c2 = 50 WHERE c1 = 2; 1 2 3 4 5 6 7 8 1,40 2,44 1,40 2,44 STOP
  • 24. © 2013 Julian Dyke juliandyke.com24 Global Cache Services gc current block 3-way wait event  3-way Current Read Source Destination Description Bytes RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456 RAC2 - LMS1 RAC4 - Server OK 212 RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480 RAC3 - LMS1 RAC2 - LMS1 OK 212 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500 RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868 RAC4 - LMS1 RAC2 - LMS1 Received file 8 block 15 244 RAC2 - LMS1 RAC4 - LMS1 OK 212
  • 25. © 2013 Julian Dyke juliandyke.com25 11 Global Cache Services gc current block 3-way wait event RAC1 RAC2 RAC4 1318 RAC3 Resource Master 1,40 2,44 1,42 2,44 UPDATE t1 SET c2 = 50 WHERE c1 = 2; 1 2 3 4 5 10 6 7 8 9 1,42 2,44 12 UPDATE t1 SET c2 = 42 WHERE c1 = 1; RAC3 saves past image of the dirty block until RAC4 writes the block to disk 1,42 2,44 1,42 2,50 STOP
  • 26. © 2013 Julian Dyke juliandyke.com26 Global Cache Services gc cr grant 2-way wait event  2-way Consistent Read Source Destination Description Bytes RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400 RAC3 - LMS1 RAC4 - Server OK 212 RAC3 - LMS1 RAC4 - Server Grant read file 6 block 69 276 RAC4 - Server RAC3 - LMS1 OK 212
  • 27. © 2013 Julian Dyke juliandyke.com27 Global Cache Services gc cr grant 2-way wait event RAC1 RAC2 RAC4 1318 RAC3 Resource Master 1,40 2,44 1,40 2,44 1,40 2,44 SELECT c2 FROM t1 WHERE c1 = 1; 1 2 5 6 34 STOP
  • 28. © 2013 Julian Dyke juliandyke.com28 Global Cache Services gc cr multi block request wait event Source Destination Description Bytes RAC4 - Server RAC3 - LMS1 Request file 8 blocks 69-76 1872 RAC3 - LMS1 RAC4 - Server OK 212 RAC3 - LMS1 RAC4 - Server Grant file 8 blocks 69-76 to RAC4 772 RAC4 - Server RAC3 - LMS1 OK 212
  • 29. © 2013 Julian Dyke juliandyke.com29 Global Cache Services gc cr multi block request wait event RAC1 RAC2 RAC4 1318 RAC3 Resource Master SELECT c2 FROM t1 WHERE c1 = 1; 1 2 5 6 34 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 1,40 2,44 STOP
  • 30. © 2013 Julian Dyke juliandyke.com30 Global Cache Services gc cr multi block request wait event  The following 10046/8 trace is for a gc cr multi block request WAIT #2: nam='gc cr multi block request' ela= 722 file#=4 block#=248 class#=1 obj#=51866 tim=1169728375495574 WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244 blocks=5 obj#=51866 tim=1169728375506092  This trace can be misleading because:  the gc cr multi block request specifies the LAST block in the range  the gc cr multi block request does not specify how many blocks should be read  the gc cr multi block request does not specify how many blocks have been returned from another instance
  • 31. © 2013 Julian Dyke juliandyke.com31 Global Cache Services Block Mastering  Each block is mastered on one instance  Block DBA is reported by X$KJBR.KJBRNAME  Names have the format: [<block_number>][<file_number>][BL]  For example [0x137][0x40000][BL]  Ordering by X$KJBR.KJBRNAME is difficult because the resource names do not collate when sorted e.g.:  is file# 4, block# 311 [0x12E][0x40000][BL] [0x12F][0x40000][BL] [0x13][0x40000][BL] [0x130][0x40000][BL] [0x131][0x40000][BL] etc...
  • 32. © 2013 Julian Dyke juliandyke.com32 Global Cache Services Block Mastering  Some useful functions CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name,'x',1,2); pos2 INTEGER := INSTR (p_resource_name,']',1,2); s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1); BEGIN RETURN TO_NUMBER (s,'XXXXXXXX') / 65536; END; / CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2) RETURN INTEGER IS pos1 INTEGER := INSTR (p_resource_name,'x',1,1); pos2 INTEGER := INSTR (p_resource_name,']',1,1); s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1); BEGIN RETURN TO_NUMBER (s,'XXXXXXXX'); END; /
  • 33. © 2013 Julian Dyke juliandyke.com33 Global Cache Services Block Mastering  In Oracle 10.2 block mastering is determined by  _lm_contiguous_res_count  Specifies number of contiguous blocks that will hash to the same HV bucket  Defaults to 128  For example Start End 0x080 0x0FF 0x180 0x1FF 0x280 0x2FF 0x380 0x3FF 0x480 0x4FF 0x580 0x5FF etc etc Start End 0x000 0x07F 0x100 0x17F 0x200 0x27F 0x300 0x37F 0x400 0x47F 0x500 0x57F etc etc Instance 0 Instance 1
  • 34. © 2013 Julian Dyke juliandyke.com34 Global Cache Services Block Mastering  The following table shows that masters are still assigned to ranges of 128 contiguous blocks in a four-node cluster Start Block End Block Master 0 127 1 128 255 2 256 383 2 384 511 3 512 639 3 640 767 3 768 895 1 896 1023 0 1024 1279 2 1280 1407 1
  • 35. © 2013 Julian Dyke juliandyke.com35 Global Cache Services Dynamic Remastering  In Oracle 9.2  documentation describes dynamic remastering  not implemented in code  In Oracle 10.1  work at data file level  very high threshold so difficult to test  does occur on some customer sites  In Oracle 10.2 and above  works at segment level  thresholds are relatively low
  • 36. © 2013 Julian Dyke juliandyke.com36 Global Cache Services Dynamic Remastering  Object remastering is recorded in V$GCSPFMASTER_INFO  Instances are internally numbered 0, 1 etc  Initially contains no rows  After remastering object 52084 to instance 0 SELECT object_id, current_master, previous_master FROM v$gcspfmaster_info;  After remastering object 52084 to instance 1 Object ID Current Master Previous Master 52084 0 32767 Object ID Current Master Previous Master 52084 1 0
  • 37. © 2013 Julian Dyke juliandyke.com37 Global Cache Services Dynamic Remastering  In Oracle 10.2 and above, information about Dynamic Remastering operations is also reported in the following fixed views  X$KJDRMREQ  Dynamic Remastering Requests  X$KJDRMAFNSTATS  File Remastering Statistics  X$KJDRMHVSTATS  Hash Value Statistics
  • 38. © 2013 Julian Dyke juliandyke.com38 Global Cache Services Dynamic Remastering  In Oracle 11.1 and above, Dynamic Remastering statistics are reported in V$DYNAMIC_REMASTER_STATS Column Name Data Type REMASTER_OPS NUMBER REMASTER_TIME NUMBER REMASTERED_OBJECTS NUMBER QUIESCE_TIME NUMBER FREEZE_TIME NUMBER CLEANUP_TIME NUMBER REPLAY_TIME NUMBER FIXWRITE_TIME NUMBER SYNC_TIME NUMBER RESOURCES_CLEANED NUMBER REPLAYED_LOCKS_SENT NUMBER REPLAYED_LOCKS_RECEIVED NUMBER CURRENT_OBJECTS NUMBER
  • 39. © 2013 Julian Dyke juliandyke.com39 Global Cache Services Dynamic Remastering  Dynamic remastering is coordinated by the LMD0 background  The LMD0 process background process includes limited details of dynamic remastering operations  Excessive dynamic remastering can cause instance freezes  Observed in both Oracle 10.1 and 10.2  Oracle Support occasionally recommends that dynamic remastering is disabled using the following parameters: _gc_affinity_time = 0 _gc_undo_affinity=FALSE
  • 40. © 2013 Julian Dyke juliandyke.com40 Thank you for listening info@juliandyke.com
  • 41. © 2013 Julian Dyke juliandyke.com41 Backup
  • 42. © 2013 Julian Dyke juliandyke.com42 Interconnect Overview  Instances communicate with each other over the interconnect (network)  Information transferred between instances includes  data blocks  locks  SCNs  Typically 1Gb Ethernet  UDP protocol  Often teamed in pairs to avoid SPOFs  Can also use Infiniband  Fewer levels in stack  Other proprietary protocols are available
  • 43. © 2013 Julian Dyke juliandyke.com43 Interconnect TCP/IP Five Layer Model  All messages travel down through layers, across physical layer then up again 5 Application 4 Transport 3 Network 2 Data Link 1Physical 5 Application 4 Transport 3 Network 2 Data Link 1Physical
  • 44. © 2013 Julian Dyke juliandyke.com44 Interconnect TCP/IP Five Layer Model  TCP/IP has a four or five layer model  Five-layer model shown below Layer TCP/IP Suite 5 Application DHCP, DNS, FTP, HTTP, SSH, NFS, NTP, SMTP, SNMP, TELNET, RPC, SOAP 4 Transport TCP, UDP 3 Network IP (IPv4, IPv6), ICMP, ARP, RARP 2 Data Link Ethernet, Token Ring, 802.11, Wi-Fi, FDDI, PPP 1 Physical 10BASE-T, 100BASE-T, 1000BASE-T, Optical Fibre, Twisted Pair  Four-layer model combines data link and physical layers
  • 45. © 2013 Julian Dyke juliandyke.com45 Interconnect TCP/IP Transport Layer  Transport Layer  Connection-oriented (TCP)  Connectionless (UDP) Ethernet Physical Layer IP TCP UDPClusterware RAC
  • 46. © 2013 Julian Dyke juliandyke.com46 Interconnect Encapsulation Ethernet Header Ethernet Trailer UDP Header IP Header Data UDP Header IP Header Data UDP Header Data Data 4 bytes14 bytes 20 bytes 8 bytes MTU Size
  • 47. © 2013 Julian Dyke juliandyke.com47 Oracle Clusterware Node Heartbeat Messages  Sent to each node in cluster every second in both directions  Checks nodes are still members of cluster  Sent by ocssd.bin using TCP well-known port 49895  Outgoing message is 134 bytes (80 byte payload)  Incoming message is 66 bytes (12 byte payload) Node 1 Node 3 Node 2 Node 4 Outgoing Incoming
  • 48. © 2013 Julian Dyke juliandyke.com48 Oracle Clusterware Node Status Messages  Number of packets exchanged by a node is determined by number of nodes in cluster  Number of packets per node per hour is  (#nodes - 1) * 4 messages * 3600 seconds Number of nodes Packets per hour 2 14,400 3 28,800 4 43,200 5 57,600 6 72,000 7 86,400 8 100,800 16 216,000 32 446,400
  • 49. © 2013 Julian Dyke juliandyke.com49 Datafiles Controlfiles Redo Logs RAC Background Processes Overview Redo Logs DIAG LMON LCK0 LMD0 LMSn PMON SMON LGWR CKPT ARCn SMON PMON DBWR DBWR LGWR Shared Pool Buffer Cache Instance 2 Shared Pool Buffer Cache Instance 1 DIAG LMON LCK0 LMD0 LMSn CKPT ARCn Node 1 Node 2
  • 50. © 2013 Julian Dyke juliandyke.com50 RAC Background Processes LMSn  LMSn  Global Cache Service Process  Manage requests for data access across cluster  Up to 20 in Oracle 10.1  LMS0-LMS9 LMSa-LMSj  Up to 36 in Oracle 10.2  LMS0-LMS9 LMSa-LMSz  In Oracle 10.1 and above, number of GCS server processes can be configured using gcs_server_processes parameter  Default value is 1 (single CPU system)  Can also be configured using _lm_lms parameter
  • 51. © 2013 Julian Dyke juliandyke.com51 RAC Background Processes LMSn  In Oracle 10.2 and above  LMS processes run in real-time mode  Remaining processes run in time-share mode  Check using: [oracle@server3 ~]$ ps -eo pid,user,opri,cmd | grep ora_lm 8596 oracle 75 ora_lmon_TEST1 8598 oracle 75 ora_lmd0_TEST1 8601 oracle 58 ora_lms0_TEST1  58 is real time; 75 or 76 is time share  You can also check process scheduling policies using chrt oracle@server3 ~]$ chrt -p 8601 # lms0 - Real Time pid 8601's current scheduling policy: SCHED_RR pid 8601's current scheduling priority: 1 [oracle@server3 ~]$ chrt -p 8596 # lmon - Time Share pid 8596's current scheduling policy: SCHED_OTHER pid 8596's current scheduling priority: 0
  • 52. © 2013 Julian Dyke juliandyke.com52 RAC Background Processes LCK0  LCK0  Instance Enqueue Process  Part of KCL (Kernel Cache Library)  Manages  instance resource requests  cross-instance call operations  Assists LMS processes  Formerly known as lock process  One LCK0 process per instance  In 9.0.1 and below, number of lock processes may be configurable using _gc_lck_procs parameter
  • 53. © 2013 Julian Dyke juliandyke.com53 RAC Background Processes LMD0  LMD0  Global Enqueue Service Daemon  Manages requests for global enqueues  Updates status of enqueues when granted to / revoked from an instance  Responsible for deadlock detection  One LMD0 process per instance  In 8.1.7 and below number of lock daemons may be configurable using _lm_dlmd_processes parameter
  • 54. © 2013 Julian Dyke juliandyke.com54 RAC Background Processes LMON  LMON  Global Enqueue Service Monitor  One LMON process per instance  Monitors cluster to maintain global enqueues and resources  Manages  instance and process expirations  recovery processing for cluster enqueues
  • 55. © 2013 Julian Dyke juliandyke.com55 RAC Background Processes DIAG  DIAG - Diagnosability Process  Collects diagnostic data in the event of a failure  Creates subdirectories in BACKGROUND_DUMP_DEST directory  In Oracle 9.0.1 and above can be disabled using _diag_daemon parameter  Do not try this on a production system
  • 56. © 2013 Julian Dyke juliandyke.com56 Global Cache Services UDP Messages  There are two types of message exchanged within RAC  These are PROBABLY defined as follows  Synchronous  These messages require an acknowledgement for each packet  In some cases the acknowledgement packet can be larger than the original request  e.g. SCN synchronization  Asynchronous  These messages do not require an individual acknowledgement for each packet  e.g. block transfers between instances
  • 57. © 2013 Julian Dyke juliandyke.com57 Global Cache Services Lock Modes  Lock modes can be:  Null  Another instance can hold an exclusive or shared lock  Shared  Another instance can hold a shared lock but not an exclusive lock  Exclusive  No other instances can hold shared or exclusive locks  Locks can also be:  Local  No other instance has held an exclusive lock  Global  Another instance has held an exclusive lock in the past
  • 58. © 2013 Julian Dyke juliandyke.com58 Global Cache Services Fairness Threshold  Intended to prevent unnecessary lock downgrades when other instances only require read-only copies  For write to read transfers  Writing instance retains X lock  Reading instance retains null lock  If _fairness_threshold reached then  Writing instance downgrades X lock to S lock  Reading instance receives S lock  _fairness_threshold default value is 4
  • 59. © 2013 Julian Dyke juliandyke.com59 Global Cache Services Lock Elements  Lock elements are externalized in the V$LOCK_ELEMENT dynamic performance view  Based on X$LE  Additional information is available in the X$LE view  Past image buffers do not have a lock element  In OPS one lock element could manage a contiguous range of blocks  Still can in RAC using GC_FILES_PER_LOCK parameter  Disables Cache Fusion
  • 60. © 2013 Julian Dyke juliandyke.com60 Global Cache Services Lock Elements  Contain embedded GCS Client structures (KJBL) Lock Element GCS Client Buffer Header Lock Element GCS Client Buffer Header Buffer Header Lock Element GCS Client Buffer Header
  • 61. © 2013 Julian Dyke juliandyke.com61 Global Cache Services Memory Structures KJBRKJBR KJBL BH BH LE KJBL LE KJBL GCS Client GCS Shadow GCS Resource Block Header Lock Element GCS Shadow describes blocks held by other instances, but mastered locally
  • 62. © 2013 Julian Dyke juliandyke.com62 Global Cache Services Memory Structures  GCS Resources (KJBR)  Stored in segmented array  Number of GCS resource structures determined by  _gcs_resources parameter  Externalized in X$KJBR  Number of free GCS resource structures in X$KJBRFX  GCS Enqueues (Clients / Shadows) (KJBL)  GCS clients embedded in lock elements  GCS shadows stored in segmented array  Number of GCS shadow structures determined by  _gcs_shadow_locks parameter  Externalized in X$KJBL  Number of free GCS shadow structures in X$KJBLFX
  • 63. © 2013 Julian Dyke juliandyke.com63 Global Cache Services Dynamic Remastering  Example SELECT data_object_id FROM dba_objects WHERE owner = 'US01'AND object_name = 'T1'; OBJECT_ID --------- 52084 ORADEBUG LKDEBUG -m pkey 52084  To remaster object at current instance use:  All blocks now mastered by the current instance  To redistribute masters to all available instances use: ORADEBUG LKDEBUG -m dpkey 52084  Blocks mastered by both (all) instances again
  • 64. © 2013 Julian Dyke juliandyke.com64 Global Cache Services Block Mastering  In Oracle 10.1 and below block mastering is determined by a hash function  Algorithm applied to groups of 1289 contiguous blocks  In two node cluster  Instance 0 has 645 blocks  Instance 1 has 644 blocks  etc  In three node cluster  Instance 0 has 430 blocks  Instance 2 has 215 blocks  Instance 1 has 430 blocks  Instance 2 has 214 blocks  etc  Beware of small hot tables and indexes....
  • 65. © 2013 Julian Dyke juliandyke.com65 Global Cache Services Dumps  To dump the contents of the global cache use: ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1'; GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18): id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0 GCS CLIENT 0x21fecd60,1 sq[(nil),(nil)] resp[(nil),0x3591.10000] pkey 181 grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCAL master 1 owner 0 sid 0 remote[(nil),0] hist 0x7c history 0x3c.0x1.0x0.0x0.0x0.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0 disk: 0x0000.00000000 write request: 0x0000.00000000 pi scn: 0x0000.00000000 msgseq 0x1 updseq 0x0 reqids[1,0,0] infop 0x0 pkey 181 hv 107 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 6, dom 0] kjga st 0x4, step 0.0.0, cinc 8, rmno 10, flags 0x0 lb 0, hb 0, myb 178, drmb 178, apifrz 0
  • 66. © 2013 Julian Dyke juliandyke.com66 Global Cache Services Dumps  Continued GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358): id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193) lock: SL rls: 0x0000 acq: 0x0000 latch: 0 flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp' bscn: 0x0.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0 GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5 ..... GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894,0x2ff57390] name[0x6a39.10000] pkey 74 grant 0x2eff3858 cvt (nil) send (nil),0 write (nil),0@65535 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL ..... GCS SHADOW 0x2eff3858,1 sq[0x237f43a0,0x2ee64e8c] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 1 sid 0 remote[0x23fea160,1] hist 0x65f ..... GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5 .....
  • 67. © 2013 Julian Dyke juliandyke.com67 Global Cache Services System Change Number  In RAC clusters SCN must be maintained across all nodes in cluster  SCN propagation scheme differs according to version  In Oracle 10.1and below defaults to Lamport algorithm  Lamport in alert.log  SCN piggy-backed on GCS/GES messages  Recorded in redo log  Default delay of 7 seconds  In Oracle 10.2 and above defaults to Broadcast on Commit algorithm  SCN negotiated immediately  Apparently no delay
  • 68. © 2013 Julian Dyke juliandyke.com68 Global Cache Services System Change Number  System Change Number algorithm is determined by the MAX_COMMIT_PROPAGATION_DELAY parameter  In Oracle 10.1 and below  Initialization parameter specified in centriseconds  Default value is 700 centiseconds (7 seconds)  Specifies maximum time taken for a COMMIT on one node to be reflected on other nodes in the cluster  For some applications performing rapid updates and queries of the same data from different instances, value must be set to 0 (Broadcast on commit)  Examples include:  E-Business suite  SAP
  • 69. © 2013 Julian Dyke juliandyke.com69 Global Cache Services System Change Number  In Oracle 10.2 and above  Default value of MAX_COMMIT_PROPAGATION_DELAY parameter is 0  SCN broadcast on commit method is used  SCN updates are synchronized immediately  SCN is synchronized  after current read  before block updated  This ensures correct SCN is written to block
  • 70. © 2013 Julian Dyke juliandyke.com70 Global Cache Services Broadcast on Commit  Ethernet broadcast is not used  SCN is synchronized by updating instance  Sends UDP SCN synchronization message to each remote instance  Remote instances respond with their current SCN  Another round of messages may be required if remote SCNs are more recent than local SCN  Synchronization occurs every time an instance needs a new SCN  Synchronization is always performed by the updating instance  Number of messages = 4 x (number of instances - 1)
  • 71. © 2013 Julian Dyke juliandyke.com71 Global Cache Services Broadcast on Commit  In a 4-node cluster 12 messages are exchanged Source Destination Description Bytes RAC4-LMS0 RAC1-LMS0 Send current SCN 192 RAC1-LMS0 RAC4-LMS0 OK 212 RAC4-LMS0 RAC2-LMS0 Send current SCN 192 RAC2-LMS0 RAC4-LMS0 OK 212 RAC4-LMS0 RAC3-LMS0 Send current SCN 192 RAC3-LMS0 RAC4-LMS0 OK 212 RAC1-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC1-LMS0 OK 212 RAC2-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC2-LMS0 OK 212 RAC3-LMS0 RAC4-LMS0 Send current SCN 192 RAC4-LMS0 RAC3-LMS0 OK 212
  • 72. © 2013 Julian Dyke juliandyke.com72 Global Cache Service Read Consistency  When a read consistent version of a block is requested it may be necessary to apply undo to a more recent version of that block  Undo can be applied by LMSn background process in  Remote instance  Local instance  If undo applied by remote instance, any outstanding redo must first be flushed from redo buffer of remote instance to redo log  Can have significant performance impact on consistent reads  Particularly on extended clusters
  • 73. © 2013 Julian Dyke juliandyke.com73 Global Cache Service Read Consistency  Statistics on inter-instance consistent reads are reported in V$CR_BLOCK_SERVER  Reports statistics for blocks served by local instances to remote instances including  Number of consistent reads served  Number of current reads served  Number of data blocks served  Number of undo blocks served  Number of undo headers served  Number of fairness down converts  Number of log flushes  Number of times light works rule invoked
  • 74. © 2013 Julian Dyke juliandyke.com74 Global Cache Service Read Consistency  In theory, once a block has been written to disk, the LMS process will not attempt to read it again when responding to a consistent read request  Light Works Rule  Prevents LMS processes from going to disk when responding to CR requests for data, undo or undo segment blocks  Can prevent LMS process from completing its response to a CR request
  • 75. © 2013 Julian Dyke juliandyke.com75 Global Cache Service Read Consistency  Uncommitted changes MUST be flushed to the redo log before the LMS process can ship a consistent block to another instance  Reading process must wait until redo log changes have been written to redo log by LMS process  Bad for standard RAC databases  Reads must wait for redo log writes  Worse for extended / stretch RAC clusters  Increased latency of cross site disk communications
  • 76. © 2013 Julian Dyke juliandyke.com76 Global Cache Service Read Consistency  For each block on which a consistent read is performed, a redo log flush must first be performed  Number of redo log flushes is recorded in the FLUSHES column of V$CR_BLOCK_SERVER  Redo log flush time  is recorded in the gc cr block flush time statistic for the LMS process  will increase time taken to serve consistent block  will increase time taken to perform consistent read  If LMS processes become very busy, consistent reads will experience high wait times e.g. for a full table scan gc cr multi block request
  • 77. © 2013 Julian Dyke juliandyke.com77 Global Cache Services Read Consistency Committed transaction on RAC2 - All blocks still in buffer cache 110 109 108 108 Redo Buffer Redo Buffer Buffer CacheBuffer Cache RAC1 RAC2 Redo Log 1 2 3 110 110 STOP
  • 78. © 2013 Julian Dyke juliandyke.com78 Global Cache Services Read Consistency Committed transaction on RAC2 - Some blocks written to disk 110 109 108 Redo Buffer Redo Buffer Buffer CacheBuffer Cache RAC1 RAC2 Redo Log 1 3 2 110 110 4 110 110 STOP
  • 79. © 2013 Julian Dyke juliandyke.com79 Global Cache Services Read Consistency Uncommitted transaction on RAC2 - All blocks still in buffer cache 110 108 Redo Buffer Redo Buffer Buffer CacheBuffer Cache RAC1 RAC2 Redo Log 2 3 1 108 110 4 5 6 109 110 109 109 108108 108108 STOP
  • 80. © 2013 Julian Dyke juliandyke.com80 Global Cache Services Read Consistency Uncommitted transaction on RAC2 - Some blocks written to disk Redo Buffer Redo Buffer Buffer CacheBuffer Cache RAC1 RAC2 Redo Log 3 2 1 110 4 6 8 110 5 7 110 110 109 110 109 109 108108 108 STOP
  • 81. © 2013 Julian Dyke juliandyke.com81 Global Cache Services Jumbo Frames  By default Maximum Transmission Unit (MTU) is 1500  MTU includes  IP header  UDP header  Data  Requires six packets to transmit one 8192 byte block  On some adapters MTU can be increased to around 9000  e.g. Intel PRO/1000  At command line ifconfig eth1 mtu 9000 up  or in /etc/sysconfig/ifcfg-eth<x> MTU=9000
  • 82. © 2013 Julian Dyke juliandyke.com82 Global Cache Services Jumbo Frames  Example - cost of sending on 8192 byte block  MTU=1500 (default) Frame# Ethernet Header IP Header UDP Header Data Ethernet Trailer Total 1 14 20 8 1472 4 1518 2 14 20 8 1472 4 1518 3 14 20 8 1472 4 1518 4 14 20 8 1472 4 1518 5 14 20 8 1472 4 1518 6 14 20 8 840 4 886 Total 84 120 48 8200 24 8476 Frame# Ethernet Header IP Header UDP Header Data Ethernet Trailer Total 1 14 20 8 8200 4 8246 Total 14 20 8 8200 4 8246  MTU=9000
  • 83. © 2013 Julian Dyke juliandyke.com83 Global Cache Services Jumbo Frames  Not all network adapter drivers support jumbo frames  Particularly cheap ones....  All network adapters in private interconnect must have same MTU size  Switch must also be configured to support jumbo frames  Lots of bugs and compatibility issues e.g.  Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000  affects 10.1.0.5, 10.2,0.1  fixed in 10.2.0.2 and above