• Save
Severalnines Training: MySQL Cluster - Part X
 

Severalnines Training: MySQL Cluster - Part X

on

  • 4,666 views

Part X of our self-training slides on MySQL Cluster, focused on troubleshooting MySQL Cluster ...

Part X of our self-training slides on MySQL Cluster, focused on troubleshooting MySQL Cluster
Topics:
- common problems encountered by users
- error logs and trace files
- recovery and escalation procedures

Statistics

Views

Total Views
4,666
Views on SlideShare
791
Embed Views
3,875

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 3,875

http://www.severalnines.com 3874
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Severalnines Training: MySQL Cluster - Part X Severalnines Training: MySQL Cluster - Part X Presentation Transcript

    • 1Copyright 2013 Severalnines AB Control your database infrastructure 10th Installment MySQL Cluster Self-Training Part 9 – Troubleshooting MySQL Cluster
    • 2Copyright 2013 Severalnines AB Control your database infrastructure Topics • Common problems • Error logs and Trace files • Recovery and Escalation procedures
    • 3Copyright 2013 Severalnines AB Control your database infrastructure Common Problems • The most common problems are – Configuration changes – Out of disk space – Out of RAM – Network issues (switch failures, network reorganization, upgrade of RST) – Swapping • echo “0” > /proc/vm/swapiness
    • 4Copyright 2013 Severalnines AB Control your database infrastructure Localizing the problem • Look in the cluster log on the management node • What node/nodes crashed and in what order • Go to those node/nodes – View the error log file for each node. – Look at the recommended restart action • Initial node recovery • Node Recovery – It could also be a Permanent error • Filesystem is full • Directory does not exist
    • 5Copyright 2013 Severalnines AB Control your database infrastructure Error logs • Data node store its error log in its DATADIR – ndb_X_error.log – X is the node id of the node • The ndb_X_out.log contains debug messages but is usually not interesting to look in. • The ndb_X_trace.log.n contains the last execution steps before the data node stopped/crashed.
    • 6Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_cluster.log 2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: Start with all nodes 3 and 4 2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 0/1 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 0/2 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Node 4: API mysql-5.1.51 ndb-7.1.10 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Node 3: API mysql-5.1.51 ndb-7.1.10 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Start phase 1 completed 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Start phase 1 completed 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 2 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 2 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 3 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 3 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restarting cluster to GCI: 231577 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Starting to restore schema 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restore of schema complete 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Starting to restore schema 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Restore of schema complete 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: DICT: activate index 8 done (sys/def/7/PRIMARY) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 3 StartLog: [GCI Keep: 227381 LastCompleted: 231577 NewestRestorable: 231577] 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 4 StartLog: [GCI Keep: 227381 LastCompleted: 231577 NewestRestorable: 231577] 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected 2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 1, 4, 5 and 9. 2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 1, 5 and 9. 2011-05-24 08:12:57 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 reserved for ip 192.168.100.112, m_reserved_nodes 1, 4, 5 and 9.
    • 7Copyright 2013 Severalnines AB Control your database infrastructure Closer Inspection • There was a system restart ongoing • Node 3 crashed – Forced node shutdown completed. Occured during startphase 4. Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’ • Node 4 – Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'
    • 8Copyright 2013 Severalnines AB Control your database infrastructure Closer Inspection • Next the error logs of the data nodes needs to be inspected. – Ndb_3_error.log – Ndb_4_error.log
    • 9Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_error.log Time: Tuesday 24 May 2011 - 02:36:36 Status: Temporary error, restart node Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error) Error: 2308 Error data: Node 3 disconnected Error object: QMGR (Line: 3050) 0x00000002 Program: /usr/local//mysql/bin//ndbd Pid: 3501 Version: mysql-5.1.51 ndb-7.1.10 Trace: /data/mysqlcluster//ndb_4_trace.log.5 ***EOM***
    • 10Copyright 2013 Severalnines AB Control your database infrastructure Error logs • Looking at the error logs usually gives a good hint what needs to be done – In the above example one node failed during system restart. – This caused
    • 11Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_error.log Time: Tuesday 24 May 2011 - 08:53:40 Status: Temporary error, restart node Message: Pointer too large (Internal error, programming error or missing error message, please report a bug) Error: 2306 Error data: dblqh/DblqhMain.cpp Error object: DBLQH (Line: 15725) 0x00000002 Program: /usr/local//mysql/bin//ndbd Pid: 4790 Version: mysql-5.1.51 ndb-7.1.10 Trace: /data/mysqlcluster//ndb_3_trace.log.25 ***EOM***
    • 12Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_out.log RESTORE table: 2 1039 rows applied RESTORE table: 2 1012 rows applied RESTORE table: 3 2 rows applied RESTORE table: 3 2 rows applied
    • 13Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_trace.log.N --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 75928 gsn: 164 "CONTINUEB" prio: 1 s.bn: 247 "DBLQH", s.proc: 3, s.sigId: 75923 length: 2 trace: 1 #sec: 0 fragInf: 0 H'00000006 H'00000000 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 75927 gsn: 262 "FSREADCONF" prio: 0 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 75926 length: 1 trace: 1 #sec: 0 fragInf: 0 UserPointer: 1 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 75926 gsn: 164 "CONTINUEB" prio: 1 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 75922 length: 1 trace: 1 #sec: 0 fragInf: 0 Scanning the memory channel again with no delay
    • 14Copyright 2013 Severalnines AB Control your database infrastructure Recovery and Escalation Procedures • There are two escalation steps to recover failed data nodes – in this case Cluster is still STARTED – Optimized Node Recovery (NR) – Initial Node Recovery (INR) • A failed Cluster can be recovered in two ways: – System Restart • The individual nodes may have to be restarted in a combination of NR and INR. – Initial System Restart + Restore Backup
    • 15Copyright 2013 Severalnines AB Control your database infrastructure Optimized Node Recovery • A failed node can recovered using Optimized Node Recovery. – This is the fastest way to recover a failed node – Node will recover from Local Checkpoint and apply Redo log. – Then copy changes from the other node in the same node group.
    • 16Copyright 2013 Severalnines AB Control your database infrastructure Optimized Node Recovery STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Multiple failed nodes can recovery in parallel • The first step is to try to restart the failed nodes in Optimized Node Recovery mode: – ndbmtd Node group 0 Node group 0
    • 17Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery • If a node fails to complete Optimized Node Recovery the next step in the escalation chain is to perform an Initial Node Recovery. – This can be because of a corrupted file system • During Initial Node Recovery the data node will – Clear out its local filesystem (rm –rf /datadir/ndbd/*) – Recreate the REDO LOG – Copy all data from the other node in the node group. • Usually this recovery takes a lot longer to perform than Optimized Node Recovery.
    • 18Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery Is Needed
    • 19Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery Is Needed
    • 20Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Pretend that Node 2 failed to recover • In this case it can be recovered with – ndbmtd --initial Node group 0 Node group 0
    • 21Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • All nodes have failed. • Every data node can be restarted with – ndbmtd Node group 0 Node group 0
    • 22Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • If one node fails during the System Restart the system restart is aborted – All nodes crash again • Error logs must be inspected Node group 0 Node group 0
    • 23Copyright 2013 Severalnines AB Control your database infrastructure System Restart • Some data nodes will write out in the error log – “Another data node failed during system restart” – Start these nodes with • Ndbmtd – The let the cluster perform a partial start (start with the nodes that are ok, at least one from each node group) -> you may have to try multiple combinations. • The goal is to find the “another node” – “Filesystem inconsistency”, “DBDIH pointer too large” – Start this node with • ndbmtd --initial
    • 24Copyright 2013 Severalnines AB Control your database infrastructure System Restart • If all nodes in one node group has written out something like: – “Filesystem inconsistency”, “DBDIH pointer too large” – System restart is not possible • OR – It is not possible to perform a partial start (i.e, one node from each node group), and all possible combinations are exempt then .. • Initial System Restart is needed ! – Which basically means you have to restore from backup.
    • 25Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • This situation requires an Initial System Restart because all nodes in one node group have failed in such a way they are impossible to restart. – Luckily this is not very common at all. Node group 0 Node group 0
    • 26Copyright 2013 Severalnines AB Control your database infrastructure Initial System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Restart all nodes with – ndbmtd --initial • Restore a backup Node group 0 Node group 0
    • 27Copyright 2013 Severalnines AB Control your database infrastructure Summary • The whole exercise is to try different combinations, but never –initial all nodes in one node group.
    • 28Copyright 2013 Severalnines AB Control your database infrastructure Coming next in Installment 11: Connectivity Overview
    • 29Copyright 2013 Severalnines AB Control your database infrastructure Disclaimer © Copyright 2013 Severalnines AB. All rights reserved. Severalnines & the Severalnines logo(s) are trademarks of Severalnines AB. MySQL is a registered trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.