The document discusses troubleshooting MySQL Cluster. The most common problems include configuration changes, running out of disk space or RAM, and network issues. When problems occur, error logs and trace files should be checked to localize the issue. If a node fails, optimized node recovery or initial node recovery may be used to restore it. If all nodes fail, a system restart or initial system restart with restore from backup may be required.
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
MySQL Cluster Troubleshooting Guide
1. 1Copyright 2013 Severalnines AB Control your database infrastructure
10th Installment
MySQL Cluster Self-Training
Part 9 – Troubleshooting MySQL
Cluster
2. 2Copyright 2013 Severalnines AB Control your database infrastructure
Topics
• Common problems
• Error logs and Trace files
• Recovery and Escalation procedures
3. 3Copyright 2013 Severalnines AB Control your database infrastructure
Common Problems
• The most common problems are
– Configuration changes
– Out of disk space
– Out of RAM
– Network issues (switch failures, network reorganization,
upgrade of RST)
– Swapping
• echo “0” > /proc/vm/swapiness
4. 4Copyright 2013 Severalnines AB Control your database infrastructure
Localizing the problem
• Look in the cluster log on the management node
• What node/nodes crashed and in what order
• Go to those node/nodes
– View the error log file for each node.
– Look at the recommended restart action
• Initial node recovery
• Node Recovery
– It could also be a Permanent error
• Filesystem is full
• Directory does not exist
5. 5Copyright 2013 Severalnines AB Control your database infrastructure
Error logs
• Data node store its error log in its DATADIR
– ndb_X_error.log
– X is the node id of the node
• The ndb_X_out.log contains debug messages but is
usually not interesting to look in.
• The ndb_X_trace.log.n contains the last execution
steps before the data node stopped/crashed.
6. 6Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_cluster.log
2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: Start with all nodes 3 and 4
2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 0/1
2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 0/2
2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Node 4: API mysql-5.1.51 ndb-7.1.10
2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Node 3: API mysql-5.1.51 ndb-7.1.10
2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Start phase 1 completed
2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Start phase 1 completed
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 2 completed (system restart)
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 2 completed (system restart)
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 3 completed (system restart)
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 3 completed (system restart)
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restarting cluster to GCI: 231577
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Starting to restore schema
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restore of schema complete
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Starting to restore schema
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Restore of schema complete
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: DICT: activate index 8 done (sys/def/7/PRIMARY)
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 3 StartLog: [GCI Keep: 227381 LastCompleted: 231577
NewestRestorable: 231577]
2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 4 StartLog: [GCI Keep: 227381 LastCompleted: 231577
NewestRestorable: 231577]
2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4.
Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message,
please report a bug). Temporary error, restart node'.
2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Occured during startphase 4.
Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other
node(s)(Restart error). Temporary error, restart node'.
2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected
2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 1, 4, 5 and 9.
2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 1, 5 and 9.
2011-05-24 08:12:57 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 reserved for ip 192.168.100.112,
m_reserved_nodes 1, 4, 5 and 9.
7. 7Copyright 2013 Severalnines AB Control your database infrastructure
Closer Inspection
• There was a system restart ongoing
• Node 3 crashed
– Forced node shutdown completed. Occured
during startphase 4. Caused by error 2306:
'Pointer too large(Internal error,
programming error or missing error message,
please report a bug). Temporary error,
restart node’
• Node 4
– Forced node shutdown completed. Occured
during startphase 4. Caused by error 2308:
'Another node failed during system restart,
please investigate error(s) on other
node(s)(Restart error). Temporary error,
restart node'
8. 8Copyright 2013 Severalnines AB Control your database infrastructure
Closer Inspection
• Next the error logs of the data nodes needs to be
inspected.
– Ndb_3_error.log
– Ndb_4_error.log
9. 9Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_error.log
Time: Tuesday 24 May 2011 - 02:36:36
Status: Temporary error, restart node
Message: Another node failed during system
restart, please investigate error(s) on
other node(s) (Restart error)
Error: 2308
Error data: Node 3 disconnected
Error object: QMGR (Line: 3050) 0x00000002
Program: /usr/local//mysql/bin//ndbd
Pid: 3501
Version: mysql-5.1.51 ndb-7.1.10
Trace: /data/mysqlcluster//ndb_4_trace.log.5
***EOM***
10. 10Copyright 2013 Severalnines AB Control your database infrastructure
Error logs
• Looking at the error logs usually gives a good hint
what needs to be done
– In the above example one node failed during system restart.
– This caused
11. 11Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_error.log
Time: Tuesday 24 May 2011 - 08:53:40
Status: Temporary error, restart node
Message: Pointer too large (Internal error,
programming error or missing error message,
please report a bug)
Error: 2306
Error data: dblqh/DblqhMain.cpp
Error object: DBLQH (Line: 15725) 0x00000002
Program: /usr/local//mysql/bin//ndbd
Pid: 4790
Version: mysql-5.1.51 ndb-7.1.10
Trace: /data/mysqlcluster//ndb_3_trace.log.25
***EOM***
14. 14Copyright 2013 Severalnines AB Control your database infrastructure
Recovery and Escalation Procedures
• There are two escalation steps to recover failed data
nodes – in this case Cluster is still STARTED
– Optimized Node Recovery (NR)
– Initial Node Recovery (INR)
• A failed Cluster can be recovered in two ways:
– System Restart
• The individual nodes may have to be restarted in a combination
of NR and INR.
– Initial System Restart + Restore Backup
15. 15Copyright 2013 Severalnines AB Control your database infrastructure
Optimized Node Recovery
• A failed node can recovered using Optimized Node
Recovery.
– This is the fastest way to recover a failed node
– Node will recover from Local Checkpoint and apply Redo
log.
– Then copy changes from the other node in the same node
group.
16. 16Copyright 2013 Severalnines AB Control your database infrastructure
Optimized Node Recovery
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• Multiple failed nodes can recovery in parallel
• The first step is to try to restart the failed nodes in
Optimized Node Recovery mode:
– ndbmtd
Node group 0 Node group 0
17. 17Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery
• If a node fails to complete Optimized Node Recovery
the next step in the escalation chain is to perform an
Initial Node Recovery.
– This can be because of a corrupted file system
• During Initial Node Recovery the data node will
– Clear out its local filesystem (rm –rf /datadir/ndbd/*)
– Recreate the REDO LOG
– Copy all data from the other node in the node group.
• Usually this recovery takes a lot longer to perform
than Optimized Node Recovery.
20. 20Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• Pretend that Node 2 failed to recover
• In this case it can be recovered with
– ndbmtd --initial
Node group 0 Node group 0
21. 21Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• All nodes have failed.
• Every data node can be restarted with
– ndbmtd
Node group 0 Node group 0
22. 22Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• If one node fails during the System Restart the
system restart is aborted
– All nodes crash again
• Error logs must be inspected
Node group 0 Node group 0
23. 23Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
• Some data nodes will write out in the error log
– “Another data node failed during system restart”
– Start these nodes with
• Ndbmtd
– The let the cluster perform a partial start (start with the
nodes that are ok, at least one from each node group) ->
you may have to try multiple combinations.
• The goal is to find the “another node”
– “Filesystem inconsistency”, “DBDIH pointer too large”
– Start this node with
• ndbmtd --initial
24. 24Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
• If all nodes in one node group has written out
something like:
– “Filesystem inconsistency”, “DBDIH pointer too large”
– System restart is not possible
• OR
– It is not possible to perform a partial start (i.e, one node from
each node group), and all possible combinations are exempt
then ..
• Initial System Restart is needed !
– Which basically means you have to restore from backup.
25. 25Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• This situation requires an Initial System Restart
because all nodes in one node group have failed in
such a way they are impossible to restart.
– Luckily this is not very common at all.
Node group 0 Node group 0
26. 26Copyright 2013 Severalnines AB Control your database infrastructure
Initial System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NODE 1
P1
subid data
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Partition 0
Partition 1
S0S1 P1
Px == PRIMARY Partition x
Sx == SECONDARY Parttion x
P2
DATA
NODE 4
DATA
NODE 3
P3
S2P2S3
Partition 2
Partition 3
• Restart all nodes with
– ndbmtd --initial
• Restore a backup
Node group 0 Node group 0
27. 27Copyright 2013 Severalnines AB Control your database infrastructure
Summary
• The whole exercise is to try different combinations,
but never –initial all nodes in one node group.
28. 28Copyright 2013 Severalnines AB Control your database infrastructure
Coming next in Installment 11:
Connectivity Overview