1Copyright 2013 Severalnines AB Control your database infrastructure
10th Installment
MySQL Cluster Self-Training
Part 9 –...
2Copyright 2013 Severalnines AB Control your database infrastructure
Topics
• Common problems
• Error logs and Trace files...
3Copyright 2013 Severalnines AB Control your database infrastructure
Common Problems
• The most common problems are
– Conf...
4Copyright 2013 Severalnines AB Control your database infrastructure
Localizing the problem
• Look in the cluster log on t...
5Copyright 2013 Severalnines AB Control your database infrastructure
Error logs
• Data node store its error log in its DAT...
6Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_cluster.log
2011-05-24 08:12:44 [MgmtSrvr] INFO...
7Copyright 2013 Severalnines AB Control your database infrastructure
Closer Inspection
• There was a system restart ongoin...
8Copyright 2013 Severalnines AB Control your database infrastructure
Closer Inspection
• Next the error logs of the data n...
9Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_error.log
Time: Tuesday 24 May 2011 - 02:36:36
...
10Copyright 2013 Severalnines AB Control your database infrastructure
Error logs
• Looking at the error logs usually gives...
11Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_error.log
Time: Tuesday 24 May 2011 - 08:53:40...
12Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_out.log
RESTORE table: 2 1039 rows applied
RES...
13Copyright 2013 Severalnines AB Control your database infrastructure
Ndb_X_trace.log.N
--------------- Signal -----------...
14Copyright 2013 Severalnines AB Control your database infrastructure
Recovery and Escalation Procedures
• There are two e...
15Copyright 2013 Severalnines AB Control your database infrastructure
Optimized Node Recovery
• A failed node can recovere...
16Copyright 2013 Severalnines AB Control your database infrastructure
Optimized Node Recovery
STORAGE LAYER
P0
DATA
NODE 2...
17Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery
• If a node fails to complete ...
18Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery Is Needed
19Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery Is Needed
20Copyright 2013 Severalnines AB Control your database infrastructure
Initial Node Recovery
STORAGE LAYER
P0
DATA
NODE 2
D...
21Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NOD...
22Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NOD...
23Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
• Some data nodes will write out in t...
24Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
• If all nodes in one node group has ...
25Copyright 2013 Severalnines AB Control your database infrastructure
System Restart
STORAGE LAYER
P0
DATA
NODE 2
DATA
NOD...
26Copyright 2013 Severalnines AB Control your database infrastructure
Initial System Restart
STORAGE LAYER
P0
DATA
NODE 2
...
27Copyright 2013 Severalnines AB Control your database infrastructure
Summary
• The whole exercise is to try different com...
28Copyright 2013 Severalnines AB Control your database infrastructure
Coming next in Installment 11:
Connectivity Overview
29Copyright 2013 Severalnines AB Control your database infrastructure
Disclaimer
© Copyright 2013 Severalnines AB. All rig...
Upcoming SlideShare
Loading in...5
×

Severalnines Training: MySQL Cluster - Part X

7,161

Published on

Part X of our self-training slides on MySQL Cluster, focused on troubleshooting MySQL Cluster
Topics:
- common problems encountered by users
- error logs and trace files
- recovery and escalation procedures

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,161
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Severalnines Training: MySQL Cluster - Part X"

  1. 1. 1Copyright 2013 Severalnines AB Control your database infrastructure 10th Installment MySQL Cluster Self-Training Part 9 – Troubleshooting MySQL Cluster
  2. 2. 2Copyright 2013 Severalnines AB Control your database infrastructure Topics • Common problems • Error logs and Trace files • Recovery and Escalation procedures
  3. 3. 3Copyright 2013 Severalnines AB Control your database infrastructure Common Problems • The most common problems are – Configuration changes – Out of disk space – Out of RAM – Network issues (switch failures, network reorganization, upgrade of RST) – Swapping • echo “0” > /proc/vm/swapiness
  4. 4. 4Copyright 2013 Severalnines AB Control your database infrastructure Localizing the problem • Look in the cluster log on the management node • What node/nodes crashed and in what order • Go to those node/nodes – View the error log file for each node. – Look at the recommended restart action • Initial node recovery • Node Recovery – It could also be a Permanent error • Filesystem is full • Directory does not exist
  5. 5. 5Copyright 2013 Severalnines AB Control your database infrastructure Error logs • Data node store its error log in its DATADIR – ndb_X_error.log – X is the node id of the node • The ndb_X_out.log contains debug messages but is usually not interesting to look in. • The ndb_X_trace.log.n contains the last execution steps before the data node stopped/crashed.
  6. 6. 6Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_cluster.log 2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: Start with all nodes 3 and 4 2011-05-24 08:12:44 [MgmtSrvr] INFO -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 0/1 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 0/2 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Node 4: API mysql-5.1.51 ndb-7.1.10 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Node 3: API mysql-5.1.51 ndb-7.1.10 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 4: Start phase 1 completed 2011-05-24 08:12:45 [MgmtSrvr] INFO -- Node 3: Start phase 1 completed 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 2 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 2 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Start phase 3 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Start phase 3 completed (system restart) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restarting cluster to GCI: 231577 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Starting to restore schema 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Restore of schema complete 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Starting to restore schema 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 4: Restore of schema complete 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: DICT: activate index 8 done (sys/def/7/PRIMARY) 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 3 StartLog: [GCI Keep: 227381 LastCompleted: 231577 NewestRestorable: 231577] 2011-05-24 08:12:46 [MgmtSrvr] INFO -- Node 3: Node: 4 StartLog: [GCI Keep: 227381 LastCompleted: 231577 NewestRestorable: 231577] 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. 2011-05-24 08:12:48 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected 2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 1, 4, 5 and 9. 2011-05-24 08:12:49 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 1, 5 and 9. 2011-05-24 08:12:57 [MgmtSrvr] INFO -- Mgmt server state: nodeid 4 reserved for ip 192.168.100.112, m_reserved_nodes 1, 4, 5 and 9.
  7. 7. 7Copyright 2013 Severalnines AB Control your database infrastructure Closer Inspection • There was a system restart ongoing • Node 3 crashed – Forced node shutdown completed. Occured during startphase 4. Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’ • Node 4 – Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'
  8. 8. 8Copyright 2013 Severalnines AB Control your database infrastructure Closer Inspection • Next the error logs of the data nodes needs to be inspected. – Ndb_3_error.log – Ndb_4_error.log
  9. 9. 9Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_error.log Time: Tuesday 24 May 2011 - 02:36:36 Status: Temporary error, restart node Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error) Error: 2308 Error data: Node 3 disconnected Error object: QMGR (Line: 3050) 0x00000002 Program: /usr/local//mysql/bin//ndbd Pid: 3501 Version: mysql-5.1.51 ndb-7.1.10 Trace: /data/mysqlcluster//ndb_4_trace.log.5 ***EOM***
  10. 10. 10Copyright 2013 Severalnines AB Control your database infrastructure Error logs • Looking at the error logs usually gives a good hint what needs to be done – In the above example one node failed during system restart. – This caused
  11. 11. 11Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_error.log Time: Tuesday 24 May 2011 - 08:53:40 Status: Temporary error, restart node Message: Pointer too large (Internal error, programming error or missing error message, please report a bug) Error: 2306 Error data: dblqh/DblqhMain.cpp Error object: DBLQH (Line: 15725) 0x00000002 Program: /usr/local//mysql/bin//ndbd Pid: 4790 Version: mysql-5.1.51 ndb-7.1.10 Trace: /data/mysqlcluster//ndb_3_trace.log.25 ***EOM***
  12. 12. 12Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_out.log RESTORE table: 2 1039 rows applied RESTORE table: 2 1012 rows applied RESTORE table: 3 2 rows applied RESTORE table: 3 2 rows applied
  13. 13. 13Copyright 2013 Severalnines AB Control your database infrastructure Ndb_X_trace.log.N --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 75928 gsn: 164 "CONTINUEB" prio: 1 s.bn: 247 "DBLQH", s.proc: 3, s.sigId: 75923 length: 2 trace: 1 #sec: 0 fragInf: 0 H'00000006 H'00000000 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 75927 gsn: 262 "FSREADCONF" prio: 0 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 75926 length: 1 trace: 1 #sec: 0 fragInf: 0 UserPointer: 1 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 75926 gsn: 164 "CONTINUEB" prio: 1 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 75922 length: 1 trace: 1 #sec: 0 fragInf: 0 Scanning the memory channel again with no delay
  14. 14. 14Copyright 2013 Severalnines AB Control your database infrastructure Recovery and Escalation Procedures • There are two escalation steps to recover failed data nodes – in this case Cluster is still STARTED – Optimized Node Recovery (NR) – Initial Node Recovery (INR) • A failed Cluster can be recovered in two ways: – System Restart • The individual nodes may have to be restarted in a combination of NR and INR. – Initial System Restart + Restore Backup
  15. 15. 15Copyright 2013 Severalnines AB Control your database infrastructure Optimized Node Recovery • A failed node can recovered using Optimized Node Recovery. – This is the fastest way to recover a failed node – Node will recover from Local Checkpoint and apply Redo log. – Then copy changes from the other node in the same node group.
  16. 16. 16Copyright 2013 Severalnines AB Control your database infrastructure Optimized Node Recovery STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Multiple failed nodes can recovery in parallel • The first step is to try to restart the failed nodes in Optimized Node Recovery mode: – ndbmtd Node group 0 Node group 0
  17. 17. 17Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery • If a node fails to complete Optimized Node Recovery the next step in the escalation chain is to perform an Initial Node Recovery. – This can be because of a corrupted file system • During Initial Node Recovery the data node will – Clear out its local filesystem (rm –rf /datadir/ndbd/*) – Recreate the REDO LOG – Copy all data from the other node in the node group. • Usually this recovery takes a lot longer to perform than Optimized Node Recovery.
  18. 18. 18Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery Is Needed
  19. 19. 19Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery Is Needed
  20. 20. 20Copyright 2013 Severalnines AB Control your database infrastructure Initial Node Recovery STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Pretend that Node 2 failed to recover • In this case it can be recovered with – ndbmtd --initial Node group 0 Node group 0
  21. 21. 21Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • All nodes have failed. • Every data node can be restarted with – ndbmtd Node group 0 Node group 0
  22. 22. 22Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • If one node fails during the System Restart the system restart is aborted – All nodes crash again • Error logs must be inspected Node group 0 Node group 0
  23. 23. 23Copyright 2013 Severalnines AB Control your database infrastructure System Restart • Some data nodes will write out in the error log – “Another data node failed during system restart” – Start these nodes with • Ndbmtd – The let the cluster perform a partial start (start with the nodes that are ok, at least one from each node group) -> you may have to try multiple combinations. • The goal is to find the “another node” – “Filesystem inconsistency”, “DBDIH pointer too large” – Start this node with • ndbmtd --initial
  24. 24. 24Copyright 2013 Severalnines AB Control your database infrastructure System Restart • If all nodes in one node group has written out something like: – “Filesystem inconsistency”, “DBDIH pointer too large” – System restart is not possible • OR – It is not possible to perform a partial start (i.e, one node from each node group), and all possible combinations are exempt then .. • Initial System Restart is needed ! – Which basically means you have to restore from backup.
  25. 25. 25Copyright 2013 Severalnines AB Control your database infrastructure System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • This situation requires an Initial System Restart because all nodes in one node group have failed in such a way they are impossible to restart. – Luckily this is not very common at all. Node group 0 Node group 0
  26. 26. 26Copyright 2013 Severalnines AB Control your database infrastructure Initial System Restart STORAGE LAYER P0 DATA NODE 2 DATA NODE 1 P1 subid data 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H Partition 0 Partition 1 S0S1  P1 Px == PRIMARY Partition x Sx == SECONDARY Parttion x P2 DATA NODE 4 DATA NODE 3 P3 S2P2S3 Partition 2 Partition 3 • Restart all nodes with – ndbmtd --initial • Restore a backup Node group 0 Node group 0
  27. 27. 27Copyright 2013 Severalnines AB Control your database infrastructure Summary • The whole exercise is to try different combinations, but never –initial all nodes in one node group.
  28. 28. 28Copyright 2013 Severalnines AB Control your database infrastructure Coming next in Installment 11: Connectivity Overview
  29. 29. 29Copyright 2013 Severalnines AB Control your database infrastructure Disclaimer © Copyright 2013 Severalnines AB. All rights reserved. Severalnines & the Severalnines logo(s) are trademarks of Severalnines AB. MySQL is a registered trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

×