Keeping data-safe-webinar-2010-11-01

  1. 1. Keeping your data safe Richard M Kreuter 10gen Inc. November 1, 2010 Keeping your data safe — webinar
  2. 2. Aspects of data safety Replication Cross-data-center replication Application-controlled replication Backup Disaster recovery Keeping your data safe — webinar
  3. 3. Replication MongoDB supports automatic replication (data mirroring) Recommended for failover, durability, backups (essentially all deployments). Works well over wide area networks. Also good for horizontal read scaling: clients can conditionally read from any of a number of slaves. Keeping your data safe — webinar
  4. 4. Replication Overview MongoDB’s replication is similar to many DB’s. Writes are accepted only by a Primary-mode (master, writable) mongod. Writes are recorded in a normalized format in the operation log. Secondary-mode (slave, read-only) mongods periodically query the oplog and apply operations. Keeping your data safe — webinar
  5. 5. Replica set replication Master (write server) Slave (read replica) Slave (read replica) Slave (read replica) Old Master Slave (read replica) Slave (read replica) New master Keeping your data safe — webinar
  6. 6. Replica Set Failover and Invariants Replicating mongods track replica set membership. If secondaries can’t see the master, but can see a majority of replica set votes, an election is induced. Election selects exactly one most-recently-written node for primary. A primary steps down to secondary when it can’t see a majority of replica set votes. On set reintegration, unreplicated data on old primaries is rolled back to offline storage (e.g., for manual intervention). Keeping your data safe — webinar
  7. 7. getLastError() Data manipulation operations are “fire and forget” by default; that is, they return immediately, and don’t wait for any server process. The database command getLastError() is the interface for forcing operation synchrony: db.getLastError() // returns null for "no error", // otherwise, a document containing // an error message Keeping your data safe — webinar
  8. 8. getLastError() and write replication When running in a replicated configuration, getLastError() can also force data writes to replicating slaves: // write to 4 servers, timeout after 3 seconds db.getLastError({w: 4, wtimeout: 3000}) Keeping your data safe — webinar
  9. 9. getLastError() and drivers, deployments All officially-supported MongoDB drivers have a SafeMode feature that implicitly invokes getLastError() after insert, update, delete operations. This way, application programmers have control over write replication separably from data manipulation logic. Replica Sets support a getLastErrorDefaults setting, which are used whenever a client calls getLastError() without parameters. This way, application architects and operations staff can design a system whose write replication can be configured independently of application code, if desired. Keeping your data safe — webinar
  10. 10. Backup strategies MongoDB tools (mongoexport, mongodump) More generic tools (fs snapshots, file copying commands) Storage device features (SAN, EBS snapshots) Keeping your data safe — webinar
  11. 11. MongoDB tools MongoDB comes with a couple pairs tools for backups mongodump & mongorestore — produce/consume BSON dumps of database content. Good for making compact backups. Note that indexes are reconstructed on mongorestore. mongoexport & mongoimport — produce/consume JSON/CSV text files of database content. More intended for cross-software transfers (e.g., transferring data between MongoDB and a spreadsheet program), but can be used for backup/recovery. Keeping your data safe — webinar
  12. 12. Backing up database files MongoDB’s data files (under the --dbpath argument) can be backed up using any technique available for files: File System/Volume Manager snapshots — some OSes’ file systems (ZFS, XFS, etc.) and some Volume Managers (e.g., LVM) support point-in-time snapshotting. These snapshots can serve as backups. Plain ol’ file copying — you can just copy the database’s files around. Keeping your data safe — webinar
  13. 13. Storage-layer backups Some storage devices have snapshotting features; you can use these snapshots as backups Commercial SANs often have point-in-time block-level snapshotting. Amazon’s EBS supports snapshotting (but they recommend unmounting the EBS volumes to quiesce the data). Keeping your data safe — webinar
  14. 14. Locking the database for backups All backup strategies can, in principle, be performed on a live (a.k.a. “hot”) database, but with varying levels of efficacy. To ensure a clean backup, it’s recommended that you lock the database for the duration of your backup procedure. > use admin switched to db admin > db.runCommand({fsync:1,lock:1}) // now use mongodump/snapshotting/etc., and then > db.$cmd.sys.unlock.findOne(); In general, this procedure is best performed on replicating secondaries, which don’t accept writes. Keeping your data safe — webinar
  15. 15. Disaster Recovery The general solution for recovering a failed server is as follows: 1 Repair/replace any failed hardware or operating system layers (e.g., replace disks, provision new hosts or virtual machines, etc.) 2 If step 1 completes quickly enough and its data directory is trustworthy (e.g., if the mongod was cleanly shut down, say, after a UPS-induced system halt), bring the mongod online and it will attempt to replay the replica set’s primary’s oplog. 3 If the data directory is suspect, you can move it aside or delete it, and then 1 Either bring up the mongod with an empty data directory, in which case it will clone the primary’s databases ... 2 ... or else seed the mongod’s data directory with a recent snapshot or mongodump backup. 4 The mongod will attempt to replay all the primary’s oplog records. Keeping your data safe — webinar
  16. 16. Some aspects of disaster recovery Cloning the primary can impose notable load on the primary, so it’s probably prefarable to initialize a new secondary from a snapshot or a database dump. If you operate in multiple data centers, it’s advisable to try to keep snapshots/database backups “nearby” in data center space to avoid having to transfer large amounts of data during disaster recorvery events. For example, you might make periodic snapshots/backups of a secondary in each of your data centers, and use these for initializing new secondaries. It can occur that the primary’s oplog “rolls over” before a recovering secondary catches up. See for more details. In general, avoiding a disaster is better than recovering from one. Employ monitoring tools! Keeping your data safe — webinar