Hello and Welcome to SQL Saturday in Washington (or Chevy Chase). I’m Joey D’Antoni and today we will be talking about High Availabilty and Disaster Recovery in SQL Server—some 2012, some of this will apply to older versions of the software.
A little bit about myself. I’m @jdanton on Twitter---how many of you are on Twitter? It’s a really great resource for the SQL Community—we have a lot of interaction and discussion there, and additionally there is a great hashtag called SQLHelp. Where you can get questions answered by experts. My blog is at joedantoni.wordpress.com—I have posts on a lot of the topics we will talk about today, and have instructions on setting up an AlwaysOn environment there. Lastly you can reach me by email at firstname.lastname@example.org. I have a blog post with my slides and additional resources from today’s presentation up at this bit.ly URL.Lastly, stop me at anytime if you have questions for me, I’ll do my best to answer, or direct you to an answer.
To start, we’re going to talk a good bit about disaster recovery. How many of you know if your organization has a disaster recovery plan? It should—even if it’s as simple as saying we move back to a paper based system, if our computers break, that’s a plan that can be followed when everyone is freaking out. Or your company may be an e-commerce site that immediately starts losing money the second something goes down. Then you need a different strategy.Next we’re going to talk about high availability—what it means, and several different ways to implement it within your infrastructure. How many of you have worked with clustering? After talking about the high level stuff we need to talk about, we will discuss all of the different options for data protection in SQL Server—we will discuss their pros and cons, costs, and complexity. And don’t worry, I will cover options in both standard and enterprise edition.Lastly, I will detail what you need to know about one of SQL 2012’s cornerstone features—AlwaysOn Availability Groups, and we will do a live demo and build a new AG.
So, when talking about disaster recovery, we have to talk about disasters. This first disaster happened just recently in Springfield, MA. A gas worker was responding to a gas leak, and accidentally damaged a pipe. He followed his procedure though and quickly evacuated all of the buildings in the area. These actions were all according to plan and as result no fatalities happened Unfortunately, a gentleman’s club was destroyed in the process, and the resulting cloud of glitter could be seen for several days. The next picture is hurricane Sandy. Growing up in New Orleans and starting my professional career in North Carolina, I’ve been through a lot of hurricanes and written disaster recovery plans to cover these situations. When I moved to the northeast, I thought it became less of a consideration. However, twice in the last two years, we’ve had major storms hit the eastern seaboard. Some firms had really good DR plans, and continued operating as normal. Others, however had the fuel tanks for their generators in the basement and had to organize bucket brigages to run fuel to the generators.The third picture is one I use in my SAN presentations to describe RAID 0. A car hitting a tree—this is here more to show the human aspect to DR, and to remind ourselves that a very important part of the process is to have human backups, and well written documentation.The last picture is of another classic disaster scenario—a building fire. In this case employees of this firm Inintech, had been stealing money via a computer system, but they weren’t able to track them down, because they didn’t have a DR plan.
So before one gets started on a disaster recovery plan, there are a couple of things you need to know. Depending on the size of you company and the nature of you business this can get pretty complicated. Even in medium size business you will probably want to split systems based on criticality. How do we determine the criticality—RTO and RPO.Recovery time objective is how long your systems can be down, before your company starts losing money. For a customer facing e-commerce site, this is basically instantly. So you are going to want to dedicate a lot of DR resources to that system. However, a back office HR reporting system would take several days to have impact, so maybe that doesn’t get clustering or mirroring.Recovery Point Objective aligns with this pretty closely—it’s how much data you can lose before the business is impacted. Similarly, you wouldn’t want to start losing orders and invoices—so those systems need a high level of protection.In most of my experience doing this work, I’ve grouped systems into tiers—usually 3 or 4, based on application needs. This is a really good first step DR exercise to do, even if you aren’t planning on implementing any ha or dr into your environment. It justs gives you an idea of which systems are most critical to recover.
One more thing about myself, I really like auto racing. Formula 1 specifically—talk about a highly available environment. Anyway, I’ve always seen this quote in terms of racing—How Fast do You Want to Go? How much do you want to spend? The first car here is the Red Bull RB8, it won both the driver’s and constructors championship in Formula 1 this year. It is custom developed for each race, can corner with a force 4x gravity, and the team has a budget of about $400M/yr, just to build two of these cars and race them.The second car, is the Tata. It’s an Indian car that costs less than $5000. It’s top speed is under 70 mph.I use these illustrations to demonstrate something—both of these vehicles can get you from point A to point B. Just in a different fashion. Some businesses will need extremely available systems with multi-site clusters and tertiary systems. While other companies will feel comfortable with shipping their backups offsite.
Just like buying an insurance—a DR plan is really nothing more than an insurance policy. You may never need it, but when you do you will be really thankful.Most of my experience is in the health care industry, and those firms tend to have pretty low tolerance for data loss. Financial services firms also have a low tolerance for data loss and downtime. It tends to cost money in a hurry. Another consideration is the actual location of your business and what sort of natural disasters can impact you. It’s no accident that Google, Facebook and Apple have all built data centres in Western North Carolina, and Oregon. Those places tend to be out of the way of most disasters. I used to think the mid-Atlantic was pretty safe, but….I can guarantee you 5 9s of availability. But it’s going to cost a lot of money—redundant SANs, enterprise SQL Server licenses, secondary data centers—these things all cost money. A lot of money.However, if you don’t work for Goldman Sachs, fear not—there a some decent options even with Standard Edition for data protection. It may not be as automatic, and you may lose a bit of data. But you can still protect yourself. Even if it’s only from hardware failure.Since I’m mentioning the word cloud, everybody drink. Ok. That’s better—seriously though if you are implementing a solution on Amazon or Azure, think about DR. Amazon in Reston has had several outages, and customers who only were running in that data center had outages. The ones who spanned Amazon DCs stayed up.
So what are the DR options within SQL Server? Starting at the top we have AlwaysOn Availability Groups—this is only available in Enterprise Edition of 2012. Database mirroring is a very similar option, and is available in standard edtion (synchrous only). It started in SP1 of SQL Server 2005. Log shipping has been around forever, and is available in all editions, as is replication. Multi-site clustering is not a cheap option—we’ll talk a little bit more about it later, but there is a great deal of expense involved in setting up a multi-site failover cluster.Lastly, both Hyper-V and VMWare offer the ability to migrate a guest operating system from one location to another. These tend to be also pretty expensive, and they aren’t really in the hands of the DBA, so we won’t talk to much more about them here.
A little bit about high availability—these are two main things I think of High Availability solutions—hardware failure, in this case complete meltdown of a server, or operating system failures. IE the blue screen of death. Or some combination of the two. I’ve had memory fail and lead to blue screens of death—fortunately it was in a cluster, so my downtime was minimal.
This is how Wikipedia defines High Availability. In my mind high availability is generally local and doesn’t necessarily provide DR. One of my favorite horror stories from an old job relates to this a bit. We had pretty highly available systems—clusters, VMWare clusters, that were all running on a single storage array. I was on call, and got paged on a database being down. I logged into the server and only saw the c: drive—it was SAN attached. I called my boss, and ask if she knew what was going on? She said, oh sorry, I was going to call—HP came into decommission a SAN and they took the wrong one. So about a week later we finally got everything back. Total mess.Even with HA—there tends to be a single point of failure at this storage layer. It’s pretty common in most cluster solutions, which is why they provide HA and not DR. Don’t confuse the two!
So what are the major HA technologies in SQL? The most common on is Failover Cluster Instances. One nice thing to know about Failover Clusters is that traditionally they have been dependent on the Enterprise version of Windows. Starting with 2012, standard edition has failover clustering built in. Also starting with SQL 2012, SMB file shares are supported as cluster disk, so you might not even need a SAN.The other options I have listed here are VMWare vMotion and Hyper-V live migration. Both of these solutions are completely transparent to SQL Server (you don’t have to do anything), but neither offer protection against any OS failures. But they work really well for hardware failures.Like I mentioned in the previous slide, both of these options do have a single point of failure with storage.
Just to summarize our Native Options that we are going to explore in detail here.
Recovery Time Objective – How long can your systems be down before your business is impacted?Disaster Recovery Point Objective—How much data canRecovery your business lose before being impacted?Terms These will vary highly by your industry, and your business model, but they apply to every application
"How fast do you want to go? How much do you want to spend?“ –attribution unknownRiskManagement
In a nutshell, preparing a DR policy is just like buying insurance Based on your firms tolerance for risk, business model, and geography Extremely high levels of availability andRisk protection are available, at a very expensiveManagement cost Very reasonable levels of protection and availability can be had at a low cost If you use a cloud provider—you still need to think about this!
AlwaysOn Availability Groups Database MirroringDR Solutions in Log ShippingSQL Server Multi-site Replication Multi-site Clustering Virtualization Multi-site failover
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. --wikipediaHigh • System Design allows for minimal downtime in the event ofAvailability hardware and operating system failures
SQL Server Failover Cluster InstancesHigh VMWare vMotion/Hyper-V Live MigrationAvailability in Both of these technologies have a single pointSQL Server of failure in shared storage
Review AlwaysOn Availability Groups Database MirroringSQL Server HA Failover Cluster Instancesand DR Log ShippingOptions Replication
• Transaction Log Backups take place on primary • External Process ships logs to secondary server(s) Secondary • Data can be read on secondary Server (except during t-log apply) DB (S) PrimaryLog Shipping Server DB (P) Secondary Server Log DB Backup (S) Optional
Pros • Standard Edition • Supports Multiple Targets • Can Read Secondary CopiesLog ShippingPros/Cons Cons • Dependent on Backup on Primary • Manual Failover Process • Reasonably High Complexity
• This is a really high level view of replication • There are numerous topologies and options involved in replication • This is the nuts and bolts of itReplication Image Credit—MS Books Online
Pros • Can Replicate to Multiple Servers • Replicate subset of data • Standard Edition (transactional)ReplicationPros/Cons Cons • Manual Failover • Unknown RPO • Can be fragile • Re-sync process can be ugly • Also requires connection change for failover
Cluster Virtual NameInstanceNameFailover SQL InstanceClusterInstances Node A Node B Windows Failover Cluster
Shared Storage (SAN or SMB Share*) Windows Cluster (Windows Server 2012Failover Standard Edition)Cluster SQL Server Standard Edition (Two NodeInstances Limit)Requirements Cluster Network Quorum Disk
Pros • Connections are transparent • Failover is automatic • Allows for whole instance protectionSQL Server • Multiple servers can be involvedFailoverClusterInstances ConsPros/Cons • Setup is complex • Hardware can sit idle in some configs • Single storage doesn’t allow for data protection* *More on this later
• Database transactions are compressed and shipped to secondary server (2008+) • The optional witness server facilitates automatic failover • Transfer may be sync or async*Database Primary Secondary Server ServerMirroring Mirror Mirror DB DB Witness Server *Enterprise Edition Only
Pros • Automatic Failover (w/witness) • Configuration is fast and easy • Failover happens quickly • Corrupted pages get fixed on secondaryDatabaseMirroringPros/Cons Cons • Is per database—multiple DB failovers need scripting • Async only available in EE • Marked as deprecated in SQL 2012 • Secondaries are inaccessible (except for snapshots)
Instance Instance 1 2 AG AG (P) (S)AlwaysOn Node A Node BAvailabilityGroups Washington Chicago Listener Name (AD VCO) Windows Cluster
Requires SQL Server Enterprise Edition Windows Cluster All servers in same Windows Domain Databases Failover as a groupAlwaysOn No Shared Storage NeededAvailability Async and Sync ModesGroups Automatic and Manual Failover Supports up to 4 replica copies Replicas can be read Backups on secondary copies
Pros • Readable secondaries allow for load distribution • No shared storage can reduce hardware costs • Multiple databases failing together is great for complexAlwaysOn apps • Connection string handledAvailability gracefully by listenerGroups • Administration all through SSMS ConsPros/Cons • Config is easy • Large topologies lead to $$$ license costs • Enterprise Edition only • New feature, so some growing pains • Changes in application code needed
Can cluster using SMB shares—becomes more viable option with SMB 3.0 in Windows Server 2012SQL 2012 Failover Process is changed—isAlive and LooksAlive go away. Replaced withWhat’s New sp_server_diagnostics(Clustering) Multi-subnet clustering is now available—this is designed for stretch clustering using SAN replication
Availability GroupsSQL Server Mirroring is marked as deprecated Not sure the long term impact of this for2012 DR New standard edition and DRFeatures No real changes to replication or log shipping
Great concept—allows for clusters to be automatically rebootedWindows Works perfectly with SQL Server FailoverServer 2012 Cluster InstancesCluster Aware Doesn’t work with AlwaysOn AvailabilityUpdating Groups, at the moment
Understand your business need before designing a HA and DR strategy DR is just like buying insurance—youSummary don’t need it until you do. Lots of good options for HA and DR in SQL Server for many price points Always have a plan!