Infrastructure and software failures are a pain and unfortunately sometimes unavoidable. Luckily, they don't always have to result in downtime for your application or service. High Availability (HA) to the rescue!
4. HA is Redundancy
ü RAID: Disk crash? Another disk still works!
ü Virtualization: Physical host crashes? VM available on other physical host!
ü Clustering: Server crashes? Another server still works!
ü Power: Power outage? Redundant power supply!
ü Network: Switch or NIC crashes? 2nd network route available!
ü Geographical: Datacenter offline? Another DC available to perform work!
11. States and sessions
o Multiple requests can be served by
different backend servers
o Store session in database or noSQL cache
o Loadbalancer can “stick” a single backend
server to a user…
o ... but not in all cases!
app 1 app 2 app 3 app 4
1
2
3
12 3
13. Shared storage - NAS
o Network Attached Storage
o A NAS handles the complete filesystem
o Relies on protocols like:
NFS: Network Filesystem
SMB/CIFS: Windows File Sharing
o Simple to implement
o Redundancy is very hard to achieve, often single point of failure
o Performance is mediocre and bottlenecks can occur
14. Shared storage - SAN
o Storage Area Network
o A SAN handles only the “block level” part of the filesystem
o Relies on protocols like:
iSCSI: IP based SCSI
Fibre Channel: Optical fiber transport protocol
AoE: ATA over Ethernet
o Hard to implement, expensive
o Redundancy can be achieved to avoid single point of failure
o Performance and scalability is (reasonably) good
15. Shared storage – Cluster Filesystem
o Filesystem shared on multiple servers using special software / drivers
o Windows implementation:
DFS: Windows Distributed File System
o Linux implementations:
HDFS: Hadoop Distributed Filesystem
Ceph: Object Storage Platform
GlusterFS: Red Hat Cluster Filesystem
o Relatively easy to implement
o Redundancy can easily be achieved
o Performance and scalability is (reasonably) good
16. Database High Availability
o High Availability on RDBMS (relational database management systems) is
often the most difficult thing in a High Available setup
o Hardware resources and data need to be redundant
o Remember that it isn’t just data, it is constantly changing data
o High Availability means the operation can continue uninterrupted, not by
restoring a new/backup server
17. Database HA - Replication
o Asynchronous by default
o One master, many slaves
o No write scale-out possible
o Difficult to recover from a failover situation
o Prone to inconsistency when not used properly
18. Database HA - Sharding
o Separate data over multiple database
back-ends using keyed distribution
o Multi master setup possible
o Excellent scalability
o Redundancy needs to be obtained through a complementary methodology
o Requires more complex application logic
19. Database HA – Clustering I
o Synchronous by default
o Multi master setup possible
o Write scale-out possible
o Near-automatic fault recovery
o Requires code level replication conflict resolving
20. Database HA – Clustering II
Clustering for Microsoft SQL (from 2012)
o Always On Availability Groups
o Each node requires WSFC (Windows Server Failover Clustering)
o Asynchronous and synchronous commit mode supported
o Up to 8 “warm” availability replicas can be setup
o These replicas can be used for read transactions and backups
o Availability group listener to automatically redirect clients to the best available server
o Not a “real” cluster, no master-master replication possible
21. Database HA – Clustering III
Clustering for MySQL (MariaDB)
o Galera (wsrep) plugin to enable clustering
(included in MariaDB 10.1 by default)
o Asynchronous and synchronous commit mode supported
o Multi-master synchronous replication
o Read and write scalability
o Automatic membership control, node joining and dropping
o No listener functionality that redirects clients to available nodes
22. Clustering – Quorum I
”A quorum is the minimum number of members of a deliberative
assembly necessary to conduct the business of that group”
- Wikipedia
23. Clustering – Quorum II
o Node Majority: Each node that is available
and in communication can vote. The cluster functions
only with a majority of the votes.
o When a network partition occurs, the nodes in the minority part will go in lockdown to
avoid a “split brain” situation
o When a network partition resolves, the minority part will rejoin the active cluster after
a state transfer to retrieve the data that was changed in the mean time
o A cluster should contain an odd number of nodes to prevent a total lockdown during a
node failure or network partition
24. Clustering – Scenario 1
o Node A is gracefully stopped
o Other nodes receive “leave” message
and quorum is reduced by 1
o Cluster is online
o Node B and C continue to serve
requests because they have the
majority of votes (2 of 2)
25. Clustering – Scenario 2
o Node A and B are gracefully stopped
o Node C receive “leave” messages from
A and B and quorum is reduced by 2
o Cluster is online
o Node C continues to serve clients since
it has the majority of votes in the
quorum (1 of 1)
26. Clustering – Scenario 3
o All nodes are gracefully stopped
o Cluster is offline
o There is a potential problem in starting
the cluster again. The most recent (last
stopped) node should be used to
bootstrap the cluster or there is
potential data loss
27. Clustering – Scenario 4
o Node A disappears from the cluster due to
unforeseen circumstances
o Node B and C will try to reconnect to A but will
eventually remove A from the cluster,
maintaining the quorum (3)
o Cluster is online
o Node B and C continue to serve requests
because they have the majority of votes
(2 of 3)
28. Clustering – Scenario 5
o Node A and B disappear from the cluster
due to unforeseen circumstances
o Node C will try to reconnect to A and B
but will eventually remove both from the
cluster, maintaining the quorum (3)
o Cluster is offline
o The cluster is offline because Node C
cannot acquire a majority of the votes
(1 of 3) and will remain in lockdown
29. Clustering – Scenario 6
o All nodes disappear from the cluster
due to unforeseen circumstances
o Cluster is offline (obviously)
o This is a potential problem as the Node
with the most recent data should be
used to bootstrap the cluster again to
avoid data loss
30. Clustering – Scenario 7
o A network split causes Node A, B and C
to lose connectivity with Node
D, E and F
o Cluster is offline
o Node A, B and C have no majority
(3 of 6) and Node D, E and F also have
no majority (3 of 6).
All Nodes go in lockdown
34. Clustering – Multiple Datacenters IV
DC 1 DC 2
node 1
node 2
node 3
node 4
DC 3
node 5 node 6
35. Health Endpoint Monitoring
o Monitor applications for availability in a HA pool
o Monitor middle-tier services for availability
o Automatic removal of misbehaving endpoints from the pool
o Endpoints that are healthy again after a service interruption are
automatically re-added
38. appserver 1
appserver 2
appserver 3
Monitoring Strategy
Loadbalancer
DB loadbalancer
db node 1
db node 2
db node 3
DB loadbalancer
db node 1
db node 2
db node 3appserver1appserver2
DB node 1DB node 3
39. Design Patterns for HA environments
o Safeguard performance
o Increase fault tolerancy
o Improve consistency
40. Queue based load leveling pattern I
o Temporal decoupling
o Load leveling
o Load balancing
o Loose coupling
tasks
service
message queue
requests received
at variable rate
messages processed
at a more
consistent rate
41. Queue based load leveling pattern II
When to use?
o Any type of application or service that is subject to overloading
When not to use?
o Not suitable if a response with minimal latency is expected from the
application or service
42. Throttling pattern I
o Reject or delay requests to the application when a certain number of
requests in a certain amount of time is reached
o Disable or degrade functionality of selected nonessential services so that
essential services can run unimpeded with sufficient resources
43. Throttling pattern II
When to use?
o To ensure that a system continues to meet service level agreements
o To prevent a single tenant from monopolizing the resources provided by an
application
o To handle bursts in activity
o To help cost-optimize a system by limiting the maximum resource levels
needed to keep it functioning
48. A/B Deployments I
loadbalancer application server 1 application server 2
www.live.nl
appserver 1 - A
appserver 2 - A
www.shadow.nl
appserver 1 - B
appserver 2 - B
webserver A
/deploy/A
webserver A
/deploy/A
webserver B
/deploy/B
webserver B
/deploy/B
51. Deployment best practices
o Never introduce backwards breaking changes to the database
o Thoroughly test shadow-live environment as it is the closest to the real live
deployment
o Maintain a tight release versioning, based on semantic versioning
o Releasing end of day and on a Friday is not recommended