Design (Cloud systems) for Failures

Design for Failures
(and for Availability)
IV Jornadas de Cloud
Computing & Big Data
Rodolfo Kohn
Cloud Architect
Intel Security
Rodolfo.kohn@intel.com

Original Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Failure modes
• Redundancy (process and data)
• Failure detection
• Failure recovery
• Cascade failures and recovery
Redundancy and high availability in AWS
Eventual consistency problems
Performance and scalability problems
Operations monitoring
• Techniques to avoid false positives
Logs and counters
Design software for failures
Testing availability
Measuring availability
Education
7/10/20162

Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Redundancy
– Process
– Data: Replication (multi-master, master-slave)
– Flat groups and hierarchical groups
• Synchronization Model
• Stateful vs Stateless
• Eventual consistency
• CAP Theorem
• Failure detection
• Failure recovery
• Cascade failures and recovery
7/10/20163

(Cloud or Distributed) Applications are
Complex
7/10/20164
DNS
Server
.com
Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Network
SMTP
CDN
NoSQL
SQL
Monitoring Logs Configuration
Management
Multiple Opportunities for Unexpected Failures
Brittle Systems shall not Survive
Load bursts &
Response time
deterioration

Micro-services dependencies
In distributed systems, and cloud systems, there are complex
dependencies between systems such that failure of one
component can bring down the whole system
7/10/20165

What is Availability?
Distributed Systems: Principles and Paradigms (2nd Edition),
Andrew Tanenbaum, Maarten Van Steen
“Availability is defined as the property that a system is ready to
be used immediately. In general, it refers to the probability
that the system is operating correctly at any given
moment and is available to perform its functions on
behalf of its users. In other words, a highly available
system is one that will most likely be working at a given
instant in time.”
3/4/5 9’s of Availability: see Wikipedia :)
7/10/20166
The system is always running correctly
When users access it, they have it

Systems fail …
7/10/20167
http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north-
virginia-affect-heroku-reddit-and-others-heroku-still-down/
What started as a small issue affecting some instances of Amazon’s
Elastic Cloud Compute (EC2) in North Virginia became a full-
blown outage of AWS in North Virginia. Major services, such as
Reddit, Foursquare, Minecraft and Heroku, are down. GitHub,
imgur, Pocket, HipChat, Coursera and others are affected …
And DOWNTIME COMES …

Consequences of Unavailability
7/10/20168
http://blog.smartbear.com/news/motorolas-site-collapses-under-cyber-monday-traffic/

Talk about failures
7/10/2016
9

We don’t avoid failures, we live with
them
Design for Failures is about focusing
on the Error Path
7/10/201610
PAINFUL AND TIME CONSUMIG

Failures affecting Availability
Different types of failures
• Infrastructure failures
• Software failures
• Operations failures
• Deployment failures
System updates or upgrades may affect availability if they
require downtime
Bad response time affects availability
• Unacceptable response time = system unavailable
• Bad scalability eventually affects response time
– Vulnerability to load peaks
Manual Path to Production affects availability
Neglected business/process situations affect availability
7/10/201611

Valid for all business
As core business moves to the Internet, downtime means
money
More possibilities of failure:
• (Cloud) systems are becoming increasingly complex
• Software undergoes stringent conditions
• There is a demand for excellent user experience
• In the cloud applications run in commodity hardware
7/10/201612

It’s about the whole big machinery
7/10/201613
Product/Service
Requirements
Development
Deployment
and Operation
Path to production
PDM and CXD
must think about
alternative paths
on error conditions
Architects design
for Availability
(Software and
Infrastructure)
Agile teams
Distributed Systems Skills
Availability, Scalability,
Performance mindset
Fast,
automated,
error free
DevOps, Monitoring,
Operations Automation

From Architecture to Development
Architecture:
redundancy model and management, dependency
management, state model, synchronization model, failure
detection, recovery, scalability model,
administration/configuration management
Design: logging design, monitoring design, dependency
handling, state management design (stateful and stateless),
consistency, fallback actions on failures per operation…
Development: consistency handling, retries, error analysis,
logging, error path (if ... else …), …

Topics
Redundancy (process and data)
Flat (P2P) Groups vs Hierarchical groups
State: stateless vs stateful
Replication
Synchronization: asynchonous vs. synchronous
Eventual Consistency
CAP
Failure detection
Recovery actions
Cascade Failures
Client recovery in client/server
7/10/201615

Redundancy
It is about provisioning in excess, replicating
hardware or software components or data
It allows masking failures as a mechanism of fault
tolerance
Additional hardware equipment or software
processes are provided
When a component fails another one in the group
takes over its work
Data replication, associated with a component
replication, keeps data safe in face of a component
failure
7/10/201616

Redundancy and groups
Process redundancy implies the creation of groups of
replicated processes
The group is seen by other processes as a single
process
• Replication is abstracted to be seen as one entity
• The same happens with hardware
7/10/201617

Two types of groups
7/10/201618
Flat group or peer-to-peer Hierarchical group
Coordinator
Worker

Design Considerations
Group creation and destroy
• Group bootstrapping
Group membership
• Processes can join and leave a group
Decision making
• Task distribution, synchronization, consistency, etc.
7/10/201619

Different challenges
Hierarchical group
• The coordinator, primary, or master knows and controls all
workers
• Simpler control and management
• If coordinator fails a group crashes
Flat group or peer-to-peer
• There is need of agreement or consensus algorithms
– For Coordinator election
– For consistency
– Synchronization
– For faulty process detection
– Membership change detection
• Data distribution
• If any member crashes the group continuous working, just
shrinks
7/10/201620

Hierarchical group:
Pool of servers controlled by a Load Balancer
7/10/201621
Load balancer
detects
unresponsive server
and removes it
A new server is added to
the pool.
Manually or automatically.
All other processes/applications/systems sending requests to this
group see it as just one process
The LB distributes work
and controls workers

Faulty process and server detection
Load balancer sends health checks to servers in the
pool detecting failing servers
• It can monitor at different stack layers
– In the case of AWS ELB: TCP, SSL, HTTP, HTTPS
– F5 can also test at different stack layers
• Failing servers can be automatically de-registered
• New healthy servers can be added to the pool
7/10/201622

Flat group: Cassandra
A cluster of Cassandra nodes
• Information is transmitted
with a gossip protocol
• If a node detects a new
node or a faulty node It
transmits information
through a gossip protocol
• Heartbeats with other nodes
to detect faulty nodes with
Phi Accrual Failure
Detectors
7/10/201623

Flat group: Cassandra
A cluster of Cassandra nodes
7/10/201624

Flat group: OSPF
I would say OSPF routers
form a flat group
• Routers use link-state
routing protocol to
transmit connectivity
information
• Routers can detect
neighbor failures
through Hello protocol
and transmit the data as
links states
7/10/201625

Data Redundancy
Data stores may be replicated for high availability
• Database replication
• Disk replication
Data redundancy is also found at other levels
• RAID disks
• In communications: CDMA uses Hamming code to
recover from errors
We focus on higher level failures that affect operations: a
database, SAN, whole platform, datacenter
7/10/201626

Data Redundancy
SQL and NoSQL Databases allow different replication
models
• Master-Master
– All replicas can be read and written
• Master-Slave
– All replicas read, only master can be written
– In case of master failure, a slave must take over
7/10/201627

Database Replication (1)
Replication: Data is replicated in all instances
Partitioning: Data is partitioned across different instances
• This is not replication
Data Data Data
Data Data
Clients from
America
Clients from
Europe

Replication Master-Slave: Write in one instance, Read
from all instances
Data
Data
Data
WRITE
READ
REPLICATION

Replication Master-Master or Multi-master or peer-to-
peer: Write in all instances, Read from all instances
Possibility of conflicts in asynchronous mode:
• Same row updated in different replicas
• Two inserts in different replicas
• Delete and insert/update
Data
Data
Data
WRITE READ
REPLICATION

Synchronous vs. Asynchronous
Replication
Synchronous replication assures a write will occur in all
instances at the same time
• Either multi-master or master-slave
In asynchronous replication write is sent to one node and
then replicated to other nodes
• Either multi-master or master-slave
• There is a lag in write replication
• At a point in time data might not be the same in all
nodes (eventual consistency)

Synchronous Replication
Synchronous replication assure a write will occur in all instances at the same
time
• All servers (both masters and slaves) have up-to-date data (A and C in ACID)
• Provides ACID capabilities
• High availability
• Simpler for developers
• Implementation through Two-phase commit or distributed lock which may
turn system slow
• No write scalability
• Performance might be affected
• Possibility of deadlocks
Galera cluster for MySQL
http://galeracluster.com/
Galera Replication is a
synchronous multi-master
replication plug-in for
InnoDB

Asynchronous Replication
Write occurs in one node and then it replicates to other
nodes
• Less complex (no two-phase commit or distributed
locks)
• High availability across datacenters
• Better write scalability
• Eventual consistency
• Write conflicts among masters
• Loss of synchronization is a problem to solve
• More difficult for developers (eventual consistency, write
conflicts)
This type of replication is the basic one offered by MySQL,
PostgresSQL and MariaDB (and SQL Server???)

Multi-master with Cassandra
Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Asynchronous replication
Tunable consistency

P2P Database Solutions
• Dynamo DB
– http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf
• Cassandra
– https://www.cs.cornell.edu/projects/ladis2009/papers/laksh
man-ladis2009.pdf
• Netflix’s Dynomite (Redis and Memcached)
– http://techblog.netflix.com/2014/11/introducing-
dynomite.html
– https://github.com/Netflix/dynomite
7/10/201635

Consistency and Design for Failures
When working with asynchronous replication you need to
deal with eventual consistency
• With asynchronous processes in general it is possible
that when a process goes to read something that should
be there it is not there yet
It could take milliseconds or many seconds
Under heavy load it turns worse
Write conflicts are another issue you need to deal with
• Need to have alarm and repair scripts if an automated
solution is not possible
Asynchronous, Fire and forget, Future, Let it be …
7/10/201636

Eventual consistency
Applications
Data
Applications Applications
Data
Load
Balancer
Applications
Replication
after some time
1-WRITE
•Eventually both DB
instances have the same
data
2
3
4

Eventual consistency problem
Applications
Data
Applications Applications
Data
Load
Balancer
Applications
Replication
after some time
1-WRITE4-READ
•Read-after-write problem
•Specific solution for each
case
•Cannot trust replication
will occur after some time
2
3
5
6
7

From Architecture to Development
Designers and developers must understand the
consequences of each architecture
Typical questions/comments that predict issues in
distributed systems (100% certainty)
• By comparing operations’ time I can determine
order
• How long does it take to replicate data?
• We tested it and it is replicating very fast, no
problems
• It’s fast. It’s just fire and forget (asynchronous):
check if there is a subsequent read associated
7/10/201639

Asynchronous Replication
in Active-Active
Network partitioning
7/10/201640
DNS
Server
.com
Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Disk

Why the hassle of P2P/flat
Best solution for high availability
Self-managed system
Best horizontal and dynamic scalability
Usually, can still write after network partition
7/10/201641

Brewer’s Conjecture and
CAP Theorem
• Consistency, Availability, and Partition Tolerance are all
desired features of database systems.
• However it is not possible to have all of them: pick only
two.
42
A
C P
Availability:
Each client can
always read and
write
Consistency:
All clients always
have the same view
of the data
Partition
Tolerance:
System works
well despite
physical network
CA:
RDBMS
AP:
Dynamo,
Cassandra
CP:
MongoDB,

MongoDB
7/10/201643
Source: https://docs.mongodb.com/manual/core/replica-set-elections/

MongoDB
7/10/201644
Source: https://docs.mongodb.com/manual/core/replica-set-elections/

Design (Cloud systems) for Failures

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to Design (Cloud systems) for Failures

Similar to Design (Cloud systems) for Failures (20)

Recently uploaded

Recently uploaded (20)

Design (Cloud systems) for Failures