Distributed systems involve complex interactions among many components. This increases the possibilities of failures that could turn a whole system down. Software architects, designers, and developers need to architect, design, and program functional requirements thinking about possibility of failures and the need for a system to keep running despite failures. This presentation tackles but part of the problem, focusing on redundancy, different types of groups, replication, and eventual consistency, finishing with the presentation of CAP theorem.
Presentation delivered at IV Cloud Computing and Big Data Ent at Universdad Nacional de La Plata http://www.jcc.info.unlp.edu.ar/jcc2016/wordpress/index.php/cronograma/
Breaking the Code : A Guide to WhatsApp Business API.pdf
Design (Cloud systems) for Failures
1. Design for Failures
(and for Availability)
IV Jornadas de Cloud
Computing & Big Data
Rodolfo Kohn
Cloud Architect
Intel Security
Rodolfo.kohn@intel.com
2. Original Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Failure modes
• Redundancy (process and data)
• Failure detection
• Failure recovery
• Cascade failures and recovery
Redundancy and high availability in AWS
Eventual consistency problems
Performance and scalability problems
Operations monitoring
• Techniques to avoid false positives
Logs and counters
Design software for failures
Testing availability
Measuring availability
Education
7/10/20162
3. Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Redundancy
– Process
– Data: Replication (multi-master, master-slave)
– Flat groups and hierarchical groups
• Synchronization Model
• Stateful vs Stateless
• Eventual consistency
• CAP Theorem
• Failure detection
• Failure recovery
• Cascade failures and recovery
7/10/20163
4. (Cloud or Distributed) Applications are
Complex
7/10/20164
DNS
Server
.com
Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Network
SMTP
CDN
NoSQL
SQL
Monitoring Logs Configuration
Management
Multiple Opportunities for Unexpected Failures
Brittle Systems shall not Survive
Load bursts &
Response time
deterioration
5. Micro-services dependencies
In distributed systems, and cloud systems, there are complex
dependencies between systems such that failure of one
component can bring down the whole system
7/10/20165
6. What is Availability?
Distributed Systems: Principles and Paradigms (2nd Edition),
Andrew Tanenbaum, Maarten Van Steen
“Availability is defined as the property that a system is ready to
be used immediately. In general, it refers to the probability
that the system is operating correctly at any given
moment and is available to perform its functions on
behalf of its users. In other words, a highly available
system is one that will most likely be working at a given
instant in time.”
3/4/5 9’s of Availability: see Wikipedia :)
7/10/20166
The system is always running correctly
When users access it, they have it
10. We don’t avoid failures, we live with
them
Design for Failures is about focusing
on the Error Path
7/10/201610
PAINFUL AND TIME CONSUMIG
11. Failures affecting Availability
Different types of failures
• Infrastructure failures
• Software failures
• Operations failures
• Deployment failures
System updates or upgrades may affect availability if they
require downtime
Bad response time affects availability
• Unacceptable response time = system unavailable
• Bad scalability eventually affects response time
– Vulnerability to load peaks
Manual Path to Production affects availability
Neglected business/process situations affect availability
7/10/201611
12. Valid for all business
As core business moves to the Internet, downtime means
money
More possibilities of failure:
• (Cloud) systems are becoming increasingly complex
• Software undergoes stringent conditions
• There is a demand for excellent user experience
• In the cloud applications run in commodity hardware
7/10/201612
13. It’s about the whole big machinery
7/10/201613
Product/Service
Requirements
Development
Deployment
and Operation
Path to production
PDM and CXD
must think about
alternative paths
on error conditions
Architects design
for Availability
(Software and
Infrastructure)
Agile teams
Distributed Systems Skills
Availability, Scalability,
Performance mindset
Fast,
automated,
error free
DevOps, Monitoring,
Operations Automation
14. From Architecture to Development
Architecture:
redundancy model and management, dependency
management, state model, synchronization model, failure
detection, recovery, scalability model,
administration/configuration management
Design: logging design, monitoring design, dependency
handling, state management design (stateful and stateless),
consistency, fallback actions on failures per operation…
Development: consistency handling, retries, error analysis,
logging, error path (if ... else …), …
15. Topics
Redundancy (process and data)
Flat (P2P) Groups vs Hierarchical groups
State: stateless vs stateful
Replication
Synchronization: asynchonous vs. synchronous
Eventual Consistency
CAP
Failure detection
Recovery actions
Cascade Failures
Client recovery in client/server
7/10/201615
16. Redundancy
It is about provisioning in excess, replicating
hardware or software components or data
It allows masking failures as a mechanism of fault
tolerance
Additional hardware equipment or software
processes are provided
When a component fails another one in the group
takes over its work
Data replication, associated with a component
replication, keeps data safe in face of a component
failure
7/10/201616
17. Redundancy and groups
Process redundancy implies the creation of groups of
replicated processes
The group is seen by other processes as a single
process
• Replication is abstracted to be seen as one entity
• The same happens with hardware
7/10/201617
18. Two types of groups
7/10/201618
Flat group or peer-to-peer Hierarchical group
Coordinator
Worker
19. Design Considerations
Group creation and destroy
• Group bootstrapping
Group membership
• Processes can join and leave a group
Decision making
• Task distribution, synchronization, consistency, etc.
7/10/201619
20. Different challenges
Hierarchical group
• The coordinator, primary, or master knows and controls all
workers
• Simpler control and management
• If coordinator fails a group crashes
Flat group or peer-to-peer
• There is need of agreement or consensus algorithms
– For Coordinator election
– For consistency
– Synchronization
– For faulty process detection
– Membership change detection
• Data distribution
• If any member crashes the group continuous working, just
shrinks
7/10/201620
21. Hierarchical group:
Pool of servers controlled by a Load Balancer
7/10/201621
Load balancer
detects
unresponsive server
and removes it
A new server is added to
the pool.
Manually or automatically.
All other processes/applications/systems sending requests to this
group see it as just one process
The LB distributes work
and controls workers
22. Faulty process and server detection
Load balancer sends health checks to servers in the
pool detecting failing servers
• It can monitor at different stack layers
– In the case of AWS ELB: TCP, SSL, HTTP, HTTPS
– F5 can also test at different stack layers
• Failing servers can be automatically de-registered
• New healthy servers can be added to the pool
7/10/201622
23. Flat group: Cassandra
A cluster of Cassandra nodes
• Information is transmitted
with a gossip protocol
• If a node detects a new
node or a faulty node It
transmits information
through a gossip protocol
• Heartbeats with other nodes
to detect faulty nodes with
Phi Accrual Failure
Detectors
7/10/201623
25. Flat group: OSPF
I would say OSPF routers
form a flat group
• Routers use link-state
routing protocol to
transmit connectivity
information
• Routers can detect
neighbor failures
through Hello protocol
and transmit the data as
links states
7/10/201625
26. Data Redundancy
Data stores may be replicated for high availability
• Database replication
• Disk replication
Data redundancy is also found at other levels
• RAID disks
• In communications: CDMA uses Hamming code to
recover from errors
We focus on higher level failures that affect operations: a
database, SAN, whole platform, datacenter
7/10/201626
27. Data Redundancy
SQL and NoSQL Databases allow different replication
models
• Master-Master
– All replicas can be read and written
• Master-Slave
– All replicas read, only master can be written
– In case of master failure, a slave must take over
7/10/201627
28. Database Replication (1)
Replication: Data is replicated in all instances
Partitioning: Data is partitioned across different instances
• This is not replication
Data Data Data
Data Data
Clients from
America
Clients from
Europe
30. Database Replication (3)
Replication Master-Master or Multi-master or peer-to-
peer: Write in all instances, Read from all instances
Possibility of conflicts in asynchronous mode:
• Same row updated in different replicas
• Two inserts in different replicas
• Delete and insert/update
Data
Data
Data
WRITE READ
REPLICATION
31. Synchronous vs. Asynchronous
Replication
Synchronous replication assures a write will occur in all
instances at the same time
• Either multi-master or master-slave
In asynchronous replication write is sent to one node and
then replicated to other nodes
• Either multi-master or master-slave
• There is a lag in write replication
• At a point in time data might not be the same in all
nodes (eventual consistency)
32. Synchronous Replication
Synchronous replication assure a write will occur in all instances at the same
time
• All servers (both masters and slaves) have up-to-date data (A and C in ACID)
• Provides ACID capabilities
• High availability
• Simpler for developers
• Implementation through Two-phase commit or distributed lock which may
turn system slow
• No write scalability
• Performance might be affected
• Possibility of deadlocks
Galera cluster for MySQL
http://galeracluster.com/
Galera Replication is a
synchronous multi-master
replication plug-in for
InnoDB
33. Asynchronous Replication
Write occurs in one node and then it replicates to other
nodes
• Less complex (no two-phase commit or distributed
locks)
• High availability across datacenters
• Better write scalability
• Eventual consistency
• Write conflicts among masters
• Loss of synchronization is a problem to solve
• More difficult for developers (eventual consistency, write
conflicts)
This type of replication is the basic one offered by MySQL,
PostgresSQL and MariaDB (and SQL Server???)
34. Multi-master with Cassandra
Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Asynchronous replication
Tunable consistency
35. P2P Database Solutions
• Dynamo DB
– http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf
• Cassandra
– https://www.cs.cornell.edu/projects/ladis2009/papers/laksh
man-ladis2009.pdf
• Netflix’s Dynomite (Redis and Memcached)
– http://techblog.netflix.com/2014/11/introducing-
dynomite.html
– https://github.com/Netflix/dynomite
7/10/201635
36. Consistency and Design for Failures
When working with asynchronous replication you need to
deal with eventual consistency
• With asynchronous processes in general it is possible
that when a process goes to read something that should
be there it is not there yet
It could take milliseconds or many seconds
Under heavy load it turns worse
Write conflicts are another issue you need to deal with
• Need to have alarm and repair scripts if an automated
solution is not possible
Asynchronous, Fire and forget, Future, Let it be …
7/10/201636
38. Eventual consistency problem
Applications
Data
Applications Applications
Data
Load
Balancer
Applications
Replication
after some time
1-WRITE4-READ
•Read-after-write problem
•Specific solution for each
case
•Cannot trust replication
will occur after some time
2
3
5
6
7
39. From Architecture to Development
Designers and developers must understand the
consequences of each architecture
Typical questions/comments that predict issues in
distributed systems (100% certainty)
• By comparing operations’ time I can determine
order
• How long does it take to replicate data?
• We tested it and it is replicating very fast, no
problems
• It’s fast. It’s just fire and forget (asynchronous):
check if there is a subsequent read associated
7/10/201639
41. Why the hassle of P2P/flat
Best solution for high availability
Self-managed system
Best horizontal and dynamic scalability
Usually, can still write after network partition
7/10/201641
42. Brewer’s Conjecture and
CAP Theorem
• Consistency, Availability, and Partition Tolerance are all
desired features of database systems.
• However it is not possible to have all of them: pick only
two.
42
A
C P
Availability:
Each client can
always read and
write
Consistency:
All clients always
have the same view
of the data
Partition
Tolerance:
System works
well despite
physical network
CA:
RDBMS
AP:
Dynamo,
Cassandra
CP:
MongoDB,