SlideShare a Scribd company logo
Design for Failures
(and for Availability)
IV Jornadas de Cloud
Computing & Big Data
Rodolfo Kohn
Cloud Architect
Intel Security
Rodolfo.kohn@intel.com
Original Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Failure modes
• Redundancy (process and data)
• Failure detection
• Failure recovery
• Cascade failures and recovery
Redundancy and high availability in AWS
Eventual consistency problems
Performance and scalability problems
Operations monitoring
• Techniques to avoid false positives
Logs and counters
Design software for failures
Testing availability
Measuring availability
Education
7/10/20162
Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Redundancy
– Process
– Data: Replication (multi-master, master-slave)
– Flat groups and hierarchical groups
• Synchronization Model
• Stateful vs Stateless
• Eventual consistency
• CAP Theorem
• Failure detection
• Failure recovery
• Cascade failures and recovery
7/10/20163
(Cloud or Distributed) Applications are
Complex
7/10/20164
DNS
Server
.com
Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Network
SMTP
CDN
NoSQL
SQL
Monitoring Logs Configuration
Management
Multiple Opportunities for Unexpected Failures
Brittle Systems shall not Survive
Load bursts &
Response time
deterioration
Micro-services dependencies
In distributed systems, and cloud systems, there are complex
dependencies between systems such that failure of one
component can bring down the whole system
7/10/20165
What is Availability?
Distributed Systems: Principles and Paradigms (2nd Edition),
Andrew Tanenbaum, Maarten Van Steen
“Availability is defined as the property that a system is ready to
be used immediately. In general, it refers to the probability
that the system is operating correctly at any given
moment and is available to perform its functions on
behalf of its users. In other words, a highly available
system is one that will most likely be working at a given
instant in time.”
3/4/5 9’s of Availability: see Wikipedia :)
7/10/20166
The system is always running correctly
When users access it, they have it
Systems fail …
7/10/20167
http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north-
virginia-affect-heroku-reddit-and-others-heroku-still-down/
What started as a small issue affecting some instances of Amazon’s
Elastic Cloud Compute (EC2) in North Virginia became a full-
blown outage of AWS in North Virginia. Major services, such as
Reddit, Foursquare, Minecraft and Heroku, are down. GitHub,
imgur, Pocket, HipChat, Coursera and others are affected …
And DOWNTIME COMES …
Consequences of Unavailability
7/10/20168
http://blog.smartbear.com/news/motorolas-site-collapses-under-cyber-monday-traffic/
Talk about failures
7/10/2016
9
We don’t avoid failures, we live with
them
Design for Failures is about focusing
on the Error Path
7/10/201610
PAINFUL AND TIME CONSUMIG
Failures affecting Availability
Different types of failures
• Infrastructure failures
• Software failures
• Operations failures
• Deployment failures
System updates or upgrades may affect availability if they
require downtime
Bad response time affects availability
• Unacceptable response time = system unavailable
• Bad scalability eventually affects response time
– Vulnerability to load peaks
Manual Path to Production affects availability
Neglected business/process situations affect availability
7/10/201611
Valid for all business
As core business moves to the Internet, downtime means
money
More possibilities of failure:
• (Cloud) systems are becoming increasingly complex
• Software undergoes stringent conditions
• There is a demand for excellent user experience
• In the cloud applications run in commodity hardware
7/10/201612
It’s about the whole big machinery
7/10/201613
Product/Service
Requirements
Development
Deployment
and Operation
Path to production
PDM and CXD
must think about
alternative paths
on error conditions
Architects design
for Availability
(Software and
Infrastructure)
Agile teams
Distributed Systems Skills
Availability, Scalability,
Performance mindset
Fast,
automated,
error free
DevOps, Monitoring,
Operations Automation
From Architecture to Development
Architecture:
redundancy model and management, dependency
management, state model, synchronization model, failure
detection, recovery, scalability model,
administration/configuration management
Design: logging design, monitoring design, dependency
handling, state management design (stateful and stateless),
consistency, fallback actions on failures per operation…
Development: consistency handling, retries, error analysis,
logging, error path (if ... else …), …
Topics
Redundancy (process and data)
Flat (P2P) Groups vs Hierarchical groups
State: stateless vs stateful
Replication
Synchronization: asynchonous vs. synchronous
Eventual Consistency
CAP
Failure detection
Recovery actions
Cascade Failures
Client recovery in client/server
7/10/201615
Redundancy
It is about provisioning in excess, replicating
hardware or software components or data
It allows masking failures as a mechanism of fault
tolerance
Additional hardware equipment or software
processes are provided
When a component fails another one in the group
takes over its work
Data replication, associated with a component
replication, keeps data safe in face of a component
failure
7/10/201616
Redundancy and groups
Process redundancy implies the creation of groups of
replicated processes
The group is seen by other processes as a single
process
• Replication is abstracted to be seen as one entity
• The same happens with hardware
7/10/201617
Two types of groups
7/10/201618
Flat group or peer-to-peer Hierarchical group
Coordinator
Worker
Design Considerations
Group creation and destroy
• Group bootstrapping
Group membership
• Processes can join and leave a group
Decision making
• Task distribution, synchronization, consistency, etc.
7/10/201619
Different challenges
Hierarchical group
• The coordinator, primary, or master knows and controls all
workers
• Simpler control and management
• If coordinator fails a group crashes
Flat group or peer-to-peer
• There is need of agreement or consensus algorithms
– For Coordinator election
– For consistency
– Synchronization
– For faulty process detection
– Membership change detection
• Data distribution
• If any member crashes the group continuous working, just
shrinks
7/10/201620
Hierarchical group:
Pool of servers controlled by a Load Balancer
7/10/201621
Load balancer
detects
unresponsive server
and removes it
A new server is added to
the pool.
Manually or automatically.
All other processes/applications/systems sending requests to this
group see it as just one process
The LB distributes work
and controls workers
Faulty process and server detection
Load balancer sends health checks to servers in the
pool detecting failing servers
• It can monitor at different stack layers
– In the case of AWS ELB: TCP, SSL, HTTP, HTTPS
– F5 can also test at different stack layers
• Failing servers can be automatically de-registered
• New healthy servers can be added to the pool
7/10/201622
Flat group: Cassandra
A cluster of Cassandra nodes
• Information is transmitted
with a gossip protocol
• If a node detects a new
node or a faulty node It
transmits information
through a gossip protocol
• Heartbeats with other nodes
to detect faulty nodes with
Phi Accrual Failure
Detectors
7/10/201623
Flat group: Cassandra
A cluster of Cassandra nodes
7/10/201624
Flat group: OSPF
I would say OSPF routers
form a flat group
• Routers use link-state
routing protocol to
transmit connectivity
information
• Routers can detect
neighbor failures
through Hello protocol
and transmit the data as
links states
7/10/201625
Data Redundancy
Data stores may be replicated for high availability
• Database replication
• Disk replication
Data redundancy is also found at other levels
• RAID disks
• In communications: CDMA uses Hamming code to
recover from errors
We focus on higher level failures that affect operations: a
database, SAN, whole platform, datacenter
7/10/201626
Data Redundancy
SQL and NoSQL Databases allow different replication
models
• Master-Master
– All replicas can be read and written
• Master-Slave
– All replicas read, only master can be written
– In case of master failure, a slave must take over
7/10/201627
Database Replication (1)
Replication: Data is replicated in all instances
Partitioning: Data is partitioned across different instances
• This is not replication
Data Data Data
Data Data
Clients from
America
Clients from
Europe
Database Replication (2)
Replication Master-Slave: Write in one instance, Read
from all instances
Data
Data
Data
WRITE
READ
REPLICATION
Database Replication (3)
Replication Master-Master or Multi-master or peer-to-
peer: Write in all instances, Read from all instances
Possibility of conflicts in asynchronous mode:
• Same row updated in different replicas
• Two inserts in different replicas
• Delete and insert/update
Data
Data
Data
WRITE READ
REPLICATION
Synchronous vs. Asynchronous
Replication
Synchronous replication assures a write will occur in all
instances at the same time
• Either multi-master or master-slave
In asynchronous replication write is sent to one node and
then replicated to other nodes
• Either multi-master or master-slave
• There is a lag in write replication
• At a point in time data might not be the same in all
nodes (eventual consistency)
Synchronous Replication
Synchronous replication assure a write will occur in all instances at the same
time
• All servers (both masters and slaves) have up-to-date data (A and C in ACID)
• Provides ACID capabilities
• High availability
• Simpler for developers
• Implementation through Two-phase commit or distributed lock which may
turn system slow
• No write scalability
• Performance might be affected
• Possibility of deadlocks
Galera cluster for MySQL
http://galeracluster.com/
Galera Replication is a
synchronous multi-master
replication plug-in for
InnoDB
Asynchronous Replication
Write occurs in one node and then it replicates to other
nodes
• Less complex (no two-phase commit or distributed
locks)
• High availability across datacenters
• Better write scalability
• Eventual consistency
• Write conflicts among masters
• Loss of synchronization is a problem to solve
• More difficult for developers (eventual consistency, write
conflicts)
This type of replication is the basic one offered by MySQL,
PostgresSQL and MariaDB (and SQL Server???)
Multi-master with Cassandra
Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Asynchronous replication
Tunable consistency
P2P Database Solutions
• Dynamo DB
– http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf
• Cassandra
– https://www.cs.cornell.edu/projects/ladis2009/papers/laksh
man-ladis2009.pdf
• Netflix’s Dynomite (Redis and Memcached)
– http://techblog.netflix.com/2014/11/introducing-
dynomite.html
– https://github.com/Netflix/dynomite
7/10/201635
Consistency and Design for Failures
When working with asynchronous replication you need to
deal with eventual consistency
• With asynchronous processes in general it is possible
that when a process goes to read something that should
be there it is not there yet
It could take milliseconds or many seconds
Under heavy load it turns worse
Write conflicts are another issue you need to deal with
• Need to have alarm and repair scripts if an automated
solution is not possible
Asynchronous, Fire and forget, Future, Let it be …
7/10/201636
Eventual consistency
Applications
Data
Applications Applications
Data
Load
Balancer
Applications
Replication
after some time
1-WRITE
•Eventually both DB
instances have the same
data
2
3
4
Eventual consistency problem
Applications
Data
Applications Applications
Data
Load
Balancer
Applications
Replication
after some time
1-WRITE4-READ
•Read-after-write problem
•Specific solution for each
case
•Cannot trust replication
will occur after some time
2
3
5
6
7
From Architecture to Development
Designers and developers must understand the
consequences of each architecture
Typical questions/comments that predict issues in
distributed systems (100% certainty)
• By comparing operations’ time I can determine
order
• How long does it take to replicate data?
• We tested it and it is replicating very fast, no
problems
• It’s fast. It’s just fire and forget (asynchronous):
check if there is a subsequent read associated
7/10/201639
Asynchronous Replication
in Active-Active
Network partitioning
7/10/201640
DNS
Server
.com
Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Disk
Why the hassle of P2P/flat
Best solution for high availability
Self-managed system
Best horizontal and dynamic scalability
Usually, can still write after network partition
7/10/201641
Brewer’s Conjecture and
CAP Theorem
• Consistency, Availability, and Partition Tolerance are all
desired features of database systems.
• However it is not possible to have all of them: pick only
two.
42
A
C P
Availability:
Each client can
always read and
write
Consistency:
All clients always
have the same view
of the data
Partition
Tolerance:
System works
well despite
physical network
CA:
RDBMS
AP:
Dynamo,
Cassandra
CP:
MongoDB,
MongoDB
7/10/201643
Source: https://docs.mongodb.com/manual/core/replica-set-elections/
MongoDB
7/10/201644
Source: https://docs.mongodb.com/manual/core/replica-set-elections/

More Related Content

What's hot

Sql server’s high availability technologies
Sql server’s high availability technologiesSql server’s high availability technologies
Sql server’s high availability technologies
venkatchs
 
Pre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctlyPre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctly
Antonios Chatzipavlis
 
Dueling duplications RMAN vs Delphix
Dueling duplications RMAN vs DelphixDueling duplications RMAN vs Delphix
Dueling duplications RMAN vs Delphix
Kyle Hailey
 
Veeam Webinar - Backing up Zarafa with SureBackup
Veeam Webinar - Backing up Zarafa with SureBackupVeeam Webinar - Backing up Zarafa with SureBackup
Veeam Webinar - Backing up Zarafa with SureBackup
Joep Piscaer
 
Implementing sql server always on
Implementing sql server always onImplementing sql server always on
Implementing sql server always on
Sarabpreet Anand
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
webuploader
 

What's hot (19)

Sql server’s high availability technologies
Sql server’s high availability technologiesSql server’s high availability technologies
Sql server’s high availability technologies
 
Configuring sql server - SQL Saturday, Athens Oct 2014
Configuring sql server - SQL Saturday, Athens Oct 2014Configuring sql server - SQL Saturday, Athens Oct 2014
Configuring sql server - SQL Saturday, Athens Oct 2014
 
Replication in the wild ankara cloud meetup - feb 2017
Replication in the wild   ankara cloud meetup - feb 2017Replication in the wild   ankara cloud meetup - feb 2017
Replication in the wild ankara cloud meetup - feb 2017
 
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft EngineerPLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
High Availability & Disaster Recovery with SQL Server 2012 AlwaysOn Availabil...
 
Sql server 2012 ha dr nova
Sql server 2012 ha dr novaSql server 2012 ha dr nova
Sql server 2012 ha dr nova
 
Pre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctlyPre and post tips to installing sql server correctly
Pre and post tips to installing sql server correctly
 
Oracle 12c Parallel Execution New Features
Oracle 12c Parallel Execution New FeaturesOracle 12c Parallel Execution New Features
Oracle 12c Parallel Execution New Features
 
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
 
AlwaysON Basics
AlwaysON BasicsAlwaysON Basics
AlwaysON Basics
 
Dueling duplications RMAN vs Delphix
Dueling duplications RMAN vs DelphixDueling duplications RMAN vs Delphix
Dueling duplications RMAN vs Delphix
 
Sql server 2014 online operations
Sql server 2014 online operationsSql server 2014 online operations
Sql server 2014 online operations
 
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 EditionSQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
 
Veeam Webinar - Backing up Zarafa with SureBackup
Veeam Webinar - Backing up Zarafa with SureBackupVeeam Webinar - Backing up Zarafa with SureBackup
Veeam Webinar - Backing up Zarafa with SureBackup
 
Implementing sql server always on
Implementing sql server always onImplementing sql server always on
Implementing sql server always on
 
XenApp Load Balancing
XenApp Load BalancingXenApp Load Balancing
XenApp Load Balancing
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 

Viewers also liked (7)

OSCh15
OSCh15OSCh15
OSCh15
 
Distributed processing
Distributed processingDistributed processing
Distributed processing
 
16.Distributed System Structure
16.Distributed System Structure16.Distributed System Structure
16.Distributed System Structure
 
Multi-area ospf adjacency on ios-xr
Multi-area ospf adjacency on ios-xrMulti-area ospf adjacency on ios-xr
Multi-area ospf adjacency on ios-xr
 
Distributed Processing
Distributed ProcessingDistributed Processing
Distributed Processing
 
Chapter 16 - Distributed System Structures
Chapter 16 - Distributed System StructuresChapter 16 - Distributed System Structures
Chapter 16 - Distributed System Structures
 
Distributed Operating System_4
Distributed Operating System_4Distributed Operating System_4
Distributed Operating System_4
 

Similar to Design (Cloud systems) for Failures

Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
SQL Server Cluster Presentation
SQL Server Cluster PresentationSQL Server Cluster Presentation
SQL Server Cluster Presentation
webhostingguy
 
High Availbilty In Sql Server
High Availbilty In Sql ServerHigh Availbilty In Sql Server
High Availbilty In Sql Server
Rishikesh Tiwari
 

Similar to Design (Cloud systems) for Failures (20)

Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
NoSQL
NoSQLNoSQL
NoSQL
 
Locking and Race Conditions in Web Applications
Locking and Race Conditions in Web ApplicationsLocking and Race Conditions in Web Applications
Locking and Race Conditions in Web Applications
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the Cloud
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
cloud computing preservity
cloud computing preservitycloud computing preservity
cloud computing preservity
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
SQL Server Cluster Presentation
SQL Server Cluster PresentationSQL Server Cluster Presentation
SQL Server Cluster Presentation
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
High Availbilty In Sql Server
High Availbilty In Sql ServerHigh Availbilty In Sql Server
High Availbilty In Sql Server
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL Server
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Management
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 

Design (Cloud systems) for Failures

  • 1. Design for Failures (and for Availability) IV Jornadas de Cloud Computing & Big Data Rodolfo Kohn Cloud Architect Intel Security Rodolfo.kohn@intel.com
  • 2. Original Agenda Remembering “Distributed System Design” and availability Introduction to Design for Failures • Failure modes • Redundancy (process and data) • Failure detection • Failure recovery • Cascade failures and recovery Redundancy and high availability in AWS Eventual consistency problems Performance and scalability problems Operations monitoring • Techniques to avoid false positives Logs and counters Design software for failures Testing availability Measuring availability Education 7/10/20162
  • 3. Agenda Remembering “Distributed System Design” and availability Introduction to Design for Failures • Redundancy – Process – Data: Replication (multi-master, master-slave) – Flat groups and hierarchical groups • Synchronization Model • Stateful vs Stateless • Eventual consistency • CAP Theorem • Failure detection • Failure recovery • Cascade failures and recovery 7/10/20163
  • 4. (Cloud or Distributed) Applications are Complex 7/10/20164 DNS Server .com Root GLB Auth Datacenter-1 GLB Auth Datacenter-2 Service Cache Cache Cache Cache DNS Disk Network SMTP CDN NoSQL SQL Monitoring Logs Configuration Management Multiple Opportunities for Unexpected Failures Brittle Systems shall not Survive Load bursts & Response time deterioration
  • 5. Micro-services dependencies In distributed systems, and cloud systems, there are complex dependencies between systems such that failure of one component can bring down the whole system 7/10/20165
  • 6. What is Availability? Distributed Systems: Principles and Paradigms (2nd Edition), Andrew Tanenbaum, Maarten Van Steen “Availability is defined as the property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.” 3/4/5 9’s of Availability: see Wikipedia :) 7/10/20166 The system is always running correctly When users access it, they have it
  • 7. Systems fail … 7/10/20167 http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north- virginia-affect-heroku-reddit-and-others-heroku-still-down/ What started as a small issue affecting some instances of Amazon’s Elastic Cloud Compute (EC2) in North Virginia became a full- blown outage of AWS in North Virginia. Major services, such as Reddit, Foursquare, Minecraft and Heroku, are down. GitHub, imgur, Pocket, HipChat, Coursera and others are affected … And DOWNTIME COMES …
  • 10. We don’t avoid failures, we live with them Design for Failures is about focusing on the Error Path 7/10/201610 PAINFUL AND TIME CONSUMIG
  • 11. Failures affecting Availability Different types of failures • Infrastructure failures • Software failures • Operations failures • Deployment failures System updates or upgrades may affect availability if they require downtime Bad response time affects availability • Unacceptable response time = system unavailable • Bad scalability eventually affects response time – Vulnerability to load peaks Manual Path to Production affects availability Neglected business/process situations affect availability 7/10/201611
  • 12. Valid for all business As core business moves to the Internet, downtime means money More possibilities of failure: • (Cloud) systems are becoming increasingly complex • Software undergoes stringent conditions • There is a demand for excellent user experience • In the cloud applications run in commodity hardware 7/10/201612
  • 13. It’s about the whole big machinery 7/10/201613 Product/Service Requirements Development Deployment and Operation Path to production PDM and CXD must think about alternative paths on error conditions Architects design for Availability (Software and Infrastructure) Agile teams Distributed Systems Skills Availability, Scalability, Performance mindset Fast, automated, error free DevOps, Monitoring, Operations Automation
  • 14. From Architecture to Development Architecture: redundancy model and management, dependency management, state model, synchronization model, failure detection, recovery, scalability model, administration/configuration management Design: logging design, monitoring design, dependency handling, state management design (stateful and stateless), consistency, fallback actions on failures per operation… Development: consistency handling, retries, error analysis, logging, error path (if ... else …), …
  • 15. Topics Redundancy (process and data) Flat (P2P) Groups vs Hierarchical groups State: stateless vs stateful Replication Synchronization: asynchonous vs. synchronous Eventual Consistency CAP Failure detection Recovery actions Cascade Failures Client recovery in client/server 7/10/201615
  • 16. Redundancy It is about provisioning in excess, replicating hardware or software components or data It allows masking failures as a mechanism of fault tolerance Additional hardware equipment or software processes are provided When a component fails another one in the group takes over its work Data replication, associated with a component replication, keeps data safe in face of a component failure 7/10/201616
  • 17. Redundancy and groups Process redundancy implies the creation of groups of replicated processes The group is seen by other processes as a single process • Replication is abstracted to be seen as one entity • The same happens with hardware 7/10/201617
  • 18. Two types of groups 7/10/201618 Flat group or peer-to-peer Hierarchical group Coordinator Worker
  • 19. Design Considerations Group creation and destroy • Group bootstrapping Group membership • Processes can join and leave a group Decision making • Task distribution, synchronization, consistency, etc. 7/10/201619
  • 20. Different challenges Hierarchical group • The coordinator, primary, or master knows and controls all workers • Simpler control and management • If coordinator fails a group crashes Flat group or peer-to-peer • There is need of agreement or consensus algorithms – For Coordinator election – For consistency – Synchronization – For faulty process detection – Membership change detection • Data distribution • If any member crashes the group continuous working, just shrinks 7/10/201620
  • 21. Hierarchical group: Pool of servers controlled by a Load Balancer 7/10/201621 Load balancer detects unresponsive server and removes it A new server is added to the pool. Manually or automatically. All other processes/applications/systems sending requests to this group see it as just one process The LB distributes work and controls workers
  • 22. Faulty process and server detection Load balancer sends health checks to servers in the pool detecting failing servers • It can monitor at different stack layers – In the case of AWS ELB: TCP, SSL, HTTP, HTTPS – F5 can also test at different stack layers • Failing servers can be automatically de-registered • New healthy servers can be added to the pool 7/10/201622
  • 23. Flat group: Cassandra A cluster of Cassandra nodes • Information is transmitted with a gossip protocol • If a node detects a new node or a faulty node It transmits information through a gossip protocol • Heartbeats with other nodes to detect faulty nodes with Phi Accrual Failure Detectors 7/10/201623
  • 24. Flat group: Cassandra A cluster of Cassandra nodes 7/10/201624
  • 25. Flat group: OSPF I would say OSPF routers form a flat group • Routers use link-state routing protocol to transmit connectivity information • Routers can detect neighbor failures through Hello protocol and transmit the data as links states 7/10/201625
  • 26. Data Redundancy Data stores may be replicated for high availability • Database replication • Disk replication Data redundancy is also found at other levels • RAID disks • In communications: CDMA uses Hamming code to recover from errors We focus on higher level failures that affect operations: a database, SAN, whole platform, datacenter 7/10/201626
  • 27. Data Redundancy SQL and NoSQL Databases allow different replication models • Master-Master – All replicas can be read and written • Master-Slave – All replicas read, only master can be written – In case of master failure, a slave must take over 7/10/201627
  • 28. Database Replication (1) Replication: Data is replicated in all instances Partitioning: Data is partitioned across different instances • This is not replication Data Data Data Data Data Clients from America Clients from Europe
  • 29. Database Replication (2) Replication Master-Slave: Write in one instance, Read from all instances Data Data Data WRITE READ REPLICATION
  • 30. Database Replication (3) Replication Master-Master or Multi-master or peer-to- peer: Write in all instances, Read from all instances Possibility of conflicts in asynchronous mode: • Same row updated in different replicas • Two inserts in different replicas • Delete and insert/update Data Data Data WRITE READ REPLICATION
  • 31. Synchronous vs. Asynchronous Replication Synchronous replication assures a write will occur in all instances at the same time • Either multi-master or master-slave In asynchronous replication write is sent to one node and then replicated to other nodes • Either multi-master or master-slave • There is a lag in write replication • At a point in time data might not be the same in all nodes (eventual consistency)
  • 32. Synchronous Replication Synchronous replication assure a write will occur in all instances at the same time • All servers (both masters and slaves) have up-to-date data (A and C in ACID) • Provides ACID capabilities • High availability • Simpler for developers • Implementation through Two-phase commit or distributed lock which may turn system slow • No write scalability • Performance might be affected • Possibility of deadlocks Galera cluster for MySQL http://galeracluster.com/ Galera Replication is a synchronous multi-master replication plug-in for InnoDB
  • 33. Asynchronous Replication Write occurs in one node and then it replicates to other nodes • Less complex (no two-phase commit or distributed locks) • High availability across datacenters • Better write scalability • Eventual consistency • Write conflicts among masters • Loss of synchronization is a problem to solve • More difficult for developers (eventual consistency, write conflicts) This type of replication is the basic one offered by MySQL, PostgresSQL and MariaDB (and SQL Server???)
  • 34. Multi-master with Cassandra Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Asynchronous replication Tunable consistency
  • 35. P2P Database Solutions • Dynamo DB – http://www.allthingsdistributed.com/files/amazon-dynamo- sosp2007.pdf • Cassandra – https://www.cs.cornell.edu/projects/ladis2009/papers/laksh man-ladis2009.pdf • Netflix’s Dynomite (Redis and Memcached) – http://techblog.netflix.com/2014/11/introducing- dynomite.html – https://github.com/Netflix/dynomite 7/10/201635
  • 36. Consistency and Design for Failures When working with asynchronous replication you need to deal with eventual consistency • With asynchronous processes in general it is possible that when a process goes to read something that should be there it is not there yet It could take milliseconds or many seconds Under heavy load it turns worse Write conflicts are another issue you need to deal with • Need to have alarm and repair scripts if an automated solution is not possible Asynchronous, Fire and forget, Future, Let it be … 7/10/201636
  • 37. Eventual consistency Applications Data Applications Applications Data Load Balancer Applications Replication after some time 1-WRITE •Eventually both DB instances have the same data 2 3 4
  • 38. Eventual consistency problem Applications Data Applications Applications Data Load Balancer Applications Replication after some time 1-WRITE4-READ •Read-after-write problem •Specific solution for each case •Cannot trust replication will occur after some time 2 3 5 6 7
  • 39. From Architecture to Development Designers and developers must understand the consequences of each architecture Typical questions/comments that predict issues in distributed systems (100% certainty) • By comparing operations’ time I can determine order • How long does it take to replicate data? • We tested it and it is replicating very fast, no problems • It’s fast. It’s just fire and forget (asynchronous): check if there is a subsequent read associated 7/10/201639
  • 40. Asynchronous Replication in Active-Active Network partitioning 7/10/201640 DNS Server .com Root GLB Auth Datacenter-1 GLB Auth Datacenter-2 Service Cache Cache Cache Cache DNS Disk Disk
  • 41. Why the hassle of P2P/flat Best solution for high availability Self-managed system Best horizontal and dynamic scalability Usually, can still write after network partition 7/10/201641
  • 42. Brewer’s Conjecture and CAP Theorem • Consistency, Availability, and Partition Tolerance are all desired features of database systems. • However it is not possible to have all of them: pick only two. 42 A C P Availability: Each client can always read and write Consistency: All clients always have the same view of the data Partition Tolerance: System works well despite physical network CA: RDBMS AP: Dynamo, Cassandra CP: MongoDB,