SlideShare a Scribd company logo
1 of 52
Zero Downtime Architectures
Alexander Penev
ByteSource Technology Consulting GmbH
Neubaugasse 43
1070, Vienna
Austria
whoami
Alexander Penev
Email: alexander.penev@bytesource.net
Twitter: @apenev
@ByteSourceNet
JEE, Databases, Linux, TCP/IP
Fan of (automatic) testing, TDD, ADD, BDD…..
Like to design high available and scalable systems :-)
Zero Downtime Architectures
● Base on a customer project with the classic JEE Application Stack
● Classic web applications with server side code
● HTTP based APIs
● Goals, Concepts and Implementation Techniques
● Constraints and limitations
● Developement guidelines
● How these concepts can be applied to the new cuttung edge technolgies
● Single page Java Script based Apps
● Mobile clients
● Rest APIs
● Node.js
● NoSQL stores
Zero Downtime Architecture?
● My database server has 99.999% uptime
● We have Tomcat cluster
● Redundant power supply
● Second Datacenter
● Load Balancer
● Distribute routes over OSPF
● Deploy my application online
● Second ISP
● Session Replication
● Monitoring
● Data Replication
● Auto restarts
Zero Downtime architecture: our definition
The services from the end user point of view
could be always available
Our Vision
Identify all sources of downtime and remove
all them
http://www.meteleco.com/wp-content/uploads/2011/09/p360.jpg
When could we have a downtime (unplanned)?
● Human errors
● Server node has crashed
● Power supply is broken, RAM Chip burned out, OS just crashed
● Server Software just crashed
● IO errors, software bug, tablespace full
● Network is unavailable
● Router crashed, Uplink down
● Datacenter is down
● Uplinks down ( notorious bagger :-) )
● Flood/Fire
● Aircondition broken
● Hit by a nuke (not so often :-) )
When could we need a downtime (planned)?
● Replace a hardware part
● Replace a router/switch
● Firmware upgrade
● Upgrade/exchange the storage
● Configuration of the connection pool
● Configuration of the cluster
● Upgrade the cluster software
● Recover from a logical data error
● Upgrade the database software
● Deploy a new version of our software
● Move the application to another data center
How can we avoid downtime
● Redunancy
● Hardware, network
● Uplinks
● Datacenters
● Software
● Monitoring
● Detect exhausted resources before the application notices it
● Detect a failed node and replace it
● Software design
● Idempotent service calls
● Backwards compatibility
● Live releases
● Scalability
● Scale on more load
● Protect from attacks (e.g. DDoS)
Requirements for a Zero Downtime Architecture:
handling of events of failure or maintenance
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Our goals and constraints
● Reduce downtime to 0
● Keep the costs low
● No expensive propriatery hardware
● Minimize the potential application changes/rewrites
http://www.signwarehouse.com/blog/how-to-keep-fixed-costs-low/
Our Concepts 1/4
● Independent Applications or Application Groups
● One Application (Group) = IP Address
● Communication between Application exclusively over this IP Address!
http://www.binaryguys.de/media/catalog/product/cache/1/image/313x313/9df78eab33525d08d6e5fb8d27136e95/3/6/36.noplacelikelocalhost_1_4.jpg
Our Concepts 2/4
Treat the internet and internal traffic
independently
Our Concepts 3/4
● Reduce the downtime within a datacenter to 0
● High available network
● Redundant firewalls and load balancers
● Web server farms
● Application server clusters with sesion replication
● Oracle RAC Cluster
● Downtime free application deployments
Our Concepts 4/4
● Replicate the data on both datacenters
● and make the applications switchable
Implementation: Network (Layer 2)
Concepts: Internet traffic, BGP(Border Gateway Protocol) 1/2
●
Every datacenter has fully redundant uplinks
●
Own provider independent IP address range (assigned by RIPE)
●
Hard to get in the moment (but not impossible)
●
Propagate these addresses to the rest of the internet through both ISPs using BGP
●
Both DCs our addresses
●
The network path of one announcement could be preferred (for costs reasons)
●
Switch of internet traffic
●
Gracefully by changing the preferences of the announcements
– No single TCP session lost
●
In case of disaster the backup route is propagated automatically within seconds to minutes (depending on the internet
distance)
●
Protect us from connectivity problems between our ISPs and our customer ISPs
10.8.8.0/24
10.8.8.0/24
Announcement
Announcement
Concepts: Internet traffic, use DNS ? 2/2
● We don't use DNS for switching
● A datacenter switch based on DNS could take up to months to reach all customers and
their software (e.g. JVMs caching DNS entries, default behaviour)
● No need to restart browsers, applications and proxies on the customer site. The customer
doesn't see any change at all (except that route to us has changed)
● DNS is good for load balancing but not for High Availability!
Concepts: Internal traffic
● OSPF (Open Shortest Path First) protocol for dynamic routing
● Deals with redundant paths completely transparently
● Can also do load balancing
● The second level firewalls (in front of the load balancers) announce the address to
the rest of the routers
● To switch the processing of a service, it's firewall just has to announce the route (could be also a /32)
with a higher priority, after a second the traffic goes through the new route.
● Could be also used for a unattended switch of the whole datacenter
● Just announce the same IPs from both sites with different priorities
● If the one datacenter dies there are only announcements from the other one
10.8.8.23
10.8.8.23
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce Downtime within a DC
● Replicate the data between the Dcs and make
the application switchable
Zero Downtime within a datacenter
● High Available network
● Redundant switches
– Again using Spanning Tree
Protocol
● Redundant firewalls, routers, load
balancers
– Active/Passive Clusters
– VRRP protocol implemeneted
by keepalived
– IP tables with contractd
● Web Server Apache farms
● Managed by load balancer
● Application Server Cluster
● Weblogic Cluster
● With Session replication,
● automcatic retries and restarts
● Oracle RAC database cluster
● Deployment without
downtime
Failover within one datacenter:Apache plugin (mod_wl)
Session ID Format: sessionid!primary_server_id!secondary_server_id
Quelle: http://egeneration.beasys.com/wls/docs100/cluster/wwimages/cluster-06-1-2.gif
Development guidelines (HTTPSession)
● If you need a session then you most probably want to replicate it
● Example (weblogic.xml)
● Generally all requests of one session go to the same application instance
● When it fails (answer with 50x, dies or not answer in a given period) the backup instance is involved
● The session attributes are only replicated on the backup node when HTTPSession.setAttribute
was called. HTTPSession.getAttribute("foo") .changeSomething() will not be replicated!
● Every attribute stored in the HTTPSession must be serializable!
● The ServletContext will not be replicated in any cases.
● If you implement caches they will have probably different contents on every node (except we
use a 3rd party cluster aware cache). Probably the best practice is not to rely that the data is
present and declare the cache transient
● Keep the session small in size and do regular reattaching.
Development guidelines (cluster handling)
● Return proper HTTP return codes to the client
● Common practice is to return a well formed error page with HTTP code 200
● It is a good practice if you are sure that the cluster is incapable of recovering from it (example: a
missing page will be missing on the other node too)
● But an exhausted resource (like heap, datasource) could be present on the other node
● It is hard to implement it, therefore Weblogic offers you help:
● You can bind the number of execution threads to a datasource capacity
● Shut down the node if an OutOfMemoryError occurs but use it with extreme care!
● Design for idempotence
● Do all your methods idempotent as far as possible.
● For those that cannot be idempotent (e.g. sendMoney(Money money, Account account)) prevent re-
execution:
– By using a ticketing service
– By declaring the it as not idempotent:
<LocationMatch /pathto/yourservlet > 
               SetHandler weblogic­handler
               Idempotent OFF
</Location>
Development guidelines (Datasources)
● Don't build your own connection pools, take them from the Application Server by JNDI
Lookup
● As we are using Oracle RAC , the datasource must be a multipool consisting of single datasources per RAC
node
– One can take one of the single datasources out of the mutlipool (online)
– Load balancing is guaranteed
– Reconfiguring the pool online
● Example Spring config:
● Example without Spring:
Basic monitoring
● Different possibilities for monitoring on Weblogic
● Standard admin console
– Threads (stuck, in use, etc), JVM (heap size, usage etc.), online thread dumps
– Connection pools statistics
– Transaction manager statistics
– Application statistics (per servlet), WorkManager statistics
● Diagnostic console
– Online monitoring only
– All attributes exposed by Weblogic Mbeans can be monitored
– Demo: diagnostics console
● Diagnostic images
– On demand, on shutdown, regularly
– Useful for problem analysis (especially for after crash analysis)
– For analysing of resource leaks: Demo: analyse a connection leak and a stuck thread
● SNMP and diagnostic modules
– All MBean attributes can be monitored by SNMP
– Gauge, string, counter monitors, log filters, attribute changes
– Collected metrics, watches and notifications
Zero downtime deployment
● 2 Clusters within the one datacenter
● Managed by Apache LB
● (simple script based on the session ID)
● Both are active during normal operations
● Before we deploy the new release we
switch off cluster 1
● Old sessions go to both cluster 1 and 2
● New sessions go to cluster 2 only
● When all sessions of cluster 1 expire we deploy
the new version
● Test it
● If everything ok, then we put it back
into the Apache load balancer
● Now we take cluster 2 off
● Untill all sessions expire
● The same procedure as above
● Then we deploy on the second datacenter
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable
Our requirements again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Replicate the data between the DCs
● Bidirectional data replication between DCs
● Oracle Streams/Golden Gate
http://docs.oracle.com/cd/E11882_01/server.112/e10705/man_gen_rep.htm#STREP013
Cross Cluster replication: 2 clusters in 2 datacenters
Application groups
●
One or more applications without hard dependencies to or from other
applications
●
Why application groups
●
Switching many application at once leads to long downtimes and higher risk
●
Switching a single one is not possible if there are hard dependencies on database level to
other applications
●
Identify groups of applications that are critical dependent on each other but not to other
applications out of the group
●
Switch such groups always at once
●
As bigger the group as longer the downtime
– A single application in the category HA will be able to switch without any downtime, just delayed
requests
●
Critical (hard) dependencies is if it leads to issues (editing the same record on different
DCs will be definitely problematic, reading data for reporting is not)
– Must be identified on case by case base
Identify application groups
Switch application by application
Example of a switch procedure of an application group
Applications: Limitations
Limitation/Categories
No bulk transactions
No DB sequences
No file based sequences
No shared file system storage
Use a central batch system
All new releases has to be compatible with
the previous release.
Stick to the infrastructure
Our Concepts
● Independent Applications or Application Groups
● Independent Internet and internal network trafic
● Reduce/avoid Downtime within a DC
● Replicate the data between the DCs and make
the application switchable
Our requirements once again
Event/Application category Online applications Batch jobs
Failure or maintenance of an internet uplink/router/switch Yes Yes
Failure or maintenance of a firewall node,
loadbalancer node or a network component
Yes Yes
Failure or maintenance of a webserver node Yes N/A
Failure or maintenance of an application server node Yes partly (will be restarted)
Failure or maintenance of a database node Yes partly
Switchover of a datacenter:
switching only one application (group)
Yes Yes (maintenance)
partly (failure)
Switchover of a datacenter:
switching all applications
Yes Yes (maintenance)
partly (failure)
New application deployment Yes Yes
Upgrade of operating system Yes Yes
Upgrade of an arbitrary middleware software Yes Yes
Upgrade of database software Yes Yes
Overload of processing nodes Yes Yes
Failure of a single JVM Yes No
Failure of a node due to leak of system resources Yes No
Modern Architectures: how does the concepts fit?
Modern Architectures: Application Layer
● Web apps
● Completely independent on the backend
● Using only Rest APIs
● 90% of the state is locally managed (supported by frameworks like AngularJS and
BackboneJS)
● Must be compatible with different versions of the Rest API (at least 2 versions)
● If websockets are used, then more tricky, see backend.
● New mobile versions managed by Apps Stores
● Good to have a upgrade reminder (to limit the supported versions)
● Rest API must be versioned and backwards compatible
● Messages over message clouds is transparent. HA managed by vendors
● Stafeful Services
● e.g. Oauth v1/v2
– Normally by DB Persistence
Session Replication
● Less needed that with Server Side Applications
● Frameworks like AngularJS, BackboneJS , Ember etc. manage their own sessions,
routings etc.
● but still needed
● Weblogic: no change
● Tomcat evtl. with JDBC Store
● Jetty with Terracotta
● Node.js: secure (digitally signed) sessions stored in cookies
– Senchalabs Connect
– Mozilla/node-client-sessions
● https://hacks.mozilla.org/2012/12/using-secure-client-side-sessions-to-build-simple-and-
scalable-node-js-applications-a-node-js-holiday-season-part-3/
Backend: Bidirectional Data Replication
● Elastic Search
● Currently no cross cluster replication
● But is on their roadmap
● Couchdb
● Very flexible replication, regardless within one or more datacenters
● Bidirectional replication is possible
● Mongodb
● One direction replication possible and mature
● Bidirectional not possible in the moment
● Workaround would be: one mongodb per app and strict separation of the apps
● Hadoop HDFS
● Currently no cross cluster replication available
● e.g. Facebook wrote their own replication for HIVE
● Will possibly arrive soon with Apache Falcon http://falcon.incubator.apache.org/
Questions?
Thank you for your attention !
Some pictures on this presentation were purchased from iStockphoto LP. The price paid applies for the use of the pictures within the
scope of a standard license, which includes among other things, online publications including websites up to a maximum image size of
800 x 600 pixels (video: 640 x 480 pixels).
Some icons from https://www.iconfinder.com/ are used under the Creative Commons public domain license from the following authors:
Artbees, Neurovit and Pixel Mixer (http://pixel-mixer.com)
All other trademarks mentioned herein are the property of their respective owners.
Backup slides
Big picture example architecture
Key features
● 2 datacenters
●
Both active (both datacenters active but probably different applications running on them)
● Independent uplinks
● Redundant interconnect
● Applications are deployed and running on both
● Application cluster in every datacenter
● Session replication within every datacenter
● Cross replication between the 2 datacenters
● e.g. with Weblogic Cluster
● Bidirectional database replication
● e.g. 2 independent Oracle RAC in each datacenter
● Replication over streams/Golden Gate
● Monitoring of all critical resources
● Hardware nodes
● Connection pools
● JVM heaps
● Application switch
Concepts: other network components
● Firewalls
● First level firewalls
– Cisco routers
– Stateless firewalls
– Not very restrictive
● Second level firewalls (in front of the application load balancers)
– Should be stateful
– based on Linux/Iptables with conntrackd (for failover)
– Statefull, connection tracking
– Very restrictive
– Rate limiting of new connections (DoS or slashdot)
● All firewalls will be/are in active/hot standby mode.
● On a controlled failover (both are running and we switch them) no single TCP connection
should be affected (except small delays)
● In disaster case some seconds until the cluster software detects the crash of the node and
initiate the failover. No TCP connections should be lost but there is a very small risk
Example of a switch procedure of an application group
● Preparation steps
● Check the health of the replication processes.
● Stop all batch applications (by stopping the job scheduling system). If the time
pressure for the switch is high just kill all running jobs (they should be restartable
anyway, also currently).
● Switch off the keepalive feature on all httpd servers
● Switching steps
● Change the firewall rules on the second layer firewalls, so that any new
connection requests (Syn flag is active) is being dropped.
● Wait until the data is synchronized on both sides (e.g. by monitoring a
heartbeat table) and no more httpd processes are active.
● Switch the application traffic to the other DC (by changing the routing of their
IP addresses).
● Clean up (remove dropping of Syn packages on the “old” site etc.)
● This procedure is done per application group until all applications are running
Application clusters (Weblogic)
● Features of Weblogic that we use
● mod_wl
– Manages the stickiness and failover to backup nodes
– Automatic retry of failed requests
● On time-outs
● On response header 50x
● Multipools
– Gracefully remove a database node out of the pool
– Gracefully change parameters of connection pools
– Guaranteed balance of connections between database nodes
● Binding execution threads to connection pools
● Auto shutdown (+ restart) of nodes on OutOfMemoryException
● Session replication (also over both DCs)
● Thread monitoring (detect dead or long running threads etc.)
● Diagnostic images and alarms
Apache plugin failover
Quelle: e-docs.bea.com
Deployment of connection pools
● One datasource per Oracle RAC node
● Set the initial capacity to a value that will be sufficient for the usual load for the application
– Creation of new connections is expensive
● Set the max capacity to a value that will be sufficient in a high load scenario
– The overall number of connections should match to the limit of connection on the database site
● Set JDBC parameter in the connection pool and not globally (e.g. v8compatibility=true)
● Check connections on reserve
● You can set db session parameters in the init SQL property (e.g. alter session set
NLS_SORT='GERMAN')
● Enable 2 phase commit only if you need it (expensive)
● Prepared statement caching does not bring much performance (at least for Oracle databases) but cost
open cursors in the database (per connection!), so don't use it unless you have a very good reason to
do it.
● One Multipool containing all single datasources for one database
● Strategy: load balancing

More Related Content

What's hot

Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBrian Ritchie
 
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...VMware Tanzu
 
Migration of Microsoft Workloads
Migration of Microsoft WorkloadsMigration of Microsoft Workloads
Migration of Microsoft WorkloadsAmazon Web Services
 
Using Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Servicesguest484c12
 
Scaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysScaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysMongoDB
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioAraf Karsh Hamid
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Continuent
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAnimesh Singh
 
Multi-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSMulti-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSContinuent
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDataWorks Summit
 
How to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSHow to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSDenis Gundarev
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesWill Hall
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...Amazon Web Services
 

What's hot (20)

Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
Integrating Hybrid Cloud Database-as-a-Service with Cloud Foundry’s Service​ ...
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Migration of Microsoft Workloads
Migration of Microsoft WorkloadsMigration of Microsoft Workloads
Migration of Microsoft Workloads
 
Using Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web ServicesUsing Oracle Database with Amazon Web Services
Using Oracle Database with Amazon Web Services
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Scaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - InfosysScaling Database Modernisation with MongoDB - Infosys
Scaling Database Modernisation with MongoDB - Infosys
 
Containers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes IstioContainers Docker Kind Kubernetes Istio
Containers Docker Kind Kubernetes Istio
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
 
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
Data Con LA 2019 - Orchestration of Blue-Green deployment model with AWS Docu...
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
 
Multi-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWSMulti-master, multi-region MySQL deployment in Amazon AWS
Multi-master, multi-region MySQL deployment in Amazon AWS
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere
 
How to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWSHow to build a Citrix infrastructure on AWS
How to build a Citrix infrastructure on AWS
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
PASS Summit 2020
PASS Summit 2020PASS Summit 2020
PASS Summit 2020
 
Container Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and KubernetesContainer Orchestration with Docker Swarm and Kubernetes
Container Orchestration with Docker Swarm and Kubernetes
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
 

Viewers also liked

Continuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeContinuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeAxel Fontaine
 
Moving Towards Zero Downtime
Moving Towards Zero DowntimeMoving Towards Zero Downtime
Moving Towards Zero DowntimeBCM Institute
 
A Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFA Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFASQ Reliability Division
 
Troubleshooting
TroubleshootingTroubleshooting
TroubleshootingJulia .
 
The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream Inside Analysis
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime DeploymentJoel Dickson
 
Obstacle escalation process
Obstacle escalation processObstacle escalation process
Obstacle escalation processRavi Tadwalkar
 
Deploying and releasing applications
Deploying and releasing applicationsDeploying and releasing applications
Deploying and releasing applicationsMa Xuebin
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategyCharlton Inao
 
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve ReliabilityRicky Smith CMRP, CMRT
 
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsRicky Smith CMRP, CMRT
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - AvailabilityTom Jacyszyn
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by ResiliencyReza Samei
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityRicky Smith CMRP, CMRT
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesAccendo Reliability
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsArray Technologies, Inc.
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangASQ Reliability Division
 

Viewers also liked (20)

Continuous Delivery and Zero Downtime
Continuous Delivery and Zero DowntimeContinuous Delivery and Zero Downtime
Continuous Delivery and Zero Downtime
 
Moving Towards Zero Downtime
Moving Towards Zero DowntimeMoving Towards Zero Downtime
Moving Towards Zero Downtime
 
A Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTFA Proposal for an Alternative to MTBF/MTTF
A Proposal for an Alternative to MTBF/MTTF
 
Troubleshooting
TroubleshootingTroubleshooting
Troubleshooting
 
The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream The New Simple: Predictive Analytics for the Mainstream
The New Simple: Predictive Analytics for the Mainstream
 
Zero Downtime Deployment
Zero Downtime DeploymentZero Downtime Deployment
Zero Downtime Deployment
 
Obstacle escalation process
Obstacle escalation processObstacle escalation process
Obstacle escalation process
 
Deploying and releasing applications
Deploying and releasing applicationsDeploying and releasing applications
Deploying and releasing applications
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy
 
How to measure reliability
How to measure reliabilityHow to measure reliability
How to measure reliability
 
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
 
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
How to measure reliability 2
 
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies
 
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin Yang
 

Similar to Zero Downtime Architectures: Concepts and Implementation Techniques

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Design patterns for scaling web applications
Design patterns for scaling web applicationsDesign patterns for scaling web applications
Design patterns for scaling web applicationsIvan Dimitrov
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationSarmad Makhdoom
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceShapeBlue
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In SoaWSO2
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterShapeBlue
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
SDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center NetworkingSDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center NetworkingThomas Graf
 
Network Virtualization & Software-defined Networking
Network Virtualization & Software-defined NetworkingNetwork Virtualization & Software-defined Networking
Network Virtualization & Software-defined NetworkingDigicomp Academy AG
 
Node.js Presentation
Node.js PresentationNode.js Presentation
Node.js PresentationExist
 
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedis Labs
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...Edge AI and Vision Alliance
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and HerokuTapio Rautonen
 

Similar to Zero Downtime Architectures: Concepts and Implementation Techniques (20)

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Design patterns for scaling web applications
Design patterns for scaling web applicationsDesign patterns for scaling web applications
Design patterns for scaling web applications
 
Challenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In Soa
 
Rohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual RouterRohit Yadav - The future of the CloudStack Virtual Router
Rohit Yadav - The future of the CloudStack Virtual Router
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
SDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center NetworkingSDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center Networking
 
Network Virtualization & Software-defined Networking
Network Virtualization & Software-defined NetworkingNetwork Virtualization & Software-defined Networking
Network Virtualization & Software-defined Networking
 
Node.js Presentation
Node.js PresentationNode.js Presentation
Node.js Presentation
 
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedisConf17 - Dynomite - Making Non-distributed Databases Distributed
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
 
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Zero Downtime Architectures: Concepts and Implementation Techniques

  • 1. Zero Downtime Architectures Alexander Penev ByteSource Technology Consulting GmbH Neubaugasse 43 1070, Vienna Austria
  • 2. whoami Alexander Penev Email: alexander.penev@bytesource.net Twitter: @apenev @ByteSourceNet JEE, Databases, Linux, TCP/IP Fan of (automatic) testing, TDD, ADD, BDD….. Like to design high available and scalable systems :-)
  • 3. Zero Downtime Architectures ● Base on a customer project with the classic JEE Application Stack ● Classic web applications with server side code ● HTTP based APIs ● Goals, Concepts and Implementation Techniques ● Constraints and limitations ● Developement guidelines ● How these concepts can be applied to the new cuttung edge technolgies ● Single page Java Script based Apps ● Mobile clients ● Rest APIs ● Node.js ● NoSQL stores
  • 4. Zero Downtime Architecture? ● My database server has 99.999% uptime ● We have Tomcat cluster ● Redundant power supply ● Second Datacenter ● Load Balancer ● Distribute routes over OSPF ● Deploy my application online ● Second ISP ● Session Replication ● Monitoring ● Data Replication ● Auto restarts
  • 5. Zero Downtime architecture: our definition The services from the end user point of view could be always available
  • 6. Our Vision Identify all sources of downtime and remove all them http://www.meteleco.com/wp-content/uploads/2011/09/p360.jpg
  • 7. When could we have a downtime (unplanned)? ● Human errors ● Server node has crashed ● Power supply is broken, RAM Chip burned out, OS just crashed ● Server Software just crashed ● IO errors, software bug, tablespace full ● Network is unavailable ● Router crashed, Uplink down ● Datacenter is down ● Uplinks down ( notorious bagger :-) ) ● Flood/Fire ● Aircondition broken ● Hit by a nuke (not so often :-) )
  • 8. When could we need a downtime (planned)? ● Replace a hardware part ● Replace a router/switch ● Firmware upgrade ● Upgrade/exchange the storage ● Configuration of the connection pool ● Configuration of the cluster ● Upgrade the cluster software ● Recover from a logical data error ● Upgrade the database software ● Deploy a new version of our software ● Move the application to another data center
  • 9. How can we avoid downtime ● Redunancy ● Hardware, network ● Uplinks ● Datacenters ● Software ● Monitoring ● Detect exhausted resources before the application notices it ● Detect a failed node and replace it ● Software design ● Idempotent service calls ● Backwards compatibility ● Live releases ● Scalability ● Scale on more load ● Protect from attacks (e.g. DDoS)
  • 10. Requirements for a Zero Downtime Architecture: handling of events of failure or maintenance Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 11. Our goals and constraints ● Reduce downtime to 0 ● Keep the costs low ● No expensive propriatery hardware ● Minimize the potential application changes/rewrites http://www.signwarehouse.com/blog/how-to-keep-fixed-costs-low/
  • 12. Our Concepts 1/4 ● Independent Applications or Application Groups ● One Application (Group) = IP Address ● Communication between Application exclusively over this IP Address! http://www.binaryguys.de/media/catalog/product/cache/1/image/313x313/9df78eab33525d08d6e5fb8d27136e95/3/6/36.noplacelikelocalhost_1_4.jpg
  • 13. Our Concepts 2/4 Treat the internet and internal traffic independently
  • 14. Our Concepts 3/4 ● Reduce the downtime within a datacenter to 0 ● High available network ● Redundant firewalls and load balancers ● Web server farms ● Application server clusters with sesion replication ● Oracle RAC Cluster ● Downtime free application deployments
  • 15. Our Concepts 4/4 ● Replicate the data on both datacenters ● and make the applications switchable
  • 17. Concepts: Internet traffic, BGP(Border Gateway Protocol) 1/2 ● Every datacenter has fully redundant uplinks ● Own provider independent IP address range (assigned by RIPE) ● Hard to get in the moment (but not impossible) ● Propagate these addresses to the rest of the internet through both ISPs using BGP ● Both DCs our addresses ● The network path of one announcement could be preferred (for costs reasons) ● Switch of internet traffic ● Gracefully by changing the preferences of the announcements – No single TCP session lost ● In case of disaster the backup route is propagated automatically within seconds to minutes (depending on the internet distance) ● Protect us from connectivity problems between our ISPs and our customer ISPs 10.8.8.0/24 10.8.8.0/24 Announcement Announcement
  • 18. Concepts: Internet traffic, use DNS ? 2/2 ● We don't use DNS for switching ● A datacenter switch based on DNS could take up to months to reach all customers and their software (e.g. JVMs caching DNS entries, default behaviour) ● No need to restart browsers, applications and proxies on the customer site. The customer doesn't see any change at all (except that route to us has changed) ● DNS is good for load balancing but not for High Availability!
  • 19. Concepts: Internal traffic ● OSPF (Open Shortest Path First) protocol for dynamic routing ● Deals with redundant paths completely transparently ● Can also do load balancing ● The second level firewalls (in front of the load balancers) announce the address to the rest of the routers ● To switch the processing of a service, it's firewall just has to announce the route (could be also a /32) with a higher priority, after a second the traffic goes through the new route. ● Could be also used for a unattended switch of the whole datacenter ● Just announce the same IPs from both sites with different priorities ● If the one datacenter dies there are only announcements from the other one 10.8.8.23 10.8.8.23
  • 20. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce Downtime within a DC ● Replicate the data between the Dcs and make the application switchable
  • 21. Zero Downtime within a datacenter ● High Available network ● Redundant switches – Again using Spanning Tree Protocol ● Redundant firewalls, routers, load balancers – Active/Passive Clusters – VRRP protocol implemeneted by keepalived – IP tables with contractd ● Web Server Apache farms ● Managed by load balancer ● Application Server Cluster ● Weblogic Cluster ● With Session replication, ● automcatic retries and restarts ● Oracle RAC database cluster ● Deployment without downtime
  • 22. Failover within one datacenter:Apache plugin (mod_wl) Session ID Format: sessionid!primary_server_id!secondary_server_id Quelle: http://egeneration.beasys.com/wls/docs100/cluster/wwimages/cluster-06-1-2.gif
  • 23. Development guidelines (HTTPSession) ● If you need a session then you most probably want to replicate it ● Example (weblogic.xml) ● Generally all requests of one session go to the same application instance ● When it fails (answer with 50x, dies or not answer in a given period) the backup instance is involved ● The session attributes are only replicated on the backup node when HTTPSession.setAttribute was called. HTTPSession.getAttribute("foo") .changeSomething() will not be replicated! ● Every attribute stored in the HTTPSession must be serializable! ● The ServletContext will not be replicated in any cases. ● If you implement caches they will have probably different contents on every node (except we use a 3rd party cluster aware cache). Probably the best practice is not to rely that the data is present and declare the cache transient ● Keep the session small in size and do regular reattaching.
  • 24. Development guidelines (cluster handling) ● Return proper HTTP return codes to the client ● Common practice is to return a well formed error page with HTTP code 200 ● It is a good practice if you are sure that the cluster is incapable of recovering from it (example: a missing page will be missing on the other node too) ● But an exhausted resource (like heap, datasource) could be present on the other node ● It is hard to implement it, therefore Weblogic offers you help: ● You can bind the number of execution threads to a datasource capacity ● Shut down the node if an OutOfMemoryError occurs but use it with extreme care! ● Design for idempotence ● Do all your methods idempotent as far as possible. ● For those that cannot be idempotent (e.g. sendMoney(Money money, Account account)) prevent re- execution: – By using a ticketing service – By declaring the it as not idempotent: <LocationMatch /pathto/yourservlet >                 SetHandler weblogic­handler                Idempotent OFF </Location>
  • 25. Development guidelines (Datasources) ● Don't build your own connection pools, take them from the Application Server by JNDI Lookup ● As we are using Oracle RAC , the datasource must be a multipool consisting of single datasources per RAC node – One can take one of the single datasources out of the mutlipool (online) – Load balancing is guaranteed – Reconfiguring the pool online ● Example Spring config: ● Example without Spring:
  • 26. Basic monitoring ● Different possibilities for monitoring on Weblogic ● Standard admin console – Threads (stuck, in use, etc), JVM (heap size, usage etc.), online thread dumps – Connection pools statistics – Transaction manager statistics – Application statistics (per servlet), WorkManager statistics ● Diagnostic console – Online monitoring only – All attributes exposed by Weblogic Mbeans can be monitored – Demo: diagnostics console ● Diagnostic images – On demand, on shutdown, regularly – Useful for problem analysis (especially for after crash analysis) – For analysing of resource leaks: Demo: analyse a connection leak and a stuck thread ● SNMP and diagnostic modules – All MBean attributes can be monitored by SNMP – Gauge, string, counter monitors, log filters, attribute changes – Collected metrics, watches and notifications
  • 27. Zero downtime deployment ● 2 Clusters within the one datacenter ● Managed by Apache LB ● (simple script based on the session ID) ● Both are active during normal operations ● Before we deploy the new release we switch off cluster 1 ● Old sessions go to both cluster 1 and 2 ● New sessions go to cluster 2 only ● When all sessions of cluster 1 expire we deploy the new version ● Test it ● If everything ok, then we put it back into the Apache load balancer ● Now we take cluster 2 off ● Untill all sessions expire ● The same procedure as above ● Then we deploy on the second datacenter
  • 28. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce/avoid Downtime within a DC ● Replicate the data between the DCs and make the application switchable
  • 29. Our requirements again Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 30. Replicate the data between the DCs ● Bidirectional data replication between DCs ● Oracle Streams/Golden Gate http://docs.oracle.com/cd/E11882_01/server.112/e10705/man_gen_rep.htm#STREP013
  • 31. Cross Cluster replication: 2 clusters in 2 datacenters
  • 32. Application groups ● One or more applications without hard dependencies to or from other applications ● Why application groups ● Switching many application at once leads to long downtimes and higher risk ● Switching a single one is not possible if there are hard dependencies on database level to other applications ● Identify groups of applications that are critical dependent on each other but not to other applications out of the group ● Switch such groups always at once ● As bigger the group as longer the downtime – A single application in the category HA will be able to switch without any downtime, just delayed requests ● Critical (hard) dependencies is if it leads to issues (editing the same record on different DCs will be definitely problematic, reading data for reporting is not) – Must be identified on case by case base
  • 34. Switch application by application
  • 35. Example of a switch procedure of an application group
  • 36. Applications: Limitations Limitation/Categories No bulk transactions No DB sequences No file based sequences No shared file system storage Use a central batch system All new releases has to be compatible with the previous release. Stick to the infrastructure
  • 37. Our Concepts ● Independent Applications or Application Groups ● Independent Internet and internal network trafic ● Reduce/avoid Downtime within a DC ● Replicate the data between the DCs and make the application switchable
  • 38. Our requirements once again Event/Application category Online applications Batch jobs Failure or maintenance of an internet uplink/router/switch Yes Yes Failure or maintenance of a firewall node, loadbalancer node or a network component Yes Yes Failure or maintenance of a webserver node Yes N/A Failure or maintenance of an application server node Yes partly (will be restarted) Failure or maintenance of a database node Yes partly Switchover of a datacenter: switching only one application (group) Yes Yes (maintenance) partly (failure) Switchover of a datacenter: switching all applications Yes Yes (maintenance) partly (failure) New application deployment Yes Yes Upgrade of operating system Yes Yes Upgrade of an arbitrary middleware software Yes Yes Upgrade of database software Yes Yes Overload of processing nodes Yes Yes Failure of a single JVM Yes No Failure of a node due to leak of system resources Yes No
  • 39. Modern Architectures: how does the concepts fit?
  • 40. Modern Architectures: Application Layer ● Web apps ● Completely independent on the backend ● Using only Rest APIs ● 90% of the state is locally managed (supported by frameworks like AngularJS and BackboneJS) ● Must be compatible with different versions of the Rest API (at least 2 versions) ● If websockets are used, then more tricky, see backend. ● New mobile versions managed by Apps Stores ● Good to have a upgrade reminder (to limit the supported versions) ● Rest API must be versioned and backwards compatible ● Messages over message clouds is transparent. HA managed by vendors ● Stafeful Services ● e.g. Oauth v1/v2 – Normally by DB Persistence
  • 41. Session Replication ● Less needed that with Server Side Applications ● Frameworks like AngularJS, BackboneJS , Ember etc. manage their own sessions, routings etc. ● but still needed ● Weblogic: no change ● Tomcat evtl. with JDBC Store ● Jetty with Terracotta ● Node.js: secure (digitally signed) sessions stored in cookies – Senchalabs Connect – Mozilla/node-client-sessions ● https://hacks.mozilla.org/2012/12/using-secure-client-side-sessions-to-build-simple-and- scalable-node-js-applications-a-node-js-holiday-season-part-3/
  • 42. Backend: Bidirectional Data Replication ● Elastic Search ● Currently no cross cluster replication ● But is on their roadmap ● Couchdb ● Very flexible replication, regardless within one or more datacenters ● Bidirectional replication is possible ● Mongodb ● One direction replication possible and mature ● Bidirectional not possible in the moment ● Workaround would be: one mongodb per app and strict separation of the apps ● Hadoop HDFS ● Currently no cross cluster replication available ● e.g. Facebook wrote their own replication for HIVE ● Will possibly arrive soon with Apache Falcon http://falcon.incubator.apache.org/
  • 43. Questions? Thank you for your attention !
  • 44. Some pictures on this presentation were purchased from iStockphoto LP. The price paid applies for the use of the pictures within the scope of a standard license, which includes among other things, online publications including websites up to a maximum image size of 800 x 600 pixels (video: 640 x 480 pixels). Some icons from https://www.iconfinder.com/ are used under the Creative Commons public domain license from the following authors: Artbees, Neurovit and Pixel Mixer (http://pixel-mixer.com) All other trademarks mentioned herein are the property of their respective owners.
  • 46. Big picture example architecture
  • 47. Key features ● 2 datacenters ● Both active (both datacenters active but probably different applications running on them) ● Independent uplinks ● Redundant interconnect ● Applications are deployed and running on both ● Application cluster in every datacenter ● Session replication within every datacenter ● Cross replication between the 2 datacenters ● e.g. with Weblogic Cluster ● Bidirectional database replication ● e.g. 2 independent Oracle RAC in each datacenter ● Replication over streams/Golden Gate ● Monitoring of all critical resources ● Hardware nodes ● Connection pools ● JVM heaps ● Application switch
  • 48. Concepts: other network components ● Firewalls ● First level firewalls – Cisco routers – Stateless firewalls – Not very restrictive ● Second level firewalls (in front of the application load balancers) – Should be stateful – based on Linux/Iptables with conntrackd (for failover) – Statefull, connection tracking – Very restrictive – Rate limiting of new connections (DoS or slashdot) ● All firewalls will be/are in active/hot standby mode. ● On a controlled failover (both are running and we switch them) no single TCP connection should be affected (except small delays) ● In disaster case some seconds until the cluster software detects the crash of the node and initiate the failover. No TCP connections should be lost but there is a very small risk
  • 49. Example of a switch procedure of an application group ● Preparation steps ● Check the health of the replication processes. ● Stop all batch applications (by stopping the job scheduling system). If the time pressure for the switch is high just kill all running jobs (they should be restartable anyway, also currently). ● Switch off the keepalive feature on all httpd servers ● Switching steps ● Change the firewall rules on the second layer firewalls, so that any new connection requests (Syn flag is active) is being dropped. ● Wait until the data is synchronized on both sides (e.g. by monitoring a heartbeat table) and no more httpd processes are active. ● Switch the application traffic to the other DC (by changing the routing of their IP addresses). ● Clean up (remove dropping of Syn packages on the “old” site etc.) ● This procedure is done per application group until all applications are running
  • 50. Application clusters (Weblogic) ● Features of Weblogic that we use ● mod_wl – Manages the stickiness and failover to backup nodes – Automatic retry of failed requests ● On time-outs ● On response header 50x ● Multipools – Gracefully remove a database node out of the pool – Gracefully change parameters of connection pools – Guaranteed balance of connections between database nodes ● Binding execution threads to connection pools ● Auto shutdown (+ restart) of nodes on OutOfMemoryException ● Session replication (also over both DCs) ● Thread monitoring (detect dead or long running threads etc.) ● Diagnostic images and alarms
  • 52. Deployment of connection pools ● One datasource per Oracle RAC node ● Set the initial capacity to a value that will be sufficient for the usual load for the application – Creation of new connections is expensive ● Set the max capacity to a value that will be sufficient in a high load scenario – The overall number of connections should match to the limit of connection on the database site ● Set JDBC parameter in the connection pool and not globally (e.g. v8compatibility=true) ● Check connections on reserve ● You can set db session parameters in the init SQL property (e.g. alter session set NLS_SORT='GERMAN') ● Enable 2 phase commit only if you need it (expensive) ● Prepared statement caching does not bring much performance (at least for Oracle databases) but cost open cursors in the database (per connection!), so don't use it unless you have a very good reason to do it. ● One Multipool containing all single datasources for one database ● Strategy: load balancing

Editor's Notes

  1. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  2. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  3. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  4. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  5. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  6. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  7. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  8. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications
  9. reduce downtime to 0 keep the costs low use linux use x64 hw SW Licenses as low as possible Minimize changes of applications