Comment allez vous préparer une mise en avant de votre application lors d'un reportage TV ? Allez vous réussir à absorber ce pic de traffic massif, ou autrement dit "Effet Capital"
9. Le CDN
C ’ E S T L A V I E , M Ê M E P O U R L E C O N T E N U D Y N A M I Q U E
10. Le CDN
Amazon Cloudfront
cache.monsite.com CNAME xxx.cloudfront.net
Default (*) : Min TTL=2s
Q: Origine non disponible ?
Q: Mise en cache des pages d’erreur ?
Bonus : If-Modified-Since → HTTP 304
Feature: Origin Failover
16. Et les timeouts ?
Q: Quelle valeur choisir ? p99.9
ElastiCache
for Redis
MySQL
instance
API
Application
Load balancer
CDN
? ?
?
?
?
17. Comment se sont passés les tests de charge ?
O N N ’ E N F A I T P A S , C E N ’ E S T P A S R E P R É S E N T A T I F D U T R A F F I C R É E L
tests
23. Tester le Failover
”A failure event results in a brief interruption, during which read and write operations fail
with an exception. However, service is typically restored in less than 120 seconds, and
often less than 60 seconds.”
Read Replica
Séparer les “insert” des ”select” (PHP, Java)
Amazon RDS Proxy
“With RDS Proxy, failover times for Aurora and RDS databases are reduced by
up to 66%”
Amazon RDS / Aurora
28. Healthcheck
AWS Well-Architected Framework > Operational Excellence > “OPS 8: How do you understand the health of your workload?”
Amazon Builders’ Library > ”Implementing health checks”
Workshop ”Health check and dependencies”
Timeout
Amazon Builders’ Library > “Timeouts, retries and backoff with jitter”
“Resources consumed by idle PostgreSQL connections”
Gestion d’un incident
Session AWS Reinvent 2020 : Incident management in a distributed organization
AWS Gameday
Test de charge
Distributed Load Testing on AWS
Tests de résilience
AWS Fault Injection Simulator
Et ensuite ?
Bonjour à tous, merci de m’acceuillir au sein de ce meetup.
Si on imaginait que Vous arriviez au travail
Vous voyez votre patron se fait interviewer par des journalistes TV sur la qualité, l’originalité des produits de votre société qui sont disponibles sur votre site ou votre application.
Le sujet sera diffusé lors dans 4j pendant le journal télévisé de 20h d’une grande chaine de télé.
C’est de la communication gratuite et une grande chance pour votre société.
Ils vont mettre en avant votre site web, votre application mobile.
Tous les téléspacteurs vont se connecter en même temps à la seconde pret sur votre application. Et il est pratiquement sûr qu’elle ne va pas être dispo.
J’ai eu la chance de travailler pour un grand groupe audiovisuel dans les années 2010.
J’ai toujours été surpris par le risque de succès et même recurrent. Les premières semaines, je crashais misérablement
Il n’y a pas besoin de l’effet de surprise pour crasher.
Ce qui est intéressant avec la cérémonie d’ouverture du festival de cannes, c’est que c’est tous les ans à la même date.
Malgré tout, après plusieurs montées des marches, je me suis retrouvé par terre. A cause de facteur externe : Plan marketing, deal publicitaire
Le sujet a été évoqué plusieurs fois par nos clients
Et par Sébastien.
Je voudrais partir du fait que vous ayez déjà un existant et que vous êtes On premise.
Mettre en place des quicks wins pour que ça ne soit pas catastrophique. Il faut assumer son crash.
Sans changement d’architecture
Sans développement
Facile à mettre en place
Que vous allez pouvoir mettre en place tout seul
Si j’avais mis en place ces quickwins sur mes applications, j’aurai pu passer éviter des catastrophes
Et vous pourrez regarder le journal de 20h tranquillement dans votre canapé
J’aime poser cette question à mes clients pendant les revues d’architecture. Tout d’un coup il n’y a plus de bruit, la vidéo se coupe. On croirait à un incident d’Amazon Chime mais pas du tout
health checks to detect and deal with single-server failures,
Une page d’erreur est souvent plus rapide à afficher, si le healthcheck est mal configuré, le load balancer pourrait distribuer le traffic sur ce serveur car plus rapide.
Un serveur peut planté à cause d’une dépendance et c’est donc clairement un faux positif.
Liveness checks : HTTP 200
Local Health Checks : Verifie le fonctionnement local de l’application : RW Disk, Process, Application process, Support Process (Monitoring & log)
Dependency Health Checks : A common pattern is a Read API that queries a database but caches responses locally for some time. If the database is down, the service can still serve cached reads until the database is back online.
https://aws.amazon.com/builders-library/implementing-health-checks/
Avez vous vérifié la configuration des healthchecks du DNS, Loadbalancer, du Groupe D’autoscaling et de l’instance virtuelle ?
<div>Icons made by <a href="https://www.flaticon.com/authors/freepik" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>
Traffic Flow avec plusieurs règles dont le failover et la lantence
Quel est votre failover ? Une page d’erreur
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomOriginUnavailable
CloudFront either serves the expired version of the object or serves a custom error page.
https://docs.aws.amazon.com/fr_fr/AmazonCloudFront/latest/DeveloperGuide/HTTPStatusCodes.html#HTTPStatusCodes-custom-error-pages
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html
Quel est votre failover
When a client is waiting longer than usual for a request to complete, it also holds on to the resources it was using for that request for a longer time. When a number of requests hold on to resources for a long time, the server can run out of those resources.
Avez vous bien règlés les timeouts applicatifs avec vos bdds et vos services partenaires où autres ?
Loadbalancers
OS
A good practice for choosing a timeout for calls within an AWS Region is to start with the latency metrics of the downstream service. So at Amazon, when we make one service call another service, we choose an acceptable rate of false timeouts (such as 0.1%). Then, we look at the corresponding latency percentile on the downstream service (p99.9 in this example).
C5.24xlarge
C5.24xlarge 192 GB de Ram
19 Gbits de bande passante EBS
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more.
The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time.
After all, load balancer health checks are configured with timeouts, just like any other remote service call.
Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down.
https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more.
The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time.
After all, load balancer health checks are configured with timeouts, just like any other remote service call.
Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down.
https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
https://www.php.net/manual/en/mysqlnd-ms.rwsplit.php
https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing
Q: What happens during Multi-AZ failover and how long does it take?
Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.
Failovers, as defined by the interval between the detection of the failure on the primary and the resumption of transactions on the standby, typically complete within one to two minutes. Failover time can also be affected by whether large uncommitted transactions must be recovered; the use of adequately large instance types is recommended with Multi-AZ for best results. AWS also recommends the use of Provisioned IOPS with Multi-AZ instances, for fast, predictable, and consistent throughput performance.
---
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy.html
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more.
The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time.
After all, load balancer health checks are configured with timeouts, just like any other remote service call.
Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down.
https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
Sur le long terme, il y a une autre façon de s’organiser pour gérer ces incidents de productions
démocratiser l'utilisation du chaos engineering
Merci de m’avoir écouté. Vous savez comment crasher vos applications tout en restant dans votre canapé à regarder le journal TV