Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

•Download as PPTX, PDF•

0 likes•271 views

Comment allez vous préparer une mise en avant de votre application lors d'un reportage TV ? Allez vous réussir à absorber ce pic de traffic massif, ou autrement dit "Effet Capital"

Technology

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Comment se crasher avec classe
Guillaume Marchand
Solutions Architect AWS France
A W S M E E T U P P A R I S
# E F F E T C A P I T A L

https://www.youtube.com/watch?v=BJoEGR47rk0

https://www.youtube.com/watch?v=MCpjEiemsRg

mesures rapides
P O U R É V I T E R L A C A T A S T R O P H E

Ma stratégie de healthchecks ?
“ O N N E M ’ A V A I T P A S D I T Q U I F A L L A I T Q U E J ’ E N P R E N N E … ”

Ma stratégie de healthchecks ?
Liveness checks Local health checks Dependency health checks

Traffic Flow
Le DNS
Amazon Route 53
DNS Failover
CACHE DNS

Le CDN
C ’ E S T L A V I E , M Ê M E P O U R L E C O N T E N U D Y N A M I Q U E

Le CDN
Amazon Cloudfront
cache.monsite.com CNAME xxx.cloudfront.net
Default (*) : Min TTL=2s
Q: Origine non disponible ?
Q: Mise en cache des pages d’erreur ?
Bonus : If-Modified-Since → HTTP 304
Feature: Origin Failover

Mon site est la page d’erreur 500
Amazon Simple
Storage Service (S3)
Amazon CloudFront
Amazon API Gateway AWS Lambda Amazon DynamoDB

Et les timeouts ?
E U H H … I L S V O N T B I E N …

Et les timeouts ?
Q: Quelle valeur choisir ? p99.9
ElastiCache
for Redis
MySQL
instance
API
Application
Load balancer
CDN
? ?
?
?
?

Comment se sont passés les tests de charge ?
O N N ’ E N F A I T P A S , C E N ’ E S T P A S R E P R É S E N T A T I F D U T R A F F I C R É E L
tests

Comment se sont passés les tests de charge ?

“Max Connections” ?
J U S Q U ’ À L ’ I N F I N I E T A U D E L À …

Tester le Failover
”A failure event results in a brief interruption, during which read and write operations fail
with an exception. However, service is typically restored in less than 120 seconds, and
often less than 60 seconds.”
Read Replica
Séparer les “insert” des ”select” (PHP, Java)
Amazon RDS Proxy
“With RDS Proxy, failover times for Aurora and RDS databases are reduced by
up to 66%”
Amazon RDS / Aurora

Conclusion
F A I T E S L E A U J O U R D ’ H U I

1. Healthchecks
2. DNS
3. CDN
4. Page d’erreur
Conclusion
5. Timeouts
6. Tests de charge
7. Max Connections
8. Bases de données

Quand je crash …
Atténuer l’impact
NON
NON
NON
NON
NON

Healthcheck
AWS Well-Architected Framework > Operational Excellence > “OPS 8: How do you understand the health of your workload?”
Amazon Builders’ Library > ”Implementing health checks”
Workshop ”Health check and dependencies”
Timeout
Amazon Builders’ Library > “Timeouts, retries and backoff with jitter”
“Resources consumed by idle PostgreSQL connections”
Gestion d’un incident
Session AWS Reinvent 2020 : Incident management in a distributed organization
AWS Gameday
Test de charge
Distributed Load Testing on AWS
Tests de résilience
AWS Fault Injection Simulator
Et ensuite ?

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Merci !
Freepik smashicons
https://www.linkedin.com/in/guillaumemarchand/

What's hot

How To Lock Down And Secure Your WordpressChelsea O'Brien

State of the resource timing apiAaron Peters

Your Script Just Killed My SiteSteve Souders

The Last MileStephen Melrose

Defeating Cross-Site Scripting with Content Security PolicyFrancois Marier

Content Security PolicyRyan LaBouve

Security 101Red Gate Software

Protecting Web App users in today’s hostile environmentajitdhumale

Using jsPerf correctlyMathias Bynens

10 Excellent Ways to Secure Your Spring Boot Application - The Secure Develop...Matt Raible

Web Security - CSP & Web CryptographySamsung Open Source Group

Word camp pune 2013 securityGaurav Singh

Creating Secure Web Apps: What Every Developer Needs to Know About HTTPS TodayHeroku

Content Security Policy - The application security Swiss Army KnifeScott Helme

Adventure Time with JavaScript & Single Page ApplicationsFITC

Csdn Drdobbs Tenni Theurer Yahooguestb1b95b

Building Open RadarTim Burks

Website securityAkhilesh Kant

WebHosting Performance / WordPress - Pubcon Vegas - HendisonSearch Commander, Inc.

Security and Privacy on the Web in 2016Francois Marier

What's hot (20)

How To Lock Down And Secure Your Wordpress

State of the resource timing api

Your Script Just Killed My Site

The Last Mile

Defeating Cross-Site Scripting with Content Security Policy

Content Security Policy

Security 101

Protecting Web App users in today’s hostile environment

Using jsPerf correctly

10 Excellent Ways to Secure Your Spring Boot Application - The Secure Develop...

Web Security - CSP & Web Cryptography

Word camp pune 2013 security

Creating Secure Web Apps: What Every Developer Needs to Know About HTTPS Today

Content Security Policy - The application security Swiss Army Knife

Adventure Time with JavaScript & Single Page Applications

Csdn Drdobbs Tenni Theurer Yahoo

Building Open Radar

Website security

WebHosting Performance / WordPress - Pubcon Vegas - Hendison

Security and Privacy on the Web in 2016

Similar to Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

Innovations and trends in Cloud. Connectfest Porto 2019javier ramirez

Getting Buzzed on Buzzwords: Using Cloud & Big Data to Pentest at ScaleBishop Fox

AWS Summit Singapore Opening Keynote Amazon Web Services

AWS Lambda from the trenches (Serverless London)Yan Cui

Lunch && Learn DevHub - 6 Things to Learn to become an AWS GeniusAndrew Brown

2023 Databases AWS reInvent Launches.pdfbobbyhht

Serverless in production, an experience report (JeffConf)Yan Cui

Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017Codemotion

Serverless in production, an experience report (codemotion milan)Yan Cui

AWS におけるサーバーレスの基礎からチューニングまで崇之清水

Serverless in production, an experience report (CoDe-Conf)Yan Cui

Serverless in Production, an experience report (cloudXchange)Yan Cui

Serverless in production, an experience report (Going Serverless)Yan Cui

Serverless in production, an experience report (LNUG)Yan Cui

Serverless in production, an experience reportYan Cui

Escalando hasta sus primeros 10 millones de usuariosAmazon Web Services LATAM

데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea

The servicescore card - Gamifying Operational Excellence - SRECONDaniel ( Danny ) ☃ Lawrence

SMC304 Serverless Orchestration with AWS Step FunctionsAmazon Web Services

Aws Introduction, technology and $ senseSachin Dole

Similar to Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital (20)

Innovations and trends in Cloud. Connectfest Porto 2019

Getting Buzzed on Buzzwords: Using Cloud & Big Data to Pentest at Scale

AWS Summit Singapore Opening Keynote

AWS Lambda from the trenches (Serverless London)

Lunch && Learn DevHub - 6 Things to Learn to become an AWS Genius

2023 Databases AWS reInvent Launches.pdf

Serverless in production, an experience report (JeffConf)

Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017

Serverless in production, an experience report (codemotion milan)

AWS におけるサーバーレスの基礎からチューニングまで

Serverless in production, an experience report (CoDe-Conf)

Serverless in Production, an experience report (cloudXchange)

Serverless in production, an experience report (Going Serverless)

Serverless in production, an experience report (LNUG)

Serverless in production, an experience report

Escalando hasta sus primeros 10 millones de usuarios

데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)

The servicescore card - Gamifying Operational Excellence - SRECON

SMC304 Serverless Orchestration with AWS Step Functions

Aws Introduction, technology and $ sense

Recently uploaded

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Key Features Of Token Development (1).pptxLBM Solutions

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

"ML in Production",Oleksandr BaganFwdays

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

WordPress Websites for Engineers: Elevate Your Brandgvaughan

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Artificial intelligence in the post-deep learning eraDeakin University

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Install Stable Diffusion in windows machinePadma Pradeep

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Key Features Of Token Development (1).pptx

Pigging Solutions Piggable Sweeping Elbows

"ML in Production",Oleksandr Bagan

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

WordPress Websites for Engineers: Elevate Your Brand

My INSURER PTE LTD - Insurtech Innovation Award 2024

Artificial intelligence in the post-deep learning era

My Hashitalk Indonesia April 2024 Presentation

Unleash Your Potential - Namagunga Girls Coding Club

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Scanning the Internet for External Cloud Exposures via SSL Certs

Install Stable Diffusion in windows machine

Nell’iperspazio con Rocket: il Framework Web di Rust!

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Human Factors of XR: Using Human Factors to Design XR Systems

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Benefits Of Flutter Compared To Other Frameworks

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

1. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Comment se crasher avec classe Guillaume Marchand Solutions Architect AWS France A W S M E E T U P P A R I S # E F F E T C A P I T A L

2. https://www.youtube.com/watch?v=BJoEGR47rk0

3. https://www.youtube.com/watch?v=MCpjEiemsRg

4. mesures rapides P O U R É V I T E R L A C A T A S T R O P H E

5. Ma stratégie de healthchecks ? “ O N N E M ’ A V A I T P A S D I T Q U I F A L L A I T Q U E J ’ E N P R E N N E … ”

6. Ma stratégie de healthchecks ? Liveness checks Local health checks Dependency health checks

7. Le DNS “ I T ’ S A L W A Y S D N S ”

8. Traffic Flow Le DNS Amazon Route 53 DNS Failover CACHE DNS

9. Le CDN C ’ E S T L A V I E , M Ê M E P O U R L E C O N T E N U D Y N A M I Q U E

10. Le CDN Amazon Cloudfront cache.monsite.com CNAME xxx.cloudfront.net Default (*) : Min TTL=2s Q: Origine non disponible ? Q: Mise en cache des pages d’erreur ? Bonus : If-Modified-Since → HTTP 304 Feature: Origin Failover

11. Mon site est la page d’erreur 500

12. Mon site est la page d’erreur 500

13. Mon site est la page d’erreur 500

14. Mon site est la page d’erreur 500 Amazon Simple Storage Service (S3) Amazon CloudFront Amazon API Gateway AWS Lambda Amazon DynamoDB

15. Et les timeouts ? E U H H … I L S V O N T B I E N …

16. Et les timeouts ? Q: Quelle valeur choisir ? p99.9 ElastiCache for Redis MySQL instance API Application Load balancer CDN ? ? ? ? ?

17. Comment se sont passés les tests de charge ? O N N ’ E N F A I T P A S , C E N ’ E S T P A S R E P R É S E N T A T I F D U T R A F F I C R É E L tests

18. Comment se sont passés les tests de charge ?

19. Comment se sont passés les tests de charge ?

20. Comment se sont passés les tests de charge ?

21. “Max Connections” ? J U S Q U ’ À L ’ I N F I N I E T A U D E L À …

22. A Ï E . . S Q L Amazon RDS / Aurora

23. Tester le Failover ”A failure event results in a brief interruption, during which read and write operations fail with an exception. However, service is typically restored in less than 120 seconds, and often less than 60 seconds.” Read Replica Séparer les “insert” des ”select” (PHP, Java) Amazon RDS Proxy “With RDS Proxy, failover times for Aurora and RDS databases are reduced by up to 66%” Amazon RDS / Aurora

24. Conclusion F A I T E S L E A U J O U R D ’ H U I

25. 1. Healthchecks 2. DNS 3. CDN 4. Page d’erreur Conclusion 5. Timeouts 6. Tests de charge 7. Max Connections 8. Bases de données

26. Et ensuite ?

27. Quand je crash … Atténuer l’impact NON NON NON NON NON

28. Healthcheck AWS Well-Architected Framework > Operational Excellence > “OPS 8: How do you understand the health of your workload?” Amazon Builders’ Library > ”Implementing health checks” Workshop ”Health check and dependencies” Timeout Amazon Builders’ Library > “Timeouts, retries and backoff with jitter” “Resources consumed by idle PostgreSQL connections” Gestion d’un incident Session AWS Reinvent 2020 : Incident management in a distributed organization AWS Gameday Test de charge Distributed Load Testing on AWS Tests de résilience AWS Fault Injection Simulator Et ensuite ?

Editor's Notes

Bonjour à tous, merci de m’acceuillir au sein de ce meetup. Si on imaginait que Vous arriviez au travail Vous voyez votre patron se fait interviewer par des journalistes TV sur la qualité, l’originalité des produits de votre société qui sont disponibles sur votre site ou votre application. Le sujet sera diffusé lors dans 4j pendant le journal télévisé de 20h d’une grande chaine de télé. C’est de la communication gratuite et une grande chance pour votre société. Ils vont mettre en avant votre site web, votre application mobile. Tous les téléspacteurs vont se connecter en même temps à la seconde pret sur votre application. Et il est pratiquement sûr qu’elle ne va pas être dispo.
J’ai eu la chance de travailler pour un grand groupe audiovisuel dans les années 2010. J’ai toujours été surpris par le risque de succès et même recurrent. Les premières semaines, je crashais misérablement
Il n’y a pas besoin de l’effet de surprise pour crasher. Ce qui est intéressant avec la cérémonie d’ouverture du festival de cannes, c’est que c’est tous les ans à la même date. Malgré tout, après plusieurs montées des marches, je me suis retrouvé par terre. A cause de facteur externe : Plan marketing, deal publicitaire
Le sujet a été évoqué plusieurs fois par nos clients
Et par Sébastien. Je voudrais partir du fait que vous ayez déjà un existant et que vous êtes On premise. Mettre en place des quicks wins pour que ça ne soit pas catastrophique. Il faut assumer son crash.
Sans changement d’architecture Sans développement Facile à mettre en place Que vous allez pouvoir mettre en place tout seul Si j’avais mis en place ces quickwins sur mes applications, j’aurai pu passer éviter des catastrophes Et vous pourrez regarder le journal de 20h tranquillement dans votre canapé
J’aime poser cette question à mes clients pendant les revues d’architecture. Tout d’un coup il n’y a plus de bruit, la vidéo se coupe. On croirait à un incident d’Amazon Chime mais pas du tout health checks to detect and deal with single-server failures, Une page d’erreur est souvent plus rapide à afficher, si le healthcheck est mal configuré, le load balancer pourrait distribuer le traffic sur ce serveur car plus rapide. Un serveur peut planté à cause d’une dépendance et c’est donc clairement un faux positif. Liveness checks : HTTP 200 Local Health Checks : Verifie le fonctionnement local de l’application : RW Disk, Process, Application process, Support Process (Monitoring & log) Dependency Health Checks : A common pattern is a Read API that queries a database but caches responses locally for some time. If the database is down, the service can still serve cached reads until the database is back online. https://aws.amazon.com/builders-library/implementing-health-checks/ Avez vous vérifié la configuration des healthchecks du DNS, Loadbalancer, du Groupe D’autoscaling et de l’instance virtuelle ? <div>Icons made by <a href="https://www.flaticon.com/authors/freepik" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>
Traffic Flow avec plusieurs règles dont le failover et la lantence Quel est votre failover ? Une page d’erreur
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomOriginUnavailable CloudFront either serves the expired version of the object or serves a custom error page. https://docs.aws.amazon.com/fr_fr/AmazonCloudFront/latest/DeveloperGuide/HTTPStatusCodes.html#HTTPStatusCodes-custom-error-pages https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html Quel est votre failover
When a client is waiting longer than usual for a request to complete, it also holds on to the resources it was using for that request for a longer time. When a number of requests hold on to resources for a long time, the server can run out of those resources. Avez vous bien règlés les timeouts applicatifs avec vos bdds et vos services partenaires où autres ? Loadbalancers OS A good practice for choosing a timeout for calls within an AWS Region is to start with the latency metrics of the downstream service. So at Amazon, when we make one service call another service, we choose an acceptable rate of false timeouts (such as 0.1%). Then, we look at the corresponding latency percentile on the downstream service (p99.9 in this example).
C5.24xlarge
C5.24xlarge 192 GB de Ram
19 Gbits de bande passante EBS
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more. The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time. After all, load balancer health checks are configured with timeouts, just like any other remote service call. Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down. https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more. The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time. After all, load balancer health checks are configured with timeouts, just like any other remote service call. Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down. https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
https://www.php.net/manual/en/mysqlnd-ms.rwsplit.php https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing Q: What happens during Multi-AZ failover and how long does it take? Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer. Failovers, as defined by the interval between the detection of the failure on the primary and the resumption of transactions on the standby, typically complete within one to two minutes. Failover time can also be affected by whether large uncommitted transactions must be recovered; the use of adequately large instance types is recommended with Multi-AZ for best results. AWS also recommends the use of Provisioned IOPS with Multi-AZ instances, for fast, predictable, and consistent throughput performance. --- https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy.html
When a single server fails, that's not a problem, but in a traffic surge to the service, the last thing we want is to shrink the size of the service. Taking servers out of service during an overload can cause a downward spiral. Forcing the remaining servers take even more traffic makes them more likely to become overloaded, also fail a health check, and shrink the fleet even more. The problem is not that overloaded servers return errors when they're overloaded. It's that servers don't respond to the load balancer ping request in time. After all, load balancer health checks are configured with timeouts, just like any other remote service call. Fortunately, there are some straightforward configuration best practices that we follow to help prevent this kind of downward spiral. Tools like iptables, and even some load balancers, support the notion of “max connections.” In this case, the OS (or load balancer) limits the number of connections to the server so that the server process is not flooded with concurrent requests that would have slowed it down. https://helecloud.com/blog/handling-hundreds-of-thousands-of-concurrent-http-connections-on-aws/
Sur le long terme, il y a une autre façon de s’organiser pour gérer ces incidents de productions
démocratiser l'utilisation du chaos engineering
Merci de m’avoir écouté. Vous savez comment crasher vos applications tout en restant dans votre canapé à regarder le journal TV

Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

Similar to Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital (20)

Recently uploaded

Recently uploaded (20)

Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital

Editor's Notes