SlideShare a Scribd company logo
#DevoxxUS
Architecting for failures in micro
services:
Patterns and lessons learned
Bhakti Mehta
@bhakti_mehta
INTRODUCTION
➤ Platform@Atlassian
➤ In the past Platform Lead at BlueJeans Network
➤ Worked at Sun Microsystems/Oracle for 13 years
➤ Committer to numerous open source projects including
GlassFish Application Server
MY RECENT BOOK
PREVIOUS BOOK
ATLASSSIAN
Microservices
PATH TO MICROSERVICES
➤ Advantages
➤ Simplicity
➤ Isolation of problems
➤ Scale up and scale down
➤ Easy deployment
➤ Polyglotism and heterogenity
Sounds great!!
In reality……..
MONOLITHS TO MICRO SERVICES
RESILIENT SYSTEM
➤ Processes transactions, even when there are transient
impulses, persistent stresses
➤ Functions even when there are component failures disrupting
normal processing
➤ Accepts failures will happen
➤ Design for crumple zones
RESILIENT SYSTEM
Be the duck
Behave normally when
the system is not
performing as expected
in face of outages
Behave normally
How the customer should perceive you?
RESILIENT SYSTEM
How the system needs to function?
Heal quickly before customers notice
KINDS OF FAILURES
➤ Challenges at scale
➤ Integration point failures
➤ Network errors
➤ Semantic errors.
➤ Slow responses
➤ Outright hang
➤ GC issues
THE NEW WAY OF LIFE
You build it
You run it !!
(You own it
You plan for it !!! ]
➤
PERFECT STORM
THINGS THAT WENT WRONG
➤ Bad node in load balancer group
➤ Deployment of new code
➤ Gradual increase in latency
➤ Abuse by clients
➤ Not enough prod like data in staging
➤ No easy way to trigger stale/lenient fallbacks
➤ Less alerts
LESSONS LEARNED
consequential !!!!
Errors can be frequent but
latencies are consequential !!
ACTION PLAN
➤ Circuit breakers
➤ Fallback (lenient acceptable values)
➤ Predictive caching
➤ Reduce surface area by clients
➤ Load tests
➤ Failure injection testing
➤ Monitor
➤ Alerts
Development time
Before a deploy
Post deploy
The more you sweat on the field
the less you bleed in war!!!
RESILIENCY PLANNING STAGE 1
➤ When developing code
➤ Avoiding Cascading failures
➤ Circuit breaker
➤ Timeouts
➤ Retry
➤ Bulkhead
➤ Cache optimisations
➤ Avoid malicious clients
➤ Rate limiting
RESILIENCY PLANNING STAGE 2
➤ Planning for dealing with failures before deploy to prod
➤ load test
➤ a/b test
➤ longevity
➤ dark launch features
RESILIENCY PLANNING STAGE 3
➤ Watching out for failures after deploy to prod
➤ health check
➤ metrics
CASCADING FAILURES
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
HYSTRIX- CIRCUIT BREAKER PATTERN
• Fault tolerance pattern as a library
• Automatic fail fast
• Automatic fail over
• Metrics- Circuit breaker open, calls/sec, Execution time
median, 90, 95 99 percentile
• If command has high failure rate in last 10 seconds it is
unlikely to succeed now
TIMEOUTS PATTERN
RETRY PATTERN AND TIMEOUTS
➤ Retry for failures in case of network failures, timeouts or
server errors
➤ Helps transient network errors such as dropped connections
or server fail over
BULKHEAD
RATE LIMITING
RATE LIMITING
➤ Restricting the number of requests that can be made by a
client
➤ Client can be identified based on the access token used
➤ Additionally clients can be identified based on IP address
CACHE OPTIMIZATIONS
Getting from first level cache
Getting from second
level cache
Getting from the DB
TALE OF THE NEVER LEAVING CACHE ENTRIES
➤ Longer TTL
➤ Not evicted soon enough
➤ Bottlenecks
➤ Failures
LOGGING BEST PRACTICES
➤ Include detailed, consistent pattern across service logs
➤ Obfuscate sensitive data
➤ Identify caller or initiator as part of logs
➤ Do not log payloads
➤ Request tracing across services
RESILIENCE PLANNING STAGE 2
➤ Before deploy
➤ Load testing
➤ Longevity testing
➤ Capacity planning
LOAD TESTING
➤ Ensure that you test for load on APIs
➤ Plan for longevity testing
CAPACITY PLANNING
➤ Anticipate growth
➤ Design for handling exponential growth
RESILIENCE PLANNING STAGE 3
➤ After deploy
➤ Health check
➤ Metrics and Monitoring
➤ Phased rollout of features
Health Check
HEALTH CHECK
➤ Memory
➤ CPU
➤ Threads
➤ Error rate
➤ If any of the checks exceed a threshold send alert
Metrics and Monitoring
METRICS
➤ Response times, throughput
➤ Identify slow running DB queries
➤ GC rate and pause duration
➤ Garbage collection can cause slow responses
➤ Monitor unusual activity
➤ Create alerts when thresholds are exceeded
➤ Run books for actions to be taken on alerts
Thoughts of the on call person paged at 3 am
debugging an issue in your code
MONITORING
Monitoring
server
Environment
CHECKS
ALERTS
Email
SAVED BY THE METRICS AND ALERTS
➤ MaxDBConnection alert
➤ CPU Utilisation spiking up
➤ Analysed slow running queries
➤ Some select queries taking very long avg of 718 ms 95
percentile 2030 ms.
➤ Unidentified cause which was a bug fix which introduced
pagination and the ORDER BY clause needed to match a
function based index
ROLLOUT OF NEW FEATURES
➤ Phasing rollout of new features
➤ Dark launch features
➤ Have a way to turn features off if not behaving as expected
➤ Alerts and more alerts!
AWS S3 OUTAGE
➤ S3 outage in US East
➤ Number of services affected
➤ 3rd party services we depend on have degraded performances
➤ Lots of key take aways from this
Cheat sheet
A Alerts K Key invalidations
B Bulkheads L Logging
C Circuit Breakers M Metrics & monitoring
D Data obfuscation N Network latencies
E Eventual consistent O Optimizing queries
F Fallbacks & Hystrix P Phased rollouts
G GC settings Q Queues bounded
H Health checks R Run books
I Injecting failure S Staged deployments
J Jitter with Retries T Timeouts
TAKEAWAY
➤ Inevitability of failures
➤ Expect systems will fail
➤ Failure prevention - Plan for failures Not if but when
➤ Automate
Keep Calm and Cloud On!
REFERENCES
➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png
➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg
➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex-
menapocalpse.jpg
➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg
➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg
➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg
➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign-
resized_2.jpg
➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug-
Around-The-Neck-With-A-Rope-Image.jpg
➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons
License
#DevoxxUS
Questions
@bhakti_mehta

More Related Content

What's hot

Dal deck
Dal deckDal deck
Dal deck
Caroline_Rose
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
Diego Pacheco
 
Performance testing virtualized systems v5
Performance testing virtualized systems v5Performance testing virtualized systems v5
Performance testing virtualized systems v5Mentora
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
Masashi Narumoto
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
Derek Ashmore
 
Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
 
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATIONEXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
Mysql User Camp
 
What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?
Tracy McKibben
 
Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19
Derek Ashmore
 
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
Run MongoDB with Confidence Using MongoDB Management Service (MMS)Run MongoDB with Confidence Using MongoDB Management Service (MMS)
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
MongoDB
 
Exploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator InsecuritiesExploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator Insecurities
Priyanka Aash
 
Designing microservices
Designing microservicesDesigning microservices
Designing microservices
Masashi Narumoto
 
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Dennis Doomen
 
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Continuent
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler
 
Caching 101
Caching 101Caching 101
Caching 101
Andy Melichar
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated DeploymentMichael Peacock
 
Speeding up your Drupal site
Speeding up your Drupal siteSpeeding up your Drupal site
Speeding up your Drupal site
Ronan Berder
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basicssysnickm
 

What's hot (20)

Dal deck
Dal deckDal deck
Dal deck
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
 
Performance testing virtualized systems v5
Performance testing virtualized systems v5Performance testing virtualized systems v5
Performance testing virtualized systems v5
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15Microservices for java architects it-symposium-2015-09-15
Microservices for java architects it-symposium-2015-09-15
 
Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
 
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATIONEXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
EXPERIENCE WITH MYSQL HA SOLUTION AND GROUP REPLICATION
 
What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?What Are Your Servers Doing While You’re Sleeping?
What Are Your Servers Doing While You’re Sleeping?
 
Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19
 
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
Run MongoDB with Confidence Using MongoDB Management Service (MMS)Run MongoDB with Confidence Using MongoDB Management Service (MMS)
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
 
Exploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator InsecuritiesExploiting Active Directory Administrator Insecurities
Exploiting Active Directory Administrator Insecurities
 
Designing microservices
Designing microservicesDesigning microservices
Designing microservices
 
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
Decomposing the monolith into embeddable microservices using OWIN, WebHooks, ...
 
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
Webinar Slides: Real-Time Replication vs. ETL - How Analytics Requires New Te...
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Caching 101
Caching 101Caching 101
Caching 101
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated Deployment
 
Speeding up your Drupal site
Speeding up your Drupal siteSpeeding up your Drupal site
Speeding up your Drupal site
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basics
 

Similar to Devoxx2017

Resilience planning and how the empire strikes back
Resilience planning and how the empire strikes backResilience planning and how the empire strikes back
Resilience planning and how the empire strikes back
Bhakti Mehta
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Bhakti Mehta
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
C4Media
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
Amit Kejriwal
 
Training Webinar: Detect Performance Bottlenecks of Applications
Training Webinar: Detect Performance Bottlenecks of ApplicationsTraining Webinar: Detect Performance Bottlenecks of Applications
Training Webinar: Detect Performance Bottlenecks of Applications
OutSystems
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
Lari Hotari
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
Mydbops
 
Performance testing
Performance testingPerformance testing
Performance testing
Chalana Kahandawala
 
Digital Nightmares - The Biggest Performance Killers in Your Environment
Digital Nightmares - The Biggest Performance Killers in Your EnvironmentDigital Nightmares - The Biggest Performance Killers in Your Environment
Digital Nightmares - The Biggest Performance Killers in Your Environment
Wes Morgan
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
Anu Shaji
 
Google Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsGoogle Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flaws
Barbara Aichinger
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
Dr. Prakash Sahu
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
Rodolfo Kohn
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
Amazon Web Services
 
Open source: Top issues in the top enterprise packages
Open source: Top issues in the top enterprise packagesOpen source: Top issues in the top enterprise packages
Open source: Top issues in the top enterprise packages
Rogue Wave Software
 
PQA's Performance Testing 101
PQA's Performance Testing 101PQA's Performance Testing 101
PQA's Performance Testing 101
PQA / PLATO Testing
 
ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016Derek Downey
 
6 Steps to Performance Testing like a Pro
6 Steps to Performance Testing like a Pro6 Steps to Performance Testing like a Pro
6 Steps to Performance Testing like a Pro
LogiGear Corporation
 

Similar to Devoxx2017 (20)

Resilience planning and how the empire strikes back
Resilience planning and how the empire strikes backResilience planning and how the empire strikes back
Resilience planning and how the empire strikes back
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Training Webinar: Detect Performance Bottlenecks of Applications
Training Webinar: Detect Performance Bottlenecks of ApplicationsTraining Webinar: Detect Performance Bottlenecks of Applications
Training Webinar: Detect Performance Bottlenecks of Applications
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
 
Performance testing
Performance testingPerformance testing
Performance testing
 
Digital Nightmares - The Biggest Performance Killers in Your Environment
Digital Nightmares - The Biggest Performance Killers in Your EnvironmentDigital Nightmares - The Biggest Performance Killers in Your Environment
Digital Nightmares - The Biggest Performance Killers in Your Environment
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Google Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsGoogle Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flaws
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
JMeter
JMeterJMeter
JMeter
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Open source: Top issues in the top enterprise packages
Open source: Top issues in the top enterprise packagesOpen source: Top issues in the top enterprise packages
Open source: Top issues in the top enterprise packages
 
PQA's Performance Testing 101
PQA's Performance Testing 101PQA's Performance Testing 101
PQA's Performance Testing 101
 
ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016ProxySQL Tutorial - PLAM 2016
ProxySQL Tutorial - PLAM 2016
 
6 Steps to Performance Testing like a Pro
6 Steps to Performance Testing like a Pro6 Steps to Performance Testing like a Pro
6 Steps to Performance Testing like a Pro
 

More from Bhakti Mehta

Reliability teamwork
Reliability teamworkReliability teamwork
Reliability teamwork
Bhakti Mehta
 
Let if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreLet if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and more
Bhakti Mehta
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
Bhakti Mehta
 
Think async
Think asyncThink async
Think async
Bhakti Mehta
 
Fight empire-html5
Fight empire-html5Fight empire-html5
Fight empire-html5
Bhakti Mehta
 
Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiBhakti Mehta
 

More from Bhakti Mehta (7)

Reliability teamwork
Reliability teamworkReliability teamwork
Reliability teamwork
 
Let if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and moreLet if flow: Java 8 Streams puzzles and more
Let if flow: Java 8 Streams puzzles and more
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
Think async
Think asyncThink async
Think async
 
Fight empire-html5
Fight empire-html5Fight empire-html5
Fight empire-html5
 
50 tips50minutes
50 tips50minutes50 tips50minutes
50 tips50minutes
 
Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhakti
 

Recently uploaded

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 

Recently uploaded (20)

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 

Devoxx2017

  • 1. #DevoxxUS Architecting for failures in micro services: Patterns and lessons learned Bhakti Mehta @bhakti_mehta
  • 2. INTRODUCTION ➤ Platform@Atlassian ➤ In the past Platform Lead at BlueJeans Network ➤ Worked at Sun Microsystems/Oracle for 13 years ➤ Committer to numerous open source projects including GlassFish Application Server
  • 7. PATH TO MICROSERVICES ➤ Advantages ➤ Simplicity ➤ Isolation of problems ➤ Scale up and scale down ➤ Easy deployment ➤ Polyglotism and heterogenity
  • 10. MONOLITHS TO MICRO SERVICES
  • 11. RESILIENT SYSTEM ➤ Processes transactions, even when there are transient impulses, persistent stresses ➤ Functions even when there are component failures disrupting normal processing ➤ Accepts failures will happen ➤ Design for crumple zones
  • 12. RESILIENT SYSTEM Be the duck Behave normally when the system is not performing as expected in face of outages Behave normally How the customer should perceive you?
  • 13. RESILIENT SYSTEM How the system needs to function? Heal quickly before customers notice
  • 14. KINDS OF FAILURES ➤ Challenges at scale ➤ Integration point failures ➤ Network errors ➤ Semantic errors. ➤ Slow responses ➤ Outright hang ➤ GC issues
  • 15. THE NEW WAY OF LIFE You build it You run it !! (You own it You plan for it !!! ]
  • 16.
  • 18. THINGS THAT WENT WRONG ➤ Bad node in load balancer group ➤ Deployment of new code ➤ Gradual increase in latency ➤ Abuse by clients ➤ Not enough prod like data in staging ➤ No easy way to trigger stale/lenient fallbacks ➤ Less alerts
  • 19. LESSONS LEARNED consequential !!!! Errors can be frequent but latencies are consequential !!
  • 20. ACTION PLAN ➤ Circuit breakers ➤ Fallback (lenient acceptable values) ➤ Predictive caching ➤ Reduce surface area by clients ➤ Load tests ➤ Failure injection testing ➤ Monitor ➤ Alerts Development time Before a deploy Post deploy
  • 21. The more you sweat on the field the less you bleed in war!!!
  • 22. RESILIENCY PLANNING STAGE 1 ➤ When developing code ➤ Avoiding Cascading failures ➤ Circuit breaker ➤ Timeouts ➤ Retry ➤ Bulkhead ➤ Cache optimisations ➤ Avoid malicious clients ➤ Rate limiting
  • 23. RESILIENCY PLANNING STAGE 2 ➤ Planning for dealing with failures before deploy to prod ➤ load test ➤ a/b test ➤ longevity ➤ dark launch features
  • 24. RESILIENCY PLANNING STAGE 3 ➤ Watching out for failures after deploy to prod ➤ health check ➤ metrics
  • 25.
  • 26. CASCADING FAILURES Caused by Chain reactions For example One node in a load balance group fails Others need to pick up work Eventually performance can degenerate
  • 27. HYSTRIX- CIRCUIT BREAKER PATTERN • Fault tolerance pattern as a library • Automatic fail fast • Automatic fail over • Metrics- Circuit breaker open, calls/sec, Execution time median, 90, 95 99 percentile • If command has high failure rate in last 10 seconds it is unlikely to succeed now
  • 29. RETRY PATTERN AND TIMEOUTS ➤ Retry for failures in case of network failures, timeouts or server errors ➤ Helps transient network errors such as dropped connections or server fail over
  • 32. RATE LIMITING ➤ Restricting the number of requests that can be made by a client ➤ Client can be identified based on the access token used ➤ Additionally clients can be identified based on IP address
  • 33. CACHE OPTIMIZATIONS Getting from first level cache Getting from second level cache Getting from the DB
  • 34. TALE OF THE NEVER LEAVING CACHE ENTRIES ➤ Longer TTL ➤ Not evicted soon enough ➤ Bottlenecks ➤ Failures
  • 35. LOGGING BEST PRACTICES ➤ Include detailed, consistent pattern across service logs ➤ Obfuscate sensitive data ➤ Identify caller or initiator as part of logs ➤ Do not log payloads ➤ Request tracing across services
  • 36. RESILIENCE PLANNING STAGE 2 ➤ Before deploy ➤ Load testing ➤ Longevity testing ➤ Capacity planning
  • 37. LOAD TESTING ➤ Ensure that you test for load on APIs ➤ Plan for longevity testing
  • 38. CAPACITY PLANNING ➤ Anticipate growth ➤ Design for handling exponential growth
  • 39. RESILIENCE PLANNING STAGE 3 ➤ After deploy ➤ Health check ➤ Metrics and Monitoring ➤ Phased rollout of features
  • 41. HEALTH CHECK ➤ Memory ➤ CPU ➤ Threads ➤ Error rate ➤ If any of the checks exceed a threshold send alert
  • 43. METRICS ➤ Response times, throughput ➤ Identify slow running DB queries ➤ GC rate and pause duration ➤ Garbage collection can cause slow responses ➤ Monitor unusual activity ➤ Create alerts when thresholds are exceeded ➤ Run books for actions to be taken on alerts
  • 44. Thoughts of the on call person paged at 3 am debugging an issue in your code
  • 46. SAVED BY THE METRICS AND ALERTS ➤ MaxDBConnection alert ➤ CPU Utilisation spiking up ➤ Analysed slow running queries ➤ Some select queries taking very long avg of 718 ms 95 percentile 2030 ms. ➤ Unidentified cause which was a bug fix which introduced pagination and the ORDER BY clause needed to match a function based index
  • 47. ROLLOUT OF NEW FEATURES ➤ Phasing rollout of new features ➤ Dark launch features ➤ Have a way to turn features off if not behaving as expected ➤ Alerts and more alerts!
  • 48. AWS S3 OUTAGE ➤ S3 outage in US East ➤ Number of services affected ➤ 3rd party services we depend on have degraded performances ➤ Lots of key take aways from this
  • 49. Cheat sheet A Alerts K Key invalidations B Bulkheads L Logging C Circuit Breakers M Metrics & monitoring D Data obfuscation N Network latencies E Eventual consistent O Optimizing queries F Fallbacks & Hystrix P Phased rollouts G GC settings Q Queues bounded H Health checks R Run books I Injecting failure S Staged deployments J Jitter with Retries T Timeouts
  • 50. TAKEAWAY ➤ Inevitability of failures ➤ Expect systems will fail ➤ Failure prevention - Plan for failures Not if but when ➤ Automate Keep Calm and Cloud On!
  • 51. REFERENCES ➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png ➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg ➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex- menapocalpse.jpg ➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg ➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg ➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg ➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign- resized_2.jpg ➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug- Around-The-Neck-With-A-Rope-Image.jpg ➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons License