SlideShare a Scribd company logo
Monitoring Elixir Applications
John Kelly, Sr. Engineer, Bleacher Report
Availability
Time Based Availability2 =
uptime
uptime + downtime
Nine Nines
99.9999999%
“Evidence for the long-term operational stability of the system had also not
been collected in any systematic way. For the Ericsson AXD301 the only
information on the long-term stability of the system came from a power-point
presentation showing some figures claiming that a major customer had run an
11 node system with a 99.9999999% reliability, though how these figure had
been obtained was not documented.”4
The Cloud
AWS EC2 SLA 99.95%5
What IF you were in multiple regions?
2 Region AWS Availability
99.95% = .9995
Errors = .0005
.0005 * .0005 = 0.00000025
1 - 0.00000025 = 0.99999975
~99.99998% (More than 6 nines almost 7!)
~ 6s downtime in a year7,8, but…)
Failover between Regions isn’t free
DNS TTL + (Interval * Threshold)6
60s + (5 * 10s) = 110s
Now we’re in the minutes range
Let’s say 1 region fails 3 times in a year10
That would be 5.5 minutes
Uh oh, 5.26 minutes is 5 9s
and we have more problems...
Some ISPs don’t honor DNS TTLs11
(so some long tail will hit the region with the outage)
I haven’t even touched the issue of
data replication across geographically
distant data centers (regions)
Auto Scaling from increased traffic
takes around 5 minutes...
Sharks! Really12
Does any of this matter?
It depends*
“Put simply, a user on a 99% reliable
smartphone cannot tell the difference
between 99.99% and 99.999% service
reliability!”9
So where do we go from here?
Aggregate Availability3 =
Successful Requests
Total Requests
HTTP API Availability =
2XX + 3XX + 4XX
2XX + 3XX + 4XX + 5XX
B/R Elixir Services at Origin LB
99.26%
Average of 9 April 15 - May 15
B/R Ruby Services at Origin LB
99.91%
Average of 6 April 15 - May 15
Wait, that’s better than Elixir!
70% of outages due to changes in a
live system1
Step 1:
If 70% of outages are caused by
change, the best place to start
monitoring is your change/release
process
B/R’s Release Process
Step 2:
Setup centralized request logging with
X-Request-id Headers
B/R’s Logging Process
Step 3:
Measure aggregate availability
Step 4:
Alerting
B/R’s Alerting & On Call Process
Step 5:
Measure Throughput
Step 6:
Measure Latency -
End to End and at the service level
Step 7:
Monitor Business Metrics
Step 8:
Monitor System Metrics
Step 9:
Measure VM Metrics
Application:
Use a logging system
Use an aggregate data system
Log Releases
Use an alerting system
B/R’s Application of Monitoring
Logz
Datadog
Jenkins
Opsgenie
Remember:
The application / writing the code for
monitoring is the easy part
References
1 Site Reliability Engineering - p.10
2 Site Reliability Engineering - Equation 3-1
3 Site Reliability Engineering - Equation 3-2
4 http://erlang.org/download/armstrong_thesis_2003.pdf p.191
5 https://aws.amazon.com/ec2/sla/
6https://aws.amazon.com/blogs/aws/route-53-health-check-improvements-faster-interval-and-
configurable-failover/
References
7 https://en.wikipedia.org/wiki/High_availability
8 https://aws.amazon.com/blogs/developer/working-with-different-aws-regions/
9 Site Reliability Engineering - p.25
10 (fairly conservative based on SLA it would be 4.38 hours for a region in a year)
11 https://www.godaddy.com/help/what-factors-affect-dns-propagation-time-1746
12 http://www.wired.co.uk/article/shark-cables
THANKS!
johnkelly@github
jkellyj@linkedin
code_jk@twitter

More Related Content

Similar to Empex - Monitoring Elixir Applications

Reliability Testing in OPNFV
Reliability Testing in OPNFVReliability Testing in OPNFV
Reliability Testing in OPNFV
OPNFV
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
Amazon Web Services
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
Amazon Web Services
 
Service Level Objectives and SRE: Service Level Overkill with Mick Roper
Service Level Objectives and SRE: Service Level Overkill with Mick RoperService Level Objectives and SRE: Service Level Overkill with Mick Roper
Service Level Objectives and SRE: Service Level Overkill with Mick Roper
Russell Miles
 
The acid test for cloud data center
The acid test for cloud data centerThe acid test for cloud data center
The acid test for cloud data centerCuriousRubik
 
OIF Open Transport API for Interoperable Optical Networking
OIF Open Transport API for Interoperable Optical NetworkingOIF Open Transport API for Interoperable Optical Networking
OIF Open Transport API for Interoperable Optical Networking
Leah Wilkinson
 
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
VMworld
 
RCS Service Monitoring - 1-to-1 Chat
RCS Service Monitoring - 1-to-1 ChatRCS Service Monitoring - 1-to-1 Chat
RCS Service Monitoring - 1-to-1 Chat
Jose Gonzalez
 
TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6Sravanthi N
 
Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...
Jeff Malek
 
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
ETSI NFV#13   NFV resiliency presentation - ali kafel - stratusETSI NFV#13   NFV resiliency presentation - ali kafel - stratus
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
Ali Kafel
 
How Automation And Intelligence Can Simplify Your High Availability
How Automation And Intelligence Can Simplify Your High AvailabilityHow Automation And Intelligence Can Simplify Your High Availability
How Automation And Intelligence Can Simplify Your High Availability
Precisely
 
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
Veryx Technologies
 
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
Continuent
 
Azure IoT Hub: what is it and why we select other solution (production projec...
Azure IoT Hub: what is it and why we select other solution (production projec...Azure IoT Hub: what is it and why we select other solution (production projec...
Azure IoT Hub: what is it and why we select other solution (production projec...
Katherine Golovinova
 
Evolution of Service Quality Management - Through closed loop automation
Evolution of Service Quality Management - Through closed loop automationEvolution of Service Quality Management - Through closed loop automation
Evolution of Service Quality Management - Through closed loop automation
EXFO
 
Microservices with Spring
Microservices with SpringMicroservices with Spring
Microservices with Spring
Carlos Cavero Barca
 
VoLTE Service Monitoring - VoLTE Voice Call
VoLTE Service Monitoring - VoLTE Voice CallVoLTE Service Monitoring - VoLTE Voice Call
VoLTE Service Monitoring - VoLTE Voice Call
Jose Gonzalez
 
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
Aruba, a Hewlett Packard Enterprise company
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
Mark McBride
 

Similar to Empex - Monitoring Elixir Applications (20)

Reliability Testing in OPNFV
Reliability Testing in OPNFVReliability Testing in OPNFV
Reliability Testing in OPNFV
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 
Service Level Objectives and SRE: Service Level Overkill with Mick Roper
Service Level Objectives and SRE: Service Level Overkill with Mick RoperService Level Objectives and SRE: Service Level Overkill with Mick Roper
Service Level Objectives and SRE: Service Level Overkill with Mick Roper
 
The acid test for cloud data center
The acid test for cloud data centerThe acid test for cloud data center
The acid test for cloud data center
 
OIF Open Transport API for Interoperable Optical Networking
OIF Open Transport API for Interoperable Optical NetworkingOIF Open Transport API for Interoperable Optical Networking
OIF Open Transport API for Interoperable Optical Networking
 
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
 
RCS Service Monitoring - 1-to-1 Chat
RCS Service Monitoring - 1-to-1 ChatRCS Service Monitoring - 1-to-1 Chat
RCS Service Monitoring - 1-to-1 Chat
 
TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6TechTalk_Cloud Performance Testing_0.6
TechTalk_Cloud Performance Testing_0.6
 
Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...Retrospective from a startup built in the cloud: top three big lessons learne...
Retrospective from a startup built in the cloud: top three big lessons learne...
 
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
ETSI NFV#13   NFV resiliency presentation - ali kafel - stratusETSI NFV#13   NFV resiliency presentation - ali kafel - stratus
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
 
How Automation And Intelligence Can Simplify Your High Availability
How Automation And Intelligence Can Simplify Your High AvailabilityHow Automation And Intelligence Can Simplify Your High Availability
How Automation And Intelligence Can Simplify Your High Availability
 
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...
 
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...
 
Azure IoT Hub: what is it and why we select other solution (production projec...
Azure IoT Hub: what is it and why we select other solution (production projec...Azure IoT Hub: what is it and why we select other solution (production projec...
Azure IoT Hub: what is it and why we select other solution (production projec...
 
Evolution of Service Quality Management - Through closed loop automation
Evolution of Service Quality Management - Through closed loop automationEvolution of Service Quality Management - Through closed loop automation
Evolution of Service Quality Management - Through closed loop automation
 
Microservices with Spring
Microservices with SpringMicroservices with Spring
Microservices with Spring
 
VoLTE Service Monitoring - VoLTE Voice Call
VoLTE Service Monitoring - VoLTE Voice CallVoLTE Service Monitoring - VoLTE Voice Call
VoLTE Service Monitoring - VoLTE Voice Call
 
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 

Empex - Monitoring Elixir Applications

Editor's Notes

  1. Mention B/R and consumer team Lead on monitoring Mention the initial focus on availability Mention the shift then to how to monitor with examples from BR
  2. The first metric to discuss What is it? Is it a useful metric? Deep dive on availability but this same process should be applied to all metrics you rely on
  3. Conventional definition
  4. Erlang’s nine nines Where does that figure come from What does it mean I looked for the source
  5. This quote comes from the 2003 thesis that I believe introduced the nine nines Having seen the source how reputable is this claim?
  6. The name cloud invokes this nice simple visualization, in reality tc/ip in a virtualized environment is very complex Virtualized environments add a lot more complexity Talk about our AWS setup Multiple availability zones Load balancers
  7. When you enable an Availability Zone for your load balancer, Elastic Load Balancing creates a load balancer node in the Availability Zone. If you register instances in an Availability Zone but do not enable the Availability Zone, these registered instances do not receive traffic. Note that your load balancer is most effective if you ensure that each enabled Availability Zone has at least one registered instance.
  8. Say over 3 nines This is a contrived worse case. Financially incentivised to not go lower. Of course it could drop lower and it could be much higher “Region Unavailable” and “Region Unavailability” mean that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you. Our elixir apps therefore are limited to 99.95% availability Can we do better
  9. We’re in a single region, but let’s explore going multi-region What kind of complexity does this introduce
  10. Route 53 Failover Routing Policy (active-passive failover) There are others Each AWS region is a completely independent stack of services, totally isolated from other regions. https://aws.amazon.com/blogs/developer/working-with-different-aws-regions/
  11. This assumes regions are independent which is what AWS states This downtime is the downtime where both systems are down simultaneously.
  12. Walk though this slide What does the interval mean Ttl is how long for the caching name servers to cache from the authoritative name server What does the threshold mean Why not threshold 1
  13. Quick slide
  14. Depending on persistence layer (kafka, cassandra) vs mysql, postgres
  15. Possibe mitigation vectors for this Burstable instances, container orchestration, serverless What else
  16. Possibe mitigation vectors for this Burstable instances, container orchestration, serverless
  17. Quick slide
  18. From Google, how reliable is 1 out of 100 or 1 out of 1000 cellular network fail
  19. Quick slide
  20. Google introduces Aggregate Availability in their SRE book Contrast with traditional uptime version Can you accurately measure time in a cloud system? VMs preempted for tens of milliseconds
  21. What we use at BR is a modified version
  22. Let’s revisit the architecture of 1 of B/R’s many elixir services We collect metrics from the LB nodes and send them to Datadog via cloudwatch WE CHEAT USAGE OF CDN AND THIS IS ORIGIN
  23. We didn’t spend 2 years switching to get worse! Quick slide
  24. Ruby Apps were under less active development and being sunsetted in favor Elixir apps Ruby apps aren’t changed except for exceptional cases This is a comparison of microservices The elixir apps see deploys daily or even multiple times a day
  25. Version Control - Software Version Control - Infrastructure Record releases with data time and releaser Planning for rollback We don’t deploy after 3pm*
  26. What is happening in my system?
  27. ELK Logz plug_logger_json Ecto_logger_json Bug me on the docs
  28. Is my system working correctly?
  29. Is my system working correctly?
  30. When something goes wrong? How do I know? The danger of false positives. Start conservative Alerting and avoiding mid night wake up calls
  31. Opsgenie 1 week shifts Different severities / different response times Anyone in the company can submit an issue not just automation We have a slack channel for these alerts An alert is then split out into a dedicated channel with stakeholders Goal is to triage and try to solve yourself but have permission to drag someone not on call in if necessary. Management also gets paged. Phone calls if not answered with 2 minutes
  32. When something goes wrong? How do I know? The danger of false positives. Start conservative
  33. Needed for capacity planning and figuring out user behaviors
  34. Needed for capacity planning and figuring out user behaviors
  35. There have been many studies on the importance of latency with regards to customer behavior. 100ms difference can have a measurable impact on revenue or satisfaction Percentiles not averages!!!! Clocks - how accurate can you actually measure latency?
  36. We tend to shove business metrics into a seperate analytics bucket? But what is analytics? Is that not business logic monitoring? How many user’s are using this piece of software, etc
  37. How many user’s are using this piece of software, etc
  38. CPU, open file descriptors, memory, disk space, network
  39. CPU, open file descriptors, memory, disk space, network These can creep up over time
  40. Check time Run Queue
  41. Run Queue 100% CPU - after a few minutes, run queue would increase to 20ish Infinite loop by missed pattern match case
  42. Run Queue
  43. You need to deep dive like we did with availability for all the metrics you monitor