Empex - Monitoring Elixir Applications

•Download as PPTX, PDF•

2 likes•297 views

1) Monitoring availability is important for Elixir applications. Having applications in multiple AWS regions can provide 99.99998% availability but failover between regions is not instant due to DNS propagation delays. 2) 70% of outages are caused by changes to live systems so monitoring release processes is key. Logging requests with request IDs and aggregating data helps monitor availability and failures. 3) Measuring business and system metrics along with VM metrics provides full visibility. Bleacher Report's monitoring includes logging to Logz, metrics to Datadog, release tracking with Jenkins, and alerting with Opsgenie.

Monitoring Elixir Applications
John Kelly, Sr. Engineer, Bleacher Report

Time Based Availability2 =
uptime
uptime + downtime

“Evidence for the long-term operational stability of the system had also not
been collected in any systematic way. For the Ericsson AXD301 the only
information on the long-term stability of the system came from a power-point
presentation showing some figures claiming that a major customer had run an
11 node system with a 99.9999999% reliability, though how these figure had
been obtained was not documented.”4

2 Region AWS Availability
99.95% = .9995
Errors = .0005
.0005 * .0005 = 0.00000025
1 - 0.00000025 = 0.99999975
~99.99998% (More than 6 nines almost 7!)
~ 6s downtime in a year7,8, but…)

Failover between Regions isn’t free
DNS TTL + (Interval * Threshold)6
60s + (5 * 10s) = 110s
Now we’re in the minutes range
Let’s say 1 region fails 3 times in a year10
That would be 5.5 minutes

Uh oh, 5.26 minutes is 5 9s
and we have more problems...

Some ISPs don’t honor DNS TTLs11
(so some long tail will hit the region with the outage)

I haven’t even touched the issue of
data replication across geographically
distant data centers (regions)

Auto Scaling from increased traffic
takes around 5 minutes...

“Put simply, a user on a 99% reliable
smartphone cannot tell the difference
between 99.99% and 99.999% service
reliability!”9

Aggregate Availability3 =
Successful Requests
Total Requests

HTTP API Availability =
2XX + 3XX + 4XX
2XX + 3XX + 4XX + 5XX

B/R Elixir Services at Origin LB
99.26%
Average of 9 April 15 - May 15

B/R Ruby Services at Origin LB
99.91%
Average of 6 April 15 - May 15

70% of outages due to changes in a
live system1

Step 1:
If 70% of outages are caused by
change, the best place to start
monitoring is your change/release
process

Step 2:
Setup centralized request logging with
X-Request-id Headers

Step 6:
Measure Latency -
End to End and at the service level

Application:
Use a logging system
Use an aggregate data system
Log Releases
Use an alerting system

B/R’s Application of Monitoring
Logz
Datadog
Jenkins
Opsgenie

Remember:
The application / writing the code for
monitoring is the easy part

References
1 Site Reliability Engineering - p.10
2 Site Reliability Engineering - Equation 3-1
3 Site Reliability Engineering - Equation 3-2
4 http://erlang.org/download/armstrong_thesis_2003.pdf p.191
5 https://aws.amazon.com/ec2/sla/
6https://aws.amazon.com/blogs/aws/route-53-health-check-improvements-faster-interval-and-
configurable-failover/

References
7 https://en.wikipedia.org/wiki/High_availability
8 https://aws.amazon.com/blogs/developer/working-with-different-aws-regions/
9 Site Reliability Engineering - p.25
10 (fairly conservative based on SLA it would be 4.38 hours for a region in a year)
11 https://www.godaddy.com/help/what-factors-affect-dns-propagation-time-1746
12 http://www.wired.co.uk/article/shark-cables

THANKS!
johnkelly@github
jkellyj@linkedin
code_jk@twitter

Quby is the creator and provider of Toon, a leading European smart home platform. We enable Toon users to control and monitor their homes using both an in-home display and app. As a data driven company, we use machine learning algorithms to generate actionable insights for our end users. We have developed data driven services to ensure that users do not needlessly waste energy and can receive real-time alerts about problems with their heating system. In this talk, Erni will describe our journey of productionizing data science algorithms. We’ll take a deep dive into our pipeline and describe our streamlined development and deployment workflow. We’ll explain how we define and manage dependencies between jobs in multiple environments (test, acceptance and production) and schedule the pipeline computation. We’ll delve into scale challenges, metrics, monitoring and data quality. Also, we will reflect on the lessons learned while building high volume infrastructure that offers multiple data driven services to hundreds of thousands of users.

IANA Update September 2015

APNIC

3a data link layer

kavish dani

Atlas Services Remote Analysis Report Sample

ExtraHop Networks

System reliability is one of most important features for the Network service. With the development of visualized network functions, system reliability and stability becomes more complex than before. Tes ting and verifying OPNFV infrastructure for tolerating the faults and failures is a great challenge. To address this challenge, the presentation presents a test framework which covers the means and types of faults injection as well as the implement of fault injection tools. Some test cases are shown to validate the framework.

“Spikey Workloads” Emergency Management in the Cloud

Amazon Web Services

“Spikey Workloads” Emergency Management in the Cloud

Amazon Web Services

Service Level Objectives and SRE: Service Level Overkill with Mick Roper

Russell Miles

The acid test for cloud data centerCuriousRubik

OIF Open Transport API for Interoperable Optical Networking

Leah Wilkinson

VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...

VMworld

RCS Service Monitoring - 1-to-1 Chat

Jose Gonzalez

There is currently no accepted standard for the measurement or monitoring of RCS Services, even though we believe that this is vital to assure the quality and reliability of such services -and to establish a framework for reliable comparison across implementations. To this end Ascom has defined a formal definition and implementation strategy to help the Operations team solve a range of challenges, including issues related to EPC, IMS and the Application Server. We will describe this solution in a number of short articles. This article describes the 1-to-1 Chat test case.

TechTalk_Cloud Performance Testing_0.6Sravanthi N

Retrospective from a startup built in the cloud: top three big lessons learne...

Jeff Malek

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Ali Kafel

How Automation And Intelligence Can Simplify Your High Availability

Precisely

The demand for resilient IT systems continues to accelerate. Today’s IBM i customers increasingly understand the need for a high availability solution to keep these critical business systems available and prevent data loss. An important driver for customers choosing an HA solution is the ease-of-use and automation. As these IBM i customers evaluate how to leverage new technologies and deployments options, like running their HA in a cloud environment, IT resources are challenged to find new approaches to increasing efficiency and productivity. In short, they need an HA solution that requires a minimum of configuration effort and ever-increasing automation. Assure MIMIX, the leader in IBM i high availability and disaster recovery, has new several capabilities that can dramatically simplify configuration tasks, make monitoring easy and assist customers moving to the cloud. We can help these customers keep their mission-critical business applications running continuously, even while migrating to the cloud. During the webinar, we will show a demo of the highly automated Journal-Driven Configuration that takes the complexity out of setting up Assure MIMIX replication. In addition to the demo, join us on this webinar to learn about: - Initial synchronization improvements - Streamlining migration to a new system with replication - Support for IBM’s new Virtual Serial Number

Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...

Veryx Technologies

Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...

Continuent

Learn how Bluefin Payment Systems provides 24/7/365 operation and application availability for their payment gateway and decryption-as-a-service, essential to Point-Of-Sale solutions in retail, mobile, call centers and kiosks. We’ll discuss why Bluefin uses Tungsten Clustering, and how Bluefin runs two co-located data centers with multi-master replication between each cluster in each data center, with full fail-over within the cluster and between clusters, handling 350 million records each month. AGENDA - Subscription experience of four (4) select SaaS customers - How we provide revenue protection - How we enable global revenue growth - Customer case: How Bluefin provides 24x7x365 operation and application availability for their payment gateway and decryption-as-a-service

Azure IoT Hub: what is it and why we select other solution (production projec...

Katherine Golovinova

Evolution of Service Quality Management - Through closed loop automation

EXFO

Microservices with Spring

Carlos Cavero Barca

VoLTE Service Monitoring - VoLTE Voice Call

Jose Gonzalez

There is currently no accepted standard for the measurement or monitoring of VoLTE Services, even though we believe that this is vital to assure the quality and reliability of such services - and to establish a framework for reliable comparison across implementations. To this end Ascom has defined a formal definition and implementation strategy to help the Operations team solve a range of challenges, including issues related to EPC, IMS and the Application Server. We will describe this solution in a number of short articles. This article describes the architecture of our solution and the VoLTE Voice Call test case.

EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...

Aruba, a Hewlett Packard Enterprise company

Today, most mobile connectivity issues are quickly attributed to “bad Wi-Fi”. Very often it may not be a wireless or RF related issue at all. With Aruba Clarity, IT organisations now have visibility into non-RF metrics not only giving them end-to-end visibility into a wireless user experience, but also the ability to foresee connectivity issues before users are even impacted. Check out the webinar recording where this presentation was used. https://attendee.gotowebinar.com/register/224478872155652612 Register for the upcoming webinars: https://community.arubanetworks.com/t5/Training-Certification-Career/EMEA-Airheads-Webinars-Jul-Dec-2017/td-p/271908

Traffic Control with Envoy Proxy

Mark McBride

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

Similar to Empex - Monitoring Elixir Applications

Reliability Testing in OPNFV

OPNFV

“Spikey Workloads” Emergency Management in the Cloud

Amazon Web Services

“Spikey Workloads” Emergency Management in the Cloud

Amazon Web Services

Service Level Objectives and SRE: Service Level Overkill with Mick Roper

Russell Miles

The acid test for cloud data centerCuriousRubik

OIF Open Transport API for Interoperable Optical Networking

Leah Wilkinson

VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...

VMworld

RCS Service Monitoring - 1-to-1 Chat

Jose Gonzalez

TechTalk_Cloud Performance Testing_0.6Sravanthi N

Retrospective from a startup built in the cloud: top three big lessons learne...

Jeff Malek

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Ali Kafel

How Automation And Intelligence Can Simplify Your High Availability

Precisely

Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...

Veryx Technologies

Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...

Continuent

Azure IoT Hub: what is it and why we select other solution (production projec...

Katherine Golovinova

Evolution of Service Quality Management - Through closed loop automation

EXFO

Microservices with Spring

Carlos Cavero Barca

VoLTE Service Monitoring - VoLTE Voice Call

Jose Gonzalez

EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...

Aruba, a Hewlett Packard Enterprise company

Traffic Control with Envoy Proxy

Mark McBride

Similar to Empex - Monitoring Elixir Applications (20)

Reliability Testing in OPNFV

“Spikey Workloads” Emergency Management in the Cloud

Service Level Objectives and SRE: Service Level Overkill with Mick Roper

The acid test for cloud data center

OIF Open Transport API for Interoperable Optical Networking

VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...

RCS Service Monitoring - 1-to-1 Chat

TechTalk_Cloud Performance Testing_0.6

Retrospective from a startup built in the cloud: top three big lessons learne...

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

How Automation And Intelligence Can Simplify Your High Availability

Webinar - Achieving ce 2.0 network integrity - a solid foundation to enable t...

Webinar Slides: How Bluefin Payment Systems Ensures 24/7/365 Operation and Ap...

Azure IoT Hub: what is it and why we select other solution (production projec...

Evolution of Service Quality Management - Through closed loop automation

Microservices with Spring

VoLTE Service Monitoring - VoLTE Voice Call

EMEA Airheads– Aruba Clarity. Because a Wi-Fi Problem's Often Not a "Wi-Fi" P...

Traffic Control with Envoy Proxy

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Natan Silnitsky

In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey. Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience. Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system. Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Globus Compute Introduction - GlobusWorld 2024

Globus

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

Prosigns: Transforming Business with Tailored Technology Solutions

Prosigns

Unlocking Business Potential: Tailored Technology Solutions by Prosigns Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support. Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth. Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices. AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making. Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency. DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration. Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly. Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business. Join us on a journey of innovation and growth. Let's partner for success with Prosigns.

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Large Language Models and the End of Programming

Matt Welsh

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Corporate Management | Session 3 of 3 | Tendenci AMS

Tendenci - The Open Source AMS (Association Management Software)

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Developing Distributed High-performance Computing Capabilities of an Open Sci...

SOCRadar Research Team: Latest Activities of IntelBroker

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

May Marketo Masterclass, London MUG May 22 2024.pdf

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

Enhancing Research Orchestration Capabilities at ORNL.pdf

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

BoxLang: Review our Visionary Licenses of 2024

Globus Compute Introduction - GlobusWorld 2024

Using IESVE for Room Loads Analysis - Australia & New Zealand

Prosigns: Transforming Business with Tailored Technology Solutions

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Large Language Models and the End of Programming

Globus Connect Server Deep Dive - GlobusWorld 2024

Corporate Management | Session 3 of 3 | Tendenci AMS

Navigating the Metaverse: A Journey into Virtual Evolution"

Empex - Monitoring Elixir Applications

1. Monitoring Elixir Applications John Kelly, Sr. Engineer, Bleacher Report

2. Availability

3. Time Based Availability2 = uptime uptime + downtime

4. Nine Nines 99.9999999%

5. “Evidence for the long-term operational stability of the system had also not been collected in any systematic way. For the Ericsson AXD301 the only information on the long-term stability of the system came from a power-point presentation showing some figures claiming that a major customer had run an 11 node system with a 99.9999999% reliability, though how these figure had been obtained was not documented.”4

6. The Cloud

8. AWS EC2 SLA 99.95%5

9. What IF you were in multiple regions?

10.

11. 2 Region AWS Availability 99.95% = .9995 Errors = .0005 .0005 * .0005 = 0.00000025 1 - 0.00000025 = 0.99999975 ~99.99998% (More than 6 nines almost 7!) ~ 6s downtime in a year7,8, but…)

12. Failover between Regions isn’t free DNS TTL + (Interval * Threshold)6 60s + (5 * 10s) = 110s Now we’re in the minutes range Let’s say 1 region fails 3 times in a year10 That would be 5.5 minutes

13. Uh oh, 5.26 minutes is 5 9s and we have more problems...

14. Some ISPs don’t honor DNS TTLs11 (so some long tail will hit the region with the outage)

15. I haven’t even touched the issue of data replication across geographically distant data centers (regions)

16. Auto Scaling from increased traffic takes around 5 minutes...

17. Sharks! Really12

18. Does any of this matter? It depends*

19. “Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!”9

20. So where do we go from here?

21. Aggregate Availability3 = Successful Requests Total Requests

22. HTTP API Availability = 2XX + 3XX + 4XX 2XX + 3XX + 4XX + 5XX

23.

24. B/R Elixir Services at Origin LB 99.26% Average of 9 April 15 - May 15

25. B/R Ruby Services at Origin LB 99.91% Average of 6 April 15 - May 15

26. Wait, that’s better than Elixir!

27. 70% of outages due to changes in a live system1

28. Step 1: If 70% of outages are caused by change, the best place to start monitoring is your change/release process

29. B/R’s Release Process

30. Step 2: Setup centralized request logging with X-Request-id Headers

31. B/R’s Logging Process

32. Step 3: Measure aggregate availability

33.

34. Step 4: Alerting

35. B/R’s Alerting & On Call Process

36.

37. Step 5: Measure Throughput

38.

39. Step 6: Measure Latency - End to End and at the service level

40.

41.

42. Step 7: Monitor Business Metrics

43.

44. Step 8: Monitor System Metrics

45.

46. Step 9: Measure VM Metrics

47.

48.

49. Application: Use a logging system Use an aggregate data system Log Releases Use an alerting system

50. B/R’s Application of Monitoring Logz Datadog Jenkins Opsgenie

51. Remember: The application / writing the code for monitoring is the easy part

52. References 1 Site Reliability Engineering - p.10 2 Site Reliability Engineering - Equation 3-1 3 Site Reliability Engineering - Equation 3-2 4 http://erlang.org/download/armstrong_thesis_2003.pdf p.191 5 https://aws.amazon.com/ec2/sla/ 6https://aws.amazon.com/blogs/aws/route-53-health-check-improvements-faster-interval-and- configurable-failover/

53. References 7 https://en.wikipedia.org/wiki/High_availability 8 https://aws.amazon.com/blogs/developer/working-with-different-aws-regions/ 9 Site Reliability Engineering - p.25 10 (fairly conservative based on SLA it would be 4.38 hours for a region in a year) 11 https://www.godaddy.com/help/what-factors-affect-dns-propagation-time-1746 12 http://www.wired.co.uk/article/shark-cables

54. THANKS! johnkelly@github jkellyj@linkedin code_jk@twitter

Editor's Notes

Mention B/R and consumer teamLead on monitoringMention the initial focus on availability Mention the shift then to how to monitor with examples from BR
The first metric to discussWhat is it?Is it a useful metric? Deep dive on availability but this same process should be applied to all metrics you rely on
Conventional definition
Erlang’s nine ninesWhere does that figure come fromWhat does it meanI looked for the source
This quote comes from the 2003 thesis that I believe introduced the nine ninesHaving seen the source how reputable is this claim?
The name cloud invokes this nice simple visualization, in reality tc/ip in a virtualized environment is very complexVirtualized environments add a lot more complexityTalk about our AWS setup Multiple availability zones Load balancers
When you enable an Availability Zone for your load balancer, Elastic Load Balancing creates a load balancer node in the Availability Zone. If you register instances in an Availability Zone but do not enable the Availability Zone, these registered instances do not receive traffic. Note that your load balancer is most effective if you ensure that each enabled Availability Zone has at least one registered instance.
Say over 3 ninesThis is a contrived worse case. Financially incentivised to not go lower. Of course it could drop lower and it could be much higher“Region Unavailable” and “Region Unavailability” mean that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you.Our elixir apps therefore are limited to 99.95% availabilityCan we do better
We’re in a single region, but let’s explore going multi-regionWhat kind of complexity does this introduce
Route 53Failover Routing Policy (active-passive failover)There are othersEach AWS region is a completely independent stack of services, totally isolated from other regions. https://aws.amazon.com/blogs/developer/working-with-different-aws-regions/
This assumes regions are independent which is what AWS statesThis downtime is the downtime where both systems are down simultaneously.
Walk though this slideWhat does the interval mean Ttl is how long for the caching name servers to cache from the authoritative name serverWhat does the threshold meanWhy not threshold 1
Quick slide
Depending on persistence layer (kafka, cassandra) vs mysql, postgres
Possibe mitigation vectors for thisBurstable instances, container orchestration, serverless What else
Possibe mitigation vectors for thisBurstable instances, container orchestration, serverless
Quick slide
From Google, how reliable is1 out of 100 or 1 out of 1000 cellular network fail
Quick slide
Google introduces Aggregate Availability in their SRE book Contrast with traditional uptime version Can you accurately measure time in a cloud system? VMs preempted for tens of milliseconds
What we use at BR is a modified version
Let’s revisit the architecture of 1 of B/R’s many elixir services We collect metrics from the LB nodes and send them to Datadog via cloudwatchWE CHEAT USAGE OF CDN AND THIS IS ORIGIN
We didn’t spend 2 years switching to get worse!Quick slide
Ruby Apps were under less active development and being sunsetted in favor Elixir apps Ruby apps aren’t changed except for exceptional casesThis is a comparison of microservicesThe elixir apps see deploys daily or even multiple times a day
Version Control - SoftwareVersion Control - InfrastructureRecord releases with data time and releaser Planning for rollbackWe don’t deploy after 3pm*
What is happening in my system?
ELKLogzplug_logger_jsonEcto_logger_json Bug me on the docs
Is my system working correctly?
Is my system working correctly?
When something goes wrong? How do I know? The danger of false positives. Start conservative Alerting and avoiding mid night wake up calls
Opsgenie 1 week shifts Different severities / different response timesAnyone in the company can submit an issue not just automationWe have a slack channel for these alertsAn alert is then split out into a dedicated channel with stakeholdersGoal is to triage and try to solve yourself but have permission to drag someone not on call in if necessary.Management also gets paged.Phone calls if not answered with 2 minutes
When something goes wrong? How do I know? The danger of false positives. Start conservative
Needed for capacity planning and figuring out user behaviors
Needed for capacity planning and figuring out user behaviors
There have been many studies on the importance of latency with regards to customer behavior. 100ms difference can have a measurable impact on revenue or satisfactionPercentiles not averages!!!! Clocks - how accurate can you actually measure latency?
We tend to shove business metrics into a seperate analytics bucket? But what is analytics? Is that not business logic monitoring?How many user’s are using this piece of software, etc
How many user’s are using this piece of software, etc
CPU, open file descriptors, memory, disk space, network
CPU, open file descriptors, memory, disk space, network These can creep up over time
Check timeRun Queue
Run Queue 100% CPU - after a few minutes, run queue would increase to 20ishInfinite loop by missed pattern match case
Run Queue
You need to deep dive like we did with availability for all the metrics you monitor

Empex - Monitoring Elixir Applications

Recommended

Recommended

More Related Content

Similar to Empex - Monitoring Elixir Applications

Similar to Empex - Monitoring Elixir Applications (20)

Recently uploaded

Recently uploaded (20)

Empex - Monitoring Elixir Applications

Editor's Notes