SlideShare a Scribd company logo
1 of 19
Download to read offline
True Observability
Why, What & Architecture
Jeremy Proffitt
Ally Financial
Director DevOps & SRE C3
Even the most
dependable employee,
is not 100% Reliable
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
Logs vs
Metrics Imagine collecting information such as when your
application throws an error, recording time required to
perform actions or client activity such as logging in.
Logs allow us to capture events, in detail, and store
them for analysis. One of the most common log are
access logs for web services, i.e. a record of who has
visited your website, what page they visited, browser
type and even how long it took to return information to a
user.
Metrics are the aggregation of events in a time period,
for example, 1,045 web site visits in the last minute.
Logging
The raw truth
The individualism of logs allow us
to pick out singlar events, such
as a single error or delayed
response and are often chained
together with other logs to form a
story to see how users and
information flow through your
system.
We don’t just look for errors, the absence of logs from a source,
either looked at individually or as a count by server can provide us
valuable insight when a server or service goes offline.
Logs Provide us raw data to aggregate in differing ways to
allow us to generate queries to answer specific questions.
By watching for specific events in logs, for example, high response times,
errors, we can build alerting to inform us of when systems are not
performing correctly.
Not only can we look for the absence of logs, but we can track
users, page to page to determine where users leaving, or even if a
website isn’t allowing users to advance to the next step.
Metrics
The truth in
aggregate
When we look at logs, the
majority of the time, we’re looking
at an aggregate view, how many
of what in what amount of time.
Multiple cloud providers now
provide information is an easy to
use, metric. For example, the
number of visitors, average
response time or even number of
times error pages are displayed.
In the cloud, the same metric is often available over multiple provider
offerings, for example, 500 web errors are a metric common to Load
Balancers, CloudFront and other web services in AWS.
Unlike logs, aggregated metrics are specifically set to only
count in a singular, boilerplate, out of the box way.
Alerting on metrics provide us with a quick and simple method of saying, do
I have any errors? Is my web server running slow? Am I using too much
memory, cpu or disk space?
Metrics are less expensive to generate and keep than logs, they are
a low cost method of wrapping our systems in layers of alerting, with
the downside of not being able to drill into a singular event.
Watching Infrastructure - aka “The Servers”.
90% 90% 20% Pass / Fail 90%
Memory Usage on a server should be monitoring,
some systems will automatically reset your
container or application when out of memory, and
yet others simply provide unpredictable results.
Hard Drives, or
where the data often
goes, are critical,
imagine not being
able to write a new
customer’s order to
a database because
you ran out of space
Host Health - is the
physical box
healthy, is the
operating system
working?
CPU on a server should be monitored, low
CPU availability can lead to longer wait
times for customers and even timeouts or
failure to process items in a set amount of
time
Serverless architecture regardless of provider or advertised features, must still run
on a server. Lambda’s still require memory and processing time - metrics which
must be kept in check. Because you pay for what you use, if your Lambda’s take
twice as long now as they did yesterday, they will cost you twice as much. And
serverless databases are often constrained by queries per minute, concurrent
connectivity and can automatically scale without your consent.
While there are many advantages to Serverless - in some ways, it’s like giving a
teenager a cell phone - be careful, the overages can kill you!!
Monitor Memory usage for your
serverless applications.
Time is literally money, the longer
your process takes, the more it
costs.
Be very mindful of the limitations
and cost of serverless data
storage mechanism
When traffic increases, the more
serverless resources increase,
which means the more you pay.
Memory Processing Time Connections & Queries Scaling up is Paying Up
Hey! But ...
I’m SERVERLESS!
YO! It still runs on a server.
APM’s often will not capture any error a
developer has already captured in a try
catch statement! Imagine the frustration of
not seeing all the errors you were promised!
Most APM’s provide a method of making a
code change to send the exception to your
APM - but it increases substantially the roll
out cost of an APM.
APM
Monitoring Your Code
WARNING!!!!
APM, or Application Performance Monitoring is fast becoming a
new tool in increasing reliability and development cycles. With
the ability to simply drop modules into your code, server or
container, APM’s promise the visibility to track errors and slow
downs in your code along with inter system operability.
What?
APM often hooks into the “back door” of code, providing us
essential information such as how long an application waits for
an API call or database call takes to complete, the memory
usage, often broken out by different types or heaps of memory
and of course, application exceptions and errors.
APM can also allow you to see transactions as they flow from
and to different applications if they all have the same APM
instrumentation, allow you to see a map or path traveled and
assist in pinpointing issues in larger systems.
It should be noted, APM’s are often Metric based systems, and
are often known to only using a sampling or subset of the actual
data to represent information on the screen. While still useful,
this can introduce settle differences in APMs making on the
fence or hard to find issues, even more difficult to track.
Active Monitoring
Active
Monitoring
Synthetic Monitoring
Actively accessing your site, simulating a user
experience. Often includes a robotic login,
and clicking about to validate functionality of a
site
Watch Dogs
To ensure a system is alive, we often send
an outbound signal, or make an http web
call, every x minutes. If no call comes in
after a period of time, an alert is fired.
Simple Web Monitoring
Simply accessing a web page or API and
checking for specific results or result codes (like
200 success http responses) provides a simple
way of saying, is my website or api alive.
Certificate Validation
Actively checking the HTTPS secure
certificates and providing notification 15 to 30
days before they expire, will save the
embarrassment of expired certificates causing
revenue and user experience impacts.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
The IMPACT!
Revenue & Customer Experience
The false belief that downtime is preventable!
Repeat after me,
“Downtime will happen.”,
say it again,
“Downtime will happen!”
There is only one promise I can make as we go through this presentation today - at
some point, systems have downtime - it’s absolutely unavoidable. How we plan to
recover from those outages, the decisions to build out redundancy and impact not just
in revenue but reputation are all part of a complex equation.
Beyond the discussion of architecture of application and systems, redundancy through
the use of multiple servers, resources, data centers and networking layers are all part
of a larger business discussion. I’ll say that again, for the most part, reliability due to
redundant resources, is a business, not a technical decision.
When reviewing downtime, customer and revenue impact should be thought of in a
time based equation - and downtime for customers can have crippling impacts into
app ratings, word of mouth limitations or worst, customer rants warding off new
customers.
Be Aware of the 100% uptime
promise! It’s often riddled with
exceptions for “emergency”
maintenance windows used to
cover up production issues.
The IMPACT!
Communication - Internal & External
The false belief that downtime is preventable!
Communication is about perception - and is perhaps one of the most important
aspects of both an outage and career advancement.
Communication is a balance - do we communicate that a 5 minute database slow
down to our customers and CEO? Likely no, the general rule I’ve used, is if it’s over
before the communication can be sent, or the impact is close to resolution - we’ll
typically not sent communication beyond IT leadership. Now this rule is a fair amount
of hands on learning for your organization, be flexible, validate your communication is
appropriate and accurate.
We communicate to our direct supervisor and hopefully up the chain to our CIO/CTO
because you never want these people to be asked, what’s going on - without
immediately having an answer for other members of the C-Suite. This empowers your
managers and will have a positive impact on your career.
Communication to customers is a business decision, and in this, I would bring in
marketing and legal for a discussion of pre-canned messages. This is extremely
critical for larger outages - and remember, think about how you’d feel as a customer.
When you have an outage, nothing
is better than the truth. Want to
know how to do it right? Research
Johnson and Johnson’s Tylenol
Recall.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
The idea that monitoring alerts can be done as code, pushed out through
API’s to different systems, saving both the time of manual implementation
but also reducing the cost of errors during this manual implementation.
Monitoring as code
Cloud providers, have the capability of giving you a list of all the
architecture in your account via an api call, library or command line
interface. This allows us to ask, what servers do I have?
Cloud System’s and API’s
Tagging on architecture, is typically descriptive, like “MySQL Customer
Database”. If we look at this deeper, we find example after example of
using Tags to define how other parts of architecture react together.
Tagging - the what is it
Not only can we ask a cloud provider for a list of all cloud architecture
elements, we can do the same in our monitoring system. Ensuring we
compare these lists - we can validate, all systems are being monitored.
The power of the Audit
Monitoring Auditing - Beyond Monitoring as Code
Monitoring as Code - Rolling out modifiable templates
T
e
m
p
l
a
t
e
s
S
t
e
p
t
h
r
o
u
g
h
i
t
!
R
o
l
l
I
t
O
u
t
!
R
i
n
s
e
a
n
d
R
e
p
e
a
t
Monitoring as Code
Monitoring is often rolled out based
on seperate applications and needs.
And are rolled out as multiple
independent systems.
But there is a more elegant way to
roll out automated monitoring using
code - here’s how.
Start with Monitoring Templates
Review all architecture used or
you want to use, and generate
alerts to address your business
needs. From these, you can build
a standard set of alerts.
Step through your
architecture!
Step through each object in
your cloud account -
capturing both the object
name, and tags defining
custom template
adjustments.
Rinse and Repeat
Repeating the roll out daily
ensure manual changes are
rolled back and new
architecture is monitored
quickly - and without
intervention.
Roll it out!
Now that you have the
objects, and their custom
adjustments, using the
templates, alerting can be
created or updated.
Did you know?
None of the current leaders in
Monitoring offer a system so simple
or clean - and one has to wonder
why?
Routing your Monitoring Alerts
Ensure your alerts include meta
data that can be used to determine
importance and team ownership
Alerts Generated
Processing Alerts can be complex,
routing alerts to teams and setting
hours for specific levels of alerting.
Alert Processing (PagerDuty)
Alert Fatigue, the cry wolf of IT, is when alerts which don’t require
immediate resolution, are routed incorrectly or importance is not in
alignment with the company's needs. I.e. when you get alerts that you
shouldn’t be and they wake you up at 3am continuously, those linkedin
invites for a new job - they become more and more tempting.
Alert Fatigue - the fastest way to start searching for new Team Mates
When you reach out to employees after hours, you
should have an escalation policy to reach out to the back
up person automatically.
It’s very important that Expectations are set with the
entire team, does a laptop travel with on call engineers?
Do engineers come in late if they’ve been up all night?
When do you wake up your entire team for a team
response?
Set Paging Escalation and Expectations
Alerts vs Warnings! Consider adding
warnings for most alerts, for example hard
drive space. This allows working hour based
support which can prevent the 3am wake up.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
Getting Started
A
B
C
Monitoring and Alerting has the highest return on investment you’ll likely see in
your business, if you keep the costs under control.
Starting with the alerting capabilities of the cloud provider is a quick and least
expensive start up, and it helps you understand the data available from your
systems when and if you purchase a third party tool.
Generating a script to apply templates, should be quick and simple - adding tag
based rule exceptions, or even entire template versioning - provides a level of
coverage quickly and at a low cost.
Ensuring when monitoring is triggered, you are treating the alert appropriately,
i.e. dev servers shouldn’t wake you at 3am in the morning. Ensuring we keep
production alerts with direct and immediate impact going to engineers, while
capturing other alerts for processing the next business day will not only keep
your staff happy, but prevent churn in an industry where demand is very high.
Finally, talk about outages and downtime as a business decision. Like a fire drill,
know what the response will be, know who is required to do what - and who their
backup is and ultimately, determine what and how is to be released to who.
Getting Started!

More Related Content

Similar to Building Reliability - The Realities of Observability

Defect Tracking Tool
Defect Tracking ToolDefect Tracking Tool
Defect Tracking Tool
ncct
 
ca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebookca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebook
Tiffany Hamilton
 
APM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New RelicAPM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New Relic
New Relic
 
Html web design_software
Html web design_softwareHtml web design_software
Html web design_software
pickettc_70
 

Similar to Building Reliability - The Realities of Observability (20)

Solving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 PeopleSolving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 People
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction
 
Defect Tracking Tool
Defect Tracking ToolDefect Tracking Tool
Defect Tracking Tool
 
The Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web ServiceThe Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web Service
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
 
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
 
ca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebookca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebook
 
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
 
APM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New RelicAPM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New Relic
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
 
Adobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation JourneyAdobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation Journey
 
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
 
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
 
Server Monitoring Battles
Server Monitoring BattlesServer Monitoring Battles
Server Monitoring Battles
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Html web design_software
Html web design_softwareHtml web design_software
Html web design_software
 

More from All Things Open

Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
All Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
All Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
All Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
All Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
All Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
All Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
All Things Open
 

More from All Things Open (20)

Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 
Building AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources codeBuilding AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources code
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Building Reliability - The Realities of Observability

  • 1. True Observability Why, What & Architecture Jeremy Proffitt Ally Financial Director DevOps & SRE C3
  • 2. Even the most dependable employee, is not 100% Reliable
  • 3. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 4. Logs vs Metrics Imagine collecting information such as when your application throws an error, recording time required to perform actions or client activity such as logging in. Logs allow us to capture events, in detail, and store them for analysis. One of the most common log are access logs for web services, i.e. a record of who has visited your website, what page they visited, browser type and even how long it took to return information to a user. Metrics are the aggregation of events in a time period, for example, 1,045 web site visits in the last minute.
  • 5. Logging The raw truth The individualism of logs allow us to pick out singlar events, such as a single error or delayed response and are often chained together with other logs to form a story to see how users and information flow through your system. We don’t just look for errors, the absence of logs from a source, either looked at individually or as a count by server can provide us valuable insight when a server or service goes offline. Logs Provide us raw data to aggregate in differing ways to allow us to generate queries to answer specific questions. By watching for specific events in logs, for example, high response times, errors, we can build alerting to inform us of when systems are not performing correctly. Not only can we look for the absence of logs, but we can track users, page to page to determine where users leaving, or even if a website isn’t allowing users to advance to the next step.
  • 6. Metrics The truth in aggregate When we look at logs, the majority of the time, we’re looking at an aggregate view, how many of what in what amount of time. Multiple cloud providers now provide information is an easy to use, metric. For example, the number of visitors, average response time or even number of times error pages are displayed. In the cloud, the same metric is often available over multiple provider offerings, for example, 500 web errors are a metric common to Load Balancers, CloudFront and other web services in AWS. Unlike logs, aggregated metrics are specifically set to only count in a singular, boilerplate, out of the box way. Alerting on metrics provide us with a quick and simple method of saying, do I have any errors? Is my web server running slow? Am I using too much memory, cpu or disk space? Metrics are less expensive to generate and keep than logs, they are a low cost method of wrapping our systems in layers of alerting, with the downside of not being able to drill into a singular event.
  • 7. Watching Infrastructure - aka “The Servers”. 90% 90% 20% Pass / Fail 90% Memory Usage on a server should be monitoring, some systems will automatically reset your container or application when out of memory, and yet others simply provide unpredictable results. Hard Drives, or where the data often goes, are critical, imagine not being able to write a new customer’s order to a database because you ran out of space Host Health - is the physical box healthy, is the operating system working? CPU on a server should be monitored, low CPU availability can lead to longer wait times for customers and even timeouts or failure to process items in a set amount of time
  • 8. Serverless architecture regardless of provider or advertised features, must still run on a server. Lambda’s still require memory and processing time - metrics which must be kept in check. Because you pay for what you use, if your Lambda’s take twice as long now as they did yesterday, they will cost you twice as much. And serverless databases are often constrained by queries per minute, concurrent connectivity and can automatically scale without your consent. While there are many advantages to Serverless - in some ways, it’s like giving a teenager a cell phone - be careful, the overages can kill you!! Monitor Memory usage for your serverless applications. Time is literally money, the longer your process takes, the more it costs. Be very mindful of the limitations and cost of serverless data storage mechanism When traffic increases, the more serverless resources increase, which means the more you pay. Memory Processing Time Connections & Queries Scaling up is Paying Up Hey! But ... I’m SERVERLESS! YO! It still runs on a server.
  • 9. APM’s often will not capture any error a developer has already captured in a try catch statement! Imagine the frustration of not seeing all the errors you were promised! Most APM’s provide a method of making a code change to send the exception to your APM - but it increases substantially the roll out cost of an APM. APM Monitoring Your Code WARNING!!!! APM, or Application Performance Monitoring is fast becoming a new tool in increasing reliability and development cycles. With the ability to simply drop modules into your code, server or container, APM’s promise the visibility to track errors and slow downs in your code along with inter system operability. What? APM often hooks into the “back door” of code, providing us essential information such as how long an application waits for an API call or database call takes to complete, the memory usage, often broken out by different types or heaps of memory and of course, application exceptions and errors. APM can also allow you to see transactions as they flow from and to different applications if they all have the same APM instrumentation, allow you to see a map or path traveled and assist in pinpointing issues in larger systems. It should be noted, APM’s are often Metric based systems, and are often known to only using a sampling or subset of the actual data to represent information on the screen. While still useful, this can introduce settle differences in APMs making on the fence or hard to find issues, even more difficult to track.
  • 10. Active Monitoring Active Monitoring Synthetic Monitoring Actively accessing your site, simulating a user experience. Often includes a robotic login, and clicking about to validate functionality of a site Watch Dogs To ensure a system is alive, we often send an outbound signal, or make an http web call, every x minutes. If no call comes in after a period of time, an alert is fired. Simple Web Monitoring Simply accessing a web page or API and checking for specific results or result codes (like 200 success http responses) provides a simple way of saying, is my website or api alive. Certificate Validation Actively checking the HTTPS secure certificates and providing notification 15 to 30 days before they expire, will save the embarrassment of expired certificates causing revenue and user experience impacts.
  • 11. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 12. The IMPACT! Revenue & Customer Experience The false belief that downtime is preventable! Repeat after me, “Downtime will happen.”, say it again, “Downtime will happen!” There is only one promise I can make as we go through this presentation today - at some point, systems have downtime - it’s absolutely unavoidable. How we plan to recover from those outages, the decisions to build out redundancy and impact not just in revenue but reputation are all part of a complex equation. Beyond the discussion of architecture of application and systems, redundancy through the use of multiple servers, resources, data centers and networking layers are all part of a larger business discussion. I’ll say that again, for the most part, reliability due to redundant resources, is a business, not a technical decision. When reviewing downtime, customer and revenue impact should be thought of in a time based equation - and downtime for customers can have crippling impacts into app ratings, word of mouth limitations or worst, customer rants warding off new customers. Be Aware of the 100% uptime promise! It’s often riddled with exceptions for “emergency” maintenance windows used to cover up production issues.
  • 13. The IMPACT! Communication - Internal & External The false belief that downtime is preventable! Communication is about perception - and is perhaps one of the most important aspects of both an outage and career advancement. Communication is a balance - do we communicate that a 5 minute database slow down to our customers and CEO? Likely no, the general rule I’ve used, is if it’s over before the communication can be sent, or the impact is close to resolution - we’ll typically not sent communication beyond IT leadership. Now this rule is a fair amount of hands on learning for your organization, be flexible, validate your communication is appropriate and accurate. We communicate to our direct supervisor and hopefully up the chain to our CIO/CTO because you never want these people to be asked, what’s going on - without immediately having an answer for other members of the C-Suite. This empowers your managers and will have a positive impact on your career. Communication to customers is a business decision, and in this, I would bring in marketing and legal for a discussion of pre-canned messages. This is extremely critical for larger outages - and remember, think about how you’d feel as a customer. When you have an outage, nothing is better than the truth. Want to know how to do it right? Research Johnson and Johnson’s Tylenol Recall.
  • 14. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 15. The idea that monitoring alerts can be done as code, pushed out through API’s to different systems, saving both the time of manual implementation but also reducing the cost of errors during this manual implementation. Monitoring as code Cloud providers, have the capability of giving you a list of all the architecture in your account via an api call, library or command line interface. This allows us to ask, what servers do I have? Cloud System’s and API’s Tagging on architecture, is typically descriptive, like “MySQL Customer Database”. If we look at this deeper, we find example after example of using Tags to define how other parts of architecture react together. Tagging - the what is it Not only can we ask a cloud provider for a list of all cloud architecture elements, we can do the same in our monitoring system. Ensuring we compare these lists - we can validate, all systems are being monitored. The power of the Audit Monitoring Auditing - Beyond Monitoring as Code
  • 16. Monitoring as Code - Rolling out modifiable templates T e m p l a t e s S t e p t h r o u g h i t ! R o l l I t O u t ! R i n s e a n d R e p e a t Monitoring as Code Monitoring is often rolled out based on seperate applications and needs. And are rolled out as multiple independent systems. But there is a more elegant way to roll out automated monitoring using code - here’s how. Start with Monitoring Templates Review all architecture used or you want to use, and generate alerts to address your business needs. From these, you can build a standard set of alerts. Step through your architecture! Step through each object in your cloud account - capturing both the object name, and tags defining custom template adjustments. Rinse and Repeat Repeating the roll out daily ensure manual changes are rolled back and new architecture is monitored quickly - and without intervention. Roll it out! Now that you have the objects, and their custom adjustments, using the templates, alerting can be created or updated. Did you know? None of the current leaders in Monitoring offer a system so simple or clean - and one has to wonder why?
  • 17. Routing your Monitoring Alerts Ensure your alerts include meta data that can be used to determine importance and team ownership Alerts Generated Processing Alerts can be complex, routing alerts to teams and setting hours for specific levels of alerting. Alert Processing (PagerDuty) Alert Fatigue, the cry wolf of IT, is when alerts which don’t require immediate resolution, are routed incorrectly or importance is not in alignment with the company's needs. I.e. when you get alerts that you shouldn’t be and they wake you up at 3am continuously, those linkedin invites for a new job - they become more and more tempting. Alert Fatigue - the fastest way to start searching for new Team Mates When you reach out to employees after hours, you should have an escalation policy to reach out to the back up person automatically. It’s very important that Expectations are set with the entire team, does a laptop travel with on call engineers? Do engineers come in late if they’ve been up all night? When do you wake up your entire team for a team response? Set Paging Escalation and Expectations Alerts vs Warnings! Consider adding warnings for most alerts, for example hard drive space. This allows working hour based support which can prevent the 3am wake up.
  • 18. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 19. Getting Started A B C Monitoring and Alerting has the highest return on investment you’ll likely see in your business, if you keep the costs under control. Starting with the alerting capabilities of the cloud provider is a quick and least expensive start up, and it helps you understand the data available from your systems when and if you purchase a third party tool. Generating a script to apply templates, should be quick and simple - adding tag based rule exceptions, or even entire template versioning - provides a level of coverage quickly and at a low cost. Ensuring when monitoring is triggered, you are treating the alert appropriately, i.e. dev servers shouldn’t wake you at 3am in the morning. Ensuring we keep production alerts with direct and immediate impact going to engineers, while capturing other alerts for processing the next business day will not only keep your staff happy, but prevent churn in an industry where demand is very high. Finally, talk about outages and downtime as a business decision. Like a fire drill, know what the response will be, know who is required to do what - and who their backup is and ultimately, determine what and how is to be released to who. Getting Started!