Kks sre book_ch10

SRE Book Ch 10
KKStream SRE Study Group
Presenter: Chris Huang
2018/08/29

Service Reliability Hierarchy
● From the most basic requirements needed
for a system to function as a service
● Permitting self-actualization and taking
active control of the direction of the service
rather than reactively fighting fires
2

Incident Response
● On-call support is a tool we use to achieve
our larger mission and remain in touch with
how distributed computing systems actually
work (and fail!).
● If we could find a way to relieve ourselves of
carrying a pager, we would.
3

Postmortem and Root-Cause Analysis
● We aim to be alerted on and manually solve
only new and exciting problems presented
by our service; it’s woefully boring to "fix"
the same issue over and over.
● This mindset is one of the key differentiators
between the SRE philosophy and some
more traditional operations-focused
environments.
4

Chapter 10 -
Practical Alerting

Practical Alerting
6
● Being alerted for single-machine failures is unacceptable because such data is too noisy to be
actionable.
● Instead we try to build systems that are robust against failures in the systems they depend on.
● A large system should be designed to aggregate signals and prune outliers.
● We need monitoring systems that allow us to alert for high-level service objectives, but retain the
granularity to inspect individual components as needed.
To think about the CloudWatch functionalities that qualivent to
Borgmon

Getting Metrics - varz
● Every Google service has a built-in HTTP server to export internal metrics. Borgmon can easily
fetch a target’s metrics by one HTTP fetch.
● A Borgmon can collect metrics from another Borgmon, so we can build hierarchies that follow the
topology of the service, aggregating and summarizing information and discarding some strategically
at each level.
7
chris@prod-server [~] $ curl http://webserver:80/varz
http_responses map:code 200:25 404:0 500:12
chris@prod-server [~] $ curl http://webserver:80/varz
http_requests 37
errors_total 12
● The /varz HTTP handler simply lists all the exported variables in plain text. A later extension added
a mapped variable, which allows the exporter to define several labels on a variable name, and then
export a table of values or a histogram.

JMX-liked Approach, But Simplified
● JMX (Java Management Extensions)
8

AWS CloudWatch
● AWS services (ELB, RDS) exports
default metrics to CloudWatch.
There is CloudWatch agent to
send instance metrics (CPU, disk,
memory) to CloudWatch.
● For user application, AWS
requires app to send customized
metrics.
9

CloudWatch Concepts
The following terminology and concepts
are central to your understanding and use
of Amazon CloudWatch:
● Namespaces
● Metrics
● Dimensions
● Statistics
● Percentiles
● Alarms
11
Metrics
● Metrics are the fundamental concept in CloudWatch. A
metric represents a time-ordered set of data points that are
published to CloudWatch.
● AWS services send metrics to CloudWatch, and you can
send your own custom metrics to CloudWatch.
● Metrics are uniquely defined by a name, a namespace, and
zero or more dimensions. Each data point has a time
stamp, and (optionally) a unit of measure. When you
request statistics, the returned data stream is identified by
namespace, metric name, dimension, and (optionally) the
unit.

Dimensions
● A dimension is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a
metric.
● Every metric has specific characteristics that describe it, and you can think of dimensions as categories for
those characteristics. Dimensions help you design a structure for your statistics plan.
● AWS services that send data to CloudWatch attach dimensions to each metric. You can use dimensions to
filter the results that CloudWatch returns. For example, you can get statistics for a specific EC2 instance by
specifying the InstanceId dimension when you search for metrics.
● For metrics produced by certain AWS services, such as Amazon EC2, CloudWatch can aggregate data
across dimensions.
● CloudWatch does not aggregate across dimensions for your custom metrics.
12

Metrics Statistics
● Statistics are metric data aggregations over specified periods of time. Aggregations are made using the
namespace, metric name, dimensions, and the data point unit of measure, within the time period you
specify.
13

Publish Custom Metrics
You can publish your own metrics to CloudWatch using the AWS CLI or an API. You can view statistical graphs of
your published metrics with the AWS Management Console.
14
chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name
LoginCount --unit Count --value 1 --dimensions Platform=iOS,Subscribe=Freemium
chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name
LoginLatency --unit Milliseconds --value 200.0 --dimensions
Platform=iOS,Subscribe=Freemium
We can simply aggregate and visualize LoginCount on CloudWatch dashboard for
● Total login user count in last 6 hours
● iOS login user count in last 6 hours
● Average login latency for Android Freemium user count in last 6 hours

Black-Box v.s. White-Box Monitoring
● Borgmon (or CloudWatch) is a white-box
monitoring system—it inspects the internal state
of the target service, and the rules are written with
knowledge of the internals in mind. The transparent
nature of this model provides great power to identify
quickly what components are failing
● But you only see the queries that arrive at the
target; the queries that never make it due to a DNS
error are invisible, while queries lost due to a server
crash never make a sound.
● Black-box monitoring like Pingdom is a good way
to see from user’s perspective
15

We’re Hiring
16
https://jobs.lever.co/kkstream

Thank you!
KKStream (Japan)
KKBOX Japan LLC, 6F Urbanprem
Shibuya,
1-4-2 Shibuya, Shibuya-ku,
Tokyo, 150-0002, Japan
Tel: +81 3 6758-7400
Fax: +81 3 6758-7401
Email: biz_info_jp@kkstream.com.tw
KKStream (Taiwan)
8F, 19-11, Sanchong Rd,
Nangang Dist, Taipei City 115,
Taiwan
Tel: +886 2 2655-0369
Fax: +886 2 2655-0929
Email:
biz_info_tw@kkstream.com.tw
Copyright © 2018 KKStream Limited. All rights reserved.
www.kkstream.com.tw

Kks sre book_ch10

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kks sre book_ch10

Similar to Kks sre book_ch10 (20)

More from Chris Huang

More from Chris Huang (20)

Recently uploaded

Recently uploaded (20)

Kks sre book_ch10