Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kks sre book_ch10


Published on

Kks sre book_ch10

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Kks sre book_ch10

  1. 1. SRE Book Ch 10 KKStream SRE Study Group Presenter: Chris Huang 2018/08/29
  2. 2. Service Reliability Hierarchy ● From the most basic requirements needed for a system to function as a service ● Permitting self-actualization and taking active control of the direction of the service rather than reactively fighting fires 2
  3. 3. Service Reliability Hierarchy Incident Response ● On-call support is a tool we use to achieve our larger mission and remain in touch with how distributed computing systems actually work (and fail!). ● If we could find a way to relieve ourselves of carrying a pager, we would. 3
  4. 4. Service Reliability Hierarchy Postmortem and Root-Cause Analysis ● We aim to be alerted on and manually solve only new and exciting problems presented by our service; it’s woefully boring to "fix" the same issue over and over. ● This mindset is one of the key differentiators between the SRE philosophy and some more traditional operations-focused environments. 4
  5. 5. Chapter 10 - Practical Alerting
  6. 6. Practical Alerting 6 ● Being alerted for single-machine failures is unacceptable because such data is too noisy to be actionable. ● Instead we try to build systems that are robust against failures in the systems they depend on. ● A large system should be designed to aggregate signals and prune outliers. ● We need monitoring systems that allow us to alert for high-level service objectives, but retain the granularity to inspect individual components as needed. To think about the CloudWatch functionalities that qualivent to Borgmon
  7. 7. Getting Metrics - varz ● Every Google service has a built-in HTTP server to export internal metrics. Borgmon can easily fetch a target’s metrics by one HTTP fetch. ● A Borgmon can collect metrics from another Borgmon, so we can build hierarchies that follow the topology of the service, aggregating and summarizing information and discarding some strategically at each level. 7 chris@prod-server [~] $ curl http://webserver:80/varz http_responses map:code 200:25 404:0 500:12 chris@prod-server [~] $ curl http://webserver:80/varz http_requests 37 errors_total 12 ● The /varz HTTP handler simply lists all the exported variables in plain text. A later extension added a mapped variable, which allows the exporter to define several labels on a variable name, and then export a table of values or a histogram.
  8. 8. JMX-liked Approach, But Simplified ● JMX (Java Management Extensions) 8
  9. 9. AWS CloudWatch ● AWS services (ELB, RDS) exports default metrics to CloudWatch. There is CloudWatch agent to send instance metrics (CPU, disk, memory) to CloudWatch. ● For user application, AWS requires app to send customized metrics. 9
  10. 10. Alerting 10
  11. 11. CloudWatch Concepts The following terminology and concepts are central to your understanding and use of Amazon CloudWatch: ● Namespaces ● Metrics ● Dimensions ● Statistics ● Percentiles ● Alarms 11 Metrics ● Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. ● AWS services send metrics to CloudWatch, and you can send your own custom metrics to CloudWatch. ● Metrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point has a time stamp, and (optionally) a unit of measure. When you request statistics, the returned data stream is identified by namespace, metric name, dimension, and (optionally) the unit.
  12. 12. Dimensions ● A dimension is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a metric. ● Every metric has specific characteristics that describe it, and you can think of dimensions as categories for those characteristics. Dimensions help you design a structure for your statistics plan. ● AWS services that send data to CloudWatch attach dimensions to each metric. You can use dimensions to filter the results that CloudWatch returns. For example, you can get statistics for a specific EC2 instance by specifying the InstanceId dimension when you search for metrics. ● For metrics produced by certain AWS services, such as Amazon EC2, CloudWatch can aggregate data across dimensions. ● CloudWatch does not aggregate across dimensions for your custom metrics. 12
  13. 13. Metrics Statistics ● Statistics are metric data aggregations over specified periods of time. Aggregations are made using the namespace, metric name, dimensions, and the data point unit of measure, within the time period you specify. 13
  14. 14. Publish Custom Metrics You can publish your own metrics to CloudWatch using the AWS CLI or an API. You can view statistical graphs of your published metrics with the AWS Management Console. 14 chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name LoginCount --unit Count --value 1 --dimensions Platform=iOS,Subscribe=Freemium chris@prod-server [~] $ aws cloudwatch put-metric-data --namespace VP/API --metric-name LoginLatency --unit Milliseconds --value 200.0 --dimensions Platform=iOS,Subscribe=Freemium We can simply aggregate and visualize LoginCount on CloudWatch dashboard for ● Total login user count in last 6 hours ● iOS login user count in last 6 hours ● Average login latency for Android Freemium user count in last 6 hours
  15. 15. Black-Box v.s. White-Box Monitoring ● Borgmon (or CloudWatch) is a white-box monitoring system—it inspects the internal state of the target service, and the rules are written with knowledge of the internals in mind. The transparent nature of this model provides great power to identify quickly what components are failing ● But you only see the queries that arrive at the target; the queries that never make it due to a DNS error are invisible, while queries lost due to a server crash never make a sound. ● Black-box monitoring like Pingdom is a good way to see from user’s perspective 15
  16. 16. We’re Hiring 16
  17. 17. Thank you! KKStream (Japan) KKBOX Japan LLC, 6F Urbanprem Shibuya, 1-4-2 Shibuya, Shibuya-ku, Tokyo, 150-0002, Japan Tel: +81 3 6758-7400 Fax: +81 3 6758-7401 Email: KKStream (Taiwan) 8F, 19-11, Sanchong Rd, Nangang Dist, Taipei City 115, Taiwan Tel: +886 2 2655-0369 Fax: +886 2 2655-0929 Email: Copyright © 2018 KKStream Limited. All rights reserved.