Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon CloudWatch - Observability and Monitoring

2,225 views

Published on

Amazon CloudWatch - Observability and Monitoring

Published in: Engineering
  • Be the first to comment

Amazon CloudWatch - Observability and Monitoring

  1. 1. 1
  2. 2. AWS CloudWatch Observability and Monitoring 2 Rick Hwang rick_kyhwang@hotmail.com 2017/12/28
  3. 3. 3 http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
  4. 4. 4 https://env.healthinfo.tw/air/
  5. 5. 5
  6. 6. CloudWatch Overview, Event-Driven, Automation AI / ML 6
  7. 7. Agenda ● CloudWatch Metric ● CloudWatch Dashboard ● CloudWatch Alarm ● CloudWatch Event / Rules ● CloudWatch Logs 7
  8. 8. ● SNS: Simple Notification Service ● SES: Simple Email Service ● SQS: Simple Queue Service ● Lambda: Serverless ● Auto Scaling ● CloudTrail Related AWS Services 8
  9. 9. Questions ● 怎麼知道系統的狀況? ● 系統的指標是怎麼來的? ● 系統有哪一些層級要知道?哪些人要知道?怎麼知道? ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? 9
  10. 10. How Amazon CloudWatch Works CloudWatch Basic Concepts 10
  11. 11. 11 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  12. 12. 12 出處:AWS Summit 2016: Big Data Architectural Patterns and Best Practices
  13. 13. Key Points 13 ● 產生結構化、有意義的 Log ○ 結構化: csv, json ○ 有意義: 可統計的資料 → sum, max, min, average, count … ○ 可以下 SQL ● 想想系統上線後需要知道什麼?這些東西哪裡來? ● 盡可能不要動用到 ETL (Extract, Transform, Load) ○ 成本很高、浪費 ○ 維護成本 ○ 溝通成本
  14. 14. 14
  15. 15. CloudWatch Metrics 每個指標背後都有不同的故事 15
  16. 16. 16Source: http://booklook.morningstar.com.tw/pdf/0139022.pdf 健檢報告的指標,都是經過無數 臨床經驗 (測試) 與科學實驗 (量測、觀察) 得來的。
  17. 17. Metric - CPU Utilization 17 UTC
  18. 18. CloudWatch Metric 18 ● Period: 每次取樣的時間週期 ○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費) ○ ELB 預設為 1m ○ Custom metirc supports high resolution: 1s ● Statistics: 統計方式,不同指標有預設的方式 ○ Sum ○ Average ○ Max ○ Min ○ Sample Count ● Unit: 單位 ○ Percent ○ Count ○ Bytes
  19. 19. Wikipedia: 長尾 Statistics - Long Tail 19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets
  20. 20. Metric Types ● Metrics Provided by AWS ● Custom Metric ○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做,容易出錯 ○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs,自訂 Filter 產生 Metric ■ 流程長,但是不難做 ■ 推薦這個做法 20
  21. 21. EC2 Metrics 每個指標背後代表不同的現象 21 Amazon EC2 Metrics and Dimensions
  22. 22. 22 Metric Description CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance. To use the percentiles statistic, you must enable detailed monitoring. Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the instance is not allocated a full processor core. Units: Percent DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count
  23. 23. 23 Metric Description DiskReadBytes Bytes read from all instance store volumes available to the instance. This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes DiskWriteBytes Bytes written to all instance store volumes available to the instance. This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes
  24. 24. 24 Metric Description NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a single instance. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance. The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average
  25. 25. EC2 Metrics ● 預設 Period = 5min (Free) ○ Detail Monitoring: period = 1min ($$) ● memory, disk 不支援,需要透過其他方式 ○ CloudWatch Agent (201712 release) ○ telegraf, collectd, cacti, nagios …. 25
  26. 26. ELB Metrics 負載平衡 26 Elastic Load Balancing Metrics and Dimensions
  27. 27. 27 Metric Description Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the instance started to send the response headers. [TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered instance. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer than the average. Note that Minimum is typically not useful. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b. RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes). [HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances. [TCP listener] The number of connections made to the registered instances. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.
  28. 28. 28 Metric Description HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1 unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b. UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy threshold configured for health checks. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: See HealthyHostCount.
  29. 29. 29 Metric Description HTTPCode_Backend_2XX, HTTPCode_Backend_3XX, HTTPCode_Backend_4XX, HTTPCode_Backend_5XX [HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a. HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a request is malformed or incomplete. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the values for the Availability Zones. HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for us-west-2a.
  30. 30. 30 Metric Description BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also includes any connection errors related to health checks. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be useful to determine whether a load balancer node is an outlier. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
  31. 31. 31 Metric Description SpilloverCount The total number of requests that were rejected because the surge queue is full. [HTTP listener] The load balancer returns an HTTP 503 error code. [TCP listener] The load balancer closes the connection. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for us-west-2a. SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when the queue is full. For more information, see SpilloverCount. Reporting criteria: There is a nonzero value. Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for us-west-2a.
  32. 32. 請參閱:Amazon CloudWatch Metrics and Dimensions Reference 族繁不及備載 ... 32
  33. 33. ● EC2 ● EBS ● ELB: CLB, ALB, NLB ○ Classic Load Balancing ○ Application Load Balancing ○ Network Load Balancing 需要了解的 Metrics 33
  34. 34. 每個指標背後 都有故事可以說。 34
  35. 35. Question and Think: EC2 / ELB 的指標是怎麼來的? 35
  36. 36. 36
  37. 37. CloudWatch Dashboard 拉高視野,看見全局 37
  38. 38. 38StarTrek (星艦企業號)
  39. 39. 39 Passenger (星艦過客)
  40. 40. 40 Passenger (星艦過客)
  41. 41. 41 CloudWatch Dashboard ● widget: line, stacked, number, text (markdown) ● auto refresh ● local timezone ○ EC2 metric is UTC ● time range ● Horizontal annotation ● Right / Left Y axis ● full screen (dark / light mode)
  42. 42. ● Dashboard 可以 import / export 成 json ● 可以透過 API 自動更新 ● $3.00 per dashboard per month (ap-northeast-1) ● Time zone 42 Tips
  43. 43. 43 LetSSL - System Level
  44. 44. 44 LetSSL - Application Level
  45. 45. Demo: CloudWatch Dashboard Widgets, X/Y Axis, Annotation 45
  46. 46. 46
  47. 47. CloudWatch Alarm Event-driven, Feedback 47
  48. 48. CloudWatch Alarm 48 ● 達到門檻值 (Threshold) 之後觸發的動作 ○ 五分鐘之內 ○ CPU >= 80% ○ 五次 ● 動作類型 ○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。 ○ SNS to: ■ SES ■ SQS ■ Lambda ■ HTTP Request
  49. 49. CloudWatch Alarm - Status 49 ● ALARM: over threshold ● INSUFFICIENT: INSUFFICIENT DATA ● OK
  50. 50. Demo: CloudWatch Alarm 50
  51. 51. Event-driven → Feedback → Automation 51來源:『自動化XXX』的陷阱 CW Alarm
  52. 52. 52
  53. 53. CloudWatch Events Rules, Cron, Scheduler 53
  54. 54. 54 CloudWatch Event ● Event Source ○ Event Pattern ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..
  55. 55. 55 CloudWatch Events ● Event Source ○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了 ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了
  56. 56. 56 常用情境 ● EC2 預防性自動化: ○ 不該關機的機器被關機,自動重 啟 ○ 機器硬體故障,自動重 啟 ○ 狀態改變的行為 ● S3 Action 之後 ○ Action: PutObject ○ Trigger: Lambda, Put Message to SQS
  57. 57. Demo: CloudWatch Events 57
  58. 58. 58
  59. 59. CloudWatch Logs Filter, Custom Metric, Log Shipper 59
  60. 60. 60 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  61. 61. ● 前提:EC2 要安裝 awslogs driver or CloudWatch agent ○ ECS Instance 用選的就可以 ● 即時把 Log 傳到 CWL ○ 可以在 CWL 直接 Query Log (堪用) ○ 不用擔心 Storage 會爆炸 or 維護 ○ 可以設定 Log Rotation ● 透過 Filter 建立 Custom Metric ○ 可以建立 Dashboard ○ 可以建立 Alarm → Event-driven ■ To Lambda, Slack ■ ETL ■ Automation … 無限可能 CloudWatch Logs (CWL) 61
  62. 62. ● 透過取樣 (Sampling) 待測目標得來的資料 ○ 單位時間的資料,例如每毫秒、每秒、每分 ● 取樣頻率越高,數據越精準 ● 聲音的音質 (sample rate per second) ○ CD Quality: 44.1kHz ○ 錄音室錄音:192kHz ● 攝影的解析度 (Resolution) ○ HD ○ Full-HD ○ 4k 指標 (Metric) 62
  63. 63. Demo: CloudWatch Logs 63
  64. 64. 64
  65. 65. 上述講的東西,都可以 `as Code` 65
  66. 66. Questions ● 怎麼知道系統的狀況? ○ 觀測 (Observe)、量測 (Measure) ● 系統的指標是怎麼來的? ○ 指標是經過系統性測試 (System Test) 後,分析 Log 找出來的 ● 系統有哪一些層級要知道?哪些人要知道? ○ Business、Application、OS/Hardware、Network ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? ○ 監: Watch ○ 控: Control 66
  67. 67. 67
  68. 68. 本質性問題 68
  69. 69. 什麼是監控? What is Monitoring? 69
  70. 70. 監 70
  71. 71. 監 控 71
  72. 72. 監 控 72
  73. 73. 監 控 Watch Monitor Observe Measure 73
  74. 74. 監 控 Watch Monitor Observe Measure Control Command Handle Manage 74
  75. 75. 監 控 Watch Monitor Observe Measure Control Command Handle Manage 75 Dashboard (儀表板)
  76. 76. 監 控 Watch Monitor Observe Measure Control Command Handle Manage 76 Dashboard (儀表板) Console (主控台)
  77. 77. Dashboard (儀表板) 77 StarTrek (星艦企業號)
  78. 78. Console (主控台) 78 演唱會 Mixer
  79. 79. 79
  80. 80. 80 Target Services / Systems
  81. 81. 81 Target Services / Systems Watchers
  82. 82. 82 Target Services / Systems Watchers Controllers
  83. 83. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 83 Target Services / Systems Watchers Controllers Push or Pull Data (Observability, Measure)
  84. 84. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … Push or Pull Data (Observability, Measure) 84 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ...
  85. 85. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 85 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Push or Pull Data (Observability, Measure)
  86. 86. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 86 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure)
  87. 87. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 87 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監
  88. 88. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 88 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監 控
  89. 89. 89 Observability vs Monitoring ● 量測:Measure ● 觀測:Observe ● 氣象局 ○ Observability 觀測 ○ Measurement 量測 ● 政府 ○ Monitoring ○ Alert ○ Action ○ Feedback
  90. 90. 90 http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
  91. 91. 91 量測 (Measure) → Sample from Log 觀測 (Observe) → Metric 回饋 (Feedback) → Analyze, Condition, Alarm 控制 (Control) → Automation, 躺著幹
  92. 92. 無法量測,就無法觀測 無法觀測,則沒有回饋 沒有回饋,就不能控制 92
  93. 93. Log 很重要 沒有結構化的 Log or Data 會付出很多 ETL 的成本與時間 93
  94. 94. Event-driven → Feedback → Automation 94來源:『自動化XXX』的陷阱 CW Alarm
  95. 95. 95
  96. 96. Why CloudWatch ● Serverless Monitoring System ● Event-driven ● Programmable and Automation ● Realtime and Backup ● Monitoring Monitoring System at Netflix - 2017/05/22 ● CloudWatch 滿足 “Basic Montioring” 的需求 96
  97. 97. 97 Source: Microservice Prerequisites
  98. 98. 為什麼不選其它監控工具? ● 不想自己蓋機器、養機器 ● 監控系統做得再好,都只是成本 ● 監控系統不是 Big Data ● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus 98
  99. 99. 99
  100. 100. 100 Alarm System using Serverless
  101. 101. EC2 CloudWatch Alarms Operators CloudWatch Event (time-based) SNS-Adapter Slack-Notifier SNS Topic Info, Warning Info Developers Health-Checker Auto Scaling SNS Topic Urgent SMS Warning 系統架構: CloudWatch + SNS + Lambda + Slack Testers ● Urgent: SMS, Slack ● Warning: Slack w/ tag ● Info: Slack w/o tag
  102. 102. 102 CloudWatch Reporter - System Architecture CloudWatch Reporter / Alamer CloudWatch Event (time-based) Info / Alert Channels Operators (值班) Operators Developers (On Call) Metric Configs (Namespace, Stats) Target Services Loading maintain PR Read CW Metrics Schedule maintain Developers development Feature Request
  103. 103. 103
  104. 104. Best Practice ● 盡量活用 Cloud SaaS,像是 AWS CloudWatch, GCP Stackdriver ● 把部署設定過程設計成 Configurable ● 把 Log 設計成結構化格式 (csv or json) ● 利用 Big Data Solution 處理 Log Query 需求,像是 AWS Athena or GCP BigQuery ● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到 ○ S3 備份,以符合稽核需求 ○ CloudWatch 作為 Debug / 監控需求 ● 巨量 Log Streaming 資料需要有 Queue 協助 ○ AWS Kinesis ○ GCP Pub/Sub 104
  105. 105. ● CloudWatch User Guide ● CloudWatch Events User Guide ● CloudWatch Log User Guide Reference - User Guide 105
  106. 106. ● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon CloudWatch (DVO315) ● Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets ● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch ● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan ● Study Notes - CloudWatch ● SRE CH6 Monitoring Distributed Systems (監控分散式系統) ● 高品質微服務 - CH6 監控 Reference - Youtube, Blog 106
  107. 107. 107 /* End of Slide */

×