Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

KOCOON – KAKAO Automatic K8S Monitoring

854 views

Published on

Open & Intelligent Infrastructure kr 2019
Day2 Track2 14:00 ~ 14:40
KOCOON – KAKAO Automatic K8S Monitoring

Published in: Technology
  • Login to see the comments

KOCOON – KAKAO Automatic K8S Monitoring

  1. 1. 2019.07 KOCOON KAKAO Automatic k8s Monitoring @@cloud.telemetry / issac.lim(임성국)
  2. 2. Before We Running
  3. 3. Who am I?
  4. 4. Telemetry ? Telemetry is an automated communications process by which meas urements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. ref: https://en.wikipedia.org/wiki/Telemetry
  5. 5. cloud.telemetry ? • Remote & Inaccessible points • Baremetal, IaaS, CaaS … • We Develop & Provide automated communications process • MaaS (Monitoring as a Service) • KEMI Stats, KEMI Logs, KOCOON ref: https://en.wikipedia.org/wiki/Telemetry
  6. 6. KEMI Stats
  7. 7. KEMI Logs
  8. 8. Is it enough?
  9. 9. Head: KEMI-* Longtail : ? ref: https://mgcabral.wordpress.com/2012/03/04/thelongtaileconomics/
  10. 10. Longtail • Users want  Get own resource  Deal with resources in their own way • We want  Divide resources by users  Provide self-monitoring service
  11. 11. DKOSv3 • k8s based container orchestrator @KAKAO • Kubernetes v1.11.5 • 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
  12. 12. 그렇다면 … 서비스 클러스터 별로 모니터링 리소스를 분리하자!
  13. 13. KOCOON img ref: https://www.treehugger.com/green-architecture/cocoon-tree-prefab-spherical-treehouse-pod.html
  14. 14. KOCOON • KakaO COntainer based service mONitoring  서비스 리소스 안에서 수집/조회/알람 등 모든 것을 해결
  15. 15. KOCOON-* Overview
  16. 16. Metric based Self Monitoring
  17. 17. Log Routing
  18. 18. Rule based Log Event
  19. 19. How to Use? (KOCOON-Prometheus)
  20. 20. KOCOON-Prometheus Install  Prometheus-operator, Prometheus, Alertmanager  PrometheusRule(for alarm & recreate metric)  Cupido (Prometheus webhook manager for kakao)  kube state metrics, node exporter  Grafana & Dashboard  kakao etcd & kakao ingress controller service monitor helm install kakao-stable/kocoon-prometheus --name $(kubectl config current-context) --namespace monitoring
  21. 21. KOCOON-Prometheus Set LB ## grafana ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]=example-grafana.dev. 9rum.cc ## prometheus ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set prometheus.ingress.enabled=true --set prometheus.ingress.hosts[0]=example-pro metheus.dev.9rum.cc ## alertmanager ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set alertmanager.ingress.enabled=true --set alertmanager.ingress.hosts[0]=example-al ertmanager.dev.9rum.cc
  22. 22. KOCOON-Prometheus Set Kakaotalk ## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;4443;” ## 받아야 할 watchcenter group id가 1개인(6663) 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;" ## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에 함께 알람이 갑니다.
  23. 23. k8s Status View
  24. 24. k8s DrillDown View
  25. 25. k8s Ingress Controller View
  26. 26. k8s ETCD View
  27. 27. KOCOON-Prometheus Alarm
  28. 28. Any Problems?
  29. 29. Goals • Workers: 500 • Metric scrapes interval: 30s ~ 60s • Check resource status • service namespace (default) • kube-system • ingress
  30. 30. Step 1 • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB • it’s ok ~ 100 workers but… • Internal node IP issue (max 128 nodes…)
  31. 31. Step 2 Workers Scrape Interval CPU Memory # of metrics/sec 12 30s 0.1 780MB 4605 271 30s OOM • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB
  32. 32. Out Of Memory… • Prometheus killed due to OOM in large scale • https://groups.google.com/forum/#!topic/prometheus-users/DE LLNNSVCSw • https://github.com/prometheus/prometheus/issues/4553 • https://github.com/prometheus/prometheus/issues/1358
  33. 33. We considered … • # of targets • # of metrics • # of rules • frequency of scrapes & rule evaluations • …
  34. 34. Must item • # of targets • # of rules • frequency of scrapes & rule evaluations (every minutes)
  35. 35. # of metrics & topk • # of metrics per sec rate(prometheus_tsdb_head_samples_appended_total[5m]) • Top 10 metrics topk(10, count({job=~".+"}) by(__name__))
  36. 36. Step 3 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 30s OOM 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Scrape interval: 30s -> 60s
  37. 37. Useless Metrics • cadvisor • container_(network_tcp_usage_total|network_udp_usage_total |tasks_state|cpu_load_average_10s|memory_failures_total) • container_(cpu_schedstat_run_seconds_total|cpu_schedstat_ru nqueue_seconds_total|cpu_schedstat_run_periods_total|cpu_s ystem_seconds_total|cpu_user_seconds_total) • container_(last_seen|memory_working_set_bytes|memory_cac he|memory_failcnt|memory_max_usage_bytes|memory_swap| start_time_seconds) • container_(network_receive_packets_total|network_transmit_p ackets_dropped_total|network_receive_errors_total|network_r eceive_packets_dropped_total|network_transmit_packets_total |network_transmit_errors_total)
  38. 38. Useless Metrics • cadvisor • container_(spec_([a-z_]+)|fs_([a-z_]+)) • kubelet_(runtime_operations_latency_microseconds|docker_op erations_latency_microseconds)
  39. 39. Useless Metrics • kube api • apiserver_(admission_controller_admission_latencies_seconds_ bucket|admission_step_admission_latencies_seconds_bucket|a dmission_controller_admission_latencies_seconds_sum|admissi on_controller_admission_latencies_seconds_count|admission_s tep_admission_latencies_seconds_summary) • apiserver_(request_latencies_bucket|response_sizes_bucket|re quest_latencies_summary)
  40. 40. Useless Metrics • kube state metrics • kube_pod_container_status_waiting_reason
  41. 41. Step 4 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 310 60s 0.4 2.9GB 9748 storage_operation_duration_seconds_bucket: 31834 container_memory_usage_bytes: 25737 container_memory_rss: 25737 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Drop useless metrics: kube api, cadvisor, kube state metrics • 1629 pods
  42. 42. Step 5 Workers Scrape Interval CPU Memory # of metrics/sec topk 500 60s 1.1 12.88GB 8.35GB(RSS) 18324 storage_operation_duration_seconds_bucket: 50402 container_memory_rss: 45149 container_memory_usage_bytes: 45149 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 9GB, limit 14GB • 4107 pods
  43. 43. KOCOON-Prometheus Default SLA • Metric 보관 주기 : 3 days • Metric 수집 주기 : Every 60s • Alarm 주기: • 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림 • Target Cluster • ~ 200 nodes, ~ 1500 pods • 5분 평균 append되는 metric 수 : ~ 10,000/sec • 현재 설정 그대로 사용 가능 • cpu 1~4 Core, memory 4~6GB • 더 많은 node / pod를 돌리고 싶다면?
  44. 44. KOCOON-Prometheus SLA for 500 • Target Cluster • ~ 500 nodes(worker+ingress), ~ 4000 pods • 5분 평균 append되는 metric 수 : 18,000/sec • upgrade memory! ## Upgrade prometheus memory request & limit helm upgrade $(kubectl config current-context) kocoon-stg/kocoon-prometheus --reuse-values --set prometheus.prom
  45. 45. KOCOON-Prometheus Requirements • ~ 200 nodes / ~1500 pods • krane VM / PM: 8 Core, 8GB * 2 • ~ 500 nodes / ~4000 pods • krane VM / PM : 8 Core, 16GB * 2
  46. 46. Consideration for Service.. • Resource Request & Limit • cpu, memory request는 최소로 주고 limit 제한을 없앰 • 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod) view를 보고 필요한 cpu, memory를 확인 • 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을 설정 • 500 node cluster는 가능하지만 • 100 node 단위로 서비스 / 서비스 그룹들을 넣고, • 5개의 master node, 2개의 Prometheus & Alertmanager, • LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적 으로 운영
  47. 47. Sum Up • KOCOON-* is provided with helm charts • KOCOON-Prometheus -> today’s main topic • KOCOON-Cupido : included in KOCOON-Prometheus • KOCOON-Hermes -> will be present at ifKakao 2019 • KOCOON-DIKE -> developing… helm? : package manager for k8s https://helm.sh/
  48. 48. Thanks @@cloud.telemetry with andi, beemo, issac, jenny, joanne & cloud part

×