Successfully reported this slideshow.
Your SlideShare is downloading. ×

Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Metrics: where and how
Metrics: where and how
Loading in …3
×

Check these out next

1 of 169 Ad

Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)

Download to read offline

Обзор мониторинга в Grammarly, о котором я докладывал на прошлом RootConf'е.

Почему мы опять решили всё изменить после перехода на докер, и как мы пришли к zipper-stack, go-carbon, carbon-c-relay (в том числе и бенчмарки альтернативных решений), как получать миллион уникальных метрик в секунду, как мы пришли к тому, что теги в условии безымянных инстансов необходимы, и как мы их сделали, как работает zipper-stack и, вообще, архитектура нашего текущего убер мониторинга.

Обзор мониторинга в Grammarly, о котором я докладывал на прошлом RootConf'е.

Почему мы опять решили всё изменить после перехода на докер, и как мы пришли к zipper-stack, go-carbon, carbon-c-relay (в том числе и бенчмарки альтернативных решений), как получать миллион уникальных метрик в секунду, как мы пришли к тому, что теги в условии безымянных инстансов необходимы, и как мы их сделали, как работает zipper-stack и, вообще, архитектура нашего текущего убер мониторинга.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly) (20)

Advertisement

More from Ontico (20)

Recently uploaded (20)

Advertisement

Путь мониторинга 2.0 всё стало другим / Всеволод Поляков (Grammarly)

  1. 1. МОНИТОРИНГ. ОПЯТЬ. Всеволод Поляков
  2. 2. Platform Engineer . Grammarly ctrlok.com
  3. 3. Что такое метрики?
  4. 4. Успешность
  5. 5. Количество
  6. 6. Время
  7. 7. Взаимодействие
  8. 8. Внутренние процессы
  9. 9. Системные метрики
  10. 10. Зачем нужны метрики?
  11. 11. Алерты
  12. 12. Аналитика
  13. 13. Graphite
  14. 14. Default graphite architecture
  15. 15. what?
  16. 16. what? • RRD-like (gram.ly/gfsx)
  17. 17. what? • RRD-like (gram.ly/gfsx) • so.it.is.my.metric → /so/it/is/my/metric.wsp
  18. 18. what? • RRD-like (gram.ly/gfsx) • so.it.is.my.metric → /so/it/is/my/metric.wsp • Fixed retention (by namepattern)
  19. 19. what? • RRD-like (gram.ly/gfsx) • so.it.is.my.metric → /so/it/is/my/metric.wsp • Fixed retention (by namepattern) • Fixed size (actually no)
  20. 20. Retention and size
  21. 21. Retention and size • 1s:1d → 1 036 828 bytes
  22. 22. Retention and size • 1s:1d → 1 036 828 bytes • 10s:10d → 1 036 828 bytes
  23. 23. Retention and size • 1s:1d → 1 036 828 bytes • 10s:10d → 1 036 828 bytes whisper calc
  24. 24. Retention and size • 1s:1d → 1 036 828 bytes • 10s:10d → 1 036 828 bytes • 1s:365d → 378 432 028 bytes (1 TB ~ 3 000) whisper calc
  25. 25. Retention and size • 1s:1d → 1 036 828 bytes • 10s:10d → 1 036 828 bytes • 1s:365d → 378 432 028 bytes (1 TB ~ 3 000) • 10s:365d → 37 843 228 bytes (1 TB ~ 30 000) whisper calc
  26. 26. Retention and size
  27. 27. Retention and size • 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
  28. 28. Retention and size • 10s:30d,1m:120d,10m:365d → 4 564 864 bytes • 240 864 metrics in 1 TB
  29. 29. Retention and size • 10s:30d,1m:120d,10m:365d → 4 564 864 bytes • 240 864 metrics in 1 TB • aggregation: average, sum, min, max, and last.
  30. 30. Retention and size • 10s:30d,1m:120d,10m:365d → 4 564 864 bytes • 240 864 metrics in 1 TB • aggregation: average, sum, min, max, and last. • can be assign per metric
  31. 31. How • terraform (https://www.terraform.io/) • docker (https://www.docker.com/) • ansible (https://www.ansible.com/) • rocker (https://github.com/grammarly/rocker) • rocker-compose (https://github.com/grammarly/rocker-compose)
  32. 32. Default graphite architecture
  33. 33. Default graphite architecture
  34. 34. carbon-cache.py link
  35. 35. carbon-cache.py • single-core link
  36. 36. carbon-cache.py • single-core • many options in config file link
  37. 37. carbon-cache.py • single-core • many options in config file • default link
  38. 38. architecture carbon-cache.py
  39. 39. Start load testing
  40. 40. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
  41. 41. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d
  42. 42. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf
  43. 43. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf • defaults
  44. 44. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf • defaults • almost 1.5h to get limit :(
  45. 45. carbon-cache.py cache size → 75k ms
  46. 46. updates
  47. 47. upd time
  48. 48. results • 75 000 ms max • 60 000 ms flagman speed • IO :(
  49. 49. Try to tune! • WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear IO load. • CACHE_WRITE_STRATEGY = sorted (default)
  50. 50. cache size 1k → 195k ms
  51. 51. results • 120 000 ms flagman speed • cache flush problem :(
  52. 52. Try to tune! • CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.
  53. 53. from 1k to 150k
  54. 54. results • 90 000 ms flagman speed • cache flush problem :(
  55. 55. Try to tune! • CACHE_WRITE_STRATEGY = naive just flush. Better with random IO.
  56. 56. from 45k to 135k
  57. 57. results • 120 000 ms flagman speed • still CPU
  58. 58. sorted max naive
  59. 59. • Maybe it’s IO EBS limitation? → 512 GB disk.
  60. 60. • Maybe it’s IO EBS limitation? → 512 GB disk. • No.
  61. 61. • Maybe it’s IO EBS limitation? → 512 GB disk. • No.
  62. 62. go-carbon link
  63. 63. go-carbon • multi-core single daemon link
  64. 64. go-carbon • multi-core single daemon • written in golang link
  65. 65. go-carbon • multi-core single daemon • written in golang • not many options to tune :( link
  66. 66. Start load testing
  67. 67. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
  68. 68. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d
  69. 69. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • max-size = 0
  70. 70. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • max-size = 0 • max-updates-per-second = 0
  71. 71. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • max-size = 0 • max-updates-per-second = 0 • almost 1h to get limit :(
  72. 72. 1k → 130k ms ~3k/min
  73. 73. 1k → 130k ms ~3k/min
  74. 74. 1k → 130k ms ~3k/min
  75. 75. results
  76. 76. results • 120 000 ms flagman speed
  77. 77. results • 120 000 ms flagman speed • but it’s without sparse.
  78. 78. results • 120 000 ms flagman speed • but it’s without sparse. • try to implement
  79. 79. try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }
  80. 80. Уже есть в go-carbon
  81. 81. 180 000 ms !
  82. 82. try to tune! • max update operation = 1500
  83. 83. results • TLDR 210 000 - 240 000 ms flagman speed • 31 000 000 cache size!
  84. 84. try to tune! • max update operation = 0 • input-buffer = 400 000
  85. 85. results • 270 000 ms flagman speed • 10-20kk cache size!
  86. 86. try to tune! • vm.dirty_background_ratio=40 • vm.dirty_ratio=60
  87. 87. 300 000 reqs
  88. 88. results • 300 000 ms flagman speed • 180k+ ms ±without cache
  89. 89. Re:Lays
  90. 90. Default graphite architecture
  91. 91. Default graphite architecture
  92. 92. arch forward
  93. 93. arch namedregexp
  94. 94. arch hash
  95. 95. arch hash replicafactor: 2
  96. 96. carbon-relay.py • twisted based • native
  97. 97. Start load testing
  98. 98. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram)
  99. 99. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram) • ~1 Gb lan
  100. 100. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram) • ~1 Gb lan • default parameters
  101. 101. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram) • ~1 Gb lan • default parameters • hashing
  102. 102. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram) • ~1 Gb lan • default parameters • hashing • 10 connections
  103. 103. WTF!
  104. 104. carbon-relay-ng link
  105. 105. carbon-relay-ng • golang-based link
  106. 106. carbon-relay-ng • golang-based • web-panel link
  107. 107. carbon-relay-ng • golang-based • web-panel • live-updates link
  108. 108. carbon-relay-ng • golang-based • web-panel • live-updates • aggregators link
  109. 109. carbon-relay-ng • golang-based • web-panel • live-updates • aggregators • spooling link
  110. 110. <150 000 reqs
  111. 111. carbon-c-relay • написан на C • advanced cluster management
  112. 112. from 100 000 to 1 600 000 reqs
  113. 113. 1 400 000 flagman speed. Or not?
  114. 114. 1 400 000 flagman speed. Or not?
  115. 115. 1 400 000 flagman speed. Or not?
  116. 116. Итак… go-carbon + carbon-c-relay = ♡
  117. 117. Контейнеры
  118. 118. Всё перепутано
  119. 119. Различия • Окружение • Роль • Трек (Модификатор) • IP • Датацентр • Что-угодно
  120. 120. Теги
  121. 121. TSDB с тегами • influxDB • openTSDB (hbase) • cyanite (cassandra) • newTS (cassandra) • Prometheus
  122. 122. (cluster) influx, 130k metrics
  123. 123. openTSDB single instance + hbase cluster = upto 150k metrics
  124. 124. Compaction
  125. 125. Graphite
  126. 126. Найти уникальное
  127. 127. Работает с Grafana
  128. 128. Zipper • https://github.com/grobian/carbonserver • https://github.com/dgryski/carbonzipper • https://github.com/dgryski/carbonapi
  129. 129. ALSO • https://github.com/jssjr/carbonate • https://github.com/jjneely/buckytools • https://github.com/dgryski/carbonmem • https://github.com/grobian/carbonwriter
  130. 130. Планы • Патч statsd → ES • Патч carbonserver → carbonlink
  131. 131. feel free to ask • Vsevolod Polyakov • ctrlok@gmail.com • skype: ctrlok1987 • github.com/ctrlok • twitter.com/ctrlok • slack: HangOps • Gitter: dev_ua/devops • skype: DevOps from Ukraine • slack.ukrops.club
  132. 132. feel free to ask • Vsevolod Polyakov • ctrlok@gmail.com • skype: ctrlok1987 • github.com/ctrlok • twitter.com/ctrlok • slack: HangOps • Gitter: dev_ua/devops • skype: DevOps from Ukraine • slack.ukrops.club Мы хайрим!

×