Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metrics: where and how

1,862 views

Published on

Graphite tuning story from Kyiv Devops Day 2016

Published in: Software
  • Be the first to comment

Metrics: where and how

  1. 1. Metrics: where and how graphite-oriented story
  2. 2. • Vsevolod Polyakov • Platform Engineer at Grammarly
  3. 3. Graphite All whisper-based systems
  4. 4. Default graphite architecture
  5. 5. what? • RRD-like (gram.ly/gfsx) • so.it.is.my.metric → /so/it/is/my/metric.wsp • Fixed retention (by namepattern) • Fixed size (actually no)
  6. 6. Retention and size • 1s:1d → 1 036 828 bytes • 10s:10d → 1 036 828 bytes • 1s:365d → 378 432 028 bytes (1 TB ~ 3 000) • 10s:365d → 37 843 228 bytes (1 TB ~ 30 000) whisper calc
  7. 7. Retention and size • 10s:30d,1m:120d,10m:365d → 4 564 864 bytes • 240 864 metrics in 1 TB • aggregation: average, sum, min, max, and last. • can be assign per metric
  8. 8. How • terraform (https://www.terraform.io/) • docker (https://www.docker.com/) • ansible (https://www.ansible.com/) • rocker (https://github.com/grammarly/rocker) • rocker-compose (https://github.com/grammarly/rocker-compose)
  9. 9. Default graphite architecture
  10. 10. carbon-cache.py • single-core • many options in config file • default link
  11. 11. architecture carbon-cache.py
  12. 12. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CR • defaults • almost 1.5h to get limit :(
  13. 13. carbon-cache.py cache size → 75k reqs
  14. 14. results • 75 000 reqs max • 60 000 reqs flagman speed • IO :(
  15. 15. Try to tune! • WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear IO load. • CACHE_WRITE_STRATEGY = sorted (default)
  16. 16. cache size 1k → 195k reqs
  17. 17. results • 120 000 reqs flagman speed • cache flush problem :(
  18. 18. Try to tune! • CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.
  19. 19. from 1k to 150k
  20. 20. results • 90 000 reqs flagman speed • cache flush problem :(
  21. 21. Try to tune! • CACHE_WRITE_STRATEGY = naive just flush. Better with random IO.
  22. 22. from 45k to 135k
  23. 23. results • 120 000 reqs flagman speed • still CPU
  24. 24. sorted max naive
  25. 25. • Maybe it’s IO EBS limitation? → 512 GB disk. • No.
  26. 26. go-carbon • multi-core single daemon • written in golang • not many options to tune :( link
  27. 27. Start load testing • m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2) • retentions = 1s:1d • max-size = 0 • max-updates-per-second = 0 • almost 1h to get limit :(
  28. 28. 1k → 130k reqs ~3k/min
  29. 29. results • 120 000 reqs flagman speed • but it’s without sparse. • try to implement
  30. 30. try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }
  31. 31. 180 000 reqs !
  32. 32. try to tune! • max update operation = 1500
  33. 33. results • TLDR 210 000 - 240 000 reqs flagman speed • 31 000 000 cache size!
  34. 34. try to tune! • max update operation = 0 • input-buffer = 400 000
  35. 35. results • 270 000 reqs flagman speed • 10-20 million req cache size!
  36. 36. try to tune! • vm.dirty_background_ratio=40 • vm.dirty_ratio=60
  37. 37. 300 000 reqs
  38. 38. results • 300 000 reqs flagman speed • 180k+ reqs ±without cache
  39. 39. Re:Lays
  40. 40. Default graphite architecture
  41. 41. arch forward
  42. 42. arch namedregexp
  43. 43. arch hash
  44. 44. arch hash replicafactor: 2
  45. 45. carbon-relay.py • twisted based • native
  46. 46. Start load testing • c4.xlarge instance (4 CPU, 7.5 GB ram) • ~1 Gb lan • default parameters • hashing • 10 connections
  47. 47. WTF!
  48. 48. carbon-relay-ng • golang-based • web-panel • live-updates • aggregators • spooling link
  49. 49. <150 000 reqs
  50. 50. carbon-c-relay • written in C • advanced cluster management
  51. 51. from 100 000 to 1 600 000 reqs
  52. 52. 1 400 000 flagman speed. Or not?
  53. 53. So… go-carbon + carbon-c-relay = ♡
  54. 54. BTW. influx, 130k reqs on cluster
  55. 55. influx
  56. 56. openTSDB single instance + hbase cluster = upto 150k reqs
  57. 57. ALSO • zipper: • https://github.com/grobian/carbonserver • https://github.com/grobian/carbonwriter • https://github.com/dgryski/carbonzipper • https://github.com/dgryski/carbonapi • https://github.com/dgryski/carbonmem • https://github.com/jssjr/carbonate
  58. 58. plans • Cyanite, retest • newTS • openTSDB tuninig • zipper tuning
  59. 59. feel free to ask • Vsevolod Polyakov • ctrlok@gmail.com • skype: ctrlok1987 • github.com/ctrlok • twitter.com/ctrlok • slack: HangOps • Gitter: dev_ua/devops • skype: DevOps from Ukraine

×