Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed HPC monitoring

13 views

Published on

Statscraft monitoring conference 2019 presentation slides

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Distributed HPC monitoring

  1. 1. MONITORING DISTRIBUTED HPC
  2. 2. DO YUO FNID TIHS SMILPE TO RAED ?
  3. 3. Q1 Q2 Q3 Q4 X Y X Y X Y X Y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Frank Anscombe
 1918 - 2001 Standard Expected ANSCOMBE'S QUARTET
  4. 4. Q1 Q2 Q3 Q4 X Y X Y X Y X Y Mean 9 7.50 9 7.50 9 7.50 9 7.50 Var 11 4.125 11 4.125 11 4.125 11 4.125 Cor 0.816 0.816 0.816 0.816 Σ(^2) 13.7627 13.7763 13.7562 13.7425 LinReg y = 0.5x + 3 y = 0.5x + 3 y = 0.5x + 3 y = 0.5x + 3 Coefficient 0.67 0.67 0.67 0.67 Intercept 3.00009 ± 0.909545 3.00091 ± 0.909545 3.00245 ± 0.909545 3.00173 ± 0.909545 Slope 0.500091 ± 0.0953463 0.500000 ± 0.0953463 0.499727 ± 0.0953463 0.499909 ± 0.0953463
  5. 5. 0 11 0 14 y = 0.5001x + 3.0001 R² = 0.6665 Q1 0 11 0 14 y = 0.5x + 3.0009 R² = 0.6662 Q2 0 13 0 14 y = 0.4997x + 3.0025 R² = 0.6663 Q3 0 13 0 20 y = 0.4999x + 3.0017 R² = 0.6667 Q4
  6. 6. GRAPH != KPI
  7. 7. t0 BATCH PROCESSING TIMELINE
  8. 8. t0 t10 BATCH PROCESSING TIMELINE Metrics
 Collection
  9. 9. t10 t45 Metrics
 Parsing BATCH PROCESSING TIMELINE Metrics
 Collection t0
  10. 10. t10 t45 Action t60 Metrics
 Parsing BATCH PROCESSING TIMELINE Metrics
 Collection t0
  11. 11. t0 STREAM PROCESSING TIMELINE
  12. 12. t7 STREAM PROCESSING TIMELINE Metrics
 Parsing t0
  13. 13. t7 t15 Action STREAM PROCESSING TIMELINE Metrics
 Parsing t0
  14. 14. #1 INFRASTRUCTURE MATTERS
  15. 15. #2 NO TIME TO CONSULT HQ A B C
  16. 16. #3 NO WAY TO CONSULT HQ
  17. 17. SETUP NUDNIK - INFRASTRUCTURE HEARTBEAT ▸ https://pypi.org/project/nudnik/ ▸ pip install nudnik ▸ https://aur.archlinux.org/packages/ nudnik/ ▸ pacman -S nudnik ▸ git clone
 https://github.com/salosh/nudnik
  18. 18. NUDNIK MASSAGE TYPES - BASELINE A B C D E 20ms 10ms 20ms 10ms 10ms 20ms 10ms 10ms ▸Tiny fingerprint ▸GRPC / REST ▸Multiplatform
  19. 19. NUDNIK MASSAGE TYPES - LOAD A B C D E 10ms 10ms 10ms 10ms 10ms 10ms 20ms 20ms ▸CPU ▸Memory ▸Disk ▸Network ▸Executable
  20. 20. NUDNIK MASSAGE TYPES - CHAOS A B C D E10% 10% +? ms ? % ▸Set failure % ▸Set latency ▸.. Or randomise 12ms
  21. 21. NUDNIK REPORTING ▸InfluxDB ▸ElasticSearch ▸Prometheus PG ▸Text csv / TTY AB TS C
  22. 22. A MULTILAYER DISTRIBUTION B C Node A Node B
  23. 23. A SCALE PLANNING / TESTING B C RuOK

×