Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MONITORING
DISTRIBUTED HPC
DO YUO FNID TIHS SMILPE TO RAED ?
Q1 Q2 Q3 Q4
X Y X Y X Y X Y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74...
Q1 Q2 Q3 Q4
X Y X Y X Y X Y
Mean 9 7.50 9 7.50 9 7.50 9 7.50
Var 11 4.125 11 4.125 11 4.125 11 4.125
Cor 0.816 0.816 0.816...
0
11
0 14
y = 0.5001x + 3.0001
R² = 0.6665
Q1
0
11
0 14
y = 0.5x + 3.0009
R² = 0.6662
Q2
0
13
0 14
y = 0.4997x + 3.0025
R²...
GRAPH != KPI
t0
BATCH PROCESSING TIMELINE
t0 t10
BATCH PROCESSING TIMELINE
Metrics

Collection
t10 t45
Metrics

Parsing
BATCH PROCESSING TIMELINE
Metrics

Collection
t0
t10 t45
Action
t60
Metrics

Parsing
BATCH PROCESSING TIMELINE
Metrics

Collection
t0
t0
STREAM PROCESSING TIMELINE
t7
STREAM PROCESSING TIMELINE
Metrics

Parsing
t0
t7 t15
Action
STREAM PROCESSING TIMELINE
Metrics

Parsing
t0
#1 INFRASTRUCTURE MATTERS
#2 NO TIME TO CONSULT HQ
A
B
C
#3 NO WAY TO CONSULT HQ
SETUP
NUDNIK - INFRASTRUCTURE HEARTBEAT
▸ https://pypi.org/project/nudnik/
▸ pip install nudnik
▸ https://aur.archlinux.or...
NUDNIK
MASSAGE TYPES - BASELINE
A
B
C
D
E
20ms
10ms
20ms
10ms
10ms
20ms
10ms
10ms
▸Tiny fingerprint
▸GRPC / REST
▸Multiplat...
NUDNIK
MASSAGE TYPES - LOAD
A
B
C
D
E
10ms
10ms
10ms
10ms
10ms
10ms
20ms
20ms
▸CPU
▸Memory
▸Disk
▸Network
▸Executable
NUDNIK
MASSAGE TYPES - CHAOS
A
B
C
D
E10%
10%
+? ms
? %
▸Set failure %
▸Set latency
▸.. Or randomise
12ms
NUDNIK
REPORTING
▸InfluxDB
▸ElasticSearch
▸Prometheus PG
▸Text csv / TTY
AB
TS
C
A
MULTILAYER DISTRIBUTION
B
C
Node A Node B
A
SCALE PLANNING / TESTING
B C
RuOK
Distributed HPC monitoring
Distributed HPC monitoring
Upcoming SlideShare
Loading in …5
×

Distributed HPC monitoring

20 views

Published on

Statscraft monitoring conference 2019 presentation slides

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Distributed HPC monitoring

  1. 1. MONITORING DISTRIBUTED HPC
  2. 2. DO YUO FNID TIHS SMILPE TO RAED ?
  3. 3. Q1 Q2 Q3 Q4 X Y X Y X Y X Y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Frank Anscombe
 1918 - 2001 Standard Expected ANSCOMBE'S QUARTET
  4. 4. Q1 Q2 Q3 Q4 X Y X Y X Y X Y Mean 9 7.50 9 7.50 9 7.50 9 7.50 Var 11 4.125 11 4.125 11 4.125 11 4.125 Cor 0.816 0.816 0.816 0.816 Σ(^2) 13.7627 13.7763 13.7562 13.7425 LinReg y = 0.5x + 3 y = 0.5x + 3 y = 0.5x + 3 y = 0.5x + 3 Coefficient 0.67 0.67 0.67 0.67 Intercept 3.00009 ± 0.909545 3.00091 ± 0.909545 3.00245 ± 0.909545 3.00173 ± 0.909545 Slope 0.500091 ± 0.0953463 0.500000 ± 0.0953463 0.499727 ± 0.0953463 0.499909 ± 0.0953463
  5. 5. 0 11 0 14 y = 0.5001x + 3.0001 R² = 0.6665 Q1 0 11 0 14 y = 0.5x + 3.0009 R² = 0.6662 Q2 0 13 0 14 y = 0.4997x + 3.0025 R² = 0.6663 Q3 0 13 0 20 y = 0.4999x + 3.0017 R² = 0.6667 Q4
  6. 6. GRAPH != KPI
  7. 7. t0 BATCH PROCESSING TIMELINE
  8. 8. t0 t10 BATCH PROCESSING TIMELINE Metrics
 Collection
  9. 9. t10 t45 Metrics
 Parsing BATCH PROCESSING TIMELINE Metrics
 Collection t0
  10. 10. t10 t45 Action t60 Metrics
 Parsing BATCH PROCESSING TIMELINE Metrics
 Collection t0
  11. 11. t0 STREAM PROCESSING TIMELINE
  12. 12. t7 STREAM PROCESSING TIMELINE Metrics
 Parsing t0
  13. 13. t7 t15 Action STREAM PROCESSING TIMELINE Metrics
 Parsing t0
  14. 14. #1 INFRASTRUCTURE MATTERS
  15. 15. #2 NO TIME TO CONSULT HQ A B C
  16. 16. #3 NO WAY TO CONSULT HQ
  17. 17. SETUP NUDNIK - INFRASTRUCTURE HEARTBEAT ▸ https://pypi.org/project/nudnik/ ▸ pip install nudnik ▸ https://aur.archlinux.org/packages/ nudnik/ ▸ pacman -S nudnik ▸ git clone
 https://github.com/salosh/nudnik
  18. 18. NUDNIK MASSAGE TYPES - BASELINE A B C D E 20ms 10ms 20ms 10ms 10ms 20ms 10ms 10ms ▸Tiny fingerprint ▸GRPC / REST ▸Multiplatform
  19. 19. NUDNIK MASSAGE TYPES - LOAD A B C D E 10ms 10ms 10ms 10ms 10ms 10ms 20ms 20ms ▸CPU ▸Memory ▸Disk ▸Network ▸Executable
  20. 20. NUDNIK MASSAGE TYPES - CHAOS A B C D E10% 10% +? ms ? % ▸Set failure % ▸Set latency ▸.. Or randomise 12ms
  21. 21. NUDNIK REPORTING ▸InfluxDB ▸ElasticSearch ▸Prometheus PG ▸Text csv / TTY AB TS C
  22. 22. A MULTILAYER DISTRIBUTION B C Node A Node B
  23. 23. A SCALE PLANNING / TESTING B C RuOK

×