SlideShare a Scribd company logo
1 of 22
Download to read offline
Fine-grained Monitoring at HP
Mark Seger
Hewlett Packard
Cloud Services
4/19/2013 1Fine-grained Monitoring
Agenda
• What is the problem we’re trying to solve
• Introduction to collectl
• Monitoring Swift & Glance
• Monitoring VMs
4/19/2013 2Fine-grained Monitoring
Conflicting Problem Statements
• (Ks of nodes) X (hundreds of metrics)
– And you want to centrally monitor them how often?
– And store then in a database for future mining?
– Politely pause for laughter…
• Reality
– Choose a frequency on the order of a minute or more
– Don’t collect all data at the same time
– Don’t collect everything
• BUT when problems arise
– Granularity measured in minutes is too coarse
– If samples aren’t taken at the same time, how do you correlate?
– There’s almost never enough detailed data to provide answers
4/19/2013 3Fine-grained Monitoring
Isn’t the solution obvious?
• Use one central tool and another local tool
• But this has its own problems as well
– Will the data cross-correlate? Hopefully…
– What about customizations?
– Do you really want to collect the same data twice?
• At HP we’ve chosen a hybrid model
– Use a lightweight local data collection tool, some redundancy
• Collect cpu, disk, net & mem every second; key processes every 5
• Extend it to add OpenStack monitoring capabilities
• Send subset/updates to centralized monitor every minute or more
– Central tool: collectd, Local tool: collectl no relation!
4/19/2013 4Fine-grained Monitoring
Introduction to collectl
• Open source tool, in use for many years
• Roots in HPC, so knows how to be efficient
– Think of it as SAR on steroids
– Can monitor at sub-second intervals when needed
– Synchronizes samples across cluster to usecs for correlation
– Can generate data in plottable formats
– Can record data locally and send over a socket
• And do it at different frequencies!
• Can also write to a snapshot file which is what we’re doing
– Has an API for extensibility
– Several utilities for plotting and real-time cluster monitoring
4/19/2013 5Fine-grained Monitoring
Summary Mode Formats
#<-------CPU--------><-----------Disks-----------><-----------Network--------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
10 9 206 94 0 0 0 0 0 1 0 0
26 26 183 80 0 0 1279 27 18 78 12 37
27 27 396 70 0 0 31597 275 0 6 0 5
9 9 341 71 0 0 32629 274 4 43 0 2
### RECORD 3 >>> cag-dl380-01 <<< (1176471932.010) (Fri Apr 13 09:45:32 2007) ###
# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# User Nice Sys Wait IRQ Soft Steal Idle Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15
0 0 0 0 0 0 0 99 1070 206 4 246 0 0.10 0.03 0.01
# DISK SUMMARY (/sec)
#KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB
0 0 0 0 0 0 0 0
# NETWORK SUMMARY (/sec)
# KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut
4 32 133 0 0 0 28 31 926 0 0
Verbose: collectl --verbose
Brief: collectl
Detail Mode Format
# SINGLE CPU STATISTICS
# Cpu User Nice Sys Wait IRQ Soft Steal Idle
0 2 0 0 0 0 0 0 98
1 0 0 0 0 0 0 0 100
2 0 0 0 0 0 0 0 100
3 2 0 0 0 0 0 0 98
# DISK STATISTICS (/sec)
# <---------reads---------><---------writes---------><--------averages--------> Pct
#Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util
sda 0 0 0 0 0 0 0 0 0 0 0 0 0
sdb 0 0 0 0 0 0 0 0 0 0 0 0 0
sdc 0 0 0 0 0 0 0 0 0 0 0 0 0
sdd 0 0 0 0 0 0 0 0 0 0 0 0 0
hda 0 0 0 0 0 0 0 0 0 0 0 0 0
# NETWORK STATISTICS (/sec)
#Num Name KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut
0 lo: 0 0 0 0 0 0 0 0 0 0 0
1 eth0: 1 11 144 0 0 0 1 7 257 0 0
2 eth1: 0 1 64 0 0 0 0 1 64 0 0
3 sit0: 0 0 0 0 0 0 0 0 0 0 0
collectl -sCDN
Let’s talk about swift/glance monitoring
• The real question is what metrics do they expose?
– Track GETs, PUTs, etc
– Include object sizes and timings
– Also provides error codes/text
• Tail swift logs and write rolling counters every second
– Operation types
– Object size and network bandwidth histograms, though b/w can be
misleading
• Also generate hourly/daily summaries, retaining 1 week’s worth
• Same utility also knows how to parse glance logs
– Separates tracking of metadata operations
cat /var/log/perf/ops/opscount.txt
get: 29452211 put: 84775192 del: 12433208 post: 65666 pat: 0 head: 28489510 e4xx: 174473 e500: 4774
get 25679667 547036 58147 28824 234 8839362141 23961566 717292 124028 95240 69479 35804 12355 5796 6814 8794
put 49048442 489078 123922 41085 715 12633635902 36213403 120714 10620 3815 4293 3666 6570 4030 188 0
4/19/2013 8Fine-grained Monitoring
Collectl plugins
• Use collectl’s import API
– Read opscount file every monitoring interval logging to disk
– Can also be used interactively
– Most importantly, supported by collectl utilities
• Use collectl’s export API to write to local file every minute
– Aligned to top-of-minute to avoid RRD messing with the data
• Use collectd’s putval capabilility to upload when file changes
4/19/2013 9Fine-grained Monitoring
Here’s what we can see locally
# <----CPU[HYPER]-----><----------Network----------><-------Ops--------
>
#Time cpu sys inter ctxsw KBIn PktIn KBOut PktOut GKBs PKBs Ops Errs
15:18:16 0 0 214 131 28 53 77 90 0 64 1 0
15:18:17 0 0 556 505 57 171 39 175 0 22 2 0
15:18:18 0 0 390 13776 143 223 614 522 0 0 0 0
15:18:19 0 0 292 12881 3 31 3 27 0 86 7 0
Note you can mix’n match with any standard collectl data in brief mode
# <--------------------operations-------------------->
#Time GetKB PutKB Gets Puts Dels Post Pats Head E4xx E500
15:20:36 0 0 1 2 1 0 0 0 0 0
15:20:37 0 64 1 3 1 0 0 0 0 0
15:20:38 0 0 0 0 0 0 0 0 0 0
15:20:39 0 0 2 0 1 0 0 0 0 0
Lots more detail in verbose mode
# <----------------- network gets up to-----------------><----------------- network puts up to---
# 0MB 10MB 20MB 30MB 40MB 50MB 60MB 70MB 80MB 90MB 100M 0MB 10MB 20MB 30MB 40MB 50MB 60MB 70MB
0 0 0 0 0 0 0 0 0 0 0 64 2 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# OPS SUMMARY (/sec)
# <----------gets---------><----------puts--------->
# 0MB 1MB 10MB 100M 1GB 0MB 1MB 10MB 100M 1GB
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 2 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
4/19/2013 10Fine-grained Monitoring
What about KVM?
• Uses several collectl plugins
– Collectl tells us the command line used to start each process
• Parses line for instance ID and mac address
• Another collectl plugin tell us mac -> vnet name mapping
– Collectl also tracks I/O for each processs
– Another to monitor our block storage service
– Can use nova manage to look up user info by instance
• This data also sent to collectd once/minute
segerm@nv-aw2az1-compute0004:~$ sudo collectl --import vnet:bockc --export kvmsum
# PROCESS SUMMARY (counters are /sec)
# PID THRD S SysT UsrT Pct N AccumTim BckI BckO DskI DskO NetI NetO Instance UserID BockServer(s)
13273 6 S 0.01 0.05 1 4 15:39:28 0 0 0 12 0 0 000d5cc3 31689020408812
18093 14 S 0.02 0.26 7 4 41:15:00 0 0 0 0 0 0 000e33d7 29387151913164
19517 1 S 0.00 0.01 1 1 01:39:54 0 0 0 0 0 0 000e23e1 84575604886783
30287 1 S 0.00 0.00 0 1 05:30:11 0 0 0 0 0 0 000bf753 22420103441357 10.8.14.129
30739 2 S 0.00 0.00 0 2 12:18:42 0 0 0 0 0 0 00061147 30248174159870
4/19/2013 11Fine-grained Monitoring
Using the cloud to monitor the cloud
• Each morning slightly after midnight
– Ask each node to generate a summary of yesterday
– Do parallel copy to pull back to central node
– Also do parallel copy of plottable data
– Generate a set of 24 hour plots in batch, slow but worth it
• Investigating parallelizing some of this too
4/19/2013 12Fine-grained Monitoring
Currently, a very crude prototype
Daily numbers
4/19/2013 13Fine-grained Monitoring
Error Counts and Bandwidth Too
4/19/2013 14Fine-grained Monitoring
Hyperlinks to exact error text
4/19/2013 15Fine-grained Monitoring
Can even get operations by node by hour
4/19/2013 16Fine-grained Monitoring
colplot
• Web-based plotting tool
• If collectl can collect it, colplot can plot it
4/19/2013 17Fine-grained Monitoring
Links to plots…
4/19/2013 18Fine-grained Monitoring
Operations and PUT histograms
Note – the sample sizes are 1 second and plots only 10KB
4/19/2013 19Fine-grained Monitoring
Collectl Multiplexor: colmux
• Think of collectl top-anything
– Including any plugins
– Runs collectl in real-time against set of nodes
– Sorts by any column and can dynamically change with arrow keys
– 2 different output formats
• Can also playback historical data for diagnostic analysis
# OPS SUMMARY (/sec) Fri Mar 22 15:45:05 2013 Connected: 14 of 14
# <--------------------operations-------------------->
#Host GetKB PutKB Gets Puts Dels Post Pats Head E4xx E500
sw-aw2az1-proxy014 0 224 0 4 0 0 0 0 0 0
sw-aw2az1-proxy010 0 166 0 3 0 0 0 0 0 0
sw-aw2az1-proxy009 0 100 0 2 0 0 0 0 0 0
sw-aw2az1-proxy005 0 77 0 2 0 0 0 0 0 0
Time 001 003 004 005 007 008 | 001 003 004 005 007 008 | Get Put
16:02:45 0 0 0 1 0 0 | 0 23 0 0 0 0 | 1 23
16:02:46 0 0 0 0 0 0 | 0 18 0 0 0 0 | 0 18
16:02:47 0 0 0 0 0 0 | 0 23 3 0 3 1 | 0 30
16:02:48 0 0 0 0 0 0 | 1 19 15 2 0 0 | 0 37
16:02:49 0 0 0 0 0 0 | 1 10 19 0 2 1 | 0 33
4/19/2013 20Fine-grained Monitoring
Monitoring 192 nodes
Idle Nodes CPU Burst Very Busy Erratic
You don’t even have to be able to read the output to see what’s happening
4/19/2013 21Fine-grained Monitoring
Questions?
4/19/2013 22Fine-grained Monitoring

More Related Content

What's hot

LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
Full PPT Stack
Full PPT StackFull PPT Stack
Full PPT StackWendi Sapp
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterIvan Babrou
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part IIIAlkin Tezuysal
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeAnne Nicolas
 

What's hot (20)

LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Full PPT Stack
Full PPT StackFull PPT Stack
Full PPT Stack
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
HPC Examples
HPC ExamplesHPC Examples
HPC Examples
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
ZFSperftools2012
ZFSperftools2012ZFSperftools2012
ZFSperftools2012
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 

Viewers also liked

Viewers also liked (8)

Vmworld 2005-sln241
Vmworld 2005-sln241Vmworld 2005-sln241
Vmworld 2005-sln241
 
Lois Soaxe
Lois SoaxeLois Soaxe
Lois Soaxe
 
A world beneath the waves
A world beneath the wavesA world beneath the waves
A world beneath the waves
 
A world beneath the waves
A world beneath the wavesA world beneath the waves
A world beneath the waves
 
Verigraph
VerigraphVerigraph
Verigraph
 
A world beneath the waves
A world beneath the wavesA world beneath the waves
A world beneath the waves
 
Re-Engineering Engineering
Re-Engineering EngineeringRe-Engineering Engineering
Re-Engineering Engineering
 
Sl sulinetwork
Sl sulinetworkSl sulinetwork
Sl sulinetwork
 

Similar to Fine grained monitoring

Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全維泰 蔡
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and ArchitectureSidney Chen
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance ToolsBrendan Gregg
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
 
IO_Analysis_with_SAR.ppt
IO_Analysis_with_SAR.pptIO_Analysis_with_SAR.ppt
IO_Analysis_with_SAR.pptcookie1969
 
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringOSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringNETWAYS
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...aaajjj4
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performanceahl0003
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringNETWAYS
 
Boost UDP Transaction Performance
Boost UDP Transaction PerformanceBoost UDP Transaction Performance
Boost UDP Transaction PerformanceLF Events
 
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesMySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesSeveralnines
 
Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Mauro Pagano
 
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdf
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdfBRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdf
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdfaaajjj4
 

Similar to Fine grained monitoring (20)

Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and Architecture
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
IO_Analysis_with_SAR.ppt
IO_Analysis_with_SAR.pptIO_Analysis_with_SAR.ppt
IO_Analysis_with_SAR.ppt
 
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringOSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
C&C Botnet Factory
C&C Botnet FactoryC&C Botnet Factory
C&C Botnet Factory
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
BRKDCT-3144 - Advanced - Troubleshooting Cisco Nexus 7000 Series Switches (20...
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
 
Boost UDP Transaction Performance
Boost UDP Transaction PerformanceBoost UDP Transaction Performance
Boost UDP Transaction Performance
 
PoC Oracle Exadata - Retour d'expérience
PoC Oracle Exadata - Retour d'expériencePoC Oracle Exadata - Retour d'expérience
PoC Oracle Exadata - Retour d'expérience
 
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesMySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
 
Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Is your SQL Exadata-aware?
Is your SQL Exadata-aware?
 
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdf
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdfBRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdf
BRKRST-3066 - Troubleshooting Nexus 7000 (2013 Melbourne) - 2 Hours.pdf
 

More from Iben Rodriguez

Ipv6 test plan for opnfv poc v2.2 spirent-vctlab
Ipv6 test plan for opnfv poc v2.2 spirent-vctlabIpv6 test plan for opnfv poc v2.2 spirent-vctlab
Ipv6 test plan for opnfv poc v2.2 spirent-vctlabIben Rodriguez
 
CENIC Conference agenda 2017_v1
CENIC Conference agenda 2017_v1CENIC Conference agenda 2017_v1
CENIC Conference agenda 2017_v1Iben Rodriguez
 
Incident Handling in a BYOD Environment
Incident Handling in a BYOD EnvironmentIncident Handling in a BYOD Environment
Incident Handling in a BYOD EnvironmentIben Rodriguez
 
New Threats, New Approaches in Modern Data Centers
New Threats, New Approaches in Modern Data CentersNew Threats, New Approaches in Modern Data Centers
New Threats, New Approaches in Modern Data CentersIben Rodriguez
 
Iben from Spirent talks at the SDN World Congress about the importance of and...
Iben from Spirent talks at the SDN World Congress about the importance of and...Iben from Spirent talks at the SDN World Congress about the importance of and...
Iben from Spirent talks at the SDN World Congress about the importance of and...Iben Rodriguez
 

More from Iben Rodriguez (6)

Ipv6 test plan for opnfv poc v2.2 spirent-vctlab
Ipv6 test plan for opnfv poc v2.2 spirent-vctlabIpv6 test plan for opnfv poc v2.2 spirent-vctlab
Ipv6 test plan for opnfv poc v2.2 spirent-vctlab
 
CENIC Conference agenda 2017_v1
CENIC Conference agenda 2017_v1CENIC Conference agenda 2017_v1
CENIC Conference agenda 2017_v1
 
Incident Handling in a BYOD Environment
Incident Handling in a BYOD EnvironmentIncident Handling in a BYOD Environment
Incident Handling in a BYOD Environment
 
New Threats, New Approaches in Modern Data Centers
New Threats, New Approaches in Modern Data CentersNew Threats, New Approaches in Modern Data Centers
New Threats, New Approaches in Modern Data Centers
 
Iben from Spirent talks at the SDN World Congress about the importance of and...
Iben from Spirent talks at the SDN World Congress about the importance of and...Iben from Spirent talks at the SDN World Congress about the importance of and...
Iben from Spirent talks at the SDN World Congress about the importance of and...
 
Getput suite
Getput suiteGetput suite
Getput suite
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Fine grained monitoring

  • 1. Fine-grained Monitoring at HP Mark Seger Hewlett Packard Cloud Services 4/19/2013 1Fine-grained Monitoring
  • 2. Agenda • What is the problem we’re trying to solve • Introduction to collectl • Monitoring Swift & Glance • Monitoring VMs 4/19/2013 2Fine-grained Monitoring
  • 3. Conflicting Problem Statements • (Ks of nodes) X (hundreds of metrics) – And you want to centrally monitor them how often? – And store then in a database for future mining? – Politely pause for laughter… • Reality – Choose a frequency on the order of a minute or more – Don’t collect all data at the same time – Don’t collect everything • BUT when problems arise – Granularity measured in minutes is too coarse – If samples aren’t taken at the same time, how do you correlate? – There’s almost never enough detailed data to provide answers 4/19/2013 3Fine-grained Monitoring
  • 4. Isn’t the solution obvious? • Use one central tool and another local tool • But this has its own problems as well – Will the data cross-correlate? Hopefully… – What about customizations? – Do you really want to collect the same data twice? • At HP we’ve chosen a hybrid model – Use a lightweight local data collection tool, some redundancy • Collect cpu, disk, net & mem every second; key processes every 5 • Extend it to add OpenStack monitoring capabilities • Send subset/updates to centralized monitor every minute or more – Central tool: collectd, Local tool: collectl no relation! 4/19/2013 4Fine-grained Monitoring
  • 5. Introduction to collectl • Open source tool, in use for many years • Roots in HPC, so knows how to be efficient – Think of it as SAR on steroids – Can monitor at sub-second intervals when needed – Synchronizes samples across cluster to usecs for correlation – Can generate data in plottable formats – Can record data locally and send over a socket • And do it at different frequencies! • Can also write to a snapshot file which is what we’re doing – Has an API for extensibility – Several utilities for plotting and real-time cluster monitoring 4/19/2013 5Fine-grained Monitoring
  • 6. Summary Mode Formats #<-------CPU--------><-----------Disks-----------><-----------Network---------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 10 9 206 94 0 0 0 0 0 1 0 0 26 26 183 80 0 0 1279 27 18 78 12 37 27 27 396 70 0 0 31597 275 0 6 0 5 9 9 341 71 0 0 32629 274 4 43 0 2 ### RECORD 3 >>> cag-dl380-01 <<< (1176471932.010) (Fri Apr 13 09:45:32 2007) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # User Nice Sys Wait IRQ Soft Steal Idle Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 0 0 0 0 0 0 0 99 1070 206 4 246 0 0.10 0.03 0.01 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 0 0 0 0 0 0 0 0 # NETWORK SUMMARY (/sec) # KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut 4 32 133 0 0 0 28 31 926 0 0 Verbose: collectl --verbose Brief: collectl
  • 7. Detail Mode Format # SINGLE CPU STATISTICS # Cpu User Nice Sys Wait IRQ Soft Steal Idle 0 2 0 0 0 0 0 0 98 1 0 0 0 0 0 0 0 100 2 0 0 0 0 0 0 0 100 3 2 0 0 0 0 0 0 98 # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util sda 0 0 0 0 0 0 0 0 0 0 0 0 0 sdb 0 0 0 0 0 0 0 0 0 0 0 0 0 sdc 0 0 0 0 0 0 0 0 0 0 0 0 0 sdd 0 0 0 0 0 0 0 0 0 0 0 0 0 hda 0 0 0 0 0 0 0 0 0 0 0 0 0 # NETWORK STATISTICS (/sec) #Num Name KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut 0 lo: 0 0 0 0 0 0 0 0 0 0 0 1 eth0: 1 11 144 0 0 0 1 7 257 0 0 2 eth1: 0 1 64 0 0 0 0 1 64 0 0 3 sit0: 0 0 0 0 0 0 0 0 0 0 0 collectl -sCDN
  • 8. Let’s talk about swift/glance monitoring • The real question is what metrics do they expose? – Track GETs, PUTs, etc – Include object sizes and timings – Also provides error codes/text • Tail swift logs and write rolling counters every second – Operation types – Object size and network bandwidth histograms, though b/w can be misleading • Also generate hourly/daily summaries, retaining 1 week’s worth • Same utility also knows how to parse glance logs – Separates tracking of metadata operations cat /var/log/perf/ops/opscount.txt get: 29452211 put: 84775192 del: 12433208 post: 65666 pat: 0 head: 28489510 e4xx: 174473 e500: 4774 get 25679667 547036 58147 28824 234 8839362141 23961566 717292 124028 95240 69479 35804 12355 5796 6814 8794 put 49048442 489078 123922 41085 715 12633635902 36213403 120714 10620 3815 4293 3666 6570 4030 188 0 4/19/2013 8Fine-grained Monitoring
  • 9. Collectl plugins • Use collectl’s import API – Read opscount file every monitoring interval logging to disk – Can also be used interactively – Most importantly, supported by collectl utilities • Use collectl’s export API to write to local file every minute – Aligned to top-of-minute to avoid RRD messing with the data • Use collectd’s putval capabilility to upload when file changes 4/19/2013 9Fine-grained Monitoring
  • 10. Here’s what we can see locally # <----CPU[HYPER]-----><----------Network----------><-------Ops-------- > #Time cpu sys inter ctxsw KBIn PktIn KBOut PktOut GKBs PKBs Ops Errs 15:18:16 0 0 214 131 28 53 77 90 0 64 1 0 15:18:17 0 0 556 505 57 171 39 175 0 22 2 0 15:18:18 0 0 390 13776 143 223 614 522 0 0 0 0 15:18:19 0 0 292 12881 3 31 3 27 0 86 7 0 Note you can mix’n match with any standard collectl data in brief mode # <--------------------operations--------------------> #Time GetKB PutKB Gets Puts Dels Post Pats Head E4xx E500 15:20:36 0 0 1 2 1 0 0 0 0 0 15:20:37 0 64 1 3 1 0 0 0 0 0 15:20:38 0 0 0 0 0 0 0 0 0 0 15:20:39 0 0 2 0 1 0 0 0 0 0 Lots more detail in verbose mode # <----------------- network gets up to-----------------><----------------- network puts up to--- # 0MB 10MB 20MB 30MB 40MB 50MB 60MB 70MB 80MB 90MB 100M 0MB 10MB 20MB 30MB 40MB 50MB 60MB 70MB 0 0 0 0 0 0 0 0 0 0 0 64 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # OPS SUMMARY (/sec) # <----------gets---------><----------puts---------> # 0MB 1MB 10MB 100M 1GB 0MB 1MB 10MB 100M 1GB 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4/19/2013 10Fine-grained Monitoring
  • 11. What about KVM? • Uses several collectl plugins – Collectl tells us the command line used to start each process • Parses line for instance ID and mac address • Another collectl plugin tell us mac -> vnet name mapping – Collectl also tracks I/O for each processs – Another to monitor our block storage service – Can use nova manage to look up user info by instance • This data also sent to collectd once/minute segerm@nv-aw2az1-compute0004:~$ sudo collectl --import vnet:bockc --export kvmsum # PROCESS SUMMARY (counters are /sec) # PID THRD S SysT UsrT Pct N AccumTim BckI BckO DskI DskO NetI NetO Instance UserID BockServer(s) 13273 6 S 0.01 0.05 1 4 15:39:28 0 0 0 12 0 0 000d5cc3 31689020408812 18093 14 S 0.02 0.26 7 4 41:15:00 0 0 0 0 0 0 000e33d7 29387151913164 19517 1 S 0.00 0.01 1 1 01:39:54 0 0 0 0 0 0 000e23e1 84575604886783 30287 1 S 0.00 0.00 0 1 05:30:11 0 0 0 0 0 0 000bf753 22420103441357 10.8.14.129 30739 2 S 0.00 0.00 0 2 12:18:42 0 0 0 0 0 0 00061147 30248174159870 4/19/2013 11Fine-grained Monitoring
  • 12. Using the cloud to monitor the cloud • Each morning slightly after midnight – Ask each node to generate a summary of yesterday – Do parallel copy to pull back to central node – Also do parallel copy of plottable data – Generate a set of 24 hour plots in batch, slow but worth it • Investigating parallelizing some of this too 4/19/2013 12Fine-grained Monitoring
  • 13. Currently, a very crude prototype Daily numbers 4/19/2013 13Fine-grained Monitoring
  • 14. Error Counts and Bandwidth Too 4/19/2013 14Fine-grained Monitoring
  • 15. Hyperlinks to exact error text 4/19/2013 15Fine-grained Monitoring
  • 16. Can even get operations by node by hour 4/19/2013 16Fine-grained Monitoring
  • 17. colplot • Web-based plotting tool • If collectl can collect it, colplot can plot it 4/19/2013 17Fine-grained Monitoring
  • 18. Links to plots… 4/19/2013 18Fine-grained Monitoring
  • 19. Operations and PUT histograms Note – the sample sizes are 1 second and plots only 10KB 4/19/2013 19Fine-grained Monitoring
  • 20. Collectl Multiplexor: colmux • Think of collectl top-anything – Including any plugins – Runs collectl in real-time against set of nodes – Sorts by any column and can dynamically change with arrow keys – 2 different output formats • Can also playback historical data for diagnostic analysis # OPS SUMMARY (/sec) Fri Mar 22 15:45:05 2013 Connected: 14 of 14 # <--------------------operations--------------------> #Host GetKB PutKB Gets Puts Dels Post Pats Head E4xx E500 sw-aw2az1-proxy014 0 224 0 4 0 0 0 0 0 0 sw-aw2az1-proxy010 0 166 0 3 0 0 0 0 0 0 sw-aw2az1-proxy009 0 100 0 2 0 0 0 0 0 0 sw-aw2az1-proxy005 0 77 0 2 0 0 0 0 0 0 Time 001 003 004 005 007 008 | 001 003 004 005 007 008 | Get Put 16:02:45 0 0 0 1 0 0 | 0 23 0 0 0 0 | 1 23 16:02:46 0 0 0 0 0 0 | 0 18 0 0 0 0 | 0 18 16:02:47 0 0 0 0 0 0 | 0 23 3 0 3 1 | 0 30 16:02:48 0 0 0 0 0 0 | 1 19 15 2 0 0 | 0 37 16:02:49 0 0 0 0 0 0 | 1 10 19 0 2 1 | 0 33 4/19/2013 20Fine-grained Monitoring
  • 21. Monitoring 192 nodes Idle Nodes CPU Burst Very Busy Erratic You don’t even have to be able to read the output to see what’s happening 4/19/2013 21Fine-grained Monitoring