Your SlideShare is downloading. ×
Ganglia Overview-v2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Ganglia Overview-v2

5,133
views

Published on

Vladimir Vuksan's presentation on Ganglia at the "Not Nagios" episode of The Bay Area Large-Scale Production Engineering meetup: …

Vladimir Vuksan's presentation on Ganglia at the "Not Nagios" episode of The Bay Area Large-Scale Production Engineering meetup: http://www.meetup.com/SF-Bay-Area-Large-Scale-Production-Engineering/events/15481164/

Published in: Technology

1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,133
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
1
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Ganglia Monitoring Framework
    • Vladimir Vuksan,
    • June 2011
    • http://deanspot.org/content/ganglia-references
  • 2. About this talk
    • Ganglia architecture
    • How to get metrics in.
    • How to get metrics out.
  • 3. Getting Ganglia
    • tar xzvf ganglia-3.1.7.tar.gz ./configure --with-gmetad make make install
    • Or use binary packages ie.
      • Ubuntu/Debian : apt-get install ganglia-monitor gmetad
      • Fedora : yum install ganglia-gmond ganglia-gmetad
  • 4. Ganglia Architecture
    • 2 daemons: gmond & gmetad
    • gmond collects or receives metric data on each node
    • 1 gmetad per grid. polls 1 gmond per cluster for data.
    • a node belongs to a cluster. a cluster belongs to a grid.
    • Web UI a separate item use it or lose it
  • 5.  
  • 6.  
  • 7.  
  • 8.  
  • 9.  
  • 10. Demo
  • 11. Custom Graphs { { "report_name" : "network_report", "report_type" : "standard", "title" : "Network", "vertical_label" : "Bytes/sec", "series" : [ { "metric": "bytes_in", "color": "33cc33", "label": "In", "line_width": "2", "type": "line" }, { "metric": "bytes_out", "color": "5555cc", "label": "Out", "line_width": "2", "type": "line" } ] }
  • 12. Replaces this:
    • DEF:'a0'='/var/lib/ganglia/g1/rrds/c-1-2/__SummaryInfo__/bytes_in.rrd':'sum':AVERAGE
    • LINE2:'a0'#33cc33:'In '
    • VDEF:a0_last=a0,LAST
    • VDEF:a0_min=a0,MINIMUM
    • VDEF:a0_avg=a0,AVERAGE
    • VDEF:a0_max=a0,MAXIMUM
    • GPRINT:'a0_last':'Now:%5.1lf%s'
    • GPRINT:'a0_min':'Min:%5.1lf%s'
    • GPRINT:'a0_avg':'Avg:%5.1lf%s'
    • GPRINT:'a0_max':'Max:%5.1lf%sl'
    • DEF:'a1'='/var/lib/ganglia/g1/rrds/c-1-2/__SummaryInfo__/bytes_out.rrd':'sum':AVERAGE
    • LINE2:'a1'#5555cc:'Out'
    • VDEF:a1_last=a1,LAST
    • VDEF:a1_min=a1,MINIMUM
    • VDEF:a1_avg=a1,AVERAGE
    • VDEF:a1_max=a1,MAXIMUM
    • GPRINT:'a1_last':'Now:%5.1lf%s'
    • GPRINT:'a1_min':'Min:%5.1lf%s'
    • GPRINT:'a1_avg':'Avg:%5.1lf%s'
    • GPRINT:'a1_max':'Max:%5.1lf%sl'
  • 13. A quick word about RRD.
  • 14.
    • gmetad creates 1 RRD file for each metric.
    • default retention schedule is defined in gmetad.conf
    Store an avg every For This Long 15 sec 60 min 6 min 1 day 42 min 1 week 168 min 30 days 1 day 1 year
  • 15. Default schedule fits into 12K per metric
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 bytes_in.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 bytes_out.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_aidle.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_idle.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_nice.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_num.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_speed.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_system.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_user.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 cpu_wio.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 disk_free.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 disk_total.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 load_fifteen.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 load_five.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 load_one.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 mem_buffers.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 proc_run.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 proc_total.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 swap_free.rrd
    • -rw-rw-rw- 1 nobody root 12224 Jun 16 04:07 swap_total.rrd
  • 16. If you need more resolution adjust it in gmetad conf
    • Getting started with RRD: http://oss.oetiker.ch/rrdtool/tut/rrd-beginners.en.html
    • More help defining RRD files: http://www.cuddletech.com/articles/rrd/ar01s02.html
    • A little more about how RRD works with Ganglia: http://vuksan.com/blog/2010/12/14/misconceptions-about-rrd-storage/
  • 17. Getting data in
    • Via gmond modules, written in C or Python.
    • Via gmetric or libraries that implement the gmetric protocol.
    • Via other daemons designed to feed metrics to ganglia (e.g. sFlow)
  • 18. Zero configuration.
    • Just start sending new metrics.
    • gmetad will create a new RRD file for any new metric it sees.
    • The web UI will draw a basic graph for every metric.
    • You can create nice colored graphs later if you want them.
  • 19. gmond module
    • modules {
    • module {
    • name = "net_module"
    • path = "modnet.so" }
    • }
    • collection_group {
    • collect_every = 40
    • time_threshold = 300
    • metric {
    • name = "bytes_out"
    • value_threshold = 4096
    • title = "Bytes Sent"
    • }
    • metric {
    • name = "bytes_in"
    • value_threshold = 4096
    • title = "Bytes Received"
    • }
    • }
  • 20. gmetric $ gmetric -c /etc/ganglia/gmond.conf –name=foo --value=512 --units=foos --type=uint8 --dmax=60 CLI Ruby
  • 21. What kind of metrics can I collect?
    • Load time of your home page?
    • Number of active trouble tickets?
    • LOC in your application?
    • rcov coverage statistics?
    • Execution time of your test suite?
    • Number of user logins?
    • Memory usage by a particular process?
    • Many other metric plugins available at http://github.com/ganglia
  • 22.  
  • 23. Log parsing apps Eat log files and make metrics http://vuksan.com/linux/ganglia/# Apache_Traffic_Stats
    • ganglia-logtailer https://bitbucket.org/maplebed/ganglia-logtailer
    • logster https://github.com/etsy/logster
  • 24. But I thought Ganglia is only useful for host metrics?
    • No
    • You can create “non-existent” hosts by spoofing
  • 25. Custom Metric Demo
  • 26. Integrating Ganglia & Nagios
    • $ check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle --critical=99
    • Status Critical, CPU Idle = 99.6 %|cpu_idle=99.6%;;99;;
    • https://github.com/mconigliaro/check_ganglia_metric
    • http://vuksan.com/blog/2011/04/19/use-your-trending-data-for-alerting/
  • 27. Getting Data Out
    • Web UI
      • JSON & CSV export
    • gmond XML
    • gmetad XML
    • gmetad interactive
  • 28. gmond
    • $ telnet localhost 8649
    • <GANGLIA_XML VERSION=&quot;3.1.2&quot; SOURCE=&quot;gmond&quot;>
    • <CLUSTER NAME=&quot;Debug&quot; LOCALTIME=&quot;1258504343&quot; OWNER=&quot;Alex&quot;>
    • <HOST NAME=&quot;10.0.3.128&quot; IP=&quot;10.0.3.128&quot; REPORTED=&quot;1258504325&quot;
    • TN=&quot;18&quot; TMAX=&quot;20&quot; DMAX=&quot;0&quot;>
    • <METRIC NAME=&quot;load_five&quot; VAL=&quot;0.15&quot; TYPE=&quot;float&quot; UNITS=&quot; &quot;
    • TN=&quot;30&quot; TMAX=&quot;325&quot; DMAX=&quot;0&quot; SLOPE=&quot;both&quot;>
    • <EXTRA_DATA>
    • <EXTRA_ELEMENT NAME=&quot;GROUP&quot; VAL=&quot;load&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;DESC&quot; VAL=&quot;Five minute load average&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;TITLE&quot; VAL=&quot;Five Minute Load Average&quot;/>
    • </EXTRA_DATA>
    • </METRIC>
    • </HOST>
    • </CLUSTER>
    • </GANGLIA>
  • 29. gmond
    • $ telnet localhost 8649
    • < GANGLIA_XML VERSION=&quot;3.1.2&quot; SOURCE=&quot;gmond&quot;>
    • < CLUSTER NAME=&quot;Debug&quot; LOCALTIME=&quot;1258504343&quot; OWNER=&quot;Alex&quot;>
    • < HOST NAME=&quot;10.0.3.128&quot; IP=&quot;10.0.3.128&quot; REPORTED=&quot;1258504325&quot;
    • TN=&quot;18&quot; TMAX=&quot;20&quot; DMAX=&quot;0&quot;>
    • <METRIC NAME=&quot;load_five&quot; VAL=&quot;0.15&quot; TYPE=&quot;float&quot; UNITS=&quot; &quot;
    • TN=&quot;30&quot; TMAX=&quot;325&quot; DMAX=&quot;0&quot; SLOPE=&quot;both&quot;>
    • <EXTRA_DATA>
    • <EXTRA_ELEMENT NAME=&quot;GROUP&quot; VAL=&quot;load&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;DESC&quot; VAL=&quot;Five minute load average&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;TITLE&quot; VAL=&quot;Five Minute Load Average&quot;/>
    • </EXTRA_DATA>
    • </METRIC>
    • </HOST>
    • </CLUSTER>
    • </GANGLIA>
  • 30. gmetad (non-interactive)
    • $ telnet localhost 8651
    • <GANGLIA_XML VERSION=&quot;3.1.2&quot; SOURCE=&quot;gmetad&quot;>
    • <GRID NAME=&quot;unspecified&quot; AUTHORITY=&quot;http://localhost/~alex/ganglia/&quot;
    • LOCALTIME=&quot;1258504287&quot;>
    • <CLUSTER NAME=&quot;Debug&quot; LOCALTIME=&quot;1258504282&quot; OWNER=&quot;Alex&quot;
    • LATLONG=&quot;unspecified&quot; URL=&quot;unspecified&quot;>
    • <HOST NAME=&quot;10.0.3.128&quot; IP=&quot;10.0.3.128&quot; REPORTED=&quot;1258504265&quot;
    • TN=&quot;21&quot; TMAX=&quot;20&quot; DMAX=&quot;0&quot; LOCATION=&quot;unspecified&quot;
    • GMOND_STARTED=&quot;1258477064&quot;>
    • <METRIC NAME=&quot;load_five&quot; VAL=&quot;0.15&quot; TYPE=&quot;float&quot; UNITS=&quot; &quot;
    • TN=&quot;30&quot; TMAX=&quot;325&quot; DMAX=&quot;0&quot; SLOPE=&quot;both&quot;>
    • <EXTRA_DATA>
    • <EXTRA_ELEMENT NAME=&quot;GROUP&quot; VAL=&quot;load&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;DESC&quot; VAL=&quot;Five minute load average&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;TITLE&quot; VAL=&quot;Five Minute Load Average&quot;/>
    • </EXTRA_DATA>
    • </METRIC>
    • </METRIC>
    • </HOST>
    • </CLUSTER>
    • </GANGLIA_XML>
  • 31. gmetad (2)
    • $ telnet localhost 8651
    • < GANGLIA_XML VERSION=&quot;3.1.2&quot; SOURCE=&quot;gmetad&quot;>
    • < GRID NAME=&quot;unspecified&quot; AUTHORITY=&quot;http://localhost/~alex/ganglia/&quot;
    • LOCALTIME=&quot;1258504287&quot;>
    • < CLUSTER NAME=&quot;Debug&quot; LOCALTIME=&quot;1258504282&quot; OWNER=&quot;Alex&quot;
    • LATLONG=&quot;unspecified&quot; URL=&quot;unspecified&quot;>
    • < HOST NAME=&quot;10.0.3.128&quot; IP=&quot;10.0.3.128&quot; REPORTED=&quot;1258504265&quot;
    • TN=&quot;21&quot; TMAX=&quot;20&quot; DMAX=&quot;0&quot; LOCATION=&quot;unspecified&quot;
    • GMOND_STARTED=&quot;1258477064&quot;>
    • <METRIC NAME=&quot;load_five&quot; VAL=&quot;0.15&quot; TYPE=&quot;float&quot; UNITS=&quot; &quot;
    • TN=&quot;30&quot; TMAX=&quot;325&quot; DMAX=&quot;0&quot; SLOPE=&quot;both&quot;>
    • <EXTRA_DATA>
    • <EXTRA_ELEMENT NAME=&quot;GROUP&quot; VAL=&quot;load&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;DESC&quot; VAL=&quot;Five minute load average&quot;/>
    • <EXTRA_ELEMENT NAME=&quot;TITLE&quot; VAL=&quot;Five Minute Load Average&quot;/>
    • </EXTRA_DATA>
    • </METRIC>
    • </METRIC>
    • </HOST>
    • </CLUSTER>
    • </GANGLIA_XML>
  • 32. gmetad (interactive)
    • $ telnet localhost 8652
    • Connected to localhost.Escape character is '^]'
    • ./cluster_name/host_name/load_five/
    • ... receive same XML format as normal gmetad port, but limited only to the metric you request ...
    • ... receive same XML format as normal gmetad port, but limited only to the metric you request ...
    • ... receive same XML format as normal gmetad port, but limited only to the metric you request ...
  • 33. Is it scalable?
  • 34. Issues
    • It uses reverse DNS lookups to determine hostname => may cause issues in a cloud (need to use workarounds)
    • Doesn't allow arbitrary metric hierarchy (at this time :-))
  • 35. What's next
    • Support for arbitrary grouping beyond clusters using e.g. tags
    • Better Nagios integration
    • New visualization e.g. heatmaps
    • Logstash integration
  • 36. Questions
    • Twitter: @vvuksan