005 cluster monitoring


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

005 cluster monitoring

  1. 1. ClusterMonitoring2012/07/26Scott Miao
  2. 2. 2Agenda Course Credit Introduction Metrics Framework Tools  Tools on wiki http://wiki.spn.tw.trendnet.org/wiki/Hadoop_ Related_Web_Site_List
  3. 3. 3Course Credit Show up, 30 scores Ask question, each question earns 5 scores Hands-on, 40 scores 70 scores will pass this course Each course credit will be calculated once for each course finished The course credit will be sent to you and your supervisor by mail
  4. 4. 4Introduction – (1/2) Using a cluster without monitoring and metrics is…  the same as driving a car while blindfolded It is great to run load tests against your HBase cluster  need to correlate the cluster’s performance with what the system is doing under the hood
  5. 5. 5Introduction – (2/2) Graphing  Captures the exposed metrics of a system and displays them in visual charts  A picture speaks a thousand words  Are good for historical, quantitative data Monitoring  Still difficult to see what a system is doing right now  Qualitative data is needed, which is handled by the monitoring kind of support systems  Sends out emails to various recipients  SMS messages to telephones  Does something by customized scripts
  6. 6. 6The Metrics Framework –Basic Classes from Hadoop
  7. 7. 7The Metrics Framework –Extended Classes in HBase
  8. 8. 8The Metrics Framework –Classes Collaboration
  9. 9. 9 The Metrics Framework – Metric Types – (1/3)Metric Type DescriptionInteger value (IV) An integer counter. Only updated when the value changesLong value (LV) A long counter. Only updated when the value changesRate (R) A float value representing a rate. 1. The rate is calculated as number of operations / elapsed time in seconds. 2. The rate is stored in the previous value field. 3. The internal counter is reset to zero. 4. The last polled timestamp is set to the current time. 5. The computed rate is returned to the caller.
  10. 10. 10 The Metrics Framework – Metric Types – (2/3)Metric Type DescriptionString (S) Static, text-based information and never reset nor changed. E.g., HBase version number, build date, and so on.Time varying The context keeps aggregating the value. When the value isinteger (TVI) polled it returns the accrued integer value, and resets to zero, until it is polled againTime varying Same as TVI, but uses Longlong (TVL)
  11. 11. 11 The Metrics Framework – Metric Types – (3/3)Metric Type DescriptionTime varying The number of operations or events and the time theyrate (TVR) required to complete. The values for operation count and time accrued are reset once the metric is polledPersistent time Same as TVR, but NOT reset for every pollvarying rate(PTVR)
  12. 12. 12 The Metrics Framework – Master Metrics  The master process exposes all metrics relating to its role in a clusterMetric Property Name DescriptionCluster hbase.master.clust The total number of requests to therequests (R) er_requests cluster, aggregated across all region serversSplit time hbase.master.splitTi The time it took to split the write-ahead(PTVR) me log files after a restartSplit size hbase.master.splitSi The total size of the write-ahead log files(PTVR) ze that were split
  13. 13. 13 The Metrics Framework – Region Server MetricsA substantial number of metrics here Includes details about different parts of the over-all architecture inside the server Into following groups  Block cache metrics  Compaction metrics  Memstore metrics  Store metrics  I/O metrics  Miscellaneous metrics
  14. 14. 14 Region Server Metrics – Block cache metrics – (1/2)Metric Property Name Descriptioncount (LV) hbase.regionserver.bl The number of blocks currently in ockCacheCount the cachesize (LV) hbase.regionserver.bl The number of the size of blocks ockCacheSize currently in the occupied Java heap spacefree (LV) hbase.regionserver.bl Remaining heap for the cache ockCacheFreeevicted (LV) hbase.regionserver.bl The number of blocks that had to ockCacheEvictedCo be removed because of heap size unt constraints
  15. 15. 15 Region Server Metrics – Block cache metrics – (2/2)Metric Property Name Descriptioncache hit (LV) hbase.regionse The number of cache block hits rver.blockCach eHitCountmiss (LV) hbase.regionse The number of cache block hit missed rver.blockCach eMissCounthit ratio (IV) hbase.regionse The number of cache hits in relation to rver.blockCach the total number of requests to the eHitRation cache
  16. 16. 16 Region Server Metrics – Compaction metricsMetric Property Name Descriptioncompaction hbase.regionserv The total size (in bytes) of the storagesize (PTVR) er.compactionSi files that have been compacted zecompaction hbase.regionserv How long that operation took.time (PTVR) er.compactionTi Above metrics reported after a me completed compaction runcompaction hbase.regionserv How many files a region serverqueue size (IV) er.compactionQ has queued up for compaction ueueSize currently (recommended for monitoring)
  17. 17. 17 Region Server Metrics – Memstore metricsMetric Property Name Descriptionmemstore size MB hbase.regionserv The total heap space occupied bymetric (IV) er.memstoreSize all memstores (in online regions) for MB the server in megabytesflush queue size hbase.regionserv The number of enqueued regions(IV) er.flushQueueSize that are being flushed next (recommended for monitoring)flush size (PTVR) hbase.regionserv The total size (in bytes) of the er.flushSize memstore that has been flushedflush time (PTVR) hbase.regionserv The total time took for the er.flushTime memstore that has been flushed
  18. 18. 18 Region Server Metrics – Store metricsMetric Property Name Descriptionstore files (IV) hbase.regionserver.st The total number of storage files, orefiles spread across all stores (regions) managed by current serverstores (IV) hbase.regionserver.st The total number of stores for the ores server, across all regionsstore file index hbase.regionserver.st The sum of the block index,size MB metric orefileIndexSizeMB and optional meta index, for all(IV) store files in megabytes
  19. 19. 19 Region Server Metrics – I/O metricsMetric Property Name Descriptionfs read latency hbase.regionser Filesystem read latency. e.g., the time it(TVR) ver.fsReadLaten takes to load a block from the storage cy filesfs write latency hbase.regionser The same as above, but for write(TVR) ver.fsWriteLaten operations, including the storage files cy and write-ahead logfs sync latency hbase.regionser The latency to sync the write-ahead log(TVR) ver.fsSyncLaten records to the filesystem. cy All numbers in milliseconds
  20. 20. 20 Region Server Metrics – Miscellaneous metricsMetric Property Name Descriptionread request hbase.regionserv The total number of read (such ascount (LV) er.readRequestC get()) operations ountwrite request hbase.regionserv The total number of write (such ascount (LV) er.writeRequestC put()) operations ountrequests (R) hbase.regionserv The actual request rate per second er.requestsregions (IV) hbase.regionserv The number of regions that are er.regions currently online and hosted by this region server
  21. 21. 21 The Metrics Framework – RPC MetricsMetric Property Name DescriptionRPC Process rpc.metrics.RpcP The average time took toTime rocessingTime process the RPCs on the server sideRPC Queue rpc.metrics.Rpc The time the call arrived andTime QueueTime when it is actually processed, which is the queue time (recommended for monitoring)
  22. 22. 22The Metrics Framework –JVM Metrics Tuning the JVM settings for optimizing your HBase setup  You need to know what is going on in the cluster Into following groups  Memory usage metrics  Garbage collection metrics  Thread metrics  System event metrics
  23. 23. 23 JVM Metrics – Memory usage metricsMetric Property Name DescriptionNon-heap used jvm.RegionServer.metrics. What used versusmemory memNonHeapUsedM committed memory means http://docs.oracle.comNon-heap jvm.RegionServer.metrics. /javase/6/docs/api/javcommitted memory memNonHeapCommitted a/lang/management/ M MemoryUsage.htmlHeap used memory jvm.RegionServer.metrics. memHeapUsedMHeap committed jvm.RegionServer.metrics.memory memHeapCommittedM
  24. 24. 24 JVM Metrics – Garbage collection metrics• Garbage collection process causes so-called stop-the-world pauses in certain step • It is difficult to handle when a system is bound by tight SLAs • These pauses approach the multiminute range, because this can cause a region server to miss its ZooKeeper lease renewal — forcing the master to take evasive actions • So-called ―Juliet Pause‖Metric Property Name Descriptiongc count jvm.RegionServer.metri The number of garbage cs.gcCount collectionsgc time millis jvm.RegionServer.metri The accumulated time spent in cs.gcTimeMillis garbage collection
  25. 25. 25 JVM Metrics – Thread metricsMetric Property Name Descriptionnew state jvm.RegionServer.metrics.thre The count for each adsNew possible thread state,runnable state jvm.RegionServer.metrics.thre including new, adsRunnable runnable, blocked, and so on.blocked state jvm.RegionServer.metrics.thre You could refer to adsBlocked following docs http://www.programcrwaiting state jvm.RegionServer.metrics.thre eek.com/2009/03/thre adsWaiting ad-status/timed waiting jvm.RegionServer.metrics.thre http://docs.oracle.comstate adsTimedWaiting /javase/1.5.0/docs/apiterminated state jvm.RegionServer.metrics.thre /java/lang/Thread.Stat adsTerminated e.html
  26. 26. 26 JVM Metrics – System event metricsMetric Property Name Descriptionlog fatal jvm.RegionServer. System event metrics provide counts for metrics.logFatal various log-level events. e.g., the log error metric provides thelog error jvm.RegionServer. number of log events that occurred on metrics.logError the error level.log warn jvm.RegionServer. metrics.logWarnlog info jvm.RegionServer. metrics.logInfo
  27. 27. 27The Metrics Framework –Info Metrics Only accessible through JMX
  28. 28. 28The Metrics Framework If you find other Metrics not listed here  Please refer to API docs directly…  http://hbase.apache.org/apidocs/index.ht ml?overview-summary.html
  29. 29. 29 Tools - GangliaA distributed, scalable monitoring system suitable for large cluster systems HBase inherits its native support for Ganglia directly from Hadoop
  30. 30. 30 Ganglia – Three components Ganglia monitoring daemon (gmond)  Runs on every machine that is monitored  Collects the local data and prepares the statistics to be polled by other systems Ganglia meta daemon (gmetad)  Is installed on a central node  Acts as the federation node to the entire cluster  Polls from one or more monitoring daemons to receive the current cluster status Ganglia PHP web frontend  Ganglia Web Frontend  Retrieves the combined statistics from the meta daemon and presents it as HTML
  31. 31. 31 Ganglia - Installationhttp://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start
  32. 32. 32Tools - Nagios polls current metrics on a regular basis and compares them with given thresholds Once the thresholds are exceededing it will start evasive actions  Ranging from sending out emails, SMS messages to telephones, to triggering scripts, or even physically rebooting the server when necessary
  33. 33. 33Tools - JMX Java Management Extensions technology  The standard for Java applications to export their status  Also has the ability to provide operations Common tools for JMX  JConsole  JMXToolkithttp://hbase.apache.org/metrics.html
  34. 34. 34 Hands-on Use Ganglia “Aggregate Graphs” feature  Title with your name  Including 5 hosts  Use any two Metrics  Cut the image file, just like this sample Put the image file into Git  YOUR_HOME=${GIT_ROOT}/hbase-training/005/hands- on/<your_name>  mkdir ${YOUR_HOME}  Put your hands-on into ${YOUR_HOME}