• Save
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Upcoming SlideShare
Loading in...5
×
 

Artimon - Apache Flume (incubating) NYC Meetup 20111108

on

  • 2,060 views

Slides from my Flume Meetup

Slides from my Flume Meetup

Statistics

Views

Total Views
2,060
Views on SlideShare
2,048
Embed Views
12

Actions

Likes
1
Downloads
0
Comments
0

4 Embeds 12

http://www.linkedin.com 8
https://twitter.com 2
http://www.slashdocs.com 1
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Artimon - Apache Flume (incubating) NYC Meetup 20111108 Artimon - Apache Flume (incubating) NYC Meetup 20111108 Presentation Transcript

  • Artimon Mathias Herberts - @herbertsApache Flume (incubating) User Meetup, Hadoop World 2011 NYC Edition
  • Arkéa Real Time Information Monitoring
  • Scalable metrics collection and analysis framework▪ Collects metrics called variable instances▪ Dynamic discovery, (almost) no conf needed▪ Rich analysis library▪ Fits IT and business needs▪ Adapts to third party metrics▪ Uses Flume and Kafka for transport
  • Whats in a variable instance? name{label0=value0,label1=value1,...}▪ name is the name of the variable linux.proc.diskstats.reads.ms hadoop.jobtracker.maps_completed▪ Labels are text strings, they characterize a variable instance Some labels are automatically set dc, rack, module, context, uuid, ... Others are user defined▪ Variable instances are typed INTEGER, DOUBLE, BOOLEAN, STRING▪ Variable instance values are timestamped▪ Variable instance values are Thrift objects
  • Exporting metrics▪ Metrics are exported via a Thrift service▪ Each MonitoringContext (context=...) exposes a service▪ MCs register their dynamic port in ZooKeeper /zk/artimon/contexts/xxx/ip:port:uuid▪ MonitoringContext wrapped in a BookKeeper class public interface ArtimonBookKeeper { public void setIntegerVar(String name, final Map<String,String> labels, long value); public long addToIntegerVar(String name, final Map<String,String> labels, long delta); public Long getIntegerVar(String name, final Map<String,String> labels); public void removeIntegerVar(String name, final Map<String,String> labels); public void setDoubleVar(String name, final Map<String,String> labels, double value); public double addToDoubleVar(String name, final Map<String,String> labels, double delta); public Double getDoubleVar(String name, final Map<String,String> labels); public void removeDoubleVar(String name, final Map<String,String> labels); public void setStringVar(String name, final Map<String,String> labels, String value); public String getStringVar(String name, final Map<String,String> labels); public void removeStringVar(String name, final Map<String,String> labels); public void setBooleanVar(String name, final Map<String,String> labels, boolean value); public Boolean getBooleanVar(String name, final Map<String,String> labels); public void removeBooleanVar(String name, final Map<String,String> labels); }
  • Exporting metrics▪ Thrift service returns the latest values of known instances▪ ZooKeeper not mandatory, can use a fixed port▪ Artimon written in Java▪ Checklist for porting to other languages ▪ Thrift support ▪ Optional ZooKeeper support
  • Collecting Metrics▪ Flume launched on every machine▪ artimon source artimon(hosts, contexts, vars[, polling_interval]) eg artimon(“self”,”*”,”~.*”) ▪ Watches ZooKeeper for contexts to poll ▪ Periodically collects latest values▪ artimonProxy decorator artimonProxy([[port],[ttl]]) ▪ Exposes all collected metrics via a local port (No ZooKeeper, no loop)
  • Collecting Metrics▪ Simulated flow using flume.flow event attribute artimon(...) | artimonProxy(...) value("flume.flow", "artimon")...▪ Events batched and gzipped ... value("flume.flow", "artimon") batch(100,100) gzip() ...▪ Kafka sink kafkasink(topic, propname=value...) ... gzip() < failChain("{ lazyOpen => { stubbornAppend => %s } } ", "kafkasink("flume-artimon","zk.connect=quorum:2181/zk/kafka/prod")") ? diskFailover("-kafka-flume-artimon") insistentAppend stubbornAppend insistentOpen failChain("{ lazyOpen => { stubbornAppend => %s } } ", "kafkasink("flume-artimon","zk.connect=quorum:2181/zk/kafka/prod")") >; ~ kafkaDFOChain
  • Consuming Metrics▪ Kafka source kafkasource(topic, propname=value...)▪ Custom BytesWritableEscapedSeqFileEventSink bwseqfile(filename[, idle[, maxage]]) bwseqfile("hdfs://nn/hdfs/data/artimon/%Y/%m/%d/flume-artimon"); ▪ N archivers in a single Kafka consumer group (same groupid) ▪ Metrics stored in HDFS as serialized Thrift in BytesWritables ▪ Can add archivers if metrics flow increases ▪ Ability to manipulate those metrics using Pig
  • Consuming Metrics▪ In-Memory history data (VarHistoryMemStore, VHMS) artimonVHMSDecorator(nthreads[0], bucketspan[60000], bucketcount[60], gc_grace_period[600000], port[27847], gc_period[60000], get_limit[100000]) null; ▪ Each VHMS in its own Kafka consumer group (each gets all metrics) ▪ Multiple VHMS with different granularities 60x1, 48x5, 96*15, 72*24h ▪ Filter to ignore some metrics for some VHMS artimonFilter("!~linux.proc.pid.*")
  • Why Kafka?▪ Initially used tsink/rpcSource ▪ No ZooKeeper use for Flume (avoid flapping) ▪ Collector load balancing using DNS ▪ Worked fine for some time...▪ But as metrics volume was increasing... ▪ DNS load balancing not ideal (herd effect when restarting collectors) ▪ Flumes push architecture got in the way Slowdowns not considered failures Had to add mechanisms for dropping metrics when congested
  • Why Kafka?▪ Kafka to the rescue! Source/sink coded in less than a day ▪ Acts as a buffer between metrics producers and consumers ▪ ZooKeeper based discovery and load balancing ▪ Easily scalable, just add brokers▪ Performance has increased ▪ Producers now push their metrics in less than 2s ▪ VHMS/Archivers consume at their pace with no producer slowdown => 1.3M metrics in ~10s▪ Ability to go back in time when restarting a VHMS▪ Flume still valuable, notably for DFO (collect metrics during NP)▪ Artimon [pull] Flume [push] Kafka [pull] Flume
  • Analyzing Metrics▪ Groovy library ▪ Talks to a VHMS to retrieve time series ▪ Manipulates time series, individually or in bulk▪ Groovy scripts for monitoring ▪ Use the Artimon library ▪ IT Monitoring ▪ BAM (Business Activity Monitoring)▪ Ability to generate alerts ▪ Each alert is an Artimon metric (archived for SLA compliance) ▪ Propagate to Nagios, Kafka in the work (CEP for alert manager)
  • Analyzing Metrics▪ Bulk time series manipulation ▪ Equivalence classes based on labels (same values, same class) ▪ Apply ops (+ - / * closure) to 2 variables based on equivalence classes import static com.arkea.artimon.groovy.LibArtimon.* vhmssrc=export[vhms.60] dfvars = fetch(vhmssrc,~^linux.df.bytes.(free|capacity)$,[:],60000,-30000) dfvars = select(sel_isfinite(), dfvars) free = select(dfvars, =linux.df.bytes.free, [:]) capacity = select(sel_gt(0), select(dfvars, =linux.df.bytes.capacity, [:])) usage = sort(apply(op_div(), free, capacity, [], freespace)) used50 = select(sel_lt(0.50), usage) used75 = select(sel_lt(0.25), usage) used90 = select(sel_lt(0.10), usage) used95 = select(sel_lt(0.05), usage) println Volumes occupied > 50%: + used50.size() println Volumes occupied > 75%: + used75.size() println Volumes occupied > 90%: + used90.size() println Volumes occupied > 95%: + used95.size() println Total volumes: + usage.size() Same script can handle any number of volumes, dynamically
  • Analyzing Metrics▪ Map paradigm ▪ Apply a Groovy closure on n consecutive values of a time serie map(closure, vars, nticks, name) Predefined map_delta(), map_rate(), map_{min,max,mean}() map(map_delta(), vars, 2, +:delta)▪ Reduce paradigm ▪ Apply a Groovy closure on equivalence classes ▪ Generate one time serie for each equivalence class reduceby(closure, vars, bylabels, name, relabels) Predefined red_sum(), red_{min,max,mean,sd}() reduceby(red_mean(), temps, [dc,rack], +:rackavg,[:])
  • Analyzing Metrics▪ A whole lot more getvars selectbylabels relabel fetch partition fillprevious find top fillnext findlabels bottom fillvalue display outliers map makevar dropOutliers reduceby nticks resample settype timespan normalize triggerAlert lasttick standardize clearAlert values sort CDF targets scalar PDF getlabels ntrim Percentile dump timetrim sparkline select apply ...
  • Third Party Metrics▪ JMX Agent ▪ Expose any JMX metrics as Artimon metricsjmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-artimon-0} 525762846jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-artimon-0} 511880426jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-artimon-0} 492037666jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-artimon-0} 436896839jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-artimon-0} 333034505jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-syslog-0} 163186980jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-syslog-0} 163047011jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-syslog-0} 162916713jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-syslog-0} 162704303jmx.kafka.log.logstats:currentoffset{context=kafka,jmx.domain=kafka,jmx.type=kafka.logs.flume-syslog-0} 162565421jmx.kafka.network.socketserverstats:numfetchrequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 8835417jmx.kafka.network.socketserverstats:numfetchrequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 8794654jmx.kafka.network.socketserverstats:numfetchrequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 8793525jmx.kafka.network.socketserverstats:numfetchrequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 8741181jmx.kafka.network.socketserverstats:numfetchrequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 8019699jmx.kafka.network.socketserverstats:numproducerequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 51999885jmx.kafka.network.socketserverstats:numproducerequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 51991203jmx.kafka.network.socketserverstats:numproducerequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 51986318jmx.kafka.network.socketserverstats:numproducerequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 51980976jmx.kafka.network.socketserverstats:numproducerequests{context=kafka,jmx.domain=kafka,jmx.type=kafka.SocketServerStats} 48008009
  • Third Party Metrics▪ Flume artimonReader source artimonReader(context, periodicity, file0[, fileX]) ▪ Periodically reads files containing text representation of metrics [timestamp] name{labels} value ▪ Exposes those metrics via the standard mechanism ▪ Simply create scripts which write those files and add them to crontab ▪ Successfully used for NAS, Samba, MQSeries, SNMP, MySQL, ... 1319718601000 mysql.bytes_received{db=mysql-roller} 296493399 1319718601000 mysql.bytes_sent{db=mysql-roller} 3655368849 1319718601000 mysql.com_admin_commands{db=mysql-roller} 673028 1319718601000 mysql.com_alter_db{db=mysql-roller} 0 1319718601000 mysql.com_alter_table{db=mysql-roller} 0 1319718601000 mysql.com_analyze{db=mysql-roller} 0 1319718601000 mysql.com_backup_table{db=mysql-roller} 0
  • PostMortem Analysis▪ Extract specific metrics from HDFS ▪ Simple Pig script▪ Load extracted metrics into a local VHMS▪ Interact with VHMS using Groovy ▪ Existing scripts can be ran directly if parameterized correctly▪ Interesting use cases ▪ Did we respect our SLAs? Would the new SLAs be respected too? ▪ What happened pre/post incident? ▪ Would a modified alert condition have triggered an alert?
  • Should we OpenSource this? http://www.arkea.com/ @herberts