Trending with Purpose
        Jason Dixon
Trending
A general direction in which something is
        developing or changing.
Why do we trend?
Planning for growth.
Predicting or
diagnosing failure.
... instead of finding out
   from your customer.
Operational Questions
Just because a host or service reponds,
    how do you know it’s working?
If you haven’t measured good, how will
          you recognize bad?
You don’t know what might break, so
      collect everything now.
"Those who cannot remember the
 past are condemned to repeat it"
                      George Santayana
"I don't care if my servers are on fire as
  long as they're making me money"
                              Business Owner
How can we add
Business Value?
Start simple.
Dance with the one who
     brought you.
Nagios
Nagios

• Fault Detection
Nagios

• Fault Detection
• Notifications
Nagios

• Fault Detection
• Notifications
• Escalations
Nagios

• Fault Detection
• Notifications
• Escalations
• Acknowledgements/Downtime
Nagios

• Fault Detection
• Notifications
• Escalations
• Acknowledgements/Downtime
• http://www.nagios.org/
Nagios
Nagios
• What it does well:
Nagios
• What it does well:
 • Free
Nagios
• What it does well:
 • Free
 • Extensible
Nagios
• What it does well:
 • Free
 • Extensible
   • Plugins
Nagios
• What it does well:
 • Free
 • Extensible
   • Plugins
   • Configuration templates
Nagios
• What it does well:
 • Free
 • Extensible
   • Plugins
   • Configuration templates
 • Popular (lesser of all free evils)
Nagios
• What it does well:
 • Free
 • Extensible
   • Plugins
   • Configuration templates
 • Popular (lesser of all free evils)
 • Log metrics (“performance data”)
Nagios
Nagios
• Where it sucks:
Nagios
• Where it sucks:
 • Interface
Nagios
• Where it sucks:
 • Interface
 • (Lack of) Scalability
Nagios
• Where it sucks:
 • Interface
 • (Lack of) Scalability
 • Promotes bad habits
Nagios
• Where it sucks:
 • Interface
 • (Lack of) Scalability
 • Promotes bad habits
   • Acknowledgements never expire
Nagios
• Where it sucks:
 • Interface
 • (Lack of) Scalability
 • Promotes bad habits
   • Acknowledgements never expire
   • Configuration (over-)flexibility
Nagios
• Where it sucks:
 • Interface
 • (Lack of) Scalability
 • Promotes bad habits
   • Acknowledgements never expire
   • Configuration (over-)flexibility
   • Flapping
Nagios
PNP4Nagios
PNP4Nagios

• Basic graphing & dashboard capabilities
PNP4Nagios

• Basic graphing & dashboard capabilities
• Retrieves Nagios performance data
PNP4Nagios

• Basic graphing & dashboard capabilities
• Retrieves Nagios performance data
• Creates graphs with RRD
PNP4Nagios

• Basic graphing & dashboard capabilities
• Retrieves Nagios performance data
• Creates graphs with RRD
• Limited introspection/correlation
PNP4Nagios

• Basic graphing & dashboard capabilities
• Retrieves Nagios performance data
• Creates graphs with RRD
• Limited introspection/correlation
• http://www.pnp4nagios.org/
PNP4Nagios
Graphite
Graphite

• Metric storage
Graphite

• Metric storage
• Complex graph creation
Graphite

• Metric storage
• Complex graph creation
• Web and “CLI” interfaces
Graphite

• Metric storage
• Complex graph creation
• Web and “CLI” interfaces
• Created and released by Orbitz.com
Graphite

• Metric storage
• Complex graph creation
• Web and “CLI” interfaces
• Created and released by Orbitz.com
• http://graphite.wikidot.com/
Graphite
Graphite
Graphite
• The good stuff:
Graphite
• The good stuff:
 • Horizontally scalable
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
 • Graph disparate data points
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
 • Graph disparate data points
 • Numerous formulas available
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
 • Graph disparate data points
 • Numerous formulas available
   • derive, transform, average, sum, etc...
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
 • Graph disparate data points
 • Numerous formulas available
   • derive, transform, average, sum, etc...
 • Share graphs with other users
Graphite
• The good stuff:
 • Horizontally scalable
 • Rapid graph prototyping (CLI)
 • Graph disparate data points
 • Numerous formulas available
   • derive, transform, average, sum, etc...
 • Share graphs with other users
 • Supports existing RRD databases
Graphite
Graphite

• The not-so-good stuff:
Graphite

• The not-so-good stuff:
 • Not a dashboard (well, sorta)
Graphite

• The not-so-good stuff:
 • Not a dashboard (well, sorta)
 • No hover details
Graphite

• The not-so-good stuff:
 • Not a dashboard (well, sorta)
 • No hover details
 • Single y-axis
Graphite

Web     Carbon    Metrics




        Whisper
Carbon
Carbon
• agent - starts other daemons, receives metrics and
   pipelines them to cache
Carbon
• agent - starts other daemons, receives metrics and
   pipelines them to cache

• aggregator - aggregate and transform your data
   before storage
Carbon
• agent - starts other daemons, receives metrics and
   pipelines them to cache

• aggregator - aggregate and transform your data
   before storage

• cache - caches metrics for real-time graphing,
   pipelines them to persister
Carbon
• agent - starts other daemons, receives metrics and
   pipelines them to cache

• aggregator - aggregate and transform your data
   before storage

• cache - caches metrics for real-time graphing,
   pipelines them to persister

• persister - writes persistent data to disk
Whisper
Whisper

• Metrics database format
Whisper

• Metrics database format
• Supplanted RRDtool
Whisper

• Metrics database format
• Supplanted RRDtool
• Accepts out-of-order data
Whisper

• Metrics database format
• Supplanted RRDtool
• Accepts out-of-order data
• Supports pipelining of data in a single operation
   (multiplexing)
Coming Soon - Ceres
Coming Soon - Ceres

• Smaller files - doesn’t pad missing datapoints
Coming Soon - Ceres

• Smaller files - doesn’t pad missing datapoints
• Doesn’t store timestamps, calculates them
Coming Soon - Ceres

• Smaller files - doesn’t pad missing datapoints
• Doesn’t store timestamps, calculates them
• Federates individual datapoints
Graphite Web
Graphite Web

• Traditional web interface
Graphite Web

• Traditional web interface
• Javascript CLI
Graphite Web

• Traditional web interface
• Javascript CLI
• Rudimentary dashboard
Graphite Web

• Traditional web interface
• Javascript CLI
• Rudimentary dashboard
• Django application
Sending metrics to Graphite
Sending metrics to Graphite

• Connect to Carbon socket (tcp/2003)
Sending metrics to Graphite

• Connect to Carbon socket (tcp/2003)
• Send your data
Sending metrics to Graphite

• Connect to Carbon socket (tcp/2003)
• Send your data
  my $sock = IO::Socket::INET->new(‘127.0.0.1:2003’);
  $sock->send(“endpoint.app.metric $value $epochn");
Sending metrics to Graphite

• Connect to Carbon socket (tcp/2003)
• Send your data
  my $sock = IO::Socket::INET->new(‘127.0.0.1:2003’);
  $sock->send(“endpoint.app.metric $value $epochn");


• ???
Sending metrics to Graphite

• Connect to Carbon socket (tcp/2003)
• Send your data
  my $sock = IO::Socket::INET->new(‘127.0.0.1:2003’);
  $sock->send(“endpoint.app.metric $value $epochn");


• ???
• Profit!
What should we collect?
App/DB Profiling
App/DB Profiling

• How fast is our:
App/DB Profiling

• How fast is our:
 • function foo() for each iteration
App/DB Profiling

• How fast is our:
 • function foo() for each iteration
 • SQL query
App/DB Profiling

• How fast is our:
 • function foo() for each iteration
 • SQL query
 • 3rd party API service (e.g. payment gateway,
     social media)
App/DB Profiling
App/DB Profiling

  • How many times do we:
App/DB Profiling

  • How many times do we:
   • call function foo()
App/DB Profiling

  • How many times do we:
   • call function foo()
   • register a new user
App/DB Profiling

  • How many times do we:
   • call function foo()
   • register a new user
   • chargeback a sale
IT exists to support
     Business.
Not the other way
     around.
Business Profiling
Business Profiling

• How many sales did we generate this hour:
Business Profiling

• How many sales did we generate this hour:
 • per sku
Business Profiling

• How many sales did we generate this hour:
 • per sku
 • from the recent ad campaign
Business Profiling

• How many sales did we generate this hour:
 • per sku
 • from the recent ad campaign
 • from users in West Bumble, Arkansas
Last week?
Last month?
Last year?
Be Creative.
More Data > Less Data
You probably won’t know what metrics
   you might need until it’s too late.
Don’t be that guy.
Storage is Cheap.
Data Sources

• Nagios     • SQL
• Munin      • Logs
• SNMP       • sar
• Ganglia    • /proc
• collectd   • REST APIs
You can’t be serious!
You can’t be serious!

 • You’re thinking to yourself...
You can’t be serious!

 • You’re thinking to yourself...
    • This sounds like a lot of work.
You can’t be serious!

 • You’re thinking to yourself...
    • This sounds like a lot of work.
    • Our developers will never buy in.
You can’t be serious!

 • You’re thinking to yourself...
    • This sounds like a lot of work.
    • Our developers will never buy in.
    • Is there an easier way?
Duh.
StatsD
StatsD
• Created and released by Etsy
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
• Aggregate counters and timers
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
• Aggregate counters and timers
• Pipeline to Graphite
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
• Aggregate counters and timers
• Pipeline to Graphite
• Fire-and-forget (UDP)
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
• Aggregate counters and timers
• Pipeline to Graphite
• Fire-and-forget (UDP)
• Perl, PHP, Python and Java clients
StatsD
• Created and released by Etsy
• "Measure Anything, Measure Everything"
• Aggregate counters and timers
• Pipeline to Graphite
• Fire-and-forget (UDP)
• Perl, PHP, Python and Java clients
• https://github.com/etsy/statsd
StatsD
StatsD

• https://github.com/sivy/statsd-client
StatsD

• https://github.com/sivy/statsd-client
   use Net::StatsD::Client;
   my $c = Net::StatsD::Client->new();
   $c->increment('endpoint.customer.app.metric');
   $c->timing('endpoint.customer.app.foo', 200);
StatsD

• https://github.com/sivy/statsd-client
   use Net::StatsD::Client;
   my $c = Net::StatsD::Client->new();
   $c->increment('endpoint.customer.app.metric');
   $c->timing('endpoint.customer.app.foo', 200);


• Too much activity? Sample it!
StatsD

• https://github.com/sivy/statsd-client
   use Net::StatsD::Client;
   my $c = Net::StatsD::Client->new();
   $c->increment('endpoint.customer.app.metric');
   $c->timing('endpoint.customer.app.foo', 200);


• Too much activity? Sample it!
   # sample 10%, StatsD will multiply it up
   $c->increment('endpoint.customer.app.metric', 0.1)
That’s enough?
Too much awesome?
Too Bad!
Logster
Logster

• Yet Another Etsy Project (YAEP)
Logster

• Yet Another Etsy Project (YAEP)
• Rips metrics from log files
Logster

• Yet Another Etsy Project (YAEP)
• Rips metrics from log files
• https://github.com/etsy/logster
Logster

• Yet Another Etsy Project (YAEP)
• Rips metrics from log files
• https://github.com/etsy/logster
   # /usr/sbin/logster --dry-run --output=graphite 
   --graphite-host=graphite.example.com:2003 
   SampleLogster /var/log/httpd/access_log
Questions?
Thank you

Trending with Purpose