Metrics Simplified
      Mark Lin
  mlin@admob.com
why?

"If you can not measure it, you can not improve it"
                                    -Lord Kelvin


99.999% ("fiv...
previously ...

                 Sending/Collecting is
                 complicated.
                 Single collection se...
bottlenecks ...

  Poll based collection server

  Not easy (!fun) to configure new metric collection or
  creation.
     ...
enabling technology

  Graphite

  RabbitMQ

  Graphite Local Proxy

  RockSteady ( w/ Esper )
path to graph
1min.juicer.output.apple.sc1.jcr1 20 1276822626



echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | ...
path to graph
1min.juicer.output.apple.sc1.jcr1 20 1276822626



echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | ...
graph
graph
graph
graph = post event forensic
Rocksteady, metric as event

1min.juicer.common.version.sc1.jcr1 100 1276822626

INSERT INTO Deploy
SELECT * FROM Metric(n...
Rocksteady, metric as event

1min.juicer.common.version.sc1.jcr1 100 1276822626

INSERT INTO Deploy
SELECT * FROM Metric(n...
auto threshold, prediction
correlation

  Deployment related problem.

  Capture sets of metrics when important ones crossed
  threshold.

  Determin...
correlation

  Deployment related problem.

  Capture sets of metrics when important ones crossed
  threshold.

  Determin...
revelation
beyond simple metric

  Timing info per request.

  Actual time spent in each component in an application.
  Map out depen...
beyond simple metric

  Timing info per request.

  Actual time spent in each component in an application.
  Map out depen...
what we learned?

1. Make metric sending simple.
2. Nice UI to make sense of data.
3. Real time processing of metric rocks.
Upcoming SlideShare
Loading in …5
×

Metrics simplified

4,942 views

Published on

Metrics simplified

  1. 1. Metrics Simplified Mark Lin mlin@admob.com
  2. 2. why? "If you can not measure it, you can not improve it" -Lord Kelvin 99.999% ("five nines") = 5.26 minutes
  3. 3. previously ... Sending/Collecting is complicated. Single collection server. Tedious to configure new metric collection or creation. Calculating metric from file is expensive.
  4. 4. bottlenecks ... Poll based collection server Not easy (!fun) to configure new metric collection or creation. =grunt work for ops-engineer uhhhh....
  5. 5. enabling technology Graphite RabbitMQ Graphite Local Proxy RockSteady ( w/ Esper )
  6. 6. path to graph 1min.juicer.output.apple.sc1.jcr1 20 1276822626 echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | nc localhost 3400
  7. 7. path to graph 1min.juicer.output.apple.sc1.jcr1 20 1276822626 echo "1min.juicer.output.apple.sc1.jcr1 20 1276822626" | nc localhost 3400
  8. 8. graph
  9. 9. graph
  10. 10. graph
  11. 11. graph = post event forensic
  12. 12. Rocksteady, metric as event 1min.juicer.common.version.sc1.jcr1 100 1276822626 INSERT INTO Deploy SELECT * FROM Metric(name='common. revision') MATCH_RECORNIZE ( partition by colo, hostname measures A.value as revision, A.colo as colo, A. hostname as hostname, A. app as app, A.timestamp as timestamp pattern (A) define A as A.value > prev(A.value))
  13. 13. Rocksteady, metric as event 1min.juicer.common.version.sc1.jcr1 100 1276822626 INSERT INTO Deploy SELECT * FROM Metric(name='common. revision') MATCH_RECORNIZE ( partition by colo, hostname measures A.value as revision, A.colo as colo, A. hostname as hostname, A. app as app, A.timestamp as timestamp pattern (A) define A as A.value > prev(A.value))
  14. 14. auto threshold, prediction
  15. 15. correlation Deployment related problem. Capture sets of metrics when important ones crossed threshold. Determine dependencies such as cpu to request to second or response time.
  16. 16. correlation Deployment related problem. Capture sets of metrics when important ones crossed threshold. Determine dependencies such as cpu to request to second or response time.
  17. 17. revelation
  18. 18. beyond simple metric Timing info per request. Actual time spent in each component in an application. Map out dependency, find exact area of problem.
  19. 19. beyond simple metric Timing info per request. Actual time spent in each component in an application. Map out dependency, find exact area of problem.
  20. 20. what we learned? 1. Make metric sending simple. 2. Nice UI to make sense of data. 3. Real time processing of metric rocks.

×