Your SlideShare is downloading. ×
0
Graphite
Graphs for the modern age
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
● Writte...
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
● Writte...
The church of Graphs
● Pattern Recognition
The church of Graphs
● Pattern Recognition
● Correlation
The church of Graphs
● Pattern Recognition
● Correlation
● Analytics
The church of Graphs
● Pattern Recognition
● Correlation
● Analytics
● Anomaly detection
Helpful Graphite features
● Out of order data insertion
Helpful Graphite features
● Out of order data insertion
● Ability to compare corresponding time periods
(time travel)
Helpful Graphite features
● Out of order data insertion
● Ability to compare corresponding time periods
(time travel)
● Cu...
Moving parts
● Relays
– Send data to correct backend store
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
● Stora...
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
● Stora...
Data output
● Web API
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
● Graphite offers outputs i...
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
● Graphite offers outputs i...
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For si...
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For si...
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For si...
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For si...
Using Graphite
● API
– Monitoring
– Runtime performance tuning
Using Graphite
● API
– Monitoring
– Runtime performance tuning
● Postmortem analytics
Using Graphite
● API
– Monitoring
– Runtime performance tuning
● Postmortem analytics
● Performance debugging
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
– RAID 1+0 with 4 spinning disks
...
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
– RAID 1+0 with 4 spinning disks
...
Scaling out - try 1
● Add more backend boxes
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
● Ba...
Scaling up
● Replace spinning disks with SSDs
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we...
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we...
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we...
Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful
Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful
● Keeping disk u...
Sharding - take II
● Replace regular expressions with consistent
hashing
● Switch to RAID 0
– We have switched back to RAI...
Disk usage
● Graphite uses a lot of disk io
– Background graph is in thousands on the Y axis.
– Individual files increase ...
Naming conventions
● Graphite has no rules for names
Naming conventions
● Graphite has no rules for names
● We adopted:
– sys.* is for system metrics
– user.* is for testing/o...
Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell
Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell
● Originally used collect...
Collecting metrics
● System metrics are now collected by diamond
Collecting metrics
● System metrics are now collected by diamond
● Diamond is a Python application
– Base framework + metr...
Relay issues
● The Python relaying implementation eats CPU
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more...
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more...
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more...
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more...
Data visibility
● We send data to multiple places
– Metrics get dropped
Data visibility
● We send data to multiple places
– Metrics get dropped
● Small application in Go which gets data from
mul...
statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal ap...
statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal ap...
Business metrics
● Turns out, developers like Graphite
– They don't reliably understand whisper semantics
● Querying Graph...
Scaling out clusters
● Different groups have different requirements
– Multiple backend rings, same frontend
● Unix systems...
Current problems
● Hardware
– Need more CPU
● Especially on the frontends where we do a lot of maths
– Better disk reliabi...
Current problems
● People
– If you need a graph, put the data in Graphite
● Even if the data isn't time series data
● Fron...
Current problems
● Managability
– Getting rid of older, non-required metrics is a lot of
effort
– Adding hosts into a ring...
Future possiilities
● Testing Cassandra as a backend (cyanite)
● Anomaly detection
– Tested Skyline, didn't scale
● More b...
Peopleware
● Hiring people to work on interesting
challenges
– Sysadmins, developers
– http://www.booking.com/jobs
● Booki...
Reference URLS
● Graphite
– https://github.com/graphite-project
● Graphite API
– http://graphite.readthedocs.org/en/latest...
?
Upcoming SlideShare
Loading in...5
×

OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

630

Published on

Graphite is a timeseries data charting package, similar to MRTG and Cacti. This talk will cover Graphite starting from the basics to how booking.com scaled it to millions of datapoints per second.

Published in: Software, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
630
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age "

  1. 1. Graphite Graphs for the modern age
  2. 2. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those
  3. 3. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those ● Written in Python – This does impact performance
  4. 4. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those ● Written in Python – This does impact performance ● Web based and easy to use – For once, not a marketing buzzword
  5. 5. The church of Graphs ● Pattern Recognition
  6. 6. The church of Graphs ● Pattern Recognition ● Correlation
  7. 7. The church of Graphs ● Pattern Recognition ● Correlation ● Analytics
  8. 8. The church of Graphs ● Pattern Recognition ● Correlation ● Analytics ● Anomaly detection
  9. 9. Helpful Graphite features ● Out of order data insertion
  10. 10. Helpful Graphite features ● Out of order data insertion ● Ability to compare corresponding time periods (time travel)
  11. 11. Helpful Graphite features ● Out of order data insertion ● Ability to compare corresponding time periods (time travel) ● Custom retention periods
  12. 12. Moving parts ● Relays – Send data to correct backend store
  13. 13. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing
  14. 14. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing ● Storage – Flat, fixed size files ● These are created when the metric is first recorded ● Changing later is hard
  15. 15. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing ● Storage – Flat, fixed size files ● These are created when the metric is first recorded ● Changing later is hard ● Webapp – Django based application offering a web api and Javascript based frontend application
  16. 16. Data output ● Web API
  17. 17. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation
  18. 18. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation ● Graphite offers outputs in multiple formats
  19. 19. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation ● Graphite offers outputs in multiple formats – Graphical (PNG, SVG) – Structured(JSON, CSV) – Raw data
  20. 20. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”>
  21. 21. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems
  22. 22. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools
  23. 23. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools
  24. 24. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools ● Using third party libraries – JSON is nice for this – Cubism, D3.js, rickshaw, etc
  25. 25. Using Graphite ● API – Monitoring – Runtime performance tuning
  26. 26. Using Graphite ● API – Monitoring – Runtime performance tuning ● Postmortem analytics
  27. 27. Using Graphite ● API – Monitoring – Runtime performance tuning ● Postmortem analytics ● Performance debugging
  28. 28. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend
  29. 29. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend – RAID 1+0 with 4 spinning disks ● This works well, with about 200 machines
  30. 30. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend – RAID 1+0 with 4 spinning disks ● This works well, with about 200 machines – All those individual files force a lot of seeks
  31. 31. Scaling out - try 1 ● Add more backend boxes
  32. 32. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names
  33. 33. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names
  34. 34. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names ● Balancing traffic is hard
  35. 35. Scaling up ● Replace spinning disks with SSDs
  36. 36. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed
  37. 37. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed ● Losing a SSD meant we had a box die – This has been fixed
  38. 38. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed ● Losing a SSD meant we had a box die – This has been fixed ● SSDs are not as reliable as spinning rust – SSDs last for between 12 to 14 months
  39. 39. Sharding – take II ● At about 10 storage servers, manually maintaining regular expressions became painful
  40. 40. Sharding – take II ● At about 10 storage servers, manually maintaining regular expressions became painful ● Keeping disk usage balanced was even harder – Anyone is allowed to create graphs
  41. 41. Sharding - take II ● Replace regular expressions with consistent hashing ● Switch to RAID 0 – We have switched back to RAID 1 ● Store data on two nodes in each ring ● Mirror rings in datacenters ● Shuffle metrics to avoid losing data and disk space.
  42. 42. Disk usage ● Graphite uses a lot of disk io – Background graph is in thousands on the Y axis. – Individual files increase seek times ● There are a lot of stat(2) calls – This hasn't been investigated yet
  43. 43. Naming conventions ● Graphite has no rules for names
  44. 44. Naming conventions ● Graphite has no rules for names ● We adopted: – sys.* is for system metrics – user.* is for testing/other stuff – Anything else which makes sense is acceptable
  45. 45. Collecting metrics ● We have all sorts of homegrown scripts – Shell – Perl – Python – Powershell
  46. 46. Collecting metrics ● We have all sorts of homegrown scripts – Shell – Perl – Python – Powershell ● Originally used collectd for system metrics – The version of collected we were using had memory usage issues ● These have been fixed later
  47. 47. Collecting metrics ● System metrics are now collected by diamond
  48. 48. Collecting metrics ● System metrics are now collected by diamond ● Diamond is a Python application – Base framework + metric collection scripts – Added custom patches for internal metrics – Added patches to send monitoring data directly to Nagios for passive checks
  49. 49. Relay issues ● The Python relaying implementation eats CPU
  50. 50. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU
  51. 51. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU
  52. 52. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU ● Ran multiple instances on each relay host – Still need more CPU
  53. 53. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU ● Ran multiple instances on each relay host – Still need more CPU ● Finally rewrote in C and added more relay hosts – This works for us (and we have breathing room)
  54. 54. Data visibility ● We send data to multiple places – Metrics get dropped
  55. 55. Data visibility ● We send data to multiple places – Metrics get dropped ● Small application in Go which gets data from multiple locations and gives us a single merged resultset – Prototyped in Python, which was too slow
  56. 56. statsd ● We had statsd running, but unused for a long time – statsd use is still relatively small – Only a few internal applications use it – We already have an analytics framework for this
  57. 57. statsd ● We had statsd running, but unused for a long time – statsd use is still relatively small – Only a few internal applications use it – We already have an analytics framework for this ● The PCI vulnerability scanner reliably crashed it – This was patched and pushed upstream
  58. 58. Business metrics ● Turns out, developers like Graphite – They don't reliably understand whisper semantics ● Querying Graphite like SQL doesn't work – They create a large number of named metrics ● foo.bar.YYYY-MM-DD ● Disk space use is a sudden concern – Especially when you don't try and restrict this (feature, not bug)
  59. 59. Scaling out clusters ● Different groups have different requirements – Multiple backend rings, same frontend ● Unix systems ● Windows ● Networking ● Business metrics ● User testing
  60. 60. Current problems ● Hardware – Need more CPU ● Especially on the frontends where we do a lot of maths – Better disk reliability on SSDs ● Replacing disks is expensive – More disk IO ● SSDs are now maxed out under stat(2) calls ● Testing Fusion IO cards – 10% faster, but we don't know babout reliability yet
  61. 61. Current problems ● People – If you need a graph, put the data in Graphite ● Even if the data isn't time series data ● Frontend scalability – The default frontend doesn't work well with a few thousand hosts ● Software upgrades – Our last Whisper upgrade caused data recording to stop
  62. 62. Current problems ● Managability – Getting rid of older, non-required metrics is a lot of effort – Adding hosts into a ring requires manual rebalancing effort
  63. 63. Future possiilities ● Testing Cassandra as a backend (cyanite) ● Anomaly detection – Tested Skyline, didn't scale ● More business metrics ● Sparse metrics – Metrics with a lot of nulls, but potentially a lot of named metrics involved
  64. 64. Peopleware ● Hiring people to work on interesting challenges – Sysadmins, developers – http://www.booking.com/jobs ● Booking.com will be sponsoring a Graphite dev summit in June (tentatively just before the devopsdays Amsterdam event)
  65. 65. Reference URLS ● Graphite – https://github.com/graphite-project ● Graphite API – http://graphite.readthedocs.org/en/latest/functions.html ● C Carbon relay – https://github.com/grobian/carbon-c-relay ● Zipper – https://github.com/grobian/carbonserver ● Cyanite – https://github.com/pyr/cyanite – https://github.com/brutasse/graphite-cyanite
  66. 66. ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×