Blueflood: Open Source Metrics Processing at CassandraEU 2013

2,137 views
2,010 views

Published on

Describes how Blueflood works and future development direction

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,137
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Blueflood: Open Source Metrics Processing at CassandraEU 2013

  1. 1. Blueflood Simple Metrics Processing Gary Dusbabek • Cassandra EU 2013
  2. 2. Motivation Building Blocks Future Future Stuff
  3. 3. Motivation
  4. 4. Get     the Data   In
  5. 5. Each check generates 2-20 metrics Multiply by data centers
  6. 6. Currently handling 120 million metrics per hour
  7. 7. 40 million aggregate Cassandra write operations per hour
  8. 8. Get  the  Data  Out   Fast  Graphs!   Think:  Dashboards   SLA  is  important  
  9. 9. Get  the  Data  Out   Get     the Data   Out Fast  Graphs!   Think:  Dashboards   SLA  is  important  
  10. 10. Get  the  Data  Out   Fast  Graphs!   Think:  Dashboards   SLA  is  important   Fast Graphs
  11. 11. Multitenant
  12. 12. Different SLAs expectations
  13. 13. Hard
  14. 14. Tenants imply Metadata
  15. 15. Hampers generic computing Systems
  16. 16. Lipstick system
  17. 17. Nice to Have
  18. 18. Not Mission Critical
  19. 19. Don’t Break the Bank
  20. 20. Avoid Hadoop
  21. 21. HATE Hadoop
  22. 22. HATE Hadoop
  23. 23. We Ended Up With This Ingestion API Ingestion Transform Query API Metadata + Cache Rollup Scheduler State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  24. 24. We Ended Up With This Ingestion API Ingestion Transform Query API Metadata + Cache Rollup Scheduler State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  25. 25. Cassandra Database (Cassandra)
  26. 26. Cassandra 1.0, 1.1, 1.2 Compatible No 2.0 yet  
  27. 27. Cassandra Experimented with CQL very early on CQL 1.0 time frame  
  28. 28. Cassandra Experimented with CQL very early on CQL 1.0 time frame  
  29. 29. Cassandra Astyanax now Mostly happy with it Connection pool implementation is very sensitive to network bumps
  30. 30. Cassandra Experimented with various compaction strategies No real winner Leveldb bugs in 1.0 made it almost a non-starter
  31. 31. Cassandra CASSANDRA-5685 Per-CF TTLs Doesn’t help us Might help you
  32. 32. Cassandra CASSANDRA-3974 TTL histogram used to give input on which sstables are good candidates for compaction (size-tiered only)
  33. 33. Cassandra CASSANDRA-5228 Track max TTL per sstable to expire the whole thing. We could use this by using bucketed CFs
  34. 34. Anatomy of a Metric One dimensional signal Has an ID We call this a locator Mostly opaque Tuple of (tenantId [,other things,…]
  35. 35. Anatomy of a Metric Example: 6335,web01,ping,bytes
  36. 36. Anatomy of a Metric Stuff whatever you want in there Just don’t change it It becomes a key
  37. 37. Anatomy of a Metric Has a type associated with it: long, double, string, boolean Type determines on-disk serialization
  38. 38. ! {! "timestamp": 1319222001982,! "monitoring_zone_id": "mzXXXXXXXX",! "available": true,! "status": "code=200,rt=0.257s,bytes=0",! "metrics": {! "bytes": {! "type": "i",! "data": "0"! },! "tt_firstbyte": {! "type": "I",! "data": "257"! },! "tt_connect": {! "type": "I",! "data": "128"! },! "code": {! "type": "s",! "data": "200"! },! "duration": {! "type": "I",! "data": "257"! }! }! Example }!
  39. 39. Anatomy of a Metric Sometimes has units Example: seconds, bytes, light years We guess on this
  40. 40. Column Families Metrics Full resolution One per granularity (5m, 20m, 60m, 240m, 1440m) One row per metric Locator is the key
  41. 41. Column Families Metrics No Bucketing Will be required for high frequency metrics Solution is easy Just complicates Locator resolution
  42. 42. Column Families Metadata One row per metric Rollup State Nasty map for tracking shard state Active Metrics Shard to list of locators
  43. 43. Column Families STRING & BOOLEAN Speshul Only updated when values change Plumbing keeps old values in memory
  44. 44. Libraries Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  45. 45. Ingestion LibrarY insert_metrics(list<metric>)!
  46. 46. Ingestion LibrarY update_state(shard, granularity, slot)! SLOT == Bucket of time  
  47. 47. Rollup LibrarY get_active_locators(shard)! get_state(shard, granularity, slot)! get_metrics(from, to, locator, granularity)! write_rollups(list<rollup>)! update_state(shard, granularity, slot)!  
  48. 48. Rollup LibrarY Supports bulk operations outside of the service Enables tools to be written  
  49. 49. Rollup LibrarY Rollups contain count, min, max, mean, variance Serialization is versioned  
  50. 50. Query LibrarY get_data(form, to, granularity)! get_data(from, to, num_points)!  
  51. 51. Metadata & Cache Metadata + Cache State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  52. 52. Metadata & Cache Integrated into services (ingestion & rollup) Backed by Cassandra Supports different eviction strategies based on needs
  53. 53. Metadata & Cache Example 1: TTLs are linked to tenants and are not known when metrics are ingested A separate API must be consulted
  54. 54. Metadata & Cache Example 2: Units are valuable only at query time, but are not included with metrics Heuristically guess and store these
  55. 55. Rollup Schedule Service Metadata + Cache Rollup Scheduler State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  56. 56. Rollup Schedule Service Problem: Divide time into buckets without scratching at infinity Identify them using a finite set of keys
  57. 57. Rollup Schedule Service Solution: Order preserving consistent hashing for timestamps
  58. 58. Rollup Schedule Service Imagine a two week period divided into slots the size of each granularity
  59. 59. Rollup Schedule Service 4032 5m slots 1008 20m slots 336 60m slots 84 240m slots 14 1440m slots
  60. 60. Rollup Schedule Service Gives us a way of consistently addressing and bucketing time ranges As time increases, so does the slot it hashes to (until it wraps to zero)
  61. 61. Rollup Schedule Service When do we roll up? Whenever an active slot a) has not been updated in N seconds b) is M seconds old
  62. 62. Rollup Schedule Service What about late data? Late data can be ingested for 24 hours
  63. 63. Ingestion  Processors Ingestion Transform Metadata + Cache Rollup Scheduler State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  64. 64. Ingestion  Processors Every metric is not built the same way They come from different places Processors allow you to make them consistent Can be synchronous or asynchronous
  65. 65. API Endpoints Ingestion API Ingestion Transform Query API Metadata + Cache Rollup Scheduler State Management Java Ingestion Library Java Rollup Library Database (Cassandra) Java Query Library
  66. 66. API Endpoints Why not ship it with API endpoints? External forces
  67. 67. API Endpoints Decided to make them Modular
  68. 68. API Endpoints We do ship reference API endpoints UDP Ingestion HTTP Ingestion HTTP Query
  69. 69. API Endpoints Downside? More work for you
  70. 70. API Endpoints Upside? We ♥ Pull Requests
  71. 71. How Does It Scale? Ingestion scales linearly Add ingestion nodes until Cassandra is the bottleneck
  72. 72. How Does It Scale? Two ingestors per DC Only one per DC is active Double ingest
  73. 73. How Does It Scale? Rollups scale [almost] linearly by spreading out shard ownership Shards are currently pegged at 128 Ok to have multiple nodes own a shard Zookeeper is a soft-dependency
  74. 74. Future Stuff Local ingestion durability
  75. 75. Future Stuff Richer metadata API Example: tag metrics and then use those tags as a query facet Will require an index Experimenting with ElasticSearch Home-rolled bitmap indexes
  76. 76. Future Stuff Pre-aggregated Metrics Histograms (partially implemented) Counters, Timers, Gauges, Sets
  77. 77. Future Stuff Deep statsd and graphite integration (active work) Statsd is hard because counts get reset after a flush
  78. 78. Future Stuff Graphite is just involved (new rollup types) Whisper DB interface Then hack carbon to support it Already pluggable, just needs integration
  79. 79. Thanks! http://blueflood.io blueflood-discuss@googlegroups.com Freenode: #blueflood gitub.com/rackerlabs/blueflood Twitter: @gdusbabek
  80. 80. Image Credits All images for this presentation come from the Flickr commons collection http://www.flickr.com/commons/ flood guide motivation cows jet apartments groups lipstick elephant containers anatomy columns library cache money railyard processors terminal fish future thanks http://www.flickr.com/photos/keenepubliclibrary/2593172720/sizes/z/ http://www.flickr.com/photos/field_museum_library/3796303860/ http://www.flickr.com/photos/statelibraryofnsw/4944459226/sizes/l/in/photolist-8wVDt1/ http://www.flickr.com/photos/nationalarchives/7457004362/sizes/l/ http://www.flickr.com/photos/sdasmarchives/4564334397/sizes/o/ http://www.flickr.com/photos/nypl/3110619126/sizes/o/ http://www.flickr.com/photos/fylkesarkiv/4545544268/sizes/l/ http://www.flickr.com/photos/library_of_congress/2179918784/sizes/o/ http://www.flickr.com/photos/statelibraryofnsw/2963006536/sizes/o/ http://www.flickr.com/photos/smu_cul_digitalcollections/9526924556/sizes/l/ http://www.flickr.com/photos/usnationalarchives/5573758997/sizes/l/ http://www.flickr.com/photos/cornelluniversitylibrary/3485933761/sizes/l/ http://www.flickr.com/photos/statelibraryofnsw/4414971043/sizes/l/ http://www.flickr.com/photos/smu_cul_digitalcollections/8519861690/sizes/l/ http://www.flickr.com/photos/nlireland/8443250313/sizes/h/ http://www.flickr.com/photos/national_library_of_australia_commons/6174084474/sizes/l/ http://www.flickr.com/photos/nypl/3110609190/sizes/o/ http://www.flickr.com/photos/hartlepool_museum/4398630456/sizes/o/ http://www.flickr.com/photos/usnationalarchives/7158774350/sizes/l/ http://www.flickr.com/photos/nlireland/9490851253/sizes/l/

×