Building Rackspace Cloud Monitoring

2,273 views

Published on

Slides used at my Strata 2012 presentation about how Rackspace built Cloud Monitoring to monitor thousands of servers.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,273
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Problems = Specifically what we’re trying to solveHopes/Dreams = Things that would be really nice.How We Did It = Processes & tools
  • As business has changed/segments were merged, others were added.Different teamsNone of the existing solutions will scale to meet the needs of everyone.Not harnessing the data.
  • Different needs.Internal – hook into ticketing and support.External – monitor anythingOne API for everybody.
  • Obvious reallyProactive
  • TrendingCapacity planningAuto scaling
  • ToolingAPI Driven – Self-serviceAwesome reporting
  • Open sourceDistributed systemsEmbeddedDevOps
  • Prefer open source
  • Polyglot system
  • Control:Metadata + stateHigh high replication (5 DC)Wide rowsEasy dump and loadData:ArchivalConstant writes3 DC replication
  • Crud operationsData descriptionSimple parent-child – Simplified relationsPrefixed IDsObject called ‘foo’, id becomes ‘foXXXXXXXX’.Object called ‘bar’, id becomes ‘baXXXXXXX’.1:* relationship from foo:bar, then concatenate column names: foXXXXXXXX:baXXXXXXXXSelect all the bar that belong to fo12345678 becomes a slice from fo12345678:ba..fo12345678:ba\\xef\\xbf\\xbfOne to many from Objects are slices.
  • Crud operationsData descriptionSimple parent-child – Simplified relationsPrefixed IDsObject called ‘foo’, id becomes ‘foXXXXXXXX’.Object called ‘bar’, id becomes ‘baXXXXXXX’.1:* relationship from foo:bar, then concatenate column names: foXXXXXXXX:baXXXXXXXXSelect all the bar that belong to fo12345678 becomes a slice from fo12345678:ba..fo12345678:ba\\xef\\xbf\\xbfOne to many from Objects are slices.
  • Stress VISIBILITY!
  • Deployed 120 times in JanuarySystem changes treated the same way as code changes.
  • Building Rackspace Cloud Monitoring

    1. 1. Monitoring @ Scale How Rackspace does it. Gary Dusbabek
    2. 2. Outline1. Our Problems2. Other Requirements (Hopes and Dreams)3. How we did it
    3. 3. Our Problems
    4. 4. • Many tens of thousands of servers• Several solutions in place – Merged segments – Wouldn’t scale to all – Tedious (Manpower) – Expensive• Wanted more
    5. 5. Hopes and Dreams
    6. 6. ExternalUsers No Special Cases Internal Users
    7. 7. Know about problems Before customers do.
    8. 8. GainInsight Trending Capacity Planning More
    9. 9. MONITORING Throwing away dataThrowing away knowledge Y U NO BIG DATA?
    10. 10. Availability Component Failure
    11. 11. Availability Datacenter Failure
    12. 12. SupportableDebuggableDeployable
    13. 13. How We Built It
    14. 14. 100% Ownership5 Devs2 Ops
    15. 15. Technology Choices
    16. 16. API endpointsOther services that write to Cassandra node-whiskey˚ node-cassandra-client˚ node-swiz˚ node-elementtree˚
    17. 17. C projectsReconnoiter – noitdˆ – stratcondˆ Scribe
    18. 18. Java projectsMainly for concurrencyApache Cassandraˆ Apache Zookeeper Apache Thriftˆ Metric Ingestion Complex events (Esper)
    19. 19. A Tale of Two Clusters Used Differently Control Cluster Data Cluster
    20. 20. Data Model Relational Mismatch Our Approach One row per account Prefixed Ids for col namesConcatenate for parent/child
    21. 21. Data Model Foo objects get ids like ‘foXXX’ Bar objects get ids like ‘baXXX’ Assume 1:* relationship from foo:barBar 456 that is attached to Foo 123 gets column name ‘fo123:ba456’Select ‘fo123:ba’..’fo123:baxefxbfxfb’
    22. 22. Lua/LuvitˆVirgo˚ (host agent) https://github.com/racker/virgo Vagrant/Chef Deployment Development
    23. 23. Open Source Contribution Dreadnot˚ Inspired by Deployinator Multi-region deployment tool https://github.com/racker/dreadnot Blog Post - http://bit.ly/xks0qT
    24. 24. Oh noes!
    25. 25. Supportability Dashboard KPI Metrics Logging Dump/Load (ops tools)
    26. 26. Automation• Testing continuously on buildbotDeployment• Dreadnot (bb + chef)• Constant deployment and self-monitoring (nagios + graphite)• Single-region upgrades
    27. 27. DocumentationWritten by programmers Examples generated from tests
    28. 28. Thanks! Private Betahttps://cmbeta.api.rackspacecloud.com/ @gdusbabek https://github.com/racker
    29. 29. Image CreditsScale http://www.flickr.com/photos/puuikibeach/4765115333Pyramids http://www.flickr.com/photos/gracewong/93631410Hercules http://www.flickr.com/photos/istolethetv/2203377554/Dandelion http://www.flickr.com/photos/8047705@N02/5572197407/Ear http://www.flickr.com/photos/perpetualplum/3974880498Hourglass http://www.flickr.com/photos/22244945@N00/3278869535Eyeball http://www.flickr.com/photos/miran/6567911705few gears http://www.flickr.com/photos/arthurjohnpicton/5364226117Many Gears http://www.flickr.com/photos/mwichary/2294174641Tools http://www.flickr.com/photos/zzpza/3269784239/Crane http://www.flickr.com/photos/katatoniq/2075966238/Ants http://www.flickr.com/photos/dendroica/6170146527Choice http://www.flickr.com/photos/-bast-/349497988Old Car http://www.flickr.com/photos/dok1/353601845/Monster Truck http://www.flickr.com/photos/beadmobile/3279378483/Clusters http://www.flickr.com/photos/nanagyei/6318995952/Model http://www.flickr.com/photos/inl/5097547405/Agent http://www.flickr.com/photos/erix/191965832/Chef http://www.flickr.com/photos/londonmatt/417683733/Arrows http://www.flickr.com/photos/generated/2084287794/Dashboard http://www.flickr.com/photos/80502454@N00/4172458435/Legos http://www.flickr.com/photos/great8/6820722517/Xray http://www.flickr.com/photos/karen_roe/4417259305/Document http://www.flickr.com/photos/tusnelda/6140792529/Fowers http://www.flickr.com/photos/petercastleton/5905455717/

    ×