#lspe Building a Monitoring Framework using DTrace and MongoDB

857 views
813 views

Published on

A talk I gave at the Large Scale Production Engineering meetup at Yahoo! about building monitoring tools and how to use DTrace to get more out of your monitoring data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
857
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

#lspe Building a Monitoring Framework using DTrace and MongoDB

  1. 1. Building a MonitoringFramework Using DTraceand MongoDBDan KimmelSoftware Engineer, Delphixdan.kimmel@delphix.com
  2. 2. Background● Building a performance monitoringframework on illumos using DTrace● Its monitoring our data virtualization engine○ That means "database storage virtualization andrigorous administration automation" for those whodidnt have time to study up on our marketing lingo● Our users are mostly DBAs● The monitoring framework itself is notreleased yet
  3. 3. ● DBAs have one performance metric theycare about for their database storage○ I/O latency, because it translates to database I/Olatency, which translates to end-user happiness● But to make the performance dataactionable, they usually need more than thatsingle measurement○ Luckily, DTrace always has more dataWhat to collect?
  4. 4. Virtualized Database Storage*Database Process(Oracle, SQLServer, others on the way)Storage Appliance(the Delphix Engine)* as most people imagine itDatabaseI/O pathNetwork
  5. 5. Hypervisor*Delphix OSDatabase Host OS(Windows, Linux, Solaris, *BSD, HP-UX, AIX)Virtualized Database StorageDatabase Process(Oracle, SQLServer, others on the way)Network-Mounted Storage Layer (NFS/iSCSI)NetworkDelphix FSStorageDatabaseI/O path* Sometimes the DB host is running on a hypervisor too, or even on the same hypervisor
  6. 6. HypervisorDelphix OSDatabase Host OS(Windows, Linux, Solaris, *BSD, HP-UX, AIX)Latency can come from anywhereDatabase Process(Oracle, SQLServer, others on the way)Network-Mounted Storage Layer (NFS/iSCSI)NetworkDelphix FSStorageOut of memory? Out of CPU?Out of bandwidth?Out of memory? Out of CPU?Out of memory? Out of CPU?Out of IOPS? Out of bandwidth?NFS client latencyNetwork latencyQueuing latencyFS latencyDevice latencyDatabaseI/O pathBottlenecks on the left Sources of latency on the right
  7. 7. Investigation RequirementsWant users to be able to dig deeper during aperformance investigation.● Show many different sources of latency andshow many possible bottlenecks○ i.e. collect data from all levels of the I/O stack○ This is something that were still working on, andsadly, not all levels of the stack have DTrace● Allow users to narrow down the cause withinone layer○ Concepts were inspired by other DTrace-basedanalytics tools from Sun and Joyent
  8. 8. Narrowing down the causeAfter looking at a high level view of the layers, auser sees NFS server latency has some slowoutliers.1. NFS latency by client IP address○ The client at 187.124.26.12 looks slowest2. NFS latency for 187... by operation○ Writes look like the slow operation3. NFS write latency for 187... by synchronous○ Synchronous writes are slower than normal
  9. 9. How that exercise helped● The user just learned a lot about the problem○ The user might be able to solve it themselves by (forinstance) upgrading or expanding the storage we siton top of to handle synchronous writes better○ They can also submit a much more useful bug reportor speak effectively to our support staff● Saves them time, saves us time!
  10. 10. DTrace is the perfect tool● To split results on a variable, collect thevariable and use it as an additional key inyour aggregations.● To narrow down a variable, add a condition.// Pseudocode alert!0. probe {@latency = quantize(start - timestamp)}1. probe {@latency[ip] = quantize(start - timestamp)}2. probe /ip == "187..."/ {@latency[operation] = quantize(start - timestamp);}3. probe /ip == "187..." && operation == "write"/ {@latency[synchronous] = quantize(start - timestamp);}
  11. 11. How we built "narrowing down"● Templated D scripts for collecting datainternal to Delphix OS● Allow the user to specify constraints onvariables in each template○ Translate these into DTrace conditions● Allow the user to specify which variablesthey want to display● Fill out a template and run the resultingscript
  12. 12. Enhancing SupportabilityOur support staff hears this question frequently:We got reports of slow DB accesses lastFriday, but now everything is back to normal.Can you help us debug what went wrong?
  13. 13. Historical data is important too● We always read a few system-wide statistics● We store all readings into MongoDB○ Were not really concerned about ACID guarantees○ We dont know exactly what variables we will becollecting for each collector ahead of time○ MongoDB has a couple of features that arespecifically made for logging that we use○ It was easy to configure and use
  14. 14. Storing (lots of) historical dataThe collected data piles up quickly!● Dont collect data too frequently● Compress readings into larger and largertime intervals as the readings age○ We implemented this in the caller, but could haveused MongoDBs MapReduce as well● Eventually, delete them (after ~2 weeks)○ We used MongoDBs "time-to-live indexes" to handlethis automatically; they work nicely
  15. 15. Dealing with the Edge Cases● If an investigation is ongoing, performancedata could be compressed or deleted if theinvestigating takes too long● Users can prevent data from beingcompressed or deleted by explicitly saving it
  16. 16. Summary● We used DTrace to allow customers to digdeeper on performance issues○ Customers will love it*○ Our support staff will love it** at least, thats the hope!
  17. 17. Thanks!

×