• Save
#lspe Building a Monitoring Framework using DTrace and MongoDB
Upcoming SlideShare
Loading in...5
×
 

#lspe Building a Monitoring Framework using DTrace and MongoDB

on

  • 912 views

A talk I gave at the Large Scale Production Engineering meetup at Yahoo! about building monitoring tools and how to use DTrace to get more out of your monitoring data.

A talk I gave at the Large Scale Production Engineering meetup at Yahoo! about building monitoring tools and how to use DTrace to get more out of your monitoring data.

Statistics

Views

Total Views
912
Views on SlideShare
666
Embed Views
246

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 246

http://architects.dzone.com 191
http://java.dzone.com 48
http://server.dzone.com 6
http://172.16.10.165 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

#lspe Building a Monitoring Framework using DTrace and MongoDB #lspe Building a Monitoring Framework using DTrace and MongoDB Presentation Transcript

  • Building a MonitoringFramework Using DTraceand MongoDBDan KimmelSoftware Engineer, Delphixdan.kimmel@delphix.com
  • Background● Building a performance monitoringframework on illumos using DTrace● Its monitoring our data virtualization engine○ That means "database storage virtualization andrigorous administration automation" for those whodidnt have time to study up on our marketing lingo● Our users are mostly DBAs● The monitoring framework itself is notreleased yet
  • ● DBAs have one performance metric theycare about for their database storage○ I/O latency, because it translates to database I/Olatency, which translates to end-user happiness● But to make the performance dataactionable, they usually need more than thatsingle measurement○ Luckily, DTrace always has more dataWhat to collect?
  • Virtualized Database Storage*Database Process(Oracle, SQLServer, others on the way)Storage Appliance(the Delphix Engine)* as most people imagine itDatabaseI/O pathNetwork
  • Hypervisor*Delphix OSDatabase Host OS(Windows, Linux, Solaris, *BSD, HP-UX, AIX)Virtualized Database StorageDatabase Process(Oracle, SQLServer, others on the way)Network-Mounted Storage Layer (NFS/iSCSI)NetworkDelphix FSStorageDatabaseI/O path* Sometimes the DB host is running on a hypervisor too, or even on the same hypervisor
  • HypervisorDelphix OSDatabase Host OS(Windows, Linux, Solaris, *BSD, HP-UX, AIX)Latency can come from anywhereDatabase Process(Oracle, SQLServer, others on the way)Network-Mounted Storage Layer (NFS/iSCSI)NetworkDelphix FSStorageOut of memory? Out of CPU?Out of bandwidth?Out of memory? Out of CPU?Out of memory? Out of CPU?Out of IOPS? Out of bandwidth?NFS client latencyNetwork latencyQueuing latencyFS latencyDevice latencyDatabaseI/O pathBottlenecks on the left Sources of latency on the right
  • Investigation RequirementsWant users to be able to dig deeper during aperformance investigation.● Show many different sources of latency andshow many possible bottlenecks○ i.e. collect data from all levels of the I/O stack○ This is something that were still working on, andsadly, not all levels of the stack have DTrace● Allow users to narrow down the cause withinone layer○ Concepts were inspired by other DTrace-basedanalytics tools from Sun and Joyent
  • Narrowing down the causeAfter looking at a high level view of the layers, auser sees NFS server latency has some slowoutliers.1. NFS latency by client IP address○ The client at 187.124.26.12 looks slowest2. NFS latency for 187... by operation○ Writes look like the slow operation3. NFS write latency for 187... by synchronous○ Synchronous writes are slower than normal
  • How that exercise helped● The user just learned a lot about the problem○ The user might be able to solve it themselves by (forinstance) upgrading or expanding the storage we siton top of to handle synchronous writes better○ They can also submit a much more useful bug reportor speak effectively to our support staff● Saves them time, saves us time!
  • DTrace is the perfect tool● To split results on a variable, collect thevariable and use it as an additional key inyour aggregations.● To narrow down a variable, add a condition.// Pseudocode alert!0. probe {@latency = quantize(start - timestamp)}1. probe {@latency[ip] = quantize(start - timestamp)}2. probe /ip == "187..."/ {@latency[operation] = quantize(start - timestamp);}3. probe /ip == "187..." && operation == "write"/ {@latency[synchronous] = quantize(start - timestamp);}
  • How we built "narrowing down"● Templated D scripts for collecting datainternal to Delphix OS● Allow the user to specify constraints onvariables in each template○ Translate these into DTrace conditions● Allow the user to specify which variablesthey want to display● Fill out a template and run the resultingscript
  • Enhancing SupportabilityOur support staff hears this question frequently:We got reports of slow DB accesses lastFriday, but now everything is back to normal.Can you help us debug what went wrong?
  • Historical data is important too● We always read a few system-wide statistics● We store all readings into MongoDB○ Were not really concerned about ACID guarantees○ We dont know exactly what variables we will becollecting for each collector ahead of time○ MongoDB has a couple of features that arespecifically made for logging that we use○ It was easy to configure and use
  • Storing (lots of) historical dataThe collected data piles up quickly!● Dont collect data too frequently● Compress readings into larger and largertime intervals as the readings age○ We implemented this in the caller, but could haveused MongoDBs MapReduce as well● Eventually, delete them (after ~2 weeks)○ We used MongoDBs "time-to-live indexes" to handlethis automatically; they work nicely
  • Dealing with the Edge Cases● If an investigation is ongoing, performancedata could be compressed or deleted if theinvestigating takes too long● Users can prevent data from beingcompressed or deleted by explicitly saving it
  • Summary● We used DTrace to allow customers to digdeeper on performance issues○ Customers will love it*○ Our support staff will love it** at least, thats the hope!
  • Thanks!