System insight without Interference

5,944 views

Published on

Talk at Wordnik HQ about how to monitor application performance and business goals without intrusive engineering work on your core product.

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,944
On SlideShare
0
From Embeds
0
Number of Embeds
50
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

System insight without Interference

  1. Insight without InterferenceMonitoring with Scala, Swagger, MongoDB and Wordnik OSS Tony Tam @fehguy
  2. Nagios Dashboard
  3. Monitoring? Disk HostSpace Checks IT Ops 101 SystemNetwork Load
  4. Monitoring? Disk HostSpace Checks Necessary (but insufficient) SystemNetwork Load
  5. Why Insufficient?• What about Services? • Database running? • HTTP traffic?• Install Munin Node! • Some (good) service-level insight
  6. Your boss “OH pretty LOVES charts colors!” “up and to the “it MUST right!” beimportant!”
  7. Good vs. Bad?• Database calls avg 1ms? • Great! DB working well • But called 1M times per page load/user?• Most tools are for system, not your app• By the time you know, it’s too late Need business metrics monitoring!
  8. Enter APM• Application Performance Monitoring• Many flavors, degrees of integration • Heavy: transaction monitoring, code performance, heap, memory analysis • Medium: home-grown profiling • Light: digest your logs (failure forensics)• What you need depends on architecture, business + technology stage
  9. APM @ Wordnik• Micro Services make the System Monolithic application
  10. APM @ Wordnik• Micro Services make the System API Calls are the unit of work! Monolithic application
  11. Monitoring API Calls• Every API must be profiled• Other logic as needed • Database calls • Connection manager • etc...• Anything that might matter!
  12. How?• Wordnik-OSS Profiler for Scala • Apache 2.0 License, available in Maven Central• Profiling Arbitrary code block:import com.wordnik.util.perf.ProfileProfile("create a cat", {/* do something */})• Profiling an API call:Profile("/store/purchase", {/* do something */})
  13. Profiler gives you…• Nearly free*** tracking• Simple aggregation• Trigger mechanism • Actions on time spent “doing things”:Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter }}
  14. Profiler gives you…• Nearly free*** tracking• Simple aggregation• Trigger mechanism • Actions on time spent “doing things”:Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) This is intrusive return counter }} on your codebase
  15. Accessing Profile Data• Easy to get in code ProfileScreenPrinter.dump• Output where you want logger.info(ProfileScreenPrinter.toString)• Send to logs, email, etc.
  16. Accessing Profile Data• Easier to get via API with Swagger-JAXRSimport com.wordnik.resource.util@Path("/activity.json")@Api("/activity")@Produces(Array("application/json"))class ProfileResource extends ProfileTrait
  17. Accessing Profile Data
  18. Accessing Profile Data Inspect without bugging devs!
  19. Is Aggregate Data Enough?• Probably not• Not Actionable • Have calls increased? Decreased? • Faster response? Slower?
  20. Make it Actionable • “In a 3 hour window, I expect 300,000 views per server” • Poll & persist the counters{ • Example: Log page views, every min "_id" : "web1-word-page-view-20120625151812", "host" : "web1", "count" : 627172, "timestamp" : NumberLong("1340637492247")},{ "_id" : "web1-word-page-view-20120625151912", "host" : "web1", "count" : 627372, "timestamp" : NumberLong("1340637552778")}
  21. Make it Actionable
  22. Make it Actionable Your boss LOVES charts
  23. That’s not Actionable!• Custompretty But it’s Time APIs to window track? What’s missing?Too much Low + High custom WatermarkEngineerin s g
  24. That’s not Actionable!Custom Time APIs towindow track? Call to Action! Too much Low + High custom WatermarksEngineering
  25. Make it Actionable• Swagger + a tiny bit of engineering • Let your *product* people create monitors, set goals• A Check: specific API call mapped to a service function { "name": "word-page-view", "path": "/word/*/wordView (post)", "checkInterval": 60, "healthSpan": 300, "minCount": 300, "maxCount": 100000 }
  26. Make it Actionable• A Service Type: a collection of checks which make a functional unit { "name": "www-api", "checks": [ "word-of-the-day", "word-page-view", "word-definitions", "user-login", "api-account-signup", "api-account-activated" ] }
  27. Make it Actionable• A Host: “directions” to get to the checks{ "host": "ip-10-132-43-114", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”},{ "host": "ip-10-130-134-82", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”}
  28. Make it Actionable• And finally, a simple GUI
  29. Make it Actionable• And finally, a simple GUI
  30. Make it Actionable• Point Nagios at this!serviceHealth.json/status/www-api?explodeOnFailure=true Metrics from Product• Get a 500, get an alert Treat like Based on system YOUR app failure
  31. Make it Actionable
  32. Is this Enough?System monitoringAggregate monitoringWindowed monitoringObject monitoring? • Action on a specific event/object Why!?
  33. Object-level Actions• Any back-end engineer can build this • But shouldn’t• ETL to a cube?• Run BI queries against production?• Best way to “siphon” data from production w/o intrusive engineering?
  34. Avoiding Code Invasion• We use MongoDB everywhere• We use > 1 server wherever we use MongoDB• We have an opLog record against everything we do
  35. What is the OpLog• All participating members have one• Capped collection of all write ops t3 time t0 t1 t2 primary replica replica
  36. So What?• It’s a “pseudo-durable global topic message bus” (PDGTMB) • WTF?• All DB transactions in there• It’s persistent (cyclic collection)• It’s fast (as fast as your writes)• It’s non-blocking• It’s easily accessible
  37. More about this{ "ts" : { "t" : 1340948921000, "i" : 1 }, "h" : NumberLong("5674919573577531409"), "op" : "i", "ns" : "test.animals", "o" : {"_id" : "fred", "type" : "cat" }}, { "ts" : { "t" : 1340948935000, "i" : 1 }, "h" : NumberLong("7701120461899338740"), "op" : "i", "ns" : "test.animals", "o" : { "_id" : "bill", "type" : "rat" }}
  38. Tapping into the Oplog• Made easy for you!https://github.com/wordnik/wordnik-oss
  39. Tapping into the Oplog • Made easy for you! https://github.com/wordnik/wordnik-ossIncremental Backup Snapshots Replication Same Technique!
  40. Tapping into the Oplog • Create an OpLogProcessorclass OpLogReader extends OplogRecordProcessor { val recordTriggers = new HashSet[Function1[BasicDBObject, Unit]] @throws(classOf[Exception]) def processRecord(dbo: BasicDBObject) = { recordTriggers.foreach(t => t(dbo)) } @throws(classOf[IOException]) def close(string: String) = {}}
  41. Tapping into the Oplog• Attach it to an OpLogTailThreadval util = new OpLogReaderval coll: DBCollection = (MongoDBConnectionManager.getOplog("oplog", "localhost", None, None)).getval tailThread = new OplogTailThread(util, coll)tailThread.start
  42. Tapping into the Oplog• Add some observer functionsutil.recordTriggers += new Function1[BasicDBObject, Unit] { def apply(e: BasicDBObject): Unit = Profile("inspectObject", { totalExamined += 1 /* do something here */ } }) } }
  43. /* do something here */• Like?• Convert to business objects and act! • OpLog to domain object is EASY • Just process the ns that you care about "ns" : "test.animals”• How?
  44. Converting OpLog to Object• Jackson makes this trivialcase class User(username: String, email: String, createdAt: Date)val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])• Reuse your DAOs? Bonus points!• Got your objects!
  45. Converting OpLog to Object• Jackson makes this trivial “o” is forcase class User(username: String, email: String, createdAt: Date) “Object”val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])• Reuse your DAOs? Bonus points!• Got your objects! Now What?
  46. Use Case 1: Alert on Action• New account!obj match { case newAccount: UserAccount => { /* ring the bell! */ } case _ => { /* ignore it */ }}
  47. Use case 2: What’s Trending?• Real-time activitycase o: VisitLog => Profile("ActivityMonitor:processVisit", { wordTracker.add(o.word) })
  48. Use case 3: External Analyticscase o: UserProfile => { getSqlDatabase().executeSql( "insert into user_profile values(?,?,?)", o.username, o.email, o.createdAt)}
  49. Use case 3: External Analyticscase o: UserProfile => { getSqlDatabase().executeSql( "insert into user_profile values(?,?,?)", Your Data o.username, o.email, o.createdAt)} pushes to Relational! Don’t mix runtime & OLAP!
  50. Use case 4: Cloud analysiscase o: NewUserAccount => { getSalesforceConnector().create( Lead(Account.ID, o.firstName, o.lastName, o.company, o.email, o.phone))}
  51. Use case 4: Cloud analysiscase o: NewUserAccount => { getSalesforceConnector().create( Lead(Account.ID, o.firstName, o.lastName, o.company, o.email, o.phone))} We didn’t Pushed interrupt core directly to engineering!Salesforce!
  52. Examples Polling profile APIs cross cluster
  53. Examples Siphoning hashtags from opLog
  54. Examples Page view activity from opLog
  55. Examples Health check w/o engineering
  56. Summary• Don’t mix up monitoring servers & your application• Leave core engineering alone• Make a tiny engineering investment now• Let your product folks set metrics• FOSS tools are available (and well tested!)• The opLog is incredibly powerful • Hack it!
  57. Find out more• Wordnik: developer.wordnik.com• Swagger: swagger.wordnik.com• Wordnik OSS: github.com/wordnik/wordnik-oss• Atmosphere: github.com/Atmosphere/atmosphere• MongoDB: www.mongodb.org

×