Distributed systems-radiology

243 views

Published on

Each of us operates distributed systems. Some of us operate traditional infrastructure
with database, web, and load-balancing tiers. Others require infrastructure that is
more bespoke and may incorporate non-traditional storage solutions (such as Riak).
Regardless of where each of us falls on this spectrum, the network closely describes the
behavior of our applications. Furthermore, it is the only place we can look to understand
emergent behavior of applications working together in concert. In this talk, we take a
radiological view of network-derived imagery and discuss what it can tell us about our
systems as a whole.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
243
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Distributed systems-radiology

  1. 1. Modern Radiology for Distributed Systems Dietrich Featherston @d2fnThursday, October 11, 12
  2. 2. This is a talk about monitoringThursday, October 11, 12
  3. 3. But not just any kind of monitoring Non-invasive monitoringThursday, October 11, 12
  4. 4. non-invasive monitoring measures taken to describe the state of a system with minimal changes to the system being monitoredThursday, October 11, 12
  5. 5. Insight Radiographic Imagery InvasivenessThursday, October 11, 12
  6. 6. preventative care measures taken to prevent diseases or injuries rather than curing them or treating their symptomsThursday, October 11, 12
  7. 7. Non-invasive monitoring techniques focus primarily on host-based metrics Why is this a problem?Thursday, October 11, 12
  8. 8. Because applications are distributedThursday, October 11, 12
  9. 9. Information emitted about nodes in the network n Information emitted about edges in the network n² Network sizeThursday, October 11, 12
  10. 10. We analyze cell-structure because we can’t envision the whole organism We react to disease and injury because we lack preventative careThursday, October 11, 12
  11. 11. We lack preventative care for applications because our non-invasive monitoring techniques are growing less and less meaningfulThursday, October 11, 12
  12. 12. Radiology is useful in illuminating non-invasive monitoring of distributed systemsThursday, October 11, 12
  13. 13. Thursday, October 11, 12
  14. 14. Thursday, October 11, 12
  15. 15. Thursday, October 11, 12
  16. 16. Context is everythingThursday, October 11, 12
  17. 17. How do we use context?Thursday, October 11, 12
  18. 18. !!! Context Your Big Dumb DataThursday, October 11, 12
  19. 19. Human brain Diagnoses + med school Radiographic ImageryThursday, October 11, 12
  20. 20. E.T. Signal Processing VLA OutputThursday, October 11, 12
  21. 21. Application Topology Signal Processing Expert Brain Application Behavior NetworkThursday, October 11, 12 Data
  22. 22. dimensions (11) measurements (8) epoch seconds egress packets epoch minutes egress octets epoch hours ingress packets node id ingress octets source ip retransmits source port errors dest ip dest port app-rtt interface handshake-rtt country network/asnThursday, October 11, 12
  23. 23. Case Study #1 GC-Death of a distributed JVM applicationThursday, October 11, 12
  24. 24. Thursday, October 11, 12
  25. 25. Case Study #2 Symptoms: - Latent Riak handoff - Cluster throughput bottoming outThursday, October 11, 12
  26. 26. Thursday, October 11, 12
  27. 27. busy_dist_portThursday, October 11, 12
  28. 28. +zdbbl 8192Thursday, October 11, 12
  29. 29. Thursday, October 11, 12
  30. 30. Case Study #3 Bringing a dead riak node back onlineThursday, October 11, 12
  31. 31. Thursday, October 11, 12
  32. 32. Thursday, October 11, 12
  33. 33. Thursday, October 11, 12
  34. 34. Case Study #4 Retransmits 10% of total network throughputThursday, October 11, 12
  35. 35. Thursday, October 11, 12
  36. 36. var put: HttpPut = null try {   // ... put data } catch {   case e: Exception =>     // ... handle exception } finally {   if(put != null) {     put.abort()   } }Thursday, October 11, 12
  37. 37. var put: HttpPut = null try {   // ... put data } catch {   case e: Exception =>     // ... handle exception } finally {   if(put != null) {     put.abort()   } }Thursday, October 11, 12
  38. 38. Source: http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/methods/HttpRequestBase.html#abort() abort public void abort() Description copied from interface: HttpUriRequest Aborts execution of the request. THANKSThursday, October 11, 12
  39. 39. 129    public void abort() { 130        ClientConnectionRequest localRequest; 131        ConnectionReleaseTrigger localTrigger; 132         133        this.abortLock.lock(); 134        try { 135            if (this.aborted) { 136                return; 137            }             138            this.aborted = true; 139             140            localRequest = connRequest; 141            localTrigger = releaseTrigger; 142        } finally { 143            this.abortLock.unlock(); 144        }         145 146        // Trigger the callbacks outside of the lock, to prevent 147        // deadlocks in the scenario where the callbacks have 148        // their own locks that may be used while calling 149        // setReleaseTrigger or setConnectionRequest. 150        if (localRequest != null) { 151            localRequest.abortRequest(); 152        } 153        if (localTrigger != null) { 154            try { 155                localTrigger.abortConnection(); 156            } catch (IOException ex) { 157                // ignore 158            } 159        } 160    }Thursday, October 11, 12
  40. 40. Thursday, October 11, 12
  41. 41. augmented intelligence precedes artificial intelligenceThursday, October 11, 12
  42. 42. 9518 Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month laterThursday, October 11, 12
  43. 43. 95 0518 19 First English text on chest radiography Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month laterThursday, October 11, 12
  44. 44. 20 95 0518 19 19 First English text on chest radiography Society of Radiographers formed Wilhelm Röntgen discovers X-Rays First medical use of x-rays in human imaging takes place one month laterThursday, October 11, 12
  45. 45. Recognition of radiology as a formal medical discipline was a cultural problem, not a technology problem http://www.bshr.org.uk/page13.htmlThursday, October 11, 12
  46. 46. If you want to talk to me about the query language used to ask questions of the network data we collect at Boundary talk to me after or hit me up on twitter. @d2fn github.com/dietrichfThursday, October 11, 12
  47. 47. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  48. 48. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  49. 49. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  50. 50. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  51. 51. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  52. 52. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  53. 53. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12
  54. 54. Find 45 minutes get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; of total traffic epochMillis from -18h for 45m; seen on meters categorize ] 1, 2, 226, & 301 sum(ingress) as ingress, sum(egress) as egress,starting 18 hours sum(ingressPackets + ago broken egressPackets) as packets, sum(retransmits) as retransmits, down by peer ip bymean(appRttUsec/1000) as appRttMs retain top 10 by epochMillis, ip retain the ratio of top 10 retransmits to on retransmits/packets per epochMillis packetsThursday, October 11, 12

×