Optimum Performance, Maximum Insight:   Behind the Scenes with Network Measurement Tools  Peter Van Epp  Network Director ...
Overview <ul><li>Network measurement, performance analysis and troubleshooting  are critical elements of effective network...
Overview <ul><li>Quick-start hands-on </li></ul><ul><li>Elements of Network Performance </li></ul><ul><li>Realities:  Indu...
Troubleshooting the LAN: NDT (Public Domain) <ul><li>Preview - NDT </li></ul><ul><ul><li>Source -  http://e2epi.internet2....
Troubleshooting the LAN: AppCritical (Commercial) <ul><li>Preview - AppCritical </li></ul><ul><ul><li>Source -  http:// a ...
INTRO
Network Performance <ul><li>Measurement </li></ul><ul><ul><li>How big?  How long?  How much? </li></ul></ul><ul><ul><li>Qu...
“Functional” vs. “Dysfunctional” <ul><li>Functional networks operate as spec’d </li></ul><ul><ul><li>Consistent with  </li...
Causes of Degradation <ul><li>Five categories of degradation: </li></ul><ul><ul><li>exceeds specification </li></ul></ul><...
STATS AND EXPERIENCE
Trillions of Dollars <ul><li>Global Annual Spend on telecom =  $2 Trillion </li></ul><ul><ul><li>Network/Systems Mgmt =  $...
Real World Customer Feedback <ul><li>Based on survey of 20,000 customer tests, serious network issue 38% of the time </li>...
Last Mile <ul><li>Last 100m </li></ul><ul><li>LAN </li></ul><ul><ul><li>Workstations </li></ul></ul><ul><ul><li>Office env...
METHODOLOGIES
Real examples from the SFU network <ul><li>2 links out - one to CA*net4 at 1G usually empty </li></ul><ul><li>100M commodi...
Real examples (cont.) <ul><ul><li>problem of asymmetric route </li></ul></ul><ul><li>12:45:52 tcp taiwan_ip.port -> sfu_ip...
Network Life Cycle (NLC) <ul><li>Network life cycle </li></ul><ul><ul><li>Business case </li></ul></ul><ul><ul><li>Require...
NLC: Staging/Deployment <ul><li>Two hosts with a cross over cable  </li></ul><ul><ul><li>insure the end points work.  </li...
NLC: Staging/Deployment (cont.) <ul><li>Various bits of hardware (typically network cards) and software (the IP stack) fla...
NLC: Operation <ul><li>Easier if that the network was known to work at implementation </li></ul><ul><li>Probably disruptin...
NLC: Operation (cont.) <ul><li>automated monitoring / data collection  </li></ul><ul><ul><li>can be very expensive to impl...
NLC Dependencies Planning Requirements Request  For  Proposal Business Case Review Staging Operation Deployment
METHODOLOGIES: Measurement
Visibility <ul><li>Basic problem is  lack of visibility  at the network level </li></ul><ul><li>Performance “depends” </li...
Measurement Methodologies <ul><li>Device-centric (NMS) </li></ul><ul><ul><li>SNMP </li></ul></ul><ul><ul><li>RTCP/XR / Net...
E2E Measurement Challenges <ul><li>Layer 1 </li></ul><ul><ul><li>Optical / light paths </li></ul></ul><ul><ul><li>Wireless...
Existing Observatory Capabilities <ul><li>One way latency, jitter, loss </li></ul><ul><ul><li>IPv4 and IPv6 (“owamp”) </li...
Observatory Databases – Dat а  Types <ul><li>Data is collected locally and stored in distributed databases </li></ul><ul><...
GARR User Interface
METHODOLOGIES: Troubleshooting
Challenges to Troubleshooting <ul><li>Need resolution quickly </li></ul><ul><li>Operational networks </li></ul><ul><li>May...
HPC Networks <ul><li>Three potential problem sources </li></ul><ul><ul><li>user site to edge (x 2) </li></ul></ul><ul><ul>...
HPC Networks (cont.) <ul><li>Major difference between dedicated lightpaths and a shared network </li></ul><ul><li>Lightpat...
HPC Networks (cont.) <ul><li>Shared: see if other users have problems </li></ul><ul><ul><li>If no core problem, not common...
E2EPI Problem Statement: “The Network is Broken” <ul><li>How the can user self-diagnosis first mile problems without being...
Strategy <ul><li>Most problems are local… </li></ul><ul><li>Test ahead of time! </li></ul><ul><li>Is there connectivity & ...
What Are The Problems?  <ul><li>TCP: lack of buffer space </li></ul><ul><ul><li>Forces protocol into stop-and-wait </li></...
What Are The Problems? <ul><li>Video/Audio: lack of buffer space </li></ul><ul><ul><li>Makes broadcast streams very sensit...
The Usual Suspects <ul><li>Host configuration errors (TCP buffers) </li></ul><ul><li>Duplex mismatch (Ethernet) </li></ul>...
Typical Sources of Performance Degradation <ul><ul><li>Half/Full-Duplex Conflicts </li></ul></ul><ul><ul><li>Poorly Perfor...
Self-Diagnosis <ul><li>Find a measurement server “near me”. </li></ul><ul><li>Detect common tests in first mile. </li></ul...
Partial Path Decomposition <ul><li>Identify end-to-end path. </li></ul><ul><li>Discover measurement nodes “near to” and “r...
Partial Path Decomposition <ul><li>Instead of: </li></ul><ul><ul><li>“ Can you give me an account on your machine?” </li><...
METHODOLOGIES: Application Performance <ul><li>Network Dependent Vendors </li></ul><ul><li>Applications groups (e.g. VoIP)...
Simplified Three Layer Model OSI Layer Description 7 Application 6 Presentation 5 Session 4 Transport 3 Network 2 Data Lin...
New Layer Model User Experience App Behaviors Network Behaviors
App-to-Net Coupling <ul><li>Codec </li></ul><ul><li>Dynamics </li></ul><ul><li>Requirements </li></ul>Application Model Ou...
E-Model Mapping: R    MOS E-model generated “R-value” (0-100)  - maps to well-known MOS score MOS  (QoE) E-Model Analysis
Coupling the Layers Application Behaviors Network Behaviors Application Models test/monitor for QoE network requirements(Q...
METHODOLOGIES: Optimization
Network visibility End-to-end visibility App-to-net coupling End-to-end network path
Iterating to Performance
Wizard Gap Reprinted with permission (Matt Mathis, PSC) http://www.psc.edu/~mathis/
Wizard Gap <ul><li>Working definition: </li></ul><ul><li>Ratio of effective network performance attained by an average use...
Fix the Network First <ul><li>Three Steps to Performance </li></ul><ul><ul><li>Clean the network </li></ul></ul><ul><ul><u...
Lessons Learned <ul><li>Guy Almes , chief engineer Abilene </li></ul><ul><ul><li>“ The general consensus is that it's easi...
Tools <ul><li>CAIDA Tools (Public) </li></ul><ul><ul><li>http://www.caida.org/tools/ </li></ul></ul><ul><ul><li>Taxonomies...
Recommended (Public) Tools <ul><li>MRTG (SNMP-based router stats) </li></ul><ul><li>iPerf / NetPerf (active stress testing...
Tools: OWAMP/BWCTL <ul><li>OWAMP: one way active measurement protocol </li></ul><ul><ul><li>Ping by any other name would s...
Tools: BWCTL <ul><li>Typical constraints to running “iperf” </li></ul><ul><li>Need software on all test systems </li></ul>...
Tools: ARGUS <ul><li>http:// www.qosient.com/argus </li></ul><ul><ul><li>open source IP auditing tool </li></ul></ul><ul><...
Traffic Summary from Argus <ul><li>From: Wed Aug 25  5:59:00 2004 To: Thu Aug 26  5:59:00 2004 </li></ul><ul><li>18,972,26...
Tools: ARGUS <ul><li>using ARGUS to identify retransmission type problems. </li></ul><ul><ul><li>compare total packet size...
Tools: ARGUS <ul><li>compare to misconfigured IP stack </li></ul><ul><li>full: </li></ul><ul><li>15:27:38 * tcp outside_ip...
Tools: NDT <ul><li>(Many thanks to Lixin Liu) </li></ul><ul><li>Test 1: 50% signal on 802.11G </li></ul><ul><li>WEB100 Ena...
Tools: NDT <ul><li>------  Web100 Detailed Analysis  ------ </li></ul><ul><li>45 Mbps T3/DS3 link found. </li></ul><ul><li...
Tools: NDT <ul><li>Server 'sniffer.ucs.sfu.ca' is not behind a firewall. [Connection to the ep= hemeral port was successfu...
Tools: NetPerf <ul><li>netperf on the same link. </li></ul><ul><ul><li>available throughput less than max </li></ul></ul><...
Tools: NDT <ul><li>Test 3: 80% on 802.11A </li></ul><ul><li>WEB100 Enabled Statistics: </li></ul><ul><li>Checking for Midd...
Tools: NetPerf <ul><li>iu@CLM ~ </li></ul><ul><li>$ netperf -l 60 -H sniffer.ucs.sfu.ca -- -s 1048576 -S 1048576 -m 104857...
Tools: perfSONAR <ul><li>Performance Middleware </li></ul><ul><ul><li>perfSONAR is an international consortium in which In...
perfSONAR Integrates <ul><li>Network measurement tools </li></ul><ul><li>Network measurement archives </li></ul><ul><li>Di...
Performance Measurement:  Project Phases <ul><li>Phase 1: Tool Beacons (Today) </li></ul><ul><ul><li>BWCTL (Complete),  ht...
Implementation <ul><li>Applications </li></ul><ul><ul><li>bwctld daemon </li></ul></ul><ul><ul><li>bwctl client </li></ul>...
LIVE DEMOS <ul><li>NDT </li></ul><ul><li>AppCritical </li></ul>
<ul><li>Q&A </li></ul>
 
 
 
Outline (REMOVE) <ul><li>Set the stage – how bad is it (stats) </li></ul><ul><ul><li>Some stats from industry and SFU </li...
Breakdown of Presentation (REMOVE) <ul><li>Intro and overview (both) </li></ul><ul><li>Quick demos (both) </li></ul><ul><l...
Application Ecology <ul><li>Paraphrasing ITU categories </li></ul><ul><ul><li>Real-time </li></ul></ul><ul><ul><ul><li>Jit...
Upcoming SlideShare
Loading in …5
×

L Jorgenson Pvan Epp

623 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
623
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • L Jorgenson Pvan Epp

    1. 1. Optimum Performance, Maximum Insight: Behind the Scenes with Network Measurement Tools Peter Van Epp Network Director Simon Fraser University Loki Jorgenson Chief Scientist Conference April 17-18, 2007
    2. 2. Overview <ul><li>Network measurement, performance analysis and troubleshooting are critical elements of effective network management. </li></ul><ul><li>Recommended tools, methodologies, and practices with a bit of hands-on </li></ul>
    3. 3. Overview <ul><li>Quick-start hands-on </li></ul><ul><li>Elements of Network Performance </li></ul><ul><li>Realities: Industry and Campus </li></ul><ul><li>Contexts </li></ul><ul><li>Methodologies </li></ul><ul><li>Tools </li></ul><ul><li>Demos </li></ul><ul><li>Q&A </li></ul>
    4. 4. Troubleshooting the LAN: NDT (Public Domain) <ul><li>Preview - NDT </li></ul><ul><ul><li>Source - http://e2epi.internet2.edu/ndt/ </li></ul></ul><ul><ul><li>Local server - http://ndtbby.ucs.sfu.ca:7123 </li></ul></ul><ul><ul><ul><li>http://192.75.244.191:7123 </li></ul></ul></ul><ul><ul><ul><li>http://142.58.200.253:7123 </li></ul></ul></ul><ul><ul><li>Local instructions – </li></ul></ul><ul><ul><ul><li>http://XXX.XXX </li></ul></ul></ul>
    5. 5. Troubleshooting the LAN: AppCritical (Commercial) <ul><li>Preview - AppCritical </li></ul><ul><ul><li>Source - http:// a pparentNetworks.com </li></ul></ul><ul><ul><li>Local server - http://X XX.XXXX.XXX </li></ul></ul><ul><ul><li>Local instructions – </li></ul></ul><ul><ul><ul><li>http://XXX.XXX </li></ul></ul></ul><ul><ul><ul><li>Login: “guest” , “bcnet2007” </li></ul></ul></ul><ul><ul><ul><li>Downloads </li></ul></ul></ul><ul><ul><ul><ul><li> User Interface </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li> Download User Interface </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Install </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Start and login (see above) </li></ul></ul></ul>
    6. 6. INTRO
    7. 7. Network Performance <ul><li>Measurement </li></ul><ul><ul><li>How big? How long? How much? </li></ul></ul><ul><ul><li>Quantification and characterization </li></ul></ul><ul><li>Troubleshooting </li></ul><ul><ul><li>Where is the problem? What is causing it? </li></ul></ul><ul><ul><li>Diagnosis and remediation </li></ul></ul><ul><li>Optimization </li></ul><ul><ul><li>What is the limiter? What application affected? </li></ul></ul><ul><ul><li>Design analysis and planning </li></ul></ul>
    8. 8. “Functional” vs. “Dysfunctional” <ul><li>Functional networks operate as spec’d </li></ul><ul><ul><li>Consistent with </li></ul></ul><ul><ul><li>Only problem is congestion </li></ul></ul><ul><ul><li>“ Bandwidth” is the answer (or QoS) </li></ul></ul><ul><li>Dysfunctional networks operate otherwise </li></ul><ul><ul><li>“ Broken” but ping works </li></ul></ul><ul><ul><li>Does not meet application requirements </li></ul></ul><ul><ul><li>Bandwidth and QoS will NOT help </li></ul></ul>
    9. 9. Causes of Degradation <ul><li>Five categories of degradation: </li></ul><ul><ul><li>exceeds specification </li></ul></ul><ul><ul><ul><li>Insufficient capacity </li></ul></ul></ul><ul><ul><li>diverges from design </li></ul></ul><ul><ul><ul><li>Failed over to T1; auto-negotiate selects half-duplex </li></ul></ul></ul><ul><ul><li>presents dysfunction </li></ul></ul><ul><ul><ul><li>EM interference on cable </li></ul></ul></ul><ul><ul><li>includes devices and interfaces that are mis-configured </li></ul></ul><ul><ul><ul><li>Duplex mismatch </li></ul></ul></ul><ul><ul><li>manifests emergent features </li></ul></ul><ul><ul><ul><li>Extreme burstiness on high capacity links; TCP </li></ul></ul></ul>
    10. 10. STATS AND EXPERIENCE
    11. 11. Trillions of Dollars <ul><li>Global Annual Spend on telecom = $2 Trillion </li></ul><ul><ul><li>Network/Systems Mgmt = $10 Billion </li></ul></ul><ul><li>82% of network problems identified by end users complaining about application performance (Network World) </li></ul><ul><li>38% of 20,000 helpdesk tests showed network issues impacting application performance (Apparent Networks) </li></ul><ul><li>78% of network problems are beyond our control (TELUS) </li></ul><ul><li>50% of network alerts are false positives (Netuitive) </li></ul><ul><li>85% of networks are not ready for VOIP (Gartner 2004) </li></ul><ul><li>60% of IT problems are due to human error (Networking/CompTIA 2006) </li></ul>
    12. 12. Real World Customer Feedback <ul><li>Based on survey of 20,000 customer tests, serious network issue 38% of the time </li></ul><ul><ul><li>20% of networks have bad NIC card drivers </li></ul></ul><ul><ul><li>29% of devices have packet loss, caused by: </li></ul></ul><ul><ul><ul><li>50% high utilization </li></ul></ul></ul><ul><ul><ul><li>20% duplex conflicts </li></ul></ul></ul><ul><ul><ul><li>11% rate limiting behaviors </li></ul></ul></ul><ul><ul><ul><li>8% media errors </li></ul></ul></ul><ul><ul><ul><li>8% firewall issues </li></ul></ul></ul>
    13. 13. Last Mile <ul><li>Last 100m </li></ul><ul><li>LAN </li></ul><ul><ul><li>Workstations </li></ul></ul><ul><ul><li>Office environment </li></ul></ul><ul><ul><li>Servers </li></ul></ul><ul><li>WAN </li></ul><ul><ul><li>Leased lines </li></ul></ul><ul><ul><li>Limited capacities </li></ul></ul><ul><li>Service providers / core networks </li></ul>
    14. 14. METHODOLOGIES
    15. 15. Real examples from the SFU network <ul><li>2 links out - one to CA*net4 at 1G usually empty </li></ul><ul><li>100M commodity link heavily loaded </li></ul><ul><ul><li>(typically 6 times the volume of the C4 link) </li></ul></ul><ul><li>Physics grad student doing something data intensive to a grid site in Taiwan </li></ul><ul><ul><li>first indication  total saturation of commodity link </li></ul></ul><ul><ul><li>Argus pointed at the grid transfer as symptom </li></ul></ul><ul><ul><li>routing problem as the cause </li></ul></ul>
    16. 16. Real examples (cont.) <ul><ul><li>problem of asymmetric route </li></ul></ul><ul><li>12:45:52 tcp taiwan_ip.port -> sfu_ip.port 809 0 1224826 0 </li></ul><ul><li> ^ ^ ^ ^ </li></ul><ul><li>packets in out bytes in out </li></ul><ul><ul><li>reported the problem to Canarie NOC who quickly got it fixed </li></ul></ul><ul><ul><li>user's throughput much increased, commodity link less saturated! </li></ul></ul><ul><li>Use of NDT might have increased stress ! </li></ul>
    17. 17. Network Life Cycle (NLC) <ul><li>Network life cycle </li></ul><ul><ul><li>Business case </li></ul></ul><ul><ul><li>Requirements </li></ul></ul><ul><ul><li>Request for Proposal </li></ul></ul><ul><ul><li>Planning </li></ul></ul><ul><ul><li>Staging </li></ul></ul><ul><ul><li>Deployment </li></ul></ul><ul><ul><li>Operation </li></ul></ul><ul><ul><li>Review </li></ul></ul>Planning Requirements Request For Proposal Business Case Review Staging Operation Deployment Network Life Cycle
    18. 18. NLC: Staging/Deployment <ul><li>Two hosts with a cross over cable </li></ul><ul><ul><li>insure the end points work. </li></ul></ul><ul><li>Move one segment closer to the end (testing each time) </li></ul><ul><ul><li>Not easy to do if sites are geographically/politically distinct </li></ul></ul><ul><li>Establish connectivity to the end points </li></ul><ul><li>Tune for required throughput </li></ul><ul><ul><li>One of multiple possible points of failure – lack of visibility </li></ul></ul><ul><li>Tools (even very disruptive tools) can help by stressing the network </li></ul><ul><ul><li>Localize and characterize </li></ul></ul>
    19. 19. NLC: Staging/Deployment (cont.) <ul><li>Various bits of hardware (typically network cards) and software (the IP stack) flaws </li></ul><ul><ul><li>default configurations that are inappropriate for very high throughput networks. </li></ul></ul><ul><li>Careful what you buy </li></ul><ul><ul><li>(cheapest is not best and may be disastrous) </li></ul></ul><ul><ul><li>optical is much better, but also much more expensive than copper) </li></ul></ul><ul><li>Tune the IP stack for high performance </li></ul><ul><li>If possible try whatever you want to buy in a similar to environment (RFP/Staging) </li></ul><ul><li>Staging won't guarantee anything </li></ul><ul><ul><li>something unexpected will always bite you. </li></ul></ul>
    20. 20. NLC: Operation <ul><li>Easier if that the network was known to work at implementation </li></ul><ul><li>Probably disrupting work so pressure is higher </li></ul><ul><ul><li>may not be able to use the disruptive tools </li></ul></ul><ul><ul><li>may be occurring at a time when the staff unavailable </li></ul></ul><ul><li>Support user (e.g. NDT) </li></ul><ul><ul><li>researcher can point the web browser on their machine at an ndt server </li></ul></ul><ul><ul><li>save (even if they don't understand) the results for a network person to look at and comment on later </li></ul></ul>
    21. 21. NLC: Operation (cont.) <ul><li>automated monitoring / data collection </li></ul><ul><ul><li>can be very expensive to implement </li></ul></ul><ul><ul><li>someone must eventually interpret it </li></ul></ul><ul><li>consider issues/costs when applying for funding </li></ul><ul><li>passive continuous monitor on the network can make your life (and success) much easier </li></ul><ul><li>multiple lightpath endpoints or dynamically routed network can be challenging </li></ul><ul><ul><li>issues may be (or appear to be) intermittant </li></ul></ul><ul><ul><li>due to changes happen automatically – can be maddening. </li></ul></ul>
    22. 22. NLC Dependencies Planning Requirements Request For Proposal Business Case Review Staging Operation Deployment
    23. 23. METHODOLOGIES: Measurement
    24. 24. Visibility <ul><li>Basic problem is lack of visibility at the network level </li></ul><ul><li>Performance “depends” </li></ul><ul><ul><li>Application type </li></ul></ul><ul><ul><li>End-user / task </li></ul></ul><ul><ul><li>Benchmarks </li></ul></ul><ul><li>Healthy networks have design limits </li></ul><ul><li>Broken networks are everything else </li></ul>
    25. 25. Measurement Methodologies <ul><li>Device-centric (NMS) </li></ul><ul><ul><li>SNMP </li></ul></ul><ul><ul><li>RTCP/XR / NetCONF </li></ul></ul><ul><ul><ul><li>E.g. HP OpenView </li></ul></ul></ul><ul><li>Network behaviors </li></ul><ul><ul><li>Passive </li></ul></ul><ul><ul><ul><li>Flow-based - e.g. Cisco NetFlow </li></ul></ul></ul><ul><ul><ul><li>Packet-based – e.g. Network General “Sniffer” </li></ul></ul></ul><ul><ul><li>Active </li></ul></ul><ul><ul><ul><li>Flooding – e.g. AdTech AX/4000 </li></ul></ul></ul><ul><ul><ul><li>Probing – e.g. AppCritical </li></ul></ul></ul>
    26. 26. E2E Measurement Challenges <ul><li>Layer 1 </li></ul><ul><ul><li>Optical / light paths </li></ul></ul><ul><ul><li>Wireless </li></ul></ul><ul><li>Layer 2 </li></ul><ul><ul><li>MPLS </li></ul></ul><ul><ul><li>Ethernet switch fabric </li></ul></ul><ul><ul><li>Wireless </li></ul></ul><ul><li>Layer 3 </li></ul><ul><li>Layer 4 </li></ul><ul><ul><li>TCP </li></ul></ul><ul><li>Layer 5 </li></ul><ul><ul><li>Federation </li></ul></ul>
    27. 27. Existing Observatory Capabilities <ul><li>One way latency, jitter, loss </li></ul><ul><ul><li>IPv4 and IPv6 (“owamp”) </li></ul></ul><ul><li>Regular TCP/UDP throughput tests – ~1 Gbps </li></ul><ul><ul><li>IPv4 and IPv6; On-demand available (“bwctl”) </li></ul></ul><ul><li>SNMP </li></ul><ul><ul><li>Octets, packets, errors; collected 1/min </li></ul></ul><ul><li>Flow data </li></ul><ul><ul><li>Addresses anonymized by 0-ing the low order 11 bits </li></ul></ul><ul><li>Routing updates </li></ul><ul><ul><li>Both IGP and BGP - Measurement device participates in both </li></ul></ul><ul><li>Router configuration </li></ul><ul><ul><li>Visible Backbone – Collect 1/hr from all routers </li></ul></ul><ul><li>Dynamic updates </li></ul><ul><ul><li>Syslog; also alarm generation (~nagios); polling via router proxy </li></ul></ul>
    28. 28. Observatory Databases – Dat а Types <ul><li>Data is collected locally and stored in distributed databases </li></ul><ul><li>Databases </li></ul><ul><ul><li>Usage Data </li></ul></ul><ul><ul><li>Netflow Data </li></ul></ul><ul><ul><li>Routing Data </li></ul></ul><ul><ul><li>Latency Data </li></ul></ul><ul><ul><li>Throughput Data </li></ul></ul><ul><ul><li>Router Data </li></ul></ul><ul><ul><li>Syslog Data </li></ul></ul>
    29. 29. GARR User Interface
    30. 30. METHODOLOGIES: Troubleshooting
    31. 31. Challenges to Troubleshooting <ul><li>Need resolution quickly </li></ul><ul><li>Operational networks </li></ul><ul><li>May not be able to instrument everywhere </li></ul><ul><li>Often relies on expert engineers </li></ul><ul><li>Does not work across 3 rd party networks </li></ul><ul><li>Authorization/access </li></ul><ul><li>Converged networks </li></ul><ul><li>Application-specific symptoms </li></ul><ul><li>End-user driven </li></ul>
    32. 32. HPC Networks <ul><li>Three potential problem sources </li></ul><ul><ul><li>user site to edge (x 2) </li></ul></ul><ul><ul><li>core network </li></ul></ul><ul><li>Quickly eliminate as many of these as possible </li></ul><ul><ul><li> binary search </li></ul></ul><ul><li>Easiest during implementation phase </li></ul><ul><li>Ideally - 2 boxes at the same site and move them one link at a time </li></ul><ul><ul><li>Often impractical – deploy and pray (and troubleshoot) </li></ul></ul>
    33. 33. HPC Networks (cont.) <ul><li>Major difference between dedicated lightpaths and a shared network </li></ul><ul><li>Lightpath: end to end test </li></ul><ul><ul><li>iperf/netperf on loopback </li></ul></ul><ul><ul><li>this is likely too disruptive on shared network </li></ul></ul><ul><ul><li>DANGEROUS </li></ul></ul><ul><li>Alternately, NDT to local server to isolate </li></ul><ul><ul><li>Recommended to have at least mid-path ping! </li></ul></ul>
    34. 34. HPC Networks (cont.) <ul><li>Shared: see if other users have problems </li></ul><ul><ul><li>If no core problem, not common </li></ul></ul><ul><ul><li>If core, outside agencies involved </li></ul></ul><ul><li>Start trouble shooting </li></ul><ul><ul><li>both end user segments in parallel </li></ul></ul><ul><li>Preventive measures </li></ul><ul><ul><li>support user runnable diagnostics </li></ul></ul><ul><ul><li>ping and owamp - low impact monitoring </li></ul></ul>
    35. 35. E2EPI Problem Statement: “The Network is Broken” <ul><li>How the can user self-diagnosis first mile problems without being a network expert? </li></ul><ul><li>How can the user do partial path decomposition across multiple administrative domains? </li></ul>
    36. 36. Strategy <ul><li>Most problems are local… </li></ul><ul><li>Test ahead of time! </li></ul><ul><li>Is there connectivity & reasonable latency? (ping -> OWAMP) </li></ul><ul><li>Is routing reasonable (traceroute) </li></ul><ul><li>Is host reasonable (NDT; Web100) </li></ul><ul><li>Is path reasonable (iperf -> BWCTL) </li></ul>
    37. 37. What Are The Problems? <ul><li>TCP: lack of buffer space </li></ul><ul><ul><li>Forces protocol into stop-and-wait </li></ul></ul><ul><ul><li>Number one TCP-related performance problem. </li></ul></ul><ul><ul><li>70ms * 1Gbps = 70*10^6 bits, or 8.4MB </li></ul></ul><ul><ul><li>70ms * 100Mbps = 855KB </li></ul></ul><ul><ul><li>Many stacks default to 64KB, or 7.4Mbps </li></ul></ul>
    38. 38. What Are The Problems? <ul><li>Video/Audio: lack of buffer space </li></ul><ul><ul><li>Makes broadcast streams very sensitive to previous problems </li></ul></ul><ul><li>Application behaviors </li></ul><ul><ul><li>Stop-and-wait behavior; Can’t stream </li></ul></ul><ul><ul><li>Lack of robustness to network anomalies </li></ul></ul>
    39. 39. The Usual Suspects <ul><li>Host configuration errors (TCP buffers) </li></ul><ul><li>Duplex mismatch (Ethernet) </li></ul><ul><li>Wiring/Fiber problem </li></ul><ul><li>Bad equipment </li></ul><ul><li>Bad routing </li></ul><ul><li>Congestion </li></ul><ul><ul><li>“ Real” traffic </li></ul></ul><ul><ul><li>Unnecessary traffic (broadcasts, multicast, denial of service attacks) </li></ul></ul>
    40. 40. Typical Sources of Performance Degradation <ul><ul><li>Half/Full-Duplex Conflicts </li></ul></ul><ul><ul><li>Poorly Performing NICs </li></ul></ul><ul><ul><li>MTU Conflicts </li></ul></ul><ul><ul><li>Bandwidth Bottlenecks </li></ul></ul><ul><ul><li>Rate-Limiting Queues </li></ul></ul><ul><ul><li>Media Errors </li></ul></ul><ul><ul><li>Overlong Half-duplex </li></ul></ul><ul><ul><li>High Latency </li></ul></ul>
    41. 41. Self-Diagnosis <ul><li>Find a measurement server “near me”. </li></ul><ul><li>Detect common tests in first mile. </li></ul><ul><li>Don’t need to be a network engineer. </li></ul><ul><li>Instead of: </li></ul><ul><ul><li>“The network is broken.” </li></ul></ul><ul><li>Hoped for result: </li></ul><ul><ul><li>“I don’t know what I’m talking about, but I think I have a duplex mismatch problem.” </li></ul></ul>
    42. 42. Partial Path Decomposition <ul><li>Identify end-to-end path. </li></ul><ul><li>Discover measurement nodes “near to” and “representative of” hops along the route. </li></ul><ul><li>Authenticate to multiple measurement domains (locally-defined policies). </li></ul><ul><li>Initiate tests between remote hosts. </li></ul><ul><li>See test data for already run tests. (Future) </li></ul>
    43. 43. Partial Path Decomposition <ul><li>Instead of: </li></ul><ul><ul><li>“ Can you give me an account on your machine?” </li></ul></ul><ul><ul><li>“ Can you set up and leave up and Iperf server?” </li></ul></ul><ul><ul><li>“ Can you get up at 2 AM to start up Iperf?” </li></ul></ul><ul><ul><li>“ Can you make up a policy on the fly for just me?” </li></ul></ul><ul><li>Hoped for result: </li></ul><ul><ul><li>Regular means of authentication </li></ul></ul><ul><ul><li>Measurement peering agreements </li></ul></ul><ul><ul><li>No chance of polluted test results </li></ul></ul><ul><ul><li>Regular and consistent policy for access and limits </li></ul></ul>
    44. 44. METHODOLOGIES: Application Performance <ul><li>Network Dependent Vendors </li></ul><ul><li>Applications groups (e.g. VoIP) </li></ul><ul><li>Field engineers </li></ul><ul><li>Industry focused on QoE </li></ul>
    45. 45. Simplified Three Layer Model OSI Layer Description 7 Application 6 Presentation 5 Session 4 Transport 3 Network 2 Data Link 1 Physical Network Behaviors User Experience Application Behaviors
    46. 46. New Layer Model User Experience App Behaviors Network Behaviors
    47. 47. App-to-Net Coupling <ul><li>Codec </li></ul><ul><li>Dynamics </li></ul><ul><li>Requirements </li></ul>Application Model Outcomes <ul><li>Loss </li></ul><ul><li>Jitter </li></ul><ul><li>Latency </li></ul>
    48. 48. E-Model Mapping: R  MOS E-model generated “R-value” (0-100) - maps to well-known MOS score MOS (QoE) E-Model Analysis
    49. 49. Coupling the Layers Application Behaviors Network Behaviors Application Models test/monitor for QoE network requirements(QoS/SLA) User / Task / Process
    50. 50. METHODOLOGIES: Optimization
    51. 51. Network visibility End-to-end visibility App-to-net coupling End-to-end network path
    52. 52. Iterating to Performance
    53. 53. Wizard Gap Reprinted with permission (Matt Mathis, PSC) http://www.psc.edu/~mathis/
    54. 54. Wizard Gap <ul><li>Working definition: </li></ul><ul><li>Ratio of effective network performance attained by an average user to that attainable by a network wizard …. </li></ul>
    55. 55. Fix the Network First <ul><li>Three Steps to Performance </li></ul><ul><ul><li>Clean the network </li></ul></ul><ul><ul><ul><li>Pre-deployment </li></ul></ul></ul><ul><ul><ul><li>Monitoring </li></ul></ul></ul><ul><ul><li>Model traffic </li></ul></ul><ul><ul><ul><li>Application requirements for QoS/SLA </li></ul></ul></ul><ul><ul><ul><li>Monitoring for application performance </li></ul></ul></ul><ul><ul><li>Deploy QoS </li></ul></ul>
    56. 56. Lessons Learned <ul><li>Guy Almes , chief engineer Abilene </li></ul><ul><ul><li>“ The general consensus is that it's easier to fix a performance problem by host tuning and healthy provisioning rather than reserving. But it's understood that this may change over time. [...] For example, of the many performance problems being reported by users, very few are problems that would have been solved by QoS if we'd have had it.” </li></ul></ul>
    57. 57. Tools <ul><li>CAIDA Tools (Public) </li></ul><ul><ul><li>http://www.caida.org/tools/ </li></ul></ul><ul><ul><li>Taxonomies </li></ul></ul><ul><ul><ul><li>Topology </li></ul></ul></ul><ul><ul><ul><li>Workload </li></ul></ul></ul><ul><ul><ul><li>Performance </li></ul></ul></ul><ul><ul><ul><li>Routing </li></ul></ul></ul><ul><ul><ul><li>Multicast </li></ul></ul></ul>
    58. 58. Recommended (Public) Tools <ul><li>MRTG (SNMP-based router stats) </li></ul><ul><li>iPerf / NetPerf (active stress testing) </li></ul><ul><li>Ethereal/WireShark (passive sniffing) </li></ul><ul><li>NDT (TCP/UDP e2e active probing) </li></ul><ul><li>Argus (Flow-based traffic monitoring) </li></ul><ul><li>perfSonar (test/monitor infrastructure) </li></ul><ul><ul><li>Including OWAMP, BWCTL(iPerf), etc. </li></ul></ul>
    59. 59. Tools: OWAMP/BWCTL <ul><li>OWAMP: one way active measurement protocol </li></ul><ul><ul><li>Ping by any other name would smell as sweet </li></ul></ul><ul><ul><li>depends on stratum 1 time server at both ends </li></ul></ul><ul><ul><li>allows finding one way latency problems </li></ul></ul><ul><li>BWCTL: Bandwidth control </li></ul><ul><ul><li>management front end to iperf </li></ul></ul><ul><ul><li>prevent disruption of the network with iperf </li></ul></ul>
    60. 60. Tools: BWCTL <ul><li>Typical constraints to running “iperf” </li></ul><ul><li>Need software on all test systems </li></ul><ul><li>Need permissions on all systems involved (usually full shell accounts*) </li></ul><ul><li>Need to coordinate testing with others * </li></ul><ul><li>Need to run software on both sides with specified test parameters * </li></ul><ul><li>(* BWCTL was designed to help with these) </li></ul>
    61. 61. Tools: ARGUS <ul><li>http:// www.qosient.com/argus </li></ul><ul><ul><li>open source IP auditing tool </li></ul></ul><ul><ul><li>entirely passive </li></ul></ul><ul><ul><li>operates from network taps </li></ul></ul><ul><ul><li>network accounting down to the port level </li></ul></ul>
    62. 62. Traffic Summary from Argus <ul><li>From: Wed Aug 25 5:59:00 2004 To: Thu Aug 26 5:59:00 2004 </li></ul><ul><li>18,972,261,362 Total 10,057,240,289 Out 8,915,021,073 In </li></ul><ul><li>aaa.bb.cc.ddd 6,064,683,683 Tot 5,009,199,711 Out 1,055,483,972 In </li></ul><ul><li>ww.www.ww.www 1,490,107,096 1,396,534,031 93,573,065 </li></ul><ul><li>ww.www.ww.www:11003 1,490,107,096 1,396,534,031 93,573,065 </li></ul><ul><li>xx.xx.xx.xxx 574,727,508 548,101,513 26,625,995 </li></ul><ul><li>xx.xx.xx.xxx:6885 574,727,508 548,101,513 26,625,995 </li></ul><ul><li>yy.yyy.yyy.yyy 545,320,698 519,392,671 25,928,027 </li></ul><ul><li>yy.yyy.yyy.yyy:6884 545,320,698 519,392,671 25,928,027 </li></ul><ul><li>zzz.zzz.zz.zzz 428,146,146 414,054,598 14,091,548 </li></ul><ul><li>zzz.zzz.zz.zzz:6890 428,146,146 414,054,598 14,091,548 </li></ul>
    63. 63. Tools: ARGUS <ul><li>using ARGUS to identify retransmission type problems. </li></ul><ul><ul><li>compare total packet size to application data size </li></ul></ul><ul><li>full (complete packet including IP headers) </li></ul><ul><li>12:59:06 d tcp tcp sfu_ip.port -> taiwan_ip.port 9217 18455 497718 27940870 </li></ul><ul><li>app (application data bytes delivered to the user) </li></ul><ul><li>12:59:06 d tcp tcp sfu_ip.port -> taiwan_ip.port 9217 18455 0 26944300 </li></ul><ul><li>data transfer one way </li></ul><ul><ul><li>acks back have no user data </li></ul></ul>
    64. 64. Tools: ARGUS <ul><li>compare to misconfigured IP stack </li></ul><ul><li>full: </li></ul><ul><li>15:27:38 * tcp outside_ip.port -> sfu_ip.port 967 964 65885 119588 </li></ul><ul><li>app: </li></ul><ul><li>15:27:38 * tcp outside_ip.port -> sfu_ip.port 967 964 2051 55952 </li></ul><ul><li>retransmit rate is constantly above %50 </li></ul><ul><li>poor throughput </li></ul><ul><li>this should (and did) set off alarm bells </li></ul>
    65. 65. Tools: NDT <ul><li>(Many thanks to Lixin Liu) </li></ul><ul><li>Test 1: 50% signal on 802.11G </li></ul><ul><li>WEB100 Enabled Statistics: </li></ul><ul><li>Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done checking for firewalls . . . . . . . . . . . . . . . . . . . Done running 10s outbound test (client-to-server [C2S]) . . . . . 12.00Mb/s running 10s inbound test (server-to-client [S2C]) . . . . . . 13.90Mb/s </li></ul><ul><li>------ Client System Details ------ </li></ul><ul><li>OS data: Name =3D Windows XP, Architecture =3D x86, Version =3D 5.1 Java data: Vendor =3D Sun Microsystems Inc., Version =3D 1.5.0_11 </li></ul>
    66. 66. Tools: NDT <ul><li>------ Web100 Detailed Analysis ------ </li></ul><ul><li>45 Mbps T3/DS3 link found. </li></ul><ul><li>Link set to Full Duplex mode </li></ul><ul><li>No network congestion discovered. </li></ul><ul><li>Good network cable(s) found </li></ul><ul><li>Normal duplex operation found. </li></ul><ul><li>Web100 reports the Round trip time =3D 13.09 msec; the Packet size =3D 1460= Bytes; and = </li></ul><ul><li>There were 63 packets retransmitted, 447 duplicate acks received, and 0 SAC= K blocks received The connection was idle 0 seconds (0%) of the time C2S throughput test: Packet queuing detected: 0.10% S2C throughput test: Packet queuing detected: 22.81% This connection is receiver limited 3.88% of the time. </li></ul><ul><li>This connection is network limited 95.87% of the time. </li></ul><ul><li>Web100 reports TCP negotiated the optional Performance Settings to: = </li></ul><ul><li>RFC 2018 Selective Acknowledgment: OFF </li></ul><ul><li>RFC 896 Nagle Algorithm: ON </li></ul><ul><li>RFC 3168 Explicit Congestion Notification: OFF RFC 1323 Time Stamping: OFF RFC 1323 Window Scaling: ON </li></ul>
    67. 67. Tools: NDT <ul><li>Server 'sniffer.ucs.sfu.ca' is not behind a firewall. [Connection to the ep= hemeral port was successful] Client is not behind a firewall. [Connection to the ephemeral port was succ= essful] Packet size is preserved End-to-End Server IP addresses are preserved End-to-End Client IP addresses are preserved End-to-End </li></ul><ul><li>... (lots of web100 stats removed!) </li></ul><ul><li>aspd: 0.00000 </li></ul><ul><li>CWND-Limited: 4449.30 </li></ul><ul><li>The theoretical network limit is 23.74 Mbps The NDT server has a 8192.0 KByte buffer which limits the throughput to 977= </li></ul><ul><li>6.96 Mbps </li></ul><ul><li>Your PC/Workstation has a 63.0 KByte buffer which limits the throughput to = </li></ul><ul><li>38.19 Mbps </li></ul><ul><li>The network based flow control limits the throughput to 38.29 Mbps </li></ul><ul><li>Client Data reports link is 'T3', Client Acks report link is 'T3' </li></ul><ul><li>Server Data reports link is 'OC-48', Server Acks report link is 'OC-12' </li></ul>
    68. 68. Tools: NetPerf <ul><li>netperf on the same link. </li></ul><ul><ul><li>available throughput less than max </li></ul></ul><ul><li>liu@CLM ~ </li></ul><ul><li>$ netperf -l 60 -H sniffer.ucs.sfu.ca -- -s 1048576 -S 1048576 -m 1048576 TCP STREAM TEST from CLM (0.0.0.0) port 0 AF_INET to sniffer.ucs.sfu.ca (14= 2.58.200.252) port 0 AF_INET </li></ul><ul><li>Recv Send Send </li></ul><ul><li>Socket Socket Message Elapsed </li></ul><ul><li>Size Size Size Time Throughput </li></ul><ul><li>bytes bytes bytes secs. 10^6bits/sec </li></ul><ul><li>2097152 1048576 1048576 60.10 9.91 </li></ul><ul><li>(second run) </li></ul><ul><li>2097152 1048576 1048576 61.52 5.32 </li></ul>
    69. 69. Tools: NDT <ul><li>Test 3: 80% on 802.11A </li></ul><ul><li>WEB100 Enabled Statistics: </li></ul><ul><li>Checking for Middleboxes . . . . . . . . . . . . . . . . . . Done checking for firewalls . . . . . . . . . . . . . . . . . . . Done running 10s outbound test (client-to-server [C2S]) . . . . . 20.35Mb/s running 10s inbound test (server-to-client [S2C]) . . . . . . 20.61Mb/s </li></ul><ul><li>... </li></ul><ul><li>The theoretical network limit is 26.7 Mbps The NDT server has a 8192.0 KByte buffer which limits the throughput to 993= 4.80 Mbps Your PC/Workstation has a 63.0 KByte buffer which limits the throughput to = 38.80 Mbps The network based flow control limits the throughput to 38.90 Mbps </li></ul><ul><li>Client Data reports link is 'T3', Client Acks report link is 'T3' </li></ul><ul><li>Server Data reports link is 'OC-48', Server Acks report link is 'OC-12' </li></ul>
    70. 70. Tools: NetPerf <ul><li>iu@CLM ~ </li></ul><ul><li>$ netperf -l 60 -H sniffer.ucs.sfu.ca -- -s 1048576 -S 1048576 -m 1048576 TCP STREAM TEST from CLM (0.0.0.0) port 0 AF_INET to sniffer.ucs.sfu.ca (14= 2.58. </li></ul><ul><li>200.252) port 0 AF_INET </li></ul><ul><li>Recv Send Send </li></ul><ul><li>Socket Socket Message Elapsed </li></ul><ul><li>Size Size Size Time Throughput </li></ul><ul><li>bytes bytes bytes secs. 10^6bits/sec </li></ul><ul><li>2097152 1048576 1048576 60.25 21.86 </li></ul><ul><li>No one else using wireless on A (i.e. the case on a lightpath) </li></ul><ul><li>NetPerf gets full throughput unlike the G case </li></ul>
    71. 71. Tools: perfSONAR <ul><li>Performance Middleware </li></ul><ul><ul><li>perfSONAR is an international consortium in which Internet2 is a founder and leading participant </li></ul></ul><ul><ul><li>perfSONAR is a set of protocol standards for interoperability between measurement and monitoring systems </li></ul></ul><ul><ul><li>perfSONAR is a set of open source web services that can be mixed-and-matched and extended to create a performance monitoring framework </li></ul></ul><ul><li>Design Goals: </li></ul><ul><ul><li>Standards-based </li></ul></ul><ul><ul><li>Modular </li></ul></ul><ul><ul><li>Decentralized </li></ul></ul><ul><ul><li>Locally controlled </li></ul></ul><ul><ul><li>Open Source </li></ul></ul><ul><ul><li>Extensible </li></ul></ul>
    72. 72. perfSONAR Integrates <ul><li>Network measurement tools </li></ul><ul><li>Network measurement archives </li></ul><ul><li>Discovery </li></ul><ul><li>Authentication and authorization </li></ul><ul><li>Data manipulation </li></ul><ul><li>Resource protection </li></ul><ul><li>Topology </li></ul>
    73. 73. Performance Measurement: Project Phases <ul><li>Phase 1: Tool Beacons (Today) </li></ul><ul><ul><li>BWCTL (Complete), http://e2epi.internet2.edu/bwctl </li></ul></ul><ul><ul><li>OWAMP (Complete), http://e2epi.internet2.edu/owamp </li></ul></ul><ul><ul><li>NDT (Complete), http://e2epi.internet2.edu/ndt </li></ul></ul><ul><li>Phase 2: Measurement Domain Support </li></ul><ul><ul><li>General Measurement Infrastructure (Prototype in Progress) </li></ul></ul><ul><ul><li>Abilene Measurement Infrastructure Deployment (Complete), http://abilene.internet2.edu/observatory </li></ul></ul><ul><li>Phase 3: Federation Support (Future) </li></ul><ul><ul><li>AA (Prototype – optional AES key, policy file, limits file) </li></ul></ul><ul><ul><li>Discovery (Measurement Nodes, Databases) (Prototype – nearest NDT server, web page) </li></ul></ul><ul><ul><li>Test Request/Response Schema Support (Prototype – GGF NMWG Schema) </li></ul></ul>
    74. 74. Implementation <ul><li>Applications </li></ul><ul><ul><li>bwctld daemon </li></ul></ul><ul><ul><li>bwctl client </li></ul></ul><ul><li>Built upon protocol abstraction library </li></ul><ul><ul><li>Supports one-off applications </li></ul></ul><ul><ul><li>Allows authentication/policy hooks to be incorporated </li></ul></ul>
    75. 75. LIVE DEMOS <ul><li>NDT </li></ul><ul><li>AppCritical </li></ul>
    76. 76. <ul><li>Q&A </li></ul>
    77. 80. Outline (REMOVE) <ul><li>Set the stage – how bad is it (stats) </li></ul><ul><ul><li>Some stats from industry and SFU </li></ul></ul><ul><ul><li>What kinds of problems are typical </li></ul></ul><ul><li>Overview of contexts </li></ul><ul><ul><li>LAN and campus </li></ul></ul><ul><ul><li>Core networks including MPLS and optical </li></ul></ul><ul><ul><li>Wireless </li></ul></ul><ul><li>Methodologies – sniffing, flows, synthetic traffic, active probing </li></ul><ul><li>Recommended tools with examples and demo </li></ul>
    78. 81. Breakdown of Presentation (REMOVE) <ul><li>Intro and overview (both) </li></ul><ul><li>Quick demos (both) </li></ul><ul><li>Stats and experience </li></ul><ul><ul><li>Industry stats (Loki) </li></ul></ul><ul><ul><li>Campus experience (Peter) </li></ul></ul><ul><li>Problem types </li></ul><ul><ul><li>Seven Deadly Sins (Loki) </li></ul></ul><ul><ul><li>SFU/BCnet/CANARIE idiosyncrasies (Peter) </li></ul></ul><ul><li>Context overview (Peter) </li></ul><ul><li>Methodologies overview (Loki) </li></ul><ul><li>Tools lists and recommended tools </li></ul><ul><li>Demos </li></ul>
    79. 82. Application Ecology <ul><li>Paraphrasing ITU categories </li></ul><ul><ul><li>Real-time </li></ul></ul><ul><ul><ul><li>Jitter sensitive </li></ul></ul></ul><ul><ul><ul><li>Voice, video, collaborative </li></ul></ul></ul><ul><ul><li>Synchronous/transactional </li></ul></ul><ul><ul><ul><li>Response time (RTT) sensitive </li></ul></ul></ul><ul><ul><ul><li>Database, remote control </li></ul></ul></ul><ul><ul><li>Data </li></ul></ul><ul><ul><ul><li>Bandwidth sensitive </li></ul></ul></ul><ul><ul><ul><li>Transfer, backup/recover </li></ul></ul></ul><ul><ul><li>Best-effort </li></ul></ul><ul><ul><ul><li>Not sensitive </li></ul></ul></ul>

    ×