Successfully reported this slideshow.
Towards Automated Performance Diagnosis in a Large IPTV Network<br />Ajay Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jenni...
Internet Protocol Television (IPTV)<br />Television delivered through IP network<br />Encoded in series of IP packets <br ...
IPTV Service Architecture<br />3<br />SHO<br />Servers<br />VHO<br />IO<br />Provider Network<br />CO<br />DSLAM<br />Cust...
IPTV Characteristics<br /><ul><li>Stringent constraints on reliability and performance
Small packet loss or delay can impair video quality
Scale
Millions of devices (routers, servers, RG, STB)
Number is growing
Complexity
New service
New application for native IP multicast
Operational experience with IP multicast is limited </li></ul>4<br />
Problem Statement<br /><ul><li>Characterize faults and performance in IPTV networks
What are the dominant issues?
Is there spatial correlation between different events?
Is there daily pattern of events?
Detect and troubleshoot recurring conditions
Temporal – repeating over time
Upcoming SlideShare
Loading in …5
×

PPT - Towards Automated Performance Diagnosis in a Large IPTV Network

832 views

Published on

  • Be the first to comment

  • Be the first to like this

PPT - Towards Automated Performance Diagnosis in a Large IPTV Network

  1. 1. Towards Automated Performance Diagnosis in a Large IPTV Network<br />Ajay Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao<br />UT-Austin and AT&T Labs-Research<br />mahimkar@cs.utexas.edu<br />ACM Sigcomm 2009<br />1<br />
  2. 2. Internet Protocol Television (IPTV)<br />Television delivered through IP network<br />Encoded in series of IP packets <br />Rapid deployment by telecom companies <br />New services: quadruple-play (digital voice, TV, data & wireless) <br />More flexibility and interactivity for users<br />One of the largest commercial IPTV deployments in US<br />By 2008, more than 1 million customers spanning 4 time-zones <br />Supports advanced features <br /> Digital video recording (DVR), Video on demand (VoD)<br /> High definition (HD) channels, Choice programming <br />2<br />
  3. 3. IPTV Service Architecture<br />3<br />SHO<br />Servers<br />VHO<br />IO<br />Provider Network<br />CO<br />DSLAM<br />Customer <br />Home Network<br />STB<br />RG<br />PC<br />TV<br />Phone<br />
  4. 4. IPTV Characteristics<br /><ul><li>Stringent constraints on reliability and performance
  5. 5. Small packet loss or delay can impair video quality
  6. 6. Scale
  7. 7. Millions of devices (routers, servers, RG, STB)
  8. 8. Number is growing
  9. 9. Complexity
  10. 10. New service
  11. 11. New application for native IP multicast
  12. 12. Operational experience with IP multicast is limited </li></ul>4<br />
  13. 13. Problem Statement<br /><ul><li>Characterize faults and performance in IPTV networks
  14. 14. What are the dominant issues?
  15. 15. Is there spatial correlation between different events?
  16. 16. Is there daily pattern of events?
  17. 17. Detect and troubleshoot recurring conditions
  18. 18. Temporal – repeating over time
  19. 19. E.g., recurring poor picture quality at TV
  20. 20. Spatial – re-occurring across different spatial locations
  21. 21. E.g., software crashes at multiple set-top-boxes within a region</li></ul>5<br />Lots of alarms. How do you identify recurring <br />conditions of interest to network operators? <br />
  22. 22. IPTV Measurement Data<br /><ul><li>Customer care call records
  23. 23. Trouble tickets related to billing, provisioning, service disruption
  24. 24. Home network performance / activities
  25. 25. User activities: STB power on/off, STB resets, RG reboots
  26. 26. Performance/Faults: STB software crashes
  27. 27. Provider network performance
  28. 28. Syslogs at SHO, VHO
  29. 29. SNMP (CPU, memory, packet counts) at SHO, VHO, IO, CO
  30. 30. Video quality alarms at monitors</li></ul>6<br />
  31. 31. IPTV Characterization<br /><ul><li>Data analyzed over 3 months
  32. 32. Customer Trouble Tickets
  33. 33. Performance related issues
  34. 34. Sample Live TV video tickets
  35. 35. Blue screen on TV
  36. 36. Picture freezing
  37. 37. Poor video quality
  38. 38. Small degree of spatial correlation</li></ul>7<br />Decreasing in<br />frequency<br />.<br />.<br />.<br />.<br />
  39. 39. Daily Pattern of Events<br />8<br />Lot of activity between: <br /><ul><li> 00:00 and 04:00 GMT (evening prime time)
  40. 40. 12:00 and 23:59 GMT (day time) </li></ul>Relatively quiet period between <br /><ul><li> 4:00 and 12:00 GMT (customers are sleeping)
  41. 41. Number of syslogs at SHO/VHO is high. (provisioning and maintenance) </li></li></ul><li>Giza<br />9<br />Giza<br /><ul><li>First multi-resolution infrastructure
  42. 42. Detect and troubleshoot spatial locations that are experiencing serious performance problems
  43. 43. Detection
  44. 44. Output locations (e.g. VHO, or STB) that have significant event counts
  45. 45. Troubleshooting
  46. 46. Output list the other event-series that are the potential root-causes</li></ul>Spatial locations<br />E.g., VHO or Customer network<br />Symptom<br />E.g., video quality degradation<br />Root-causes for symptom<br />E.g., STB Crash<br />
  47. 47. Mining Challenges<br /><ul><li>Massive scale of event-series
  48. 48. Each device (SHO, VHO, RG) can generate lots of event-series
  49. 49. Blind mining could easily lead to information snow of results
  50. 50. Skewed event distribution
  51. 51. Small frequency counts for majority of events
  52. 52. Insufficient sample size for statistical analysis
  53. 53. Heavy hitters do not contribute to majority of issues
  54. 54. Imperfect timing information
  55. 55. Propagation delay: From event location to measurement process
  56. 56. Distributed events: From root of tree (SHO) towards RG</li></ul>10<br />
  57. 57. Giza<br />11<br />Hierarchical<br />Heavy Hitter<br />Detection<br />Spatial Proximity Model<br />Unified <br />Data Model<br />Statistical Correlation<br />Learn Edge <br />Directionality<br />Remove Spurious Edges<br />(scalability and skewed <br />event distribution )<br />Symptom<br />Spatial<br />locations<br />(Filter irrelevant events)<br />Other<br />Network<br />Events<br />(Event impact)<br />(Imperfect timing)<br />Causality Discovery<br />Root-Causes for Symptom<br />(Sparsity)<br />
  58. 58. Hierarchical Heavy Hitter Detection<br />Region<br />Identify spatial regions where symptom count is significant<br />E.g., customers in Texas have poor video quality<br />Significance test<br />Account for (i) event frequency and (ii) density concentration <br />Null hypothesis: children are drawn independently and uniformly at random from lower locations <br />SHO<br />VHO<br />Metro<br />IO<br />CO<br />DSLAM<br />STB / RG<br />IPTV Pyramid Model<br />12<br />
  59. 59. Causal Graph Discovery<br />13<br />A<br />B<br />A statistically precedes B<br /><ul><li>First, learn edge directionality</li></ul>Idea: use approximate timing to test statistically precedence<br /><ul><li>Using lag correlation
  60. 60. Compute cross-correlations at different time-lags
  61. 61. Compare range of positive lags with negative lags </li></ul>Addresses <br />Imperfect timing <br />Auto-correlation within each event-series <br />Second, condense correlation graph<br />Idea: identify smallest set of events that best explains symptoms <br />Using L1-norm minimization and L1-regularization <br />L1-regularization achieves sparse solution<br />
  62. 62. Giza Experiments<br /><ul><li>Validation
  63. 63. Select customer trouble tickets as input symptom
  64. 64. Apply Giza to identify potential root causes
  65. 65. Ground truth: mitigation actions in customer trouble tickets
  66. 66. Result: Good match between ground truth and Giza output
  67. 67. Case Study: Provider network events
  68. 68. Identify dependencies with customer trouble tickets
  69. 69. Result: Giza discovered unknown causal dependency
  70. 70. Causality Discovery Comparison
  71. 71. Ground truth: networking domain knowledge
  72. 72. Result: Giza performs better than WISE (Sigcomm’08) </li></ul>14<br />Majority of tickets explained by home network issues<br />
  73. 73. Case Study: Provider Network Events<br />Layer-1 Alarms<br />VRRP Packet<br />Discards<br />Link Down<br />Configuration<br />Changes<br />OSPF Interface<br />State Down<br />Multicast (PIM)<br />Neighbor Loss<br />MPLS Interface<br />State Changes<br />SAP Port <br />State Changes<br />SDP Bind <br />State Changes<br />MPLS Path<br />Re-routes<br />VRRP Packet<br />Discards<br />Multicast (PIM)<br />Neighbor Loss<br />Shadow edges are<br /> spurious correlations <br /><ul><li> Dependency between VRRP packet discards and PIM timeouts was unknown
  74. 74. Behavior more prevalent within SHO and VHOs near SHO
  75. 75. We are investigating with operations team </li></ul>15<br />
  76. 76. Comparison to Related Literature<br />16<br />
  77. 77. Conclusions<br /><ul><li>First characterization study in operational IPTV networks
  78. 78. Home networks contribute to majority of performance issues
  79. 79. Giza - Multi-resolution troubleshooting infrastructure
  80. 80. Hierarchical Heavy Hitter detection to identify regions
  81. 81. Statistical correlation to filter irrelevant events
  82. 82. Lag correlation to identify dependency directionality
  83. 83. L1-norm minimization to identify best explainable root-causes
  84. 84. Validation and case studies demonstrate effectiveness
  85. 85. Future Work
  86. 86. Troubleshooting home networks
  87. 87. Network-wide change detection</li></ul>17<br />
  88. 88. 18<br />Thank You !<br />

×