Your SlideShare is downloading. ×
Feb15.ppt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Feb15.ppt

267
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
267
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Last class, we talked about software debugging in general and how some of its ideas can be applied to network context. All of the three papers deals with how to find fault. The first one is geared to BGP misconfiguration The registry value of the PC corresponds the configuration files in each router. The key idea of PeerPressure was comparing it with the majority to find fault. The first paper suggest router configuration checker which finds faults via static analysis. And the most importantly fundamental question is identifying a common design pattern within the routers And statically First paper detecting BGP configuration faults.
  • So we have the responsibility to build more reliable network.
  • There are two main reasons that routing is prone to fault. One is complex policies. There are competing Ases also peers in contract. And as you know everyone would like to keep their information secret. The other is that it just that network is too big.
  • This is also from the author’s slide. This is just the note that some of the problems can’t be solved by research. Being aware of this makes us a bit humble.
  • External filtering vs. internal dissemination (external vs. internal) What makes distributed configuration hard? Filtering: Who gets what Dissemination: how they get it (the path by which they get it) Ranking: what they get Not much else in terms of complicated side effects – Verifying distributed program’s correctness
  • Informal analysis (collected from a mailing list archive): by no means a complete list Be careful: incidents not decreasing. With larger scale, these incidents could be further reaching. Statistical analysis just shows that there are a lot of these incidents.
  • Workflow!! Away from sr towards operation with tool can be used prior to deployment Problems may be masked Not just mistakes What are the kinds of things that people are likely to want to change: For example: link provision, flash crowd (simple example)
  • Somehow include the theory on this slide. Say “local correctness specification” Play up the fact that normalized representation is *not easy*! Interesting engineering sidenote: difficult to parse Scientific side note: had to come up with a normalized references so that testing constraints is easy Com,piler guys try to do this for multiple languages (cf)
  • How many lines of code for each one of these modules Complexity of verifier. (could also talk about this at the beginning) Advantages of sql: extensible Talk about the *complexity* of the database and verifier operations Extensible… Do it quickly! Declarative language to run deductive queries Normalization means  Express configuration with centralized tables Check constraints by issuing queries on tables
  • What this diagram shows is that Rcc’s constraints is neither complete nor sound :; they may not find all problematic configurations, and they may report false positives. Rcc detects a subset of latent faults. Latent faults are faults that are not actively causing any problems but certainly violates the correctness constraints. Potentially active faults, there is at least one input sequence that is certain to trigger the fault. When deployed, a potentially active fault will become active if the corresponding input sequence occurs. BGP currently does not have a high level specification mechanism. First it will need the high level specification language, and the network operator should learn the language and there are additional amount of work the network operator should put into. BGP might need an abstract specification, it will need the language and it requires additional work from operators. And even so, the operators may well write incorrect specification. So rcc is just convienient and better.
  • A path is Usable means that it should reach the destination and conforms to the routing policies of ASes on the path
  • What problem arises? If the client receives something from a route reflector, it doesn’t re-advertise it
  • Edge exists iff the configuration of each router endpoint specifies the loopback address of the other endpoint And both routers agree on session options. It should be acyclic to ensure that there exists a stable path assignment.
  • The authors argue that requiring operators to provide a high-level policy specification would require designing a specification language and convincing operators to use it. What is worse is that such specification itself can be erroneous , so there is no guarantee that the results would be more accurate. As an alternative, RCC has the following principles. It just “believes” that intended policies will conform to best common practice, and if some thing deviates from the common practice report error.
  • Say, rcc is checking this for AS X. First rcc checks all routes which X exports to Sprint and figure outs their common attributes. (say, they are all tagged 1000) Then rcc checks the import policies for all sessions to Worldcom, ensuring no import policy will set route attributes which will make X to export to Sprint. Violations when routers in AS have different policy set to same peer - rcc easily checks distributed policies by normalizing all the policies Violations when iBGP signaling partition and routes with equally good attributes may not propagate to all peering outers rcc checks the routers that advertise routes to the same peer are in the same iBGP signalling partition
  • This is with the ‘belief’ that when an AS exchanges routes with neighboring AS on many sessions, most of those sessions have identical policies. There are legitimate reasons for having slightly different import policy
  • Most of these stem from the fact that these are distributed Every AS had errors, independent of size of the network iBGP signaling partition is the full mesh condition of top layer doesn’t hold. Duplicate loopback means two router have the same loopback address, then one router may discard a route learned from the other, thinking that the route is one that it had announced itself. Incomplete iBGP session is one sided session. Inconsistent export/import means that an AS advertised routes that were not equally good at every peering point. Transit between peers means that a routes learned from one peer or provider is readvertised to another peer.
  • Better method of scaling iBGP is needed!
  • WORKFLOW Say something about routing as a distributed program. Approach still likely to be useful.
  • Lets suppose Alice wants to chat with Eve on an Instant Messenger. Alice types her text which hops through a bunch of routers here and eventually reaches Eve. Now suppose a fault happens (such as a fiber cut) that disrupts the link between two routers. We call this an IP fault as it disrupts connectivity at the IP layer. IP Routing fails-over to the alternate path through which the messages now begin to flow. IP networks are designed to be fault-tolerant. So, why care about these IP faults !
  • Of course, just because IP networks are designed to be fault tolerant to some extent, doesn’t mean we can ignore these faults. The operator needs to fix these faults as soon as possible. Otherwise, the probability of a simultaneous failure increases as the down time of the primary failure increases. It is also too expensive to provision too many alternate paths. Therefore fast repair is necessary. OLD : IP fault-tolerance is only a temporary solution. Eventually you want it fixed otherwise, the second one also will fail leading to no connectivity. Well, why don’t we provision many alternate paths ? Obviously it is expensive to do so. Hence, we should be able to repair these failures fast enough. Of course we should first know where the fault happened, and this constitutes what we call localization. Fast localization is therefore an important goal as it decreases the mean time to repair. It is extremely critical to attain 5 nine reliability that ISP networks seek. But why is it difficult ?!
  • As shown in this gragh, OSPFareas typically consist of a large number Where as the ports comprise only a single circuit. Fiber spans typically have a significant number of IP links sharing them While SONET network element typically have fewer. The important observation here is that there is a significant degree of sharing of network components that can be utilized in spatial correlation in real IP networks. Thus shared risk group analysis is promising for large-scale networks.
  • Hit Ratio and Coverage Ratio Hit Ratio is the fraction of the links in the group that is part of the observation Coverage Ratio is the fraction of the observation explained by this group.
  • Some failure messages are transported using UDP (or other unreliable mechanisms) Also inaccuracy in modeling of the shared risk group
  • Let’s Consider a fault message lost in a monitoring system. Say there is a failure to a particular optical component consists of six links, only five of them the failure message goes through the monitoring system. Then the hit ratio would be 5 over 6. So this relaxing the error threshold would account for this message lost. But there could be the case that there are genuinely larger number of failures, due to occam’s principle thus ignored by greedy, then if we relax the threshold then this could lead to more inaccurate modeling. So there is a trade off here.
  • At the top of the hierarchy are the performance monitoring data sources. They can consist of SNMP traps or Router Syslogs or SONET Performance Monitoring data (if available) or whatever data source is available. They used Router Syslogs as a mechanism to indicate IP link failures. The data is sanitized and transformed to a consistent format first using data translators. For diagnosis, we apply a suitable data source specific fault localization policy to localize the fault. An SRLG data base usually constructed from router configurations is also created and fed into SCORE. Spatial Correlation engine itself is just a blackbox that takes in an SRLG database coupled with observations recorded and an optional error threshold and outputs a hypothesis. The particular fault localization policy for router syslogs we have implemented consists of a simple query engine that calls SCORE with different error thresholds and evaluates the hypotheses based on a cost function. The cost function is very simple and it includes two variables, one is the size of the hypothesis out put and the second the error threshold. The ratio of these is the cost function. The lower the value of error threshold, the higher the cost. The higher the number of SRLG groups pointed out in a hypothesis, the higher the cost. Acc. We select the right hypothesis and this is output through a webinterface.
  • Next we evaluate the efficacy of our greedy approximation with artificially generated faults but from a real SRLG database from a section of AT&T Backbone network. They picked out a random set of SRLG components to fail
  • The demonstration of the accuracy of localization of fault does not give any indication of how precise the localization is. Each SRLG actually consist of one or more physical components. So they introduce another terminology localization precision to express what precision the localization achieves. which is the ratio the number of suspect components after localization to before localization.
  • There could be genuinely larger number of failures describing the set of observations. The example of high-level risk group would be all links terminating in a particular point of presence sharing a power grid, and the example of low-level risk group is some internal risk group within a router In this case relaxing threshold will give incorrect hypothesis
  • The BGP monitor receives millions updates per day. And they may arrive in burst. This will overwhelm the operator. And make our system design very challenging. Please note that our work is not the same as the root-cause analysis. We are interested in identifying routing changes and their effects. Our work focus on identify actionable anomalies rather than diagnosis. We attempt to diagnose causes only if the changes occur in/near the AS
  • The following illustrates the architecture of our system. The system is composed of four components. As we can see, our system reduces millions of BGP routing updates down to only tens of large routing changes as well as flapping prefixes. Next, I’m going to show you the details of each component and the design challenges that we face.
  • The first challenge we face is that a single routing change can lead to multiple update messages and affects routing decisions at multiple routers. In our system, we group all updates for a prefix with interarrival time < 70 sec into events. we flag events if it lasts > 10 min as ‘persistant flapping prefixes’.
  • There are few major concerns in network management. The first one is changes in reachability. The second one is heavy load or routing messages on the router. High volume of routing updates will overload routers’ CPU. The third one is traffic shift in the network. To better meet the operators interests, we classify events by severity of their impact on the network. In particular, we type events into five categories. I will explain them in detail next.
  • Another challenge that we face is that a single routing change can affect multiple destination prefixes. In our system, we group events of same type that occur close in time into clusters. We found that there are two major contributors of large clusters: ebgp session reset and hot potato changes.
  • This tables shows the statistics of the five event categories. The first 3 categories vary significantly from day to day. The updates per event depends on the type of events and the number of affected routers. For example, gain/loss of reachability involves a long path-exploration process, even if only one router is involved. Whereas egress-point changes (e.g., hot-potato) usually involve a few messages to move from one egress to another.
  • Prefixes are not equally popular. One can imagine that routing changes on popular prefixes may affect more traffic. We weight each cluster by traffic volume and identify large disruptions based on the traffic volume it affects.
  • Transcript

    • 1. ★ Detecting BGP Configuration Faults with Static Analysis ★ IP Fault Localization Via Risk Modeling ★ Finding a Needle in a Haystack: Pinpointing … Nick Feamster et al Ramana Rao Kompella et al Jian Wu et al Presented by Mikyung Han
    • 2. Detecting BGP Configuration Faults 2nd Symposium on Networked Systems Design and Implementation (NSDI) , Boston, MA, May 2005 Nick Feamster Hari Balakrishnan ★ Best Paper Award With Static Analysis
    • 3.
      • The Internet is increasingly becoming part of the mission-critical Infrastructure (a public utility!).
      Is correctness really that important? Big problem: Very poor understanding of how to manage it.
    • 4. Why does routing go wrong?
      • Complex policies
        • Competing / cooperating networks
        • Each with only limited visibility
      • Large scale
        • Tens of thousands networks
        • … each with hundreds of routers
        • … each routing to hundreds of thousands of IP prefixes
    • 5. What can go wrong? Two-thirds of the problems are caused by configuration of the routing protocol Some things are out of the hands of networking research But…
    • 6. Categories of BGP Configurations Ranking: route selection Customer Competitor Primary Backup … . More Flexibility brings More COMPLEXITY! Dissemination: internal route advertisement Filtering: route advertisement
    • 7. These problems are real “… a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com , April 25, 1997 “ Microsoft's websites were offline for up to 23 hours... because of a [router] misconfiguration …it took nearly a day to determine what was wrong and undo the changes.” -- wired.com , January 25, 2001 “ WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to " a route table issue ." -- cnn.com , October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration) .” -- dslreports.com , February 23, 2004
    • 8. Routing Faults Discussed on NANOG mailing List
    • 9. Why is routing hard to get right?
      • Defining correctness is hard
      • Interactions cause unintended consequences
        • Each network independently configured
        • Unintended policy interactions
      • Operators make mistakes
        • Configuration is difficult
        • Complex policies, distributed configuration
    • 10. Today: Tweak-N-Pray
      • Problems cause downtime
      • Problems often not immediately apparent
      What happens if I tweak this policy…? Configure Observe Wait for Next Problem Desired Effect? Revert No Yes
    • 11. Goal: Proactive Approach
      • Idea: Analyze configuration before deployment
      Many faults can be detected with static analysis. Configure Detect Faults Deploy rcc
    • 12. Router Configuration Checker ( rcc )
      • A tool that finds faults in BGP configuration with static analysis
        • Does not require additional work of operators
      • Detects
        • Path Visibility Faults
        • Route Validity Faults
        • Only detects faults in single AS
        • Only detects faults that cause persistent failures
    • 13. What is so cool about rcc?
      • Finds faults proactively
        • before deployment
      • Just convenient for now
        • BGP might need a high level specification of policies in the future
        • To do so,
          • High level specification language needed
          • Network operators need to learn and deploy
          • Even so, they may well write it incorrectly !
        • No additional works from network operators!
    • 14. rcc Overview
      • Analyzing complex, distributed configuration
      • Defining a correctness specification
      • Mapping specification to constraints
      “ rcc” Normalized Representation Correctness Specification Constraints Faults Challenges Distributed router configurations (Single AS)
    • 15. rcc Implementation Preprocessor Parser Verifier Distributed router Configurations (offline) Relational Database (mySQL) Constraints Faults (Cisco, Avici, Juniper, Procket, etc.) Normalized Representation More Parsable Version Runs simple queries Select, join, etc
    • 16. Which faults does rcc detect? Faults found by rcc Latent faults Potentially active faults End-to-end failures
    • 17. Correctness Specification Safety The protocol converges to a stable path assignment for every possible initial state and message ordering Path Visibility Every destination with a usable path has a route advertisement Route Validity Every route advertisement corresponds to a usable path Example violation: Network partition Example violation: Routing loop If there exists a path , then there exists a route If there exists a route , then there exists a path The protocol does not oscillate
    • 18. Path Visibility in iBGP Default: “Full mesh” iBGP. Doesn’t scale. Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes. “ iBGP” c c c RR c RR RR
    • 19. iBGP Fault Example
      • Network Partition
        • W learns r1 via eBGP
        • X does not readvertise to other iBGP sessions
        • Then Y and Z won’t learn r1 to d
      • Suboptimal Routing
        • Even if Y and Z learn a route to d via eBGP, this would be worse than r1 learned by W
    • 20. iBGP Signaling: Static Check Theorem. Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a full mesh. rcc checks whether iBGP signaling graph G is connected and acyclic , and whether the routers at the top layer of G form a full mesh .
    • 21. Route Validity: Policy Related Problems
      • rcc operates without a specification of the intended policy
        • For the convenience’s sake
      • rcc forms beliefs
        • Assume intended policies conform to best common practice
        • Analyze the configuration for common patterns and look for deviations from those patterns
      • Still useful but some false positives
    • 22. Route Validity: Best Common Practice
      • A route learned from peers should not be re-advertised to another peer
        • Ex: Ensuring no routes learned from Worldcom propagate to Sprint
      • AS should advertise routes with equally good attributes to each peer at every peering point
        • Violations
          • when routers in AS have different policy set to same peer
          • When there exists iBGP signaling partition
    • 23. Route Validity: Configuration Anomalies
      • When the configurations for sessions at different routers to a neighboring AS are the same except at one or two routers, rcc reports faults!
      • False Positives of course …
    • 24. Analyzing Real-World Configuration
      • Downloaded by 70 network operators, some of them shared their configurations
        • Reluctant to share because its proprietary
        • Because they don’t like researchers finding faults on their network
      • Detected more than 1000 faults previously undiscovered in 17 ASes
    • 25. Summary: Faults across 17 ASes Route Validity Path Visibility Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration
    • 26. rcc : Take-home lessons
      • Better intra-AS route dissemination protocol needed
        • Current route reflection causes many faults!
      • BGP needs to be configured with a centralized higher-level specification language
        • Current distributed, low-level nature introduces complexity, obscurity, and possibility to misconfiguration
        • But! trade-off with flexibility and expressiveness
    • 27. Discussion
      • Strength
        • Proves static configuration analysis uncovers many errors
        • Identifies major causes of error
          • Distributed configuration
          • Intra-AS dissemination is too complex
          • Mechanistic expression of policy
      • Weakness
        • rcc is not sound or complete
        • More room for improvement on ‘beliefs’
    • 28. IP Fault Localization 2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI) , Boston, MA, May 2005 Ramana Rao Kompella Jennifer Yates Albert Greenberg Alex C Snoeren via Risk Modeling
    • 29. IP Network Fault-Tolerance Internet X IP Fault Alternate Path IP Networks are designed to be fault-tolerant! Router Alice Eve Any failure that causes an IP link to fail is termed “ IP Fault”
    • 30. Fault Repair
      • Fast Repair is necessary because
        • Probability of a simultaneous failure increases with down-time
        • Expensive to provision too many alternate paths
      • Fault Localization is a bottleneck for fault repair!
    • 31. What makes fault localization hard?
      • A typical Tier-I ISP network has
        • About a thousand routers
        • A few thousand IP links
        • Tens of thousands of optical components
        • About 50-100 thousand miles of optical fiber
        • Complicated topologies (mesh, ring etc.)
      • Current alarms do not indicate root-cause
      • Often problematic to monitor actual component failure
      • Failure alerts can get lost
      Operators Need an automated tool for fast fault localization
    • 32. Key Ideas: Shared Risk!
      • Risk modeling to localize faults across the IP and optical layers
      • SRLG : Shared Risk Link Groups
        • A physical object represents shared risk for a group of logical entities at IP layer
      • SCORE: Spatial Correlation Engine
        • cross-correlates dynamic fault information from two disparate network layers
    • 33. Logical/Physical IP Network QWEST IP Network Los Angeles San Jose Washington Atlanta Houston
    • 34. Logical/Physical IP Network San Jose Washington Atlanta Houston Los Angeles SHARED RISK X X DWDM failed ? Links that share a “Shared Risk” form an Shared Risk Link Group (SRLG) X Los Angeles San Jose Washington Atlanta Houston DWDM O-E-O Conversion Router
    • 35. Various types of SRLGs
      • Physical Shared Risks
        • SONET (e.g. DWDM, ADM, Optical Amplifiers)
        • Fiber
        • Fiber Span
        • Router
        • Module
        • Port
      • Logical Shared Risks
        • Autonomous System
        • OSPF Areas
    • 36. SRLG Prevalence 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 CDF SRLG Cardinality (no. of links per group) Logscale Fiber Spans Fiber SONET Network Elements Ports Router Modules Routers Areas Aggregated Database At least 47% of all SRLGs have atleast two links More than 85% of OSPF Areas have atleast 10 links Source : Section of ATT Backbone Network
    • 37. Problem Formulation
      • A set of link
        • C = {c 1 , c 2 , … , c n }
      • A set of risk Group
        • G = {G 1 , G 2 , … , G m }
        • G i = {c i1 , c i2 , … , c ik }, st c ix are likely to fail simultaneously
      • An observation
        • O = {c e1 , c e2 , … , c em }
      • Find Hypothesis H
        • H = {G h1 , G h2 , … , G hk } which explains O
          • Every member of O belongs to at least one member of H and all the members of a given group G hi belong to O
        • Many Hs!
      Occam’s Razor : Let’s not assume more than what is necessary Simplicity is the Best
    • 38. SRLG Database R0 – {L0,L1} R1 – {L0,L2,L3,L4} R2 – {L4,L5} R3 – {L3,L5,L6} R4 – {L1,L2,L6} D1 – {L0,L1,L2} D2 – {L3,L5,L6} D3 – {L3,L4,L5} F0 – {L0,L1} F1 – {L0,L2} … R0 R1 R2 R3 R4 R0 R1 R2 R3 R4 L0 L1 L2 L3 L4 L5 L6 D1 D2 D3 F0 F1 F2 F3 F4 F5 F6 F7
    • 39. Bipartite Graph Formulation DWDM1 DWDM2 Fiber Span0 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X X X Hypothesis : Possible Explanation Observation : Temporally Correlated R2 R3 R4 DWDM3 Fiber Span1
    • 40. Bipartite Graph Formulation DWDM1 DWDM2 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X R2 R3 R4 DWDM3 X X X Hypothesis : Can contain multiple simultaneous failures Fiber Span0 Fiber Span1 Set cover of a given Observation : NP-Hard
    • 41. Greedy Approximation DWDM1 DWDM2 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X R2 R3 R4 DWDM3 X Fiber Span 0 Fiber Span 1 Hit Ratio of R0 = |G i  O|/|Gi| = 1/2 = 50% Coverage Ratio of R0 = | G i  O|/|O| = 1/4 = 25%
    • 42. Greedy Approximation
      • Out of all groups with hit-ratio 100%, pick group with max coverage
      • Prune links associated with this group and add this group to hypothesis
      • Repeat with pruned observation until no unexplained Observation
      R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%) D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%) X X X X DWDM1 DWDM2 R0 R1 L0 L1 L2 L3 L4 L5 L6 R2 R3 R4 DWDM3 Fiber Span 0 Fiber Span 1
    • 43. Modeling Imperfections
      • Ideally,
        • If a shared component fails, all associated links fail
      • Not true in practice sometimes
        • Failure message could get lost! (transported by UDP)
        • Inaccurate modeling of risk groups
      • Solution : Use an error threshold for the hit-ratios
        • Accounts for losses in data
        • Inaccurate modeling of SRLGs
    • 44. Modified Greedy Approximation
      • Select groups that have hit ratio > error threshold
      • Out of these groups, identify the group with maximum coverage
      • Prune the set of links that are explained by this group
      • Recursively repeat the above steps until all links are fully explained
    • 45. SCORE Spatial Correlation Module
      • Intelligence is built onto the SRLG database and reflected in the SCORE queries
      • Obtains minimum set hypothesis
    • 46. SCORE System Architecture Data Translator WWW Router Syslogs Spatial Correlation (SCORE) FAULT LOCALIZATION POLICIES SRLG Database API Input : <Ckt1, Ckt2 ..>, Error Threshold Output : <Grp1, Grp2..> 1. Event Clustering -captures events close together in time 2. Localization Heuristics: -uses multiple error threshold outputs H with min cost (|H|/eThresh) -queries clustered events with similar signature Data Translator Data Translator SNMP Traps SONET PM data Multiple Query
    • 47. Evaluation : Artificial Faults
      • Artificially generated faults but real SRLG database from (a section of) AT&T backbone network
      • Picked a set of components to fail
      • Observation then fed to SCORE
        • No losses in data no database inconsistency
      • Hypothesis compared with injected faults
    • 48. Perfect Fault Notification ROUTER AREA SONET Aggregated Accuracy Greater than 95% for 5 failures 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 2 4 6 8 10 12 14 16 18 20 Fraction of Correct Hypotheses Number of simultaneously induced failures FIBERSPAN PORT MODULE
    • 49. Imperfect Fault Notifications 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of Correct Hypotheses Loss Probability (eThresh 0.6) One Failure Two Failures Three Failures Four Failures Five Failures Almost linear accuracy tradeoff with loss probability
    • 50. Evaluation : Real Faults
      • A set of 18 faults studied and diagnosed
        • Where root-cause well-known
      • One Case Study
        • OSPF Area wide problem that affected about 70 links
        • SCORE identified about 20 SRLG groups as hypothesis
        • Further analysis revealed that error due to incorrect SRLG modeling
        • Relaxed error threshold to 0.7 brought it down to 4
        • Only OSPF interfaces with MPLS enabled got affected by the protocol bug
    • 51. Evaluation: Real Faults
      • Similarly, SCORE uncovered
        • Database problems
        • Missing error reports from certain links
        • Other inconsistencies
      • Shows how error-thresholds are effective in uncovering these inconsistencies and data losses
    • 52. Localization Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Localization Fraction About 40% of faults could be localized to less than 5% of components About 80% of faults could be localized to less than 10% of components
    • 53. Discussion
      • Strength
        • Captured the spatial correlation between IP links
        • Database inconsistencies are resolved in SCORE using a simple error threshold scheme
      • Weakness
        • Fails to model either very high-level risk group or very low-level risk group
        • Extremely hard to select a single error threshold for all observations!
        • Need more intelligent heuristics to fault localization policy
    • 54. Finding a Needle in a Haystack: Proc. Networked Systems Design and Implementation May 2005 Jian Wu Z. Morley Mao Jennifer Rexford Jia Wang Pinpointing Significant BGP Routing Changes in an IP Network
    • 55. Challenges & Goals
      • Large volume of BGP updates
        • Millions daily, very bursty
        • Too much for an operator to manage
      • Different than root-cause analysis
        • Identify changes and their effects
        • Focus on actionable events
        • Diagnose causes only in/near the AS
      • Goal
        • Covert millions of BGP updates into a few dozen of actionable reports!
    • 56. System Architecture Event Classification “ Typed” Events E E BR E E BR E E BR BGP Updates (10 6 ) BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction E E BR E E BR E E BR Large Disruptions Netflow Data (10 1 )
    • 57. Grouping BGP Update into Events
      • Challenge : A single routing change
        • leads to multiple update messages
        • affects routing decisions at multiple routers
      • Solution :
      • Group all updates for a prefix with inter-arrival < 70 seconds
      • Flag prefixes with changes lasting > 10 minutes.
      Persistent Flapping Prefixes BGP Update Grouping E E BR E E BR E E BR BGP Updates Events
    • 58. Event Classification
      • Challenge : Major concerns in network management
        • Changes in reachability
        • Heavy load of routing messages on the routers
        • Change of flow of traffic through the network
      Event Classification Events “ Typed” Events Solution : classify events by severity of their impacts
    • 59. Event Correlation
      • Challenge : A single routing change
        • affects multiple destination prefixes
      Event Correlation “ Typed” Events Clusters Solution : group events of same type that occur close in time
    • 60. Statistics on Event Classification
      • First 3 categories have significant variations from day to day
      • Updates per event depends on the type of events and the number of affected routers
      21.9% 6.0% Loss/Gain of Reachability 18.2% 7.4% Multiple External Disruption 7.9% 20.7% Single External Disruption 3.4% 15.6% Internal Disruption 48.6% 50.3% No Disruption Updates Events
    • 61. Traffic Impact Prediction
      • Challenge : Routing changes have different impacts on the network which depends on the popularity of the destinations
      Traffic Impact Prediction Clusters Large Disruptions Netflow Data Solution : weigh each cluster by traffic volume E E BR E E BR E E BR
    • 62. Conclusion
      • BGP anomaly detection
        • Fast, online fashion
        • Significant information reduction (to a few dozen of actionable reports!)
      • Uncovered important network behaviors
        • Persistent flapping prefixes
        • Hot-potato changes
        • Session resets and interface failures