Last class, we talked about software debugging in general and how some of its ideas can be applied to network context. All of the three papers deals with how to find fault. The first one is geared to BGP misconfiguration The registry value of the PC corresponds the configuration files in each router. The key idea of PeerPressure was comparing it with the majority to find fault. The first paper suggest router configuration checker which finds faults via static analysis. And the most importantly fundamental question is identifying a common design pattern within the routers And statically First paper detecting BGP configuration faults.
So we have the responsibility to build more reliable network.
There are two main reasons that routing is prone to fault. One is complex policies. There are competing Ases also peers in contract. And as you know everyone would like to keep their information secret. The other is that it just that network is too big.
This is also from the author’s slide. This is just the note that some of the problems can’t be solved by research. Being aware of this makes us a bit humble.
External filtering vs. internal dissemination (external vs. internal) What makes distributed configuration hard? Filtering: Who gets what Dissemination: how they get it (the path by which they get it) Ranking: what they get Not much else in terms of complicated side effects – Verifying distributed program’s correctness
Informal analysis (collected from a mailing list archive): by no means a complete list Be careful: incidents not decreasing. With larger scale, these incidents could be further reaching. Statistical analysis just shows that there are a lot of these incidents.
Workflow!! Away from sr towards operation with tool can be used prior to deployment Problems may be masked Not just mistakes What are the kinds of things that people are likely to want to change: For example: link provision, flash crowd (simple example)
Somehow include the theory on this slide. Say “local correctness specification” Play up the fact that normalized representation is *not easy*! Interesting engineering sidenote: difficult to parse Scientific side note: had to come up with a normalized references so that testing constraints is easy Com,piler guys try to do this for multiple languages (cf)
How many lines of code for each one of these modules Complexity of verifier. (could also talk about this at the beginning) Advantages of sql: extensible Talk about the *complexity* of the database and verifier operations Extensible… Do it quickly! Declarative language to run deductive queries Normalization means Express configuration with centralized tables Check constraints by issuing queries on tables
What this diagram shows is that Rcc’s constraints is neither complete nor sound :; they may not find all problematic configurations, and they may report false positives. Rcc detects a subset of latent faults. Latent faults are faults that are not actively causing any problems but certainly violates the correctness constraints. Potentially active faults, there is at least one input sequence that is certain to trigger the fault. When deployed, a potentially active fault will become active if the corresponding input sequence occurs. BGP currently does not have a high level specification mechanism. First it will need the high level specification language, and the network operator should learn the language and there are additional amount of work the network operator should put into. BGP might need an abstract specification, it will need the language and it requires additional work from operators. And even so, the operators may well write incorrect specification. So rcc is just convienient and better.
A path is Usable means that it should reach the destination and conforms to the routing policies of ASes on the path
What problem arises? If the client receives something from a route reflector, it doesn’t re-advertise it
Edge exists iff the configuration of each router endpoint specifies the loopback address of the other endpoint And both routers agree on session options. It should be acyclic to ensure that there exists a stable path assignment.
The authors argue that requiring operators to provide a high-level policy specification would require designing a specification language and convincing operators to use it. What is worse is that such specification itself can be erroneous , so there is no guarantee that the results would be more accurate. As an alternative, RCC has the following principles. It just “believes” that intended policies will conform to best common practice, and if some thing deviates from the common practice report error.
Say, rcc is checking this for AS X. First rcc checks all routes which X exports to Sprint and figure outs their common attributes. (say, they are all tagged 1000) Then rcc checks the import policies for all sessions to Worldcom, ensuring no import policy will set route attributes which will make X to export to Sprint. Violations when routers in AS have different policy set to same peer - rcc easily checks distributed policies by normalizing all the policies Violations when iBGP signaling partition and routes with equally good attributes may not propagate to all peering outers rcc checks the routers that advertise routes to the same peer are in the same iBGP signalling partition
This is with the ‘belief’ that when an AS exchanges routes with neighboring AS on many sessions, most of those sessions have identical policies. There are legitimate reasons for having slightly different import policy
Most of these stem from the fact that these are distributed Every AS had errors, independent of size of the network iBGP signaling partition is the full mesh condition of top layer doesn’t hold. Duplicate loopback means two router have the same loopback address, then one router may discard a route learned from the other, thinking that the route is one that it had announced itself. Incomplete iBGP session is one sided session. Inconsistent export/import means that an AS advertised routes that were not equally good at every peering point. Transit between peers means that a routes learned from one peer or provider is readvertised to another peer.
Better method of scaling iBGP is needed!
WORKFLOW Say something about routing as a distributed program. Approach still likely to be useful.
Lets suppose Alice wants to chat with Eve on an Instant Messenger. Alice types her text which hops through a bunch of routers here and eventually reaches Eve. Now suppose a fault happens (such as a fiber cut) that disrupts the link between two routers. We call this an IP fault as it disrupts connectivity at the IP layer. IP Routing fails-over to the alternate path through which the messages now begin to flow. IP networks are designed to be fault-tolerant. So, why care about these IP faults !
Of course, just because IP networks are designed to be fault tolerant to some extent, doesn’t mean we can ignore these faults. The operator needs to fix these faults as soon as possible. Otherwise, the probability of a simultaneous failure increases as the down time of the primary failure increases. It is also too expensive to provision too many alternate paths. Therefore fast repair is necessary. OLD : IP fault-tolerance is only a temporary solution. Eventually you want it fixed otherwise, the second one also will fail leading to no connectivity. Well, why don’t we provision many alternate paths ? Obviously it is expensive to do so. Hence, we should be able to repair these failures fast enough. Of course we should first know where the fault happened, and this constitutes what we call localization. Fast localization is therefore an important goal as it decreases the mean time to repair. It is extremely critical to attain 5 nine reliability that ISP networks seek. But why is it difficult ?!
As shown in this gragh, OSPFareas typically consist of a large number Where as the ports comprise only a single circuit. Fiber spans typically have a significant number of IP links sharing them While SONET network element typically have fewer. The important observation here is that there is a significant degree of sharing of network components that can be utilized in spatial correlation in real IP networks. Thus shared risk group analysis is promising for large-scale networks.
Hit Ratio and Coverage Ratio Hit Ratio is the fraction of the links in the group that is part of the observation Coverage Ratio is the fraction of the observation explained by this group.
Some failure messages are transported using UDP (or other unreliable mechanisms) Also inaccuracy in modeling of the shared risk group
Let’s Consider a fault message lost in a monitoring system. Say there is a failure to a particular optical component consists of six links, only five of them the failure message goes through the monitoring system. Then the hit ratio would be 5 over 6. So this relaxing the error threshold would account for this message lost. But there could be the case that there are genuinely larger number of failures, due to occam’s principle thus ignored by greedy, then if we relax the threshold then this could lead to more inaccurate modeling. So there is a trade off here.
At the top of the hierarchy are the performance monitoring data sources. They can consist of SNMP traps or Router Syslogs or SONET Performance Monitoring data (if available) or whatever data source is available. They used Router Syslogs as a mechanism to indicate IP link failures. The data is sanitized and transformed to a consistent format first using data translators. For diagnosis, we apply a suitable data source specific fault localization policy to localize the fault. An SRLG data base usually constructed from router configurations is also created and fed into SCORE. Spatial Correlation engine itself is just a blackbox that takes in an SRLG database coupled with observations recorded and an optional error threshold and outputs a hypothesis. The particular fault localization policy for router syslogs we have implemented consists of a simple query engine that calls SCORE with different error thresholds and evaluates the hypotheses based on a cost function. The cost function is very simple and it includes two variables, one is the size of the hypothesis out put and the second the error threshold. The ratio of these is the cost function. The lower the value of error threshold, the higher the cost. The higher the number of SRLG groups pointed out in a hypothesis, the higher the cost. Acc. We select the right hypothesis and this is output through a webinterface.
Next we evaluate the efficacy of our greedy approximation with artificially generated faults but from a real SRLG database from a section of AT&T Backbone network. They picked out a random set of SRLG components to fail
The demonstration of the accuracy of localization of fault does not give any indication of how precise the localization is. Each SRLG actually consist of one or more physical components. So they introduce another terminology localization precision to express what precision the localization achieves. which is the ratio the number of suspect components after localization to before localization.
There could be genuinely larger number of failures describing the set of observations. The example of high-level risk group would be all links terminating in a particular point of presence sharing a power grid, and the example of low-level risk group is some internal risk group within a router In this case relaxing threshold will give incorrect hypothesis
The BGP monitor receives millions updates per day. And they may arrive in burst. This will overwhelm the operator. And make our system design very challenging. Please note that our work is not the same as the root-cause analysis. We are interested in identifying routing changes and their effects. Our work focus on identify actionable anomalies rather than diagnosis. We attempt to diagnose causes only if the changes occur in/near the AS
The following illustrates the architecture of our system. The system is composed of four components. As we can see, our system reduces millions of BGP routing updates down to only tens of large routing changes as well as flapping prefixes. Next, I’m going to show you the details of each component and the design challenges that we face.
The first challenge we face is that a single routing change can lead to multiple update messages and affects routing decisions at multiple routers. In our system, we group all updates for a prefix with interarrival time < 70 sec into events. we flag events if it lasts > 10 min as ‘persistant flapping prefixes’.
There are few major concerns in network management. The first one is changes in reachability. The second one is heavy load or routing messages on the router. High volume of routing updates will overload routers’ CPU. The third one is traffic shift in the network. To better meet the operators interests, we classify events by severity of their impact on the network. In particular, we type events into five categories. I will explain them in detail next.
Another challenge that we face is that a single routing change can affect multiple destination prefixes. In our system, we group events of same type that occur close in time into clusters. We found that there are two major contributors of large clusters: ebgp session reset and hot potato changes.
This tables shows the statistics of the five event categories. The first 3 categories vary significantly from day to day. The updates per event depends on the type of events and the number of affected routers. For example, gain/loss of reachability involves a long path-exploration process, even if only one router is involved. Whereas egress-point changes (e.g., hot-potato) usually involve a few messages to move from one egress to another.
Prefixes are not equally popular. One can imagine that routing changes on popular prefixes may affect more traffic. We weight each cluster by traffic volume and identify large disruptions based on the traffic volume it affects.
★ Detecting BGP Configuration Faults with Static Analysis ★ IP Fault Localization Via Risk Modeling ★ Finding a Needle in a Haystack: Pinpointing … Nick Feamster et al Ramana Rao Kompella et al Jian Wu et al Presented by Mikyung Han
Detecting BGP Configuration Faults 2nd Symposium on Networked Systems Design and Implementation (NSDI) , Boston, MA, May 2005 Nick Feamster Hari Balakrishnan ★ Best Paper Award With Static Analysis
… each routing to hundreds of thousands of IP prefixes
What can go wrong? Two-thirds of the problems are caused by configuration of the routing protocol Some things are out of the hands of networking research But…
Categories of BGP Configurations Ranking: route selection Customer Competitor Primary Backup … . More Flexibility brings More COMPLEXITY! Dissemination: internal route advertisement Filtering: route advertisement
These problems are real “… a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com , April 25, 1997 “ Microsoft's websites were offline for up to 23 hours... because of a [router] misconfiguration …it took nearly a day to determine what was wrong and undo the changes.” -- wired.com , January 25, 2001 “ WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to " a route table issue ." -- cnn.com , October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration) .” -- dslreports.com , February 23, 2004
Which faults does rcc detect? Faults found by rcc Latent faults Potentially active faults End-to-end failures
Correctness Specification Safety The protocol converges to a stable path assignment for every possible initial state and message ordering Path Visibility Every destination with a usable path has a route advertisement Route Validity Every route advertisement corresponds to a usable path Example violation: Network partition Example violation: Routing loop If there exists a path , then there exists a route If there exists a route , then there exists a path The protocol does not oscillate
Path Visibility in iBGP Default: “Full mesh” iBGP. Doesn’t scale. Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes. “ iBGP” c c c RR c RR RR
Even if Y and Z learn a route to d via eBGP, this would be worse than r1 learned by W
iBGP Signaling: Static Check Theorem. Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a full mesh. rcc checks whether iBGP signaling graph G is connected and acyclic , and whether the routers at the top layer of G form a full mesh .
Proves static configuration analysis uncovers many errors
Identifies major causes of error
Intra-AS dissemination is too complex
Mechanistic expression of policy
rcc is not sound or complete
More room for improvement on ‘beliefs’
IP Fault Localization 2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI) , Boston, MA, May 2005 Ramana Rao Kompella Jennifer Yates Albert Greenberg Alex C Snoeren via Risk Modeling
IP Network Fault-Tolerance Internet X IP Fault Alternate Path IP Networks are designed to be fault-tolerant! Router Alice Eve Any failure that causes an IP link to fail is termed “ IP Fault”
Risk modeling to localize faults across the IP and optical layers
SRLG : Shared Risk Link Groups
A physical object represents shared risk for a group of logical entities at IP layer
SCORE: Spatial Correlation Engine
cross-correlates dynamic fault information from two disparate network layers
Logical/Physical IP Network QWEST IP Network Los Angeles San Jose Washington Atlanta Houston
Logical/Physical IP Network San Jose Washington Atlanta Houston Los Angeles SHARED RISK X X DWDM failed ? Links that share a “Shared Risk” form an Shared Risk Link Group (SRLG) X Los Angeles San Jose Washington Atlanta Houston DWDM O-E-O Conversion Router
SRLG Prevalence 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 CDF SRLG Cardinality (no. of links per group) Logscale Fiber Spans Fiber SONET Network Elements Ports Router Modules Routers Areas Aggregated Database At least 47% of all SRLGs have atleast two links More than 85% of OSPF Areas have atleast 10 links Source : Section of ATT Backbone Network
Bipartite Graph Formulation DWDM1 DWDM2 Fiber Span0 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X X X Hypothesis : Possible Explanation Observation : Temporally Correlated R2 R3 R4 DWDM3 Fiber Span1
Bipartite Graph Formulation DWDM1 DWDM2 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X R2 R3 R4 DWDM3 X X X Hypothesis : Can contain multiple simultaneous failures Fiber Span0 Fiber Span1 Set cover of a given Observation : NP-Hard
Greedy Approximation DWDM1 DWDM2 R0 R1 L0 L1 L2 L3 L4 L5 L6 X X X R2 R3 R4 DWDM3 X Fiber Span 0 Fiber Span 1 Hit Ratio of R0 = |G i O|/|Gi| = 1/2 = 50% Coverage Ratio of R0 = | G i O|/|O| = 1/4 = 25%
Intelligence is built onto the SRLG database and reflected in the SCORE queries
Obtains minimum set hypothesis
SCORE System Architecture Data Translator WWW Router Syslogs Spatial Correlation (SCORE) FAULT LOCALIZATION POLICIES SRLG Database API Input : <Ckt1, Ckt2 ..>, Error Threshold Output : <Grp1, Grp2..> 1. Event Clustering -captures events close together in time 2. Localization Heuristics: -uses multiple error threshold outputs H with min cost (|H|/eThresh) -queries clustered events with similar signature Data Translator Data Translator SNMP Traps SONET PM data Multiple Query
Artificially generated faults but real SRLG database from (a section of) AT&T backbone network
Picked a set of components to fail
Observation then fed to SCORE
No losses in data no database inconsistency
Hypothesis compared with injected faults
Perfect Fault Notification ROUTER AREA SONET Aggregated Accuracy Greater than 95% for 5 failures 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 2 4 6 8 10 12 14 16 18 20 Fraction of Correct Hypotheses Number of simultaneously induced failures FIBERSPAN PORT MODULE
Imperfect Fault Notifications 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of Correct Hypotheses Loss Probability (eThresh 0.6) One Failure Two Failures Three Failures Four Failures Five Failures Almost linear accuracy tradeoff with loss probability
Shows how error-thresholds are effective in uncovering these inconsistencies and data losses
Localization Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Localization Fraction About 40% of faults could be localized to less than 5% of components About 80% of faults could be localized to less than 10% of components
Database inconsistencies are resolved in SCORE using a simple error threshold scheme
Fails to model either very high-level risk group or very low-level risk group
Extremely hard to select a single error threshold for all observations!
Need more intelligent heuristics to fault localization policy
Finding a Needle in a Haystack: Proc. Networked Systems Design and Implementation May 2005 Jian Wu Z. Morley Mao Jennifer Rexford Jia Wang Pinpointing Significant BGP Routing Changes in an IP Network
Covert millions of BGP updates into a few dozen of actionable reports!
System Architecture Event Classification “ Typed” Events E E BR E E BR E E BR BGP Updates (10 6 ) BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction E E BR E E BR E E BR Large Disruptions Netflow Data (10 1 )