Bgp Anomaly Detection In An Isp
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Bgp Anomaly Detection In An Isp

on

  • 674 views

 

Statistics

Views

Total Views
674
Views on SlideShare
674
Embed Views
0

Actions

Likes
0
Downloads
21
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Bgp Anomaly Detection In An Isp Presentation Transcript

  • 1. BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs) http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
  • 2. Goal
    • Identify important anomalies
      • Lost reachability
      • Persistent flapping
      • Large traffic shifts
    • Contributions:
    • Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time .
    • Use the tool to characterize routing disruptions in an operational network
  • 3. Capturing Routing Changes CPE BGP Monitor iBGP iBGP iBGP iBGP iBGP iBGP eBGP eBGP eBGP eBGP eBGP eBGP Updates Updates Best routes Best routes Large operational network (8/16/2004 – 10/10-2004) C BR C BR C BR C BR C BR C BR C BR C BR C BR C BR C BR C BR
  • 4. Challenges
    • Large volume of BGP updates
      • Millions daily, very bursty
      • Too much for an operator to manage
    • Different than root-cause analysis
      • Identify changes and their effects
      • Focus on actionable events
      • Diagnose causes only in/near the AS
  • 5. System Architecture Event Classification “ Typed” Events E E BR E E BR E E BR BGP Updates (10 6 ) BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction E E BR E E BR E E BR Large Disruptions Netflow Data (10 1 )
  • 6. Grouping BGP Update into Events
    • Challenge : A single routing change
      • leads to multiple update messages
      • affects routing decisions at multiple routers
    • Solution :
    • Group all updates for a prefix with inter-arrival < 70 seconds
    • Flag prefixes with changes lasting > 10 minutes.
    Persistent Flapping Prefixes BGP Update Grouping E E BR E E BR E E BR BGP Updates Events
  • 7. Grouping Thresholds
    • Based on data analysis and our understanding of BGP
    • Event timeout: 70 seconds
      • 2 * MRAI timer + 10 seconds
      • 98% inter-arrival time < 70 seconds
    • Convergence timeout: 10 minutes
      • BGP usually converges within minutes
      • 99.9% events < 10 minutes
  • 8. Persistent Flapping Prefixes
    • Causes of persistent flapping
      • Conservative damping parameters (78.6%)
      • Protocol oscillations due to MED (18.3%)
      • Unstable interface or BGP session (3.0%)
    Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled !
  • 9. Example: Unstable eBGP Session
    • Flap damping parameters are session-based
    • Damping not implemented for iBGP sessions
    ISP Peer Customer p E C E B E A E D
  • 10. Event Classification
    • Challenge : Major concerns in network management
      • Changes in reachability
      • Heavy load of routing messages on the routers
      • Change of flow of traffic through the network
    Event Classification Events “ Typed” Events Solution : classify events by severity of their impacts
  • 11. Event Category – “No Disruption” ISP p AS 2 AS 1 No Traffic Shift “ No Disruption ”: each of the border routers has no traffic shift. (50.3%) E A E B E C E E E D
  • 12. Event Category – “Internal Disruption” ISP p AS 2 AS 1 Internal Traffic Shift “ Internal Disruption ”: all of the traffic shifts are internal traffic shift. (15.6%) E A E B E C E E E D
  • 13. Event Category – “Single External Disruption” ISP p AS 2 AS 1 external Traffic Shift “ Single External Disruption ”: only one of the traffic shifts is external traffic shift. (20.7%) E A E B E C E E E D
  • 14. Statistics on Event Classification
    • First 3 categories have significant variations from day to day
    • Updates per event depends on the type of events and the number of affected routers
    21.9% 6.0% Loss/Gain of Reachability 18.2% 7.4% Multiple External Disruption 7.9% 20.7% Single External Disruption 3.4% 15.6% Internal Disruption 48.6% 50.3% No Disruption Updates Events
  • 15. Event Correlation
    • Challenge : A single routing change
      • affects multiple destination prefixes
    Event Correlation “ Typed” Events Clusters Solution : group events of same type that occur close in time
  • 16. EBGP Session Reset
    • Caused most “single external disruption” events
    • Check if the number of prefixes using that session as the best route changes dramatically
    • Validation with Syslog router report (95%)
    time Number of prefixes session failure session recovery
  • 17. Hot-Potato Changes
    • Hot-Potato Changes
    • Caused “internal disruption” events
    • Validation with OSPF measurement (95%) [ Teixeira et al – SIGMETRICS’ 04]
    ISP P 10 11 9 “ Hot-potato routing” = route to closest egress point E A E B E C
  • 18. Traffic Impact Prediction
    • Challenge : Routing changes have different impacts on the network which depends on the popularity of the destinations
    Traffic Impact Prediction Clusters Large Disruptions Netflow Data Solution : weigh each cluster by traffic volume E E BR E E BR E E BR
  • 19. Traffic Impact Prediction
    • Traffic weight
      • Per-prefix measurement from Netflow
      • 10% prefixes accounts for 90% of traffic
    • Traffic weight of a cluster
      • Sum of “traffic weight” of the prefixes
      • A few clusters have large traffic weight
      • Mostly session resets & hot-potato changes
  • 20. Performance Evaluation
    • Memory
      • Static memory: “current routes”, 600 MB
      • Dynamic memory: “clusters”, 300 MB
    • Speed
      • 99% of intervals of 1 second of updates can be process within 1 second
      • Occasional execution lag
      • Every interval of 70 seconds of updates can be processed within 70 seconds
    Measurements were based on 900MHz CPU
  • 21. Conclusion
    • BGP anomaly detection
      • Fast, online fashion
      • Operator concerns (reachability, flapping, traffic)
      • Significant information reduction
    • Uncovered important network behaviors
      • Persistent flapping prefixes
      • Hot-potato changes
      • Session resets and interface failures
  • 22. Detecting Peering Violations
    • Consistent export requirement
      • Peer should advertise prefixes at all peering points, with the same AS path length
      • Allows the AS to do hot-potato routing
    • Detecting violations
      • Using iBGP feeds from the border routers
      • Some inference tricks to identify inconsistencies
    • Results of the study
      • http://www.nanog.org/mtg-0410/feamster.html
      • http://www.cs.princeton.edu/~jrex/papers/imc04.pdf