Huawei ARG3 Router How To - Troubleshooting OSPF: Netmask mismatch
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
1. BGP ERROR HANDLING.
DEVELOPING AN OPERATOR-LED APPROACH IN THE IETF.
Shakir, Cable&Wireless Worldwide.
Rob
UKNOF 18 – 20/01/2011 - LONDON
2. CUSTOMER
A Typical SP Network?
PE PE
TRANSIT PE P P PE PEER
P
PE PE
BGP
IGP
CUSTOMER
IGP
Signals customer/Internal prefixes between PEs
EGP
Propagates internal prefixes to neighbouring ASes.
3. A (Modern) Typical SP Network?
CUSTOMER
PE
PE
RR
TRANSIT PE PE PEER
P P
P
PE PE
BGP
IGP
CUSTOMER
IGP
Minimal infrastructure routing information.
BGP
Propagate internal routing and service data.
4. BGP Failures I.
JAN.
ERRORS IN AS4_PATH
09
Erroneous
data
in
the
AS4_PATH
op6onal
transi6ve
a9ribute
causing
BGP
session
failure
(JunOS
bug).
VERY LONG AS_PATH
FEB.
Very
long
AS_PATHs
in
the
global
BGP
table
cause
session
failure.
Not
the
first
6me
this
had
been
seen.
09
5. BGP Failures II.
AUG.
RIPE NCC RIS EXPERIMENTAL
10
A
RIPE
NCC
RIS/Duke
University
experiment
results
in
BGP
sessions
being
reset
–
disrup6ng
global
table
(IOS
XR
bug).
iBGP FAILURES
??
Mul6ple
occurrences
within
xSP
networks.
Likely
to
cause
higher
financial
impact
(L3VPN
margin).
??
6. Why do we see these events?
RTR A RTR B
UPDATE
Error!
RTR A RTR B
NOTIFICATION
7. Cause/Impact.
LIMITED
Must
either
DISCARD
a9ributes
or
TOOLSET IN STANDARDS.
respond
with
NOTIFICATION.
SERVICE
Transit/Peering
failure
-‐
although
error
source
may
be
remote.
IMPACT.
iBGP
failure
–
high
impact
sessions?
Route
reflectors?
Results in loss of RIB!
Would you tolerate this in your IGP based on one erroneous LSP?
8. Intent of Work.
DEFINE HOW
Document
the
way
xSPs
use
BGP.
BGP IS USED.
Ensure
that
cri6cal
nature
of
the
protocol
is
understood.
PROVIDE
Determine
how
OPERATORS
think
that
BGP
should
REQUIREMENTS
fail
–
and
what
we’ll
compromise
on.
TIE TOGETHER
Ensure
that
tools
resul6ng
from
exis6ng
dra]s
IETF WORK ITEMS.
form
a
useful
framework
to
make
BGP
robust.
10. Avoid sending NOTIFICATION.
Error!
172.16.0.0/12
WITHDRAWN
UPDATE
172.16.0.0/12
RTR A RTR B
NOTIFICATION
WHAT DO WE
“treat-‐as-‐withdraw”
mechanism
can
result
in
COMPROMISE ON?
rou6ng
inconsistency
(possible
loops!).
EXISTING WORK
dra]-‐chen
(eBGP
errors)
–
includes
Opt
Trans.
ITEMS IN IETF?
Needs
to
be
extended
to
cover
iBGP.
11. Recover RIB Consistency.
Missing
172.16.0.0/12
from RTR A
REQUEST
172.16.0.0/12
RTR A RTR B
UPDATE
172.16.0.0/12
HOW CAN THIS
Mechanisms
to
re-‐request
missing
NLRI.
BE ACHIEVED?
One
prefix
at
once,
or
whole
RIB.
EXISTING WORK
“One-‐Time
Prefix
ORF”.
ITEMS?
Enhanced
ROUTE
REFRESH.
12. Reduce Impact of Session Reset.
SESSION RESETS,
NOTIFICATION
has
u6lity
for
resecng
state.
CAN WE AVOID THEM?
Consider
that
some6mes
it
is
unavoidable.
FORWARDING PLANE UNAFFECTED.
SESSION
RESET
RTR A RTR B
SESSION
RE-OPEN
EXISTING WORK
(Expired)
“SOFT-‐NOTIFICATION”.
ITEMS IN IETF?
Further
work
required
to
revive!
13. Introduce Further Monitoring.
EXISTING ERRORS
NOCs
can
see
session
failures
very
easily
–
both
ARE VERY VISIBLE.
via
session
monitoring
and
forwarding
outage!
FURTHER COMPLEXITY
Mechanisms
are
required
to
make
error
MEANS LESS MANAGEABLE
handling
visible
to
both
BGP
speakers.
EXISTING WORK
(In-‐band)
ADVISORY
and
DIAGNOSTIC.
ITEMS IN IETF?
(Out-‐of-‐Band)
BGP
Monitoring
Protocol.
14. Complexities of Approach.
Know
the
NLRI?
Re-‐request
(ORF)
Error!
Re-‐request
the
Hitless
Session
treat-‐as-‐
whole
RIB
Reset
withdraw
OOPS!
15. Why am I standing here?
UKNOF
As Operators, we deal with the fall-out of protocol issues!
SO…
an agreed, operator-recommended approach is required.