Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Nafeez Islam
Assistant Manager
NovoCom Limited
▪ Once in a while network engineers working in IIGs or ISPs in Bangladesh have to face
a phenomenon: a switching loop . In...
▪ In our NovoCom and InterCloud’s networks, at the switching backbone we have RSTP
running.
▪ Furthermore the switching ne...
▪ These broadcasts come to us from our partner networks. The affected devices are
our partner/client facing switches.
▪ Mo...
▪ When the loop occurs, one or multiple of our switches have almost 100 CPU
utilization, management IPs become unreachable...
▪ Upon accessing the switches we have to manually/ or physically shut down clients
and partners to see that disconnecting ...
▪One interesting matter: we come to know in real time or
later that several service providers are facing the same
issue at...
▪ But even if the networks are
physically connected in this manner,
loop should not take place.
▪ It is because in our net...
▪ So in the diagram, VLANs allowed at
Interface A do not exist at Interface Z.
▪ Therefore broadcast domains are
completel...
▪ To understand it, I performed some very simple tests with BDCOM switches. I used
BDCOM(tm) S5612 Software, Version 2.2.0...
▪But the problem occurs when the scenario is
something like this in my lab test:
▪ I created loop intentionally at A and B
to see how it affects Switch Z.
▪ Here the circled portion replicates a
network ...
▪ A single ping to a non-existent IP
creates a broadcast storm at switches
A and B and takes CPU utilization to
100 percen...
▪ Switch Z could have saved itself if STP
could block the port connected to
switch A. But a switch detects loops
(when STP...
▪ But switch Z is not getting its own
BPDU back from switch A via interface
Z which it had sent out through other
interfac...
▪ A single ping from switch A/B to a
non-existent IP creates about 600
Mbps traffic at the connected
interface of Switch Z...
▪ Afterwards I replaced BDCOM with
Cisco Me3400 and used it as Switch Z.
The result is the same , a spike in CPU
utilizati...
▪ When a router receives a broadcast,
the router simply drops it. Even if I
connect a router (a Mikrotik CCR
1016-12G for ...
▪So what actually happens to affect
several service provider networks at the
same time is something like this:
▪ Not only the broadcast storm
originator network,but all the
attached networks are affected.
▪ During real L2 looping incidents, we
find multiple of our switches getting
unreachable.
▪ But in my lab setup, when I co...
▪ An explanation to this in my opinion
is, in my LAB tests and in our own
production environment, we allow
only specific V...
▪ But my assumption is that many networks may leave all the trunk interfaces at
default config and allow all VLANS includi...
▪ (1 )Never disable STP in your switching network.
▪ It is almost never advisable to disable STP. If you want to make STP ...
▪ (1 )Never disable STP in your switching network.
▪ Another reason of STP disabling could be efficient use of all device ...
▪ (2)Connect with your client/partner/peer at their routers rather than their
switches.
▪
▪ Running STP will not save you ...
▪ (3) Using keepalive command:
▪ It is advised to apply the ‘keepalive’ command at the client facing interfaces.
▪ If the ...
▪ These conclusions are based on my own observations and studies.
▪ Findings from lab tests.
▪ Many other factors may cont...
▪ Even after deploying all loop prevention mechanisms ,you may still face broadcast
storms.
▪ These broadcast storms would...
Broadcast storms in service provider network, Nafeez Islam
Broadcast storms in service provider network, Nafeez Islam
Broadcast storms in service provider network, Nafeez Islam
Broadcast storms in service provider network, Nafeez Islam
Broadcast storms in service provider network, Nafeez Islam
Broadcast storms in service provider network, Nafeez Islam
Upcoming SlideShare
Loading in …5
×

Broadcast storms in service provider network, Nafeez Islam

63 views

Published on

Once in a while network engineers working in IIGs or ISPs in Bangladesh have to face a phenomenon: a switching loop . In our part of the network backbone which is switch based ,we have all the recommended loop prevention mechanisms. Even after that sometimes broadcast storm takes places. The paper discusses my findings on what may have caused this occurrences and my recommendations. I wrote about this topic for the first time 14 months back on LinkedIn as an article. I believe the topic is still relevant.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Broadcast storms in service provider network, Nafeez Islam

  1. 1. Nafeez Islam Assistant Manager NovoCom Limited
  2. 2. ▪ Once in a while network engineers working in IIGs or ISPs in Bangladesh have to face a phenomenon: a switching loop . In our part of the network backbone which is switch based ,we have all the recommended loop prevention mechanisms. Even after that sometimes broadcast storm takes places. The paper discusses my findings on what may have caused this occurrences and my recommendations. I wrote about this topic for the first time 14 months back on LinkedIn as an article. I believe the topic is still relevant.
  3. 3. ▪ In our NovoCom and InterCloud’s networks, at the switching backbone we have RSTP running. ▪ Furthermore the switching network has a tree like topology, there are no rings.So even with STP disabled, there is no scope of loop occurring. ▪ But no matter how unlikely it seems ,sometimes we face broadcast storms.
  4. 4. ▪ These broadcasts come to us from our partner networks. The affected devices are our partner/client facing switches. ▪ More often than not, we are compelled to connect to our clients or partners at their switches rather than their routers. ▪ These networks may be fully switch-based and may connect to other uplinks/partners/clients through their switches. They even might have their STP disabled .
  5. 5. ▪ When the loop occurs, one or multiple of our switches have almost 100 CPU utilization, management IPs become unreachable and all clients in these devices face up to 100 percent packet loss. ▪ With this 100 CPU utilization, it does not remain possible to check logs and interface traffics to isolate what traffic is causing this outage.
  6. 6. ▪ Upon accessing the switches we have to manually/ or physically shut down clients and partners to see that disconnecting which network makes things normal again. ▪ Very time consuming process. ▪ It is quite frustrating to see our network being affected even after having the recommended design and configurations.
  7. 7. ▪One interesting matter: we come to know in real time or later that several service providers are facing the same issue at exactly the same time. ▪In the earlier days ,we used to think something like this is happening :
  8. 8. ▪ But even if the networks are physically connected in this manner, loop should not take place. ▪ It is because in our network we follow a strict policy of allowing specific VLANs in client-connected interfaces and a specific VLAN is never repeated.
  9. 9. ▪ So in the diagram, VLANs allowed at Interface A do not exist at Interface Z. ▪ Therefore broadcast domains are completely separated.
  10. 10. ▪ To understand it, I performed some very simple tests with BDCOM switches. I used BDCOM(tm) S5612 Software, Version 2.2.0C Build 42666. Here are a few of the findings: ▪ When intentionally loops are created, BDCOM switches can detect and prevent loops just fine with STP running. ▪ I used a ring of 4 switches to create loop for specific VLANs, running STP even at only 1 of the switches still prevented loops by blocking certain interfaces.
  11. 11. ▪But the problem occurs when the scenario is something like this in my lab test:
  12. 12. ▪ I created loop intentionally at A and B to see how it affects Switch Z. ▪ Here the circled portion replicates a network which is connected to us but we have no control over their STP/VLAN policies. ▪ But our focus is switch Z . Switch Z resembles our device which is connected to client/partner.
  13. 13. ▪ A single ping to a non-existent IP creates a broadcast storm at switches A and B and takes CPU utilization to 100 percent. ▪ The broadcast storm is occurring in VLANs 1 and 200.But Switch Z should be discarding every packet which do not have tag of VLAN 500. ▪ So switch Z itself is not taking part in the loop. But it still gets unreachable, and CPU utilization becomes 100 percent.
  14. 14. ▪ Switch Z could have saved itself if STP could block the port connected to switch A. But a switch detects loops (when STP is enabled) when it sends out BPDU and receives that BPDU on another port.
  15. 15. ▪ But switch Z is not getting its own BPDU back from switch A via interface Z which it had sent out through other interfaces. ▪ So there is no reason for STP to conclude that there is any loop, and so does not take the interface into BLK mode.
  16. 16. ▪ A single ping from switch A/B to a non-existent IP creates about 600 Mbps traffic at the connected interface of Switch Z. ▪ Switch Z is supposed to discard these broadcast packets as the packets do not belong to VLAN 500,but it still has to check every frame,check VLAN tag and then drop. Dealing with so many broadcasts leads to CPU utilization of 100 percent. ▪ I captured packets from interface Z and they are all broadcasts
  17. 17. ▪ Afterwards I replaced BDCOM with Cisco Me3400 and used it as Switch Z. The result is the same , a spike in CPU utilization over 95 percent.
  18. 18. ▪ When a router receives a broadcast, the router simply drops it. Even if I connect a router (a Mikrotik CCR 1016-12G for my test) , it results in very high CPU utilization. ▪ And all of these are happening from a broadcast storm which was created by just 1 single ping.
  19. 19. ▪So what actually happens to affect several service provider networks at the same time is something like this:
  20. 20. ▪ Not only the broadcast storm originator network,but all the attached networks are affected.
  21. 21. ▪ During real L2 looping incidents, we find multiple of our switches getting unreachable. ▪ But in my lab setup, when I connect another switch X with switch Z shown in diagram below, only switch Z gets unreachable ,switch X is not affected.
  22. 22. ▪ An explanation to this in my opinion is, in my LAB tests and in our own production environment, we allow only specific VLANs at all interfaces ▪ . Therefore although the directly connected switch has 100 percent CPU utilization, it does not propagate the broadcasts to the next switch.
  23. 23. ▪ But my assumption is that many networks may leave all the trunk interfaces at default config and allow all VLANS including vlan 1. ▪ If a network Q is such network and is connected to a broadcast storm originator network P, then a broadcast storm from its neighbor network P will not only affect its edge switch, but will reach farthest corner of its network. ▪ As a result all other networks are connected to different switches of network Q are also affected. ▪ May be this why the looping incidents are on such a large scale and takes down so many networks at the same time.
  24. 24. ▪ (1 )Never disable STP in your switching network. ▪ It is almost never advisable to disable STP. If you want to make STP convergence faster you can use the Portfast and BPDU Guard commands at the interfaces where routers/servers/PCs are connected. But disabling STP all together is not recommended. ▪ One reason for keeping STP disabled I assume is, having many VLANs in the network and having a complete control of the traffic flow direction.
  25. 25. ▪ (1 )Never disable STP in your switching network. ▪ Another reason of STP disabling could be efficient use of all device ports and links,because STP may keep some ports blocked. ▪ However this can be done by using PVST+ and manually changing primary root bridges and secondary root bridges for each VLAN. ▪ For this an extensive and thorough planning is required, but this will enable you to dictate traffic flow for each VLAN as per your preferences.
  26. 26. ▪ (2)Connect with your client/partner/peer at their routers rather than their switches. ▪ ▪ Running STP will not save you if your directly connected network is the broadcast storm originator. So try to persuade your client/partner so that you can connect at their router. Your switch will never have to face a storm.
  27. 27. ▪ (3) Using keepalive command: ▪ It is advised to apply the ‘keepalive’ command at the client facing interfaces. ▪ If the neighboring switch has 100 percent CPU utilization due to broadcast storm, it will be unable to return back the keepalive query. This command then shuts the interface down and protects itself. ▪ In my lab tests, the keepalive command worked in 4 out of 6 cases to shut down the interface before being affected by broadcast storm.
  28. 28. ▪ These conclusions are based on my own observations and studies. ▪ Findings from lab tests. ▪ Many other factors may contribute in production environment.
  29. 29. ▪ Even after deploying all loop prevention mechanisms ,you may still face broadcast storms. ▪ These broadcast storms would originate in your neighbor network over which you have no control . ▪ Will affect your directly connected devices. ▪ Following the recommendations may save you in such scenarios.

×