There is a general rule that floats around networking forums -- particularly Cisco's -- which states that a network with an STP diameter of 7+ can be a risk for instability and unexpected re-convergence. If you haven't heard of the Beth Israel Deaconess Medical Center meltdown then, please, allow me to summarize:
In 2002 -- I note the year just because 802.1w came out in 2001, but maybe BIDMC had cold feet about the transition -- the infrastructure at BIDMC came to a full stop due to layer 2 instability. After what was described as extensive work with Cisco TAC and engineers who flew in, it was discovered that BIDMC had exceeded an STP diameter of 7.
Going forward, keep in mind that, maybe by the standards of over a decade ago, 7 was unfeasible and possibly even downright impossible, but, today (and as we'll see below) we can get away with a bit more carelessness.
Recall that STP has three major timers;
- Hello = The amount of time between each Hello BPDU sent between switches. 2 seconds, by default, but anywhere from 1-10 seconds.
- FWD Delay = The amount of time spent in the Listening and Learning states, respectively, before transitioning; 15 seconds, by default, but anywhere from 4-30 seconds.
- Max Age = The max length of time that can pass before a switch saves its Config BPDU info. 20 seconds, by default, but anywhere from 6-40 seconds.
Max Age can only be reset by the receiving of a new superior BPDU which changes the local bridge's view on how to best reach root.
Each Config BPDU contains these 3 values, but, in addition, contains a little known bonus timer called "Message Age." MSG Age isn't a fixed value. The Root sends all BPDUs with MSG Age=0 and subsequent non-roots that receive the BPDU increment MSG Age by 1 then relay it. Effectively, MSG Age represents how far you are from the Root upon receiving the BPDU.
When a BPDU arrives that is superior (better BID (MAC address + Bridge Priority)) to the current BPDU received from the current Root, the new, superior BPDU is stored and the Age Timer starts to increment, beginning at a value equal to the MSG Age received in the superior BPDU. If the Age Timer reaches a value equal to the switch's Max Age Timer before another BPDU is received from the Root (remember, the ROOT is always sending Hello BPDUs at a default of 2 sec) then the Age Timer doesn't refresh and the superior BPDU is aged out.
You can see what a problem this might cause with larger STP diameters...
Because we're working in a virtual infrastructure, say we've really let our network burn to the ground, and 18 switches are daisy chained from ROOT to furthest Access Switch. So our diameter = 18.
A majestic, swirling vortex of fail.
Recall that the Root originates its BPDU with MSG AGE = 0, then each switch increases MSG Age by 1 as it relays, so, at the far-end switch...MSG Age = 17 upon receiving the BPDU from the Root.
So, our Age Timer starts at 17 seconds, and we have 3 seconds of hold time (Max Age - MSG Age) before Max Age expires and the superior BPDU is discarded. By default, our Root will only re-send BPDUs every 2 seconds, so, assuming decent line speed/no link saturation, let's say it takes 1 second for the far-end switch to receive the superior BPDU and refresh its Age Timer.
Let's see how the MSG Age looks as we move down the chain, starting on the Root.
Root Switch in the daisy chain originating Superior BPDU:
Root-SW1#sh spanning-tree vlan 1 detail
VLAN0001 is executing the ieee compatible Spanning Tree protocol
Bridge Identifier has priority 24576, sysid 1, address aabb.cc00.0100
Configured hello time 2, max age 20, forward delay 15
We are the root of the spanning tree
Topology change flag not set, detected flag not set
Number of topology changes 2 last change occurred 00:02:00 ago
from Ethernet0/0
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0, aging 300
Port 1 (Ethernet0/0) of VLAN0001 is designated forwarding
!truncated for brevity!
Timers: message age 0
Now, the second switch in the daisy chain receiving the Superior BPDU:
SW2#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 1, forward delay 0, hold 0
We'll see the MSG Age timer increment as the switch waits for the next superior BPDU to arrive:
SW2#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 2, forward delay 0, hold 0
It will drop back to "message age 1" as, by the time the Hello arrived, set the message age to 1, and the next Hello left the Root and arrived, MSG Age will have incremented to 2 (possibly on its way to 3).
Here's the third switch in the daisy chain receiving the superior BPDU:
SW3#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 2, forward delay 0, hold 0
It starts at 2 rather than 1 because the BPDU's Message Age was incremented from 0 to 1 upon being received on SW2 then relayed from SW2 with Message Age = 1, which SW3 then incremented to 2 upon receiving it. It increments towards Max Age as it waits for a superior BPDU to arrive to refresh MSG Age.
SW3#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 3, forward delay 0, hold 0
For the sake of brevity, we'll skip down the chain to SW6 and check its MSG Age timers:
SW6#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 5, forward delay 0, hold 0
SW6#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 6, forward delay 0, hold 0
Now, all of this works nicely because, as you can see it takes maybe 1 second at most for the Root to get its superior BPDU down to the furthest switch and refresh the MSG Age timer. We probably won't see any issues unless we get our diameter up near 19 or 20. Let's see what happens!
As we get up on our 13th switch in the chain, we can see that the BPDU starts to take longer to arrive/be processed:
SW13#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 12, forward delay 0, hold 0
SW13#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 13, forward delay 0, hold 0
SW13#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 14, forward delay 10, hold 0
The superior BPDU that arrived on SW should have a MSG Age of 12, so it's taking a full 2 seconds before the Root's new BPDU can arrive and be processed by SW13 to refresh the MSG Age timer.
Let's skip ahead to our 18th switch
SW18#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 17, forward delay 0, hold 0
SW18#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 18, forward delay 0, hold 0
SW18#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 19, forward delay 0, hold 0
SW18#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 20, forward delay 0, hold 0
SW18#sh spanning-tree vlan 1 detail | incl message age
Timers: message age 0, forward delay 0, hold 0
Yup, this is exactly the sort of thing that wreaks havoc on a network (though, in all fairness, if your network has an STP diameter bad enough where your MSG Age is starting at 17, you're probably already waist-deep in havoc). Notice how the age timer starts at 17 (as it should, since Root originates it at 0 and we're 18 switches deep), but it takes a full 3 seconds for the superior BPDU to reach switch 18.
Since our Max Age was reached before the Hello could be received/processed, the current superior BPDU was discarded, MSG Age resets to 0, and our far-end switch is now undergoing a topology change where it first assumes itself to be the Root:
SW18#sh spanning-tree vlan 1
VLAN0001
Spanning tree enabled protocol ieee
Root ID Priority 32769
Address aabb.cc00.1200
This bridge is the root
Keep in mind, this simulation assumes a topology with no production traffic crossing it, no actual cables (so no chance of EMI/bends/general CRC causes), and really no overhead on the switches at all. It's entirely possible that we could have seen this on switch 10 or 11. One should always take into consideration the propensity for lost Hello BPDUs and, in a worst case scenario, how long the end-to-end BPDU propagation truly is, given the number of lost Hellos, the frequency with which Hellos go out (Hello Timer), BPDU Delay (amount of time it takes a switch to receive a BPDU then relay it (1 second max), and your STP diameter.
Cisco has a nice little algorithm providing that propagation delay:
End-to-end_BPDU_propa_delay
= ((lost_msg + 1) x hello) + ((BPDU_Delay x (dia – 1))
= ((3 + 1) x hello) + ((1 x (dia – 1))
= 4 x hello + dia – 1
= 4 x 2 + 6
= 14 sec
Once you have propagation delay, you can also make sense of why Max Age is 20 seconds by default:
max_age
= End-to-end_BPDU_propa_delay + Message_age_overestimate
= 14 + 6
= 20 sec
Where "Message_age_overestimate" accounts for the age of the BPDU since origination by the Root and 1 second incrementing by each relaying non-root:
Message_age_overestimate
= (dia – 1) x overestimate_per_bridge
= dia – 1
= 6
Once we take cable faults, switching delays, and general network overhead into account, 7 starts to look like a more fault-tolerant STP diameter limit.
So, there you have it! STP diameter and why anyone in their right mind has long since migrated to 802.1w. When I landed my first networking gig, you can imagine my surprise when I saw a few daisy-chained switches. While certainly not conducive to scalability and redundancy, it wasn't the end of the world, but it did merit a change control submission to move to something less...let's call it "horrifying."
very interesting, thanks for share.
ReplyDelete