Score:4

Linux: STP does not converge among linux containers

mg flag

I'm trying to create a lab in GNS3 with docker containers to understand more about spanning-tree. My lab is very simple: there are two Linux/Alpine containers with two links connecting them:

--------                                          --------
| SW-1 | et2 -------------------------------- et2 | SW-2 |
|      | et3 -------------------------------- et3 |      |
--------                                          --------

Each has a bridge br0 configured with the following:

ifconfig eth2 down
ifconfig eth3 down
brctl addbr br0
brctl addif br0 eth2
brctl addif br0 eth3
brctl stp br0 on
ifconfig eth2 0.0.0.0 up
ifconfig eth3 0.0.0.0 up
ifconfig br0 up

The bridges are up, the modules are loaded and stp seems to be running fine in each host but they don't converge. All ports are kept forwarding and L2 pkts keep looping indefinitely:

# BOTH SW-1 and SW-2:
# lsmod | egrep -i 'bridge|stp'
bridge                352256  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp

SW-1:
br0
 bridge id              8000.8615aca70489
 designated root        8000.8615aca70489    <<== SW-1 believes it is the root
 root port                 0                    path cost                  0
 max age                  20.00                 bridge max age            20.00
 hello time                2.00                 bridge hello time          2.00
 forward delay            15.00                 bridge forward delay      15.00
 ageing time             300.00
 hello timer               0.36                 tcn timer                  0.00
 topology change timer     0.00                 gc timer                 116.61
 flags


eth3 (2)
 port id                8002                    state                forwarding
 designated root        8000.8615aca70489       path cost                100
 designated bridge      8000.8615aca70489       message age timer          0.00
 designated port        8002                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

eth2 (1)
 port id                8001                    state                forwarding
 designated root        8000.8615aca70489       path cost                100
 designated bridge      8000.8615aca70489       message age timer          0.00
 designated port        8001                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags


SW-2:
br0
 bridge id              8000.16d0f207e210
 designated root        8000.16d0f207e210    <<== SW-2 believes it is the root
 root port                 0                    path cost                  0
 max age                  20.00                 bridge max age            20.00
 hello time                2.00                 bridge hello time          2.00
 forward delay            15.00                 bridge forward delay      15.00
 ageing time             300.00
 hello timer               0.57                 tcn timer                  0.00
 topology change timer     0.00                 gc timer                 116.61
 flags


eth3 (2)
 port id                8002                    state                forwarding
 designated root        8000.16d0f207e210       path cost                100
 designated bridge      8000.16d0f207e210       message age timer          0.00
 designated port        8002                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

eth2 (1)
 port id                8001                    state                forwarding
 designated root        8000.16d0f207e210       path cost                100
 designated bridge      8000.16d0f207e210       message age timer          0.00
 designated port        8001                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

When I ran tcpdump on eth2 and eth3 on both devices, I see the BPDUs being sent/received but, apparently, each device is ignoring the BDPU from the other (btw, load avg spikes up in my machine due to the loop) :

SW-1:
Spanning Tree Protocol
    Protocol Identifier: Spanning Tree Protocol (0x0000)
    Protocol Version Identifier: Spanning Tree (0)
    BPDU Type: Configuration (0x00)
    BPDU flags: 0x00
        0... .... = Topology Change Acknowledgment: No
        .... ...0 = Topology Change: No
    Root Identifier: 32768 / 0 / 86:15:ac:a7:04:89
        Root Bridge Priority: 32768
        Root Bridge System ID Extension: 0
        Root Bridge System ID: 86:15:ac:a7:04:89 (86:15:ac:a7:04:89)
    Root Path Cost: 0
    Bridge Identifier: 32768 / 0 / 86:15:ac:a7:04:89
        Bridge Priority: 32768
        Bridge System ID Extension: 0
        Bridge System ID: 86:15:ac:a7:04:89 (86:15:ac:a7:04:89)
    Port identifier: 0x8002
    Message Age: 0
    Max Age: 20
    Hello Time: 2
    Forward Delay: 15

SW-2:
Spanning Tree Protocol
    Protocol Identifier: Spanning Tree Protocol (0x0000)
    Protocol Version Identifier: Spanning Tree (0)
    BPDU Type: Configuration (0x00)
    BPDU flags: 0x00
        0... .... = Topology Change Acknowledgment: No
        .... ...0 = Topology Change: No
    Root Identifier: 32768 / 0 / 16:d0:f2:07:e2:10
        Root Bridge Priority: 32768
        Root Bridge System ID Extension: 0
        Root Bridge System ID: 16:d0:f2:07:e2:10 (16:d0:f2:07:e2:10)
    Root Path Cost: 0
    Bridge Identifier: 32768 / 0 / 16:d0:f2:07:e2:10
        Bridge Priority: 32768
        Bridge System ID Extension: 0
        Bridge System ID: 16:d0:f2:07:e2:10 (16:d0:f2:07:e2:10)
    Port identifier: 0x8002
    Message Age: 0
    Max Age: 20
    Hello Time: 2
    Forward Delay: 15

Each keep telling the other it is the Root bridge. It doesn't matter if I wait a few minutes. dmesg shows nothing but:

[37533.507941] br0: received packet on eth2 with own address as source address (addr:86:15:ac:a7:04:89, vlan:0)
[37533.507942] br0: received packet on eth3 with own address as source address (addr:86:15:ac:a7:04:89, vlan:0)

AFAIK, bridges are not aware of VLANs. I tried anyway to set the the default_pvid for this bridge to 0 but it made no difference. There are no ebtable filters rule applied and I zeroed all files at /proc/sys/net/bridge/ too. I don't see a reason why the BPDUs are not consumed and the devices eventually converge.

I've tried the same experiment with just one link connecting the bridges (ie, no loops) and a host behind each connected to another interface, configured static IP addresses in the hosts and successfully pinged one another, ie, the bridges are switching the packets:

--------                                          --------
| SW-1 | et2 -------------------------------- et2 | SW-2 |
--------                                          --------
   et1                                              et1
    |                                                |
   host1                                            host2

I've also tried replacing the containers by openvswitch and proprietary images and it worked fine. Any ideas?

pt flag
"AFAIK, bridges are not aware of VLANs." They totally are: https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge (but I don't think that's relevant to your question)
pt flag
I'm not familiar with GNS3. How are the links in this example connected? Are `eth2` in both containers simply two ends of a `veth` pair, or are they connected through a Docker bridge?
José Rios avatar
mg flag
You are right, they are VLAN aware even though it indeed makes no difference. I said they weren't based on this answer made a few months before the link you shared: https://serverfault.com/a/824736/1001960.
Score:3
pt flag

This was a really interesting question, so I spent some time researching it. It turns out the question has been asked previously, and this answer describes the issue:

The root cause is that, stp messages sent correctly from the bridge_slaves, but the rcv routine is only restricted to the init_ns in net/llc/llc_input.c line 166 (linux-source-5.15.0...)

It looks like the same conditional exists in current 6.1.x kernels; see e.g. here.

You can verify this is the problem by leaving your bridges in the global network namespace. I'm not familiar with GNS3 so I set up a test environment using the command line; I ended up with something like this:

enter image description here

In this diagram, sw1-br0 and sw2-br0 are bridge devices, sw1 and sw2 are network namespaces, and everything else are veth devices. This is largely equivalent to your example (the two bridges are connected by a pair of links), but the bridges live in the global namespace. We attach a namespaced interface to each bridge so that we can test end-to-end connectivity.

I set everything up with this script:

#!/bin/sh

set -ex

for dev in 1 2; do
        ns=sw$dev

        # create namespace
        ip netns add $ns

        # create bridge device
        ip link add $ns-br0 type bridge stp_state 1
        ip link set $ns-br0 up

        # create link from namespace to bridge
        ip link add $ns-int type veth peer name $ns-ext

        # configure internal device
        ip link set netns $ns dev $ns-int
        ip -n $ns link set up dev $ns-int
        ip -n $ns addr add 100.64.10.$(( dev * 10))/24 dev $ns-int

        # add external device to bridge
        ip link set master $ns-br0 dev $ns-ext
        ip link set up dev $ns-ext
done

# create links between bridge devices
for port in port0 port1; do
        ip link add sw1-$port type veth peer name sw2-$port
        ip link set master sw1-br0 sw1-$port
        ip link set master sw2-br0 sw2-$port
done

When this script finishes running, the bridge-to-bridge links are all disabled, while the namespace-to-bridge links are up. This gives us:

$ bridge link | grep br0
5090: sw1-ext@if5091: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2
5093: sw2-ext@if5094: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state forwarding priority 32 cost 2
5095: sw2-port0@sw1-port0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw2-br0 state disabled priority 32 cost 2
5096: sw1-port0@sw2-port0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw1-br0 state disabled priority 32 cost 2
5097: sw2-port1@sw1-port1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw2-br0 state disabled priority 32 cost 2
5098: sw1-port1@sw2-port1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw1-br0 state disabled priority 32 cost 2

If I bring up the port0 interface:

ip link set sw1-port0 up
ip link set sw2-port0 up

We eventually see:

5090: sw1-ext@if5091: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2
5093: sw2-ext@if5094: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state forwarding priority 32 cost 2
5095: sw2-port0@sw1-port0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state forwarding priority 32 cost 2
5096: sw1-port0@sw2-port0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2
5097: sw2-port1@sw1-port1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw2-br0 state disabled priority 32 cost 2
5098: sw1-port1@sw2-port1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master sw1-br0 state disabled priority 32 cost 2

And when I finally bring up the port1 link and wait for stp to converge:

ip link set sw1-port1 up
ip link set sw2-port1 up

We see:

5090: sw1-ext@if5091: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2
5093: sw2-ext@if5094: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state forwarding priority 32 cost 2
5095: sw2-port0@sw1-port0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state forwarding priority 32 cost 2
5096: sw1-port0@sw2-port0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2
5097: sw2-port1@sw1-port1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw2-br0 state blocking priority 32 cost 2
5098: sw1-port1@sw2-port1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master sw1-br0 state forwarding priority 32 cost 2

Here you can see that the bridges have successfully detected a loop and marked one of the ports as blocking.

We can verify connectivity between the two namespaces:

# ip netns exec sw1 ping -c2 100.64.10.20
PING 100.64.10.20 (100.64.10.20) 56(84) bytes of data.
64 bytes from 100.64.10.20: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 100.64.10.20: icmp_seq=2 ttl=64 time=0.067 ms

--- 100.64.10.20 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 0.052/0.059/0.067/0.007 ms

And by running tcpdump on any of the links we can verify that there is no loop.

José Rios avatar
mg flag
Brilliant! You nailed it, @larsks! I ran the same experiment and STP converged too. Thank you! I don't know too how GNS3 creates the links (`ip netns list` brings me nothing when the lab is running) but seems to be hitting the same issue indeed.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.