I am currently working on solving a network issue that has led to network outages across my organization. (I have included a picture of our simple topology below). Our network consists of a Router/Firewall [SRX340], 2x Access Switches [EX-2300], and 1x Top-of-Rack Switch (ToR) [EX2300].
(Sorry I could't post the image directly to the post as I do not have the enough reputation to do so, but you can find the image here: https://i.stack.imgur.com/ldX23.png)
The links between all the switches are trunk for VLANs (4,10,20...)
Our topology has GLOBAL RSTP (rapid-spanning tree) enabled on all devices. The Root Bridge is the (ToR Switch) and after convergence, the blocked ports are ge-0/0/4 and ge-0/0/5 on the (Router/Firewall).
The problem:
We have been facing issues on our ae0 interface, connecting both the (Router/Firewall) and our (ToR Switch). Based on our logs, LACP starts to flap when we add a new Layer-3 IP Address device on any of the (ToR Switch) access ports, but only for VLAN 10.
Is worth noting that the issue is resolved quicker if I unplugged the causing device from the (ToR Switch), otherwise the issue will be resolved, by itself, after a couple of minutes once the incident has started.
root@TOR-SW01> show log messages
Apr 4 12:24:04 TOR-SW01 mib2d[17124]: SNMP_TRAP_LINK_DOWN: ifIndex 610, ifAdminStatus up(1), ifOperStatus down(2), ifName ae0
Apr 4 12:24:06 TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/44: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: ifd 705; Ether boolean set error (22)
Apr 4 12:24:06 TOR-SW01 fpc0 ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr 4 12:24:06 TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr 4 12:24:06 TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/45: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr 4 12:24:06 TOR-SW01 dc-pfe[16887]: ifd 706; Ether boolean set error (22)
Apr 4 12:24:06 TOR-SW01 fpc0 ifd 705; Ether boolean set error (22)
Apr 4 12:24:06 TOR-SW01 fpc0 ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr 4 12:24:07 TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/46: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr 4 12:24:07 TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/46) unknown boolean option 112
Apr 4 12:24:07 TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr 4 12:24:07 TOR-SW01 dc-pfe[16887]: ifd 707; Ether boolean set error (22)
Apr 4 12:24:07 TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr 4 12:24:07 TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/47: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
As mentioned above, this issue can only be replicated if a device, connected to any access port on the (ToR Switch), gets an IP Address either through DHCP or Manually for VLAN 10, leading to an entire network outage as people cannot reach the router. For any other VLAN, this issue does not occur. Furthermore, if a device is connected to any (Access Switch) port and obtains an IP Address for VLAN 10, nothing will be triggered, and no problem occurs.
What Have I Done So Far?
I have already tried manually removing (unplugging) both interfaces ge-0/0/4 and ge-0/0/5 on the (Router/firewall), to minimize any network loops, yet the problem persists.
I tried updating all of our Juniper Devices to the latest version as of the time of writing: (22.4R1).
I have also cleared the MAC Address table from all (Access Switches), as well as ARP on the (Router/Firewall).
I tried rebooting our (Router/Firewall), as well as, all Switches; yet the issue persists.
My Questions are:
- Why does my ae0 link (LACP) go down when a layer-3 device gets an IP Address only when that device is connected to any (ToR Switch) access port (rstp edge).
- How can I tell if this is a layer-3 issue or a layer-2 issue?
- Why does the issue only occur on the ToR Switch?
- What else can I do to help troubleshoot this issue?
Useful Configuration:
Router/Firewall configuration:
root@RT01> show configuration interfaces ae0
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
interface-mode trunk;
vlan {
members all;
}
}
}
root@RT01> show configuration protocols rstp
bridge-priority 16k;
interface ge-0/0/4 {
mode point-to-point;
no-root-port;
}
interface ge-0/0/5 {
mode point-to-point;
}
interface ae0 {
mode point-to-point;
}
ToR Switch configuration:
root@TOR-SW01> show configuration interfaces ae0
traceoptions {
flag all;
}
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
interface-mode trunk;
vlan {
members all;
}
}
}
root@TOR-SW01> show configuration protocols rstp
bridge-priority 8k;
interface xe-0/1/2 {
mode point-to-point;
}
interface xe-0/1/3 {
mode point-to-point;
}
interface ae0 {
mode point-to-point;
}
Useful Logs:
Logs from (ToR Switch) - Normal behavior:
root@RT01> show lacp interfaces ae0
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/0 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/0 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/1 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/1 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/2 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/2 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/3 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/3 Partner No No Yes Yes Yes Yes Fast Active
LACP protocol: Receive State Transmit State Mux State
ge-0/0/0 Current Fast periodic Collecting distributing
ge-0/0/1 Current Fast periodic Collecting distributing
ge-0/0/2 Current Fast periodic Collecting distributing
ge-0/0/3 Current Fast periodic Collecting distributing
Logs from (Router/Firewall) - When Issue Starts:
root@RT01> show log message
Apr 4 12:24:07 RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in routing-instance default Interface ge-0/0/5.0
Apr 4 12:24:07 RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in routing-instance default generated on port ge-0/0/5.0
Apr 4 12:24:12 RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in routing-instance default Interface ae0.0
Apr 4 12:24:12 RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in routing-instance default generated on port ae0.0
Apr 4 12:24:18 RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in routing-instance default received on port ae0.0
Apr 4 12:24:20 RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in routing-instance default received on port ae0.0
Apr 4 12:24:27 RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in routing-instance default Interface ge-0/0/5.0
Logs from (ToR Switch) - When Issue Starts:
root@TOR-SW01> show lacp interfaces ae0
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/44 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/44 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/45 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/45 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/46 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/46 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/47 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/47 Partner No No Yes Yes Yes Yes Fast Active
LACP protocol: Receive State Transmit State Mux State
ge-0/0/44 Current Fast periodic Collecting distributing
ge-0/0/45 Current Fast periodic Collecting distributing
ge-0/0/46 Current Fast periodic Collecting distributing
ge-0/0/47 Current Fast periodic Collecting distributing
(...)
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/44 Actor Yes No No No Yes Yes Fast Active
ge-0/0/44 Partner No No Yes Yes No Yes Fast Active
ge-0/0/45 Actor Yes No No No Yes Yes Fast Active
ge-0/0/45 Partner No No Yes Yes No Yes Fast Active
ge-0/0/46 Actor Yes No No No Yes Yes Fast Active
ge-0/0/46 Partner No No Yes Yes No Yes Fast Active
ge-0/0/47 Actor Yes No No No Yes Yes Fast Active
ge-0/0/47 Partner No No Yes Yes No Yes Fast Active
LACP protocol: Receive State Transmit State Mux State
ge-0/0/44 Expired Fast periodic Attached
ge-0/0/45 Expired Fast periodic Attached
ge-0/0/46 Expired Fast periodic Attached
ge-0/0/47 Expired Fast periodic Attached
(...)
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/44 Actor No Yes No No No Yes Fast Active
ge-0/0/44 Partner No Yes No No No Yes Fast Passive
ge-0/0/45 Actor No Yes No No No Yes Fast Active
ge-0/0/45 Partner No Yes No No No Yes Fast Passive
ge-0/0/46 Actor No Yes No No No Yes Fast Active
ge-0/0/46 Partner No Yes No No No Yes Fast Passive
ge-0/0/47 Actor No Yes No No No Yes Fast Active
ge-0/0/47 Partner No Yes No No No Yes Fast Passive
LACP protocol: Receive State Transmit State Mux State
ge-0/0/44 Defaulted Fast periodic Detached
ge-0/0/45 Defaulted Fast periodic Detached
ge-0/0/46 Defaulted Fast periodic Detached
ge-0/0/47 Defaulted Fast periodic Detached
(...)
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/44 Actor No No No No No Yes Fast Active
ge-0/0/44 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/45 Actor No No No No No Yes Fast Active
ge-0/0/45 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/46 Actor No No No No No Yes Fast Active
ge-0/0/46 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/47 Actor No No No No No Yes Fast Active
ge-0/0/47 Partner No No Yes Yes Yes Yes Fast Active
LACP protocol: Receive State Transmit State Mux State
ge-0/0/44 Current Fast periodic Waiting
ge-0/0/45 Current Fast periodic Waiting
ge-0/0/46 Current Fast periodic Waiting
ge-0/0/47 Current Fast periodic Waiting
(...)
Aggregated interface: ae0
LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity
ge-0/0/44 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/44 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/45 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/45 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/46 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/46 Partner No No Yes Yes Yes Yes Fast Active
ge-0/0/47 Actor No No Yes Yes Yes Yes Fast Active
ge-0/0/47 Partner No No Yes Yes Yes Yes Fast Active
LACP protocol: Receive State Transmit State Mux State
ge-0/0/44 Current Fast periodic Collecting distributing
ge-0/0/45 Current Fast periodic Collecting distributing
ge-0/0/46 Current Fast periodic Collecting distributing
ge-0/0/47 Current Fast periodic Collecting distributing
Logs from (ToR Switch) - Spanning Tree:
root@TOR-SW01> show log rstp
Apr 4 12:24:24.173413 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: PORT_DISABLED, State: NOT_EDGE
Apr 4 12:24:24.174609 BDSM: Port ae0.0: Moved to state NOT OPER EDGE
Apr 4 12:24:24.174653 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_RESET, State: ACTIVE
Apr 4 12:24:24.174682 TCSM: Port ae0.0: No Operations to perform
Apr 4 12:24:24.174748 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_DISABLED, State: CURRENT
Apr 4 12:24:24.174783 PISM: Port ae0.0: Moving to state DISABLED
Apr 4 12:24:24.174812 PISM: Port ae0.0: Moved to state DISABLED
Apr 4 12:24:24.186004 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: NOT_DESG_ROOT, State: ACTIVE
Apr 4 12:24:24.186094 TCSM: Port ae0.0: Moved to state LEARNING
Apr 4 12:24:24.186129 TCSM: Port ae0.0 Role is NOT ROOT/DESIGNATED; Changing to INACTIVE state
Apr 4 12:24:24.186158 TCSM: Port ae0.0: Moved to state INACTIVE
Apr 4 12:24:24.192122 TCSM: Learnt Entries on Port ae0.0 have been flushed!
Apr 4 12:24:24.192423 MSG: Management Disabling of Port 3 Success
Apr 4 12:24:29.552531 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_ENABLED, State: DISABLED
Apr 4 12:24:29.552658 PISM: Port ae0.0: Moved to state AGED
Apr 4 12:24:29.552779 PISM: Port ae0.0: Port Info State Machine Called with Event: UPDATE_INFO, State: AGED
Apr 4 12:24:29.552820 PISM: Port ae0.0: UPDATING port info
Apr 4 12:24:29.552865 PISM: Port ae0.0: Moved to state UPDATE
Apr 4 12:24:29.552968 PISM: Port ae0.0: Moved to state CURRENT
Apr 4 12:24:29.553278 MSG: Management Enabling of Port 3 Success
Apr 4 12:24:32.562537 TMR: Port ae0.0: EDGEDELAYWHILE Timer EXPIRED forInstance: 0
Apr 4 12:24:32.562673 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: EDGEDELAYWHILE_EXP, State: NOT_EDGE
Apr 4 12:24:32.562711 BDSM: Port ae0.0: Moved to state OPER EDGE
Apr 4 12:24:32.562802 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: LEARN_SET, State: INACTIVE
Apr 4 12:24:32.562844 TCSM: Port ae0.0: Moved to state LEARNING
Apr 4 12:24:32.563595 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: FORWARD, State: LEARNING
Apr 4 12:24:32.563683 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_SET, State: LEARNING
Apr 4 12:24:32.563719 TCSM: Port ae0.0: No Operations to perform