Score:1

How to troubleshoot LACP flapping interface triggered by layer 3 IP Address. (Juniper)

mo flag

I am currently working on solving a network issue that has led to network outages across my organization. (I have included a picture of our simple topology below). Our network consists of a Router/Firewall [SRX340], 2x Access Switches [EX-2300], and 1x Top-of-Rack Switch (ToR) [EX2300].

(Sorry I could't post the image directly to the post as I do not have the enough reputation to do so, but you can find the image here: https://i.stack.imgur.com/ldX23.png)

  • The links between all the switches are trunk for VLANs (4,10,20...)

  • Our topology has GLOBAL RSTP (rapid-spanning tree) enabled on all devices. The Root Bridge is the (ToR Switch) and after convergence, the blocked ports are ge-0/0/4 and ge-0/0/5 on the (Router/Firewall).

The problem:

We have been facing issues on our ae0 interface, connecting both the (Router/Firewall) and our (ToR Switch). Based on our logs, LACP starts to flap when we add a new Layer-3 IP Address device on any of the (ToR Switch) access ports, but only for VLAN 10.

Is worth noting that the issue is resolved quicker if I unplugged the causing device from the (ToR Switch), otherwise the issue will be resolved, by itself, after a couple of minutes once the incident has started.

root@TOR-SW01> show log messages
Apr  4 12:24:04  TOR-SW01 mib2d[17124]: SNMP_TRAP_LINK_DOWN: ifIndex 610, ifAdminStatus up(1), ifOperStatus down(2), ifName ae0
Apr  4 12:24:06  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/44: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]:   ifd 705; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0 ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/45: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]:   ifd 706; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0   ifd 705; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0 ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr  4 12:24:07  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/46: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/46) unknown boolean option 112
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]:   ifd 707; Ether boolean set error (22)
Apr  4 12:24:07  TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:07  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/47: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|

As mentioned above, this issue can only be replicated if a device, connected to any access port on the (ToR Switch), gets an IP Address either through DHCP or Manually for VLAN 10, leading to an entire network outage as people cannot reach the router. For any other VLAN, this issue does not occur. Furthermore, if a device is connected to any (Access Switch) port and obtains an IP Address for VLAN 10, nothing will be triggered, and no problem occurs.

What Have I Done So Far?

  • I have already tried manually removing (unplugging) both interfaces ge-0/0/4 and ge-0/0/5 on the (Router/firewall), to minimize any network loops, yet the problem persists.

  • I tried updating all of our Juniper Devices to the latest version as of the time of writing: (22.4R1).

  • I have also cleared the MAC Address table from all (Access Switches), as well as ARP on the (Router/Firewall).

  • I tried rebooting our (Router/Firewall), as well as, all Switches; yet the issue persists.

My Questions are:

  • Why does my ae0 link (LACP) go down when a layer-3 device gets an IP Address only when that device is connected to any (ToR Switch) access port (rstp edge).
  • How can I tell if this is a layer-3 issue or a layer-2 issue?
  • Why does the issue only occur on the ToR Switch?
  • What else can I do to help troubleshoot this issue?

Useful Configuration:

Router/Firewall configuration:

root@RT01> show configuration interfaces ae0
aggregated-ether-options {
    lacp {
        active;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members all;
        }
    }
}
root@RT01> show configuration protocols rstp
bridge-priority 16k;
interface ge-0/0/4 {
    mode point-to-point;
    no-root-port;
}
interface ge-0/0/5 {
    mode point-to-point;
}
interface ae0 {
    mode point-to-point;
}

ToR Switch configuration:

root@TOR-SW01> show configuration interfaces ae0
traceoptions {
    flag all;
}
aggregated-ether-options {
    lacp {
        active;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members all;
        }
    }
}
root@TOR-SW01> show configuration protocols rstp
bridge-priority 8k;
interface xe-0/1/2 {
    mode point-to-point;
}
interface xe-0/1/3 {
    mode point-to-point;
}
interface ae0 {
    mode point-to-point;
}

Useful Logs:

Logs from (ToR Switch) - Normal behavior:

root@RT01> show lacp interfaces ae0
Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/0       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/0     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/1       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/1     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/2       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/2     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/3       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/3     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/0                  Current   Fast periodic Collecting distributing
      ge-0/0/1                  Current   Fast periodic Collecting distributing
      ge-0/0/2                  Current   Fast periodic Collecting distributing
      ge-0/0/3                  Current   Fast periodic Collecting distributing

Logs from (Router/Firewall) - When Issue Starts:

root@RT01> show log message
Apr  4 12:24:07  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ge-0/0/5.0
Apr  4 12:24:07  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default generated on port ge-0/0/5.0
Apr  4 12:24:12  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ae0.0
Apr  4 12:24:12  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default generated on port ae0.0
Apr  4 12:24:18  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default received on port ae0.0
Apr  4 12:24:20  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default received on port ae0.0
Apr  4 12:24:27  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ge-0/0/5.0

Logs from (ToR Switch) - When Issue Starts:

root@TOR-SW01> show lacp interfaces ae0
Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic Collecting distributing
      ge-0/0/45                 Current   Fast periodic Collecting distributing
      ge-0/0/46                 Current   Fast periodic Collecting distributing
      ge-0/0/47                 Current   Fast periodic Collecting distributing


(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/45      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/46      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/47      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Expired   Fast periodic           Attached
      ge-0/0/45                 Expired   Fast periodic           Attached
      ge-0/0/46                 Expired   Fast periodic           Attached
      ge-0/0/47                 Expired   Fast periodic           Attached

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/44    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/45      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/45    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/46      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/46    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/47      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/47    Partner    No   Yes    No   No   No   Yes     Fast   Passive
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44               Defaulted   Fast periodic           Detached
      ge-0/0/45               Defaulted   Fast periodic           Detached
      ge-0/0/46               Defaulted   Fast periodic           Detached
      ge-0/0/47               Defaulted   Fast periodic           Detached

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic            Waiting
      ge-0/0/45                 Current   Fast periodic            Waiting
      ge-0/0/46                 Current   Fast periodic            Waiting
      ge-0/0/47                 Current   Fast periodic            Waiting

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic Collecting distributing
      ge-0/0/45                 Current   Fast periodic Collecting distributing
      ge-0/0/46                 Current   Fast periodic Collecting distributing
      ge-0/0/47                 Current   Fast periodic Collecting distributing

Logs from (ToR Switch) - Spanning Tree:

root@TOR-SW01> show log rstp
Apr  4 12:24:24.173413 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: PORT_DISABLED, State: NOT_EDGE
Apr  4 12:24:24.174609 BDSM: Port ae0.0: Moved to state NOT OPER EDGE
Apr  4 12:24:24.174653 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_RESET, State: ACTIVE
Apr  4 12:24:24.174682 TCSM: Port ae0.0: No Operations to perform
Apr  4 12:24:24.174748 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_DISABLED, State: CURRENT
Apr  4 12:24:24.174783 PISM: Port ae0.0: Moving to state DISABLED
Apr  4 12:24:24.174812 PISM: Port ae0.0: Moved to state DISABLED
Apr  4 12:24:24.186004 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: NOT_DESG_ROOT, State: ACTIVE
Apr  4 12:24:24.186094 TCSM: Port ae0.0: Moved to state LEARNING
Apr  4 12:24:24.186129 TCSM: Port ae0.0 Role is NOT ROOT/DESIGNATED; Changing to INACTIVE state
Apr  4 12:24:24.186158 TCSM: Port ae0.0: Moved to state INACTIVE
Apr  4 12:24:24.192122 TCSM: Learnt Entries on Port ae0.0 have been flushed!
Apr  4 12:24:24.192423 MSG: Management Disabling of Port 3 Success

Apr  4 12:24:29.552531 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_ENABLED, State: DISABLED
Apr  4 12:24:29.552658 PISM: Port ae0.0: Moved to state AGED
Apr  4 12:24:29.552779 PISM: Port ae0.0: Port Info State Machine Called with Event: UPDATE_INFO, State: AGED
Apr  4 12:24:29.552820 PISM: Port ae0.0: UPDATING port info
Apr  4 12:24:29.552865 PISM: Port ae0.0: Moved to state UPDATE
Apr  4 12:24:29.552968 PISM: Port ae0.0: Moved to state CURRENT
Apr  4 12:24:29.553278 MSG: Management Enabling of Port 3 Success

Apr  4 12:24:32.562537 TMR: Port ae0.0: EDGEDELAYWHILE Timer EXPIRED forInstance: 0
Apr  4 12:24:32.562673 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: EDGEDELAYWHILE_EXP, State: NOT_EDGE
Apr  4 12:24:32.562711 BDSM: Port ae0.0: Moved to state OPER EDGE
Apr  4 12:24:32.562802 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: LEARN_SET, State: INACTIVE
Apr  4 12:24:32.562844 TCSM: Port ae0.0: Moved to state LEARNING
Apr  4 12:24:32.563595 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: FORWARD, State: LEARNING
Apr  4 12:24:32.563683 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_SET, State: LEARNING
Apr  4 12:24:32.563719 TCSM: Port ae0.0: No Operations to perform
Score:1
mt flag

In brief the answer is - combination of specific JunOS version + EX2300 is the problem.

My Answers are:

  1. CPU of ex2300 can be temporally overloaded if configured security features processing client device's ip traffic (dhcp snooping, DAI, IPSG). High CPU usage can cause problems in any OS process wether lacp or ospf or snmp. Remember that CPU time is SHARED among all system processes. Juniper EX2200\2300 series are notorious for their limited resources (CPU, RAM and TCAM sizes). And I must say EX2300 has modest single core ARM Cortex-A9 CPU. To see that run cli command show system boot-messages. If you carefully read "Aggregated Ethernet Interfaces" section of "Ethernet Interfaces User Guide for Routing Devices" you will see:

Note: On EX2300 and EX3400 switches, the LACP protocol must be configured with a periodic SLOW timer to prevent flaps during CPU intensive operations events such as routing engine switchover, interface flaps, and exhaustive data collection from the packet forwarding engine.

  1. You can tell if its layer2 or layer3 problem only experimentally. Try adding\removing layer3 headers from ethernet frames (disable ipv4\ipv6 on connected device or learn using packet generators like scapy to generate test traffic) or you can just try deactivating some suspicious junos features like dhcp snooping, dai, ipsg in your lab environment or during lan maintenace periods.

  2. It occurs on TOR switch because it differs from other switches whether by configuration or Junos version

  3. First try set interfaces <ae interface> aggregated-ether-options lacp periodic slow. Then you can try using 18.3R3-S4 or other software (believe me, latest Junos release is not always the best). At last you can make aggregated ethernet interfaces static (static LAG instead of lacp)

This our story...

We had upgraded our ex2300 switches from 18.3R3-S4 to 21.4R3-S3.4 and our monitoring system started reporting STP topology changes. (18.3R3-S4 is quite stable btw)

Our investigation showed that there were spikes of CPU utilization when control plane (JunOS running on integrated CPU) received errors from dataplane (pfe software running on Broadcom switch chip aka ASIC)

dc-pfe[17947]: ETH: ifd (xe-0/1/0) unknown boolean option 112
dc-pfe[17947]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
dc-pfe[17947]:   ifd 697; Ether boolean set error (22)
dc-pfe[17947]: ETH: ifd (xe-0/1/1) unknown boolean option 112
dc-pfe[17947]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
dc-pfe[17947]:   ifd 698; Ether boolean set error (22)

EX2300 unable to transmit LACP BPDUs in time when it happens. Either some JunOS process overloads CPU so that "lacpd" and other processes are put in WAIT state or pfe temporary drops packets.

Anyway, on the other side of the link that causes AE interface flapping because LACP must send and receive BPDUs at fixed time intervals.

If AE interface is RSTP\MSTP non-edge port then STP topology change occurs which requires cleaning MAC addresses from non-edge ports. It requires Junos to change the state of pfe under the hood (reprogram ASIC CAM\TCAM hw tables, relearn mac addresses etc\etc\etc). And disruption ripple caused by STP topology change spans across the whole LAN...

So we understood that the lacp timeout + stp problems were related to high CPU usage and can be mitigated by configuring ex2300 like

 set interfaces <interface> aggregated-ether-options lacp periodic slow 

And do not forget to make so at both link ends... https://supportportal.juniper.net/s/article/EX-How-transmit-rate-LACP-Interval-is-negotiated-between-Actor-and-Partner?language=en_US

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.