This is part one of a three-part series about FabricPath. The slides are from a presentation I presented to my colleagues. The source of the presented information comes from a lot of Googling, day-to-day practice and this very good piece of documentation: “Nexus 7000 FabricPath White Paper Version 2.0”.
- Part I: FabricPath – The basics (this post).
- Part II: FabricPath – Forwarding example.
- Part III: FabricPath – Advanced topics.
Loop prevention is the most challenging part in Ethernet. For decades we are running Spanning-Tree as loop prevention protocol. Recently other loop prevention protocols emerged: Transparent Interconnection of Lots of Links (TRILL) but also Shortest Path Bridging (SPB). Cisco FabricPath is a proprietary implementation of TRILL. Cisco FabricPath replaces Spanning-Tree in the Ethernet network core.
A FabricPath network is a routed network, it routes the Ethernet frames based upon switch IDs throughout the network (later more on switch IDs). Unlike Spanning-Tree FabricPath does not block any links in the network. Because of the routed nature of FabricPath blocking ports is not necessary. FabricPath brings advantages of the layer 3 world into layer 2 networks. For example it is possible to load share over multiple equal cost paths (ECMP). FabricPath supports ECMP for up to 16 paths. FabricPath also introduces a TTL field in the FabricPath header and utilizes Reverse Path Forwarding (RPF) checks. This makes the chance of loop forming within the FabricPath domain nearly impossible (note: classical Ethernet loops can still be forwarded by FabricPath networks, see the Spanning-Tree sheets). As in layer 3 you can play with metrics to steer traffic within the FabricPath domain. FabricPath converges extremely fast, I have seen convergence times far under 300ms. Because of its plug and play nature it is extremely easy to configure which makes the chances for errors small. Also very important, FabricPath enables the usage of the flat network architecture which is more suited for modern work loads in data center networks (a flat network handles increasing east-west loads more efficient).
The figure shows a flat network topology. Switches in a flat network are called spine and leaf switches (they are also called core and edge switches, depending on the topology and which document you read). In this presentation I will use the core and edge terminology. The one and only function of the core switches is to interconnect the edge switches. The edge switches connect the data center servers to the network. You can build any network topology with FabricPath, a flat network is not mandatory in a FabricPath enabled network.
All switches run the FabricPath protocol. The FabricPath protocol runs between the interfaces which are configured with the FabricPath command. These interfaces are called FabricPath core ports. In the FabricPath domain no Spanning-Tree is running. Edge switches connect the servers to the data center network, these ports are called Classical Ethernet (CE) edge ports. Edge ports are not aware of FabricPath and connect the Classical Ethernet world to the FabricPath network. CE ports run Spanning-Tree.
A VLAN created on a FabricPath enabled switch is by default in CE mode. VLANs in CE mode are not transported over the FabricPath domain. VLANs must be configured as a FabricPath VLAN to get transported over the FabricPath Domain. CE ports must be assigned to a FabricPath VLAN to get transported over the FabricPath domain. Every FabricPath VLAN is by default transported between the FabricPath core ports.
In the example server B connects via a tagged link to a CE port to S200. The VLANs server B uses are configured as FabricPath VLANs in S200. Therefor The VLANs are automatically transported over the FabricPath domain. The CE frame is always forwarded with the with the original 802.1Q tag included over the FabricPath domain.
The FabricPath IS-IS process runs at the control plane and determines the FabricPath forwarding topology. FabricPath IS-IS is based on the standard ISO/IEC 10589 specification. The FabricPath IS-IS process is completely seperated from an eventual OSI layer 3 IS-IS routing process running at the same switch. Standard IS-IS type-Length-Value (TLV) fields are used to exchange specific FabricPath information such as the FabricPath switch IDs. FabricPath IS-IS enables “routing” of frames at OSI layer 2. frames are routed based on switch IDs.
Switch ID routing tables are calculated by the shortest path first (SPF) algorithm and the best paths are installed in the routing table. Multiple equal cost paths to the same destination are installed in the routing table (up to 16). This enables FabricPath IS-IS ECMP.
The switch ID (SID) is most important in a FabricPath domain. FabricPath routing decisions are based on the SID. Every switch in the FabricPath domain must be uniquely addressable by its FabricPath switch ID. The dynamic resource allocation protocol (DRAP) allocates switch IDs and forwarding tags (FTAGs), later more on FTAGs. By default the switch ID is selected randomly. The negotiation process executed by DRAP, after the IS-IS adjacency setup, ensures that every switch in the FabricPath domain is assigned a unique switch ID and ensures FTAG consistency. Only switches that feature an unique FabricPath switch ID will start data plane forwarding for their FabricPath interfaces. If a conflict is detected the DARP process selects a different random switch ID. Switch IDs can also be manually configured (which is considered best practice). In case of a clash between manually configured switch IDs the switches which are in conflict will stop data plane forwarding at their FabricPath interfaces.
When a Classical Ethernet frame hits a FabricPath core port it is encapsulated in a FabricPath frame. A 16 bit FabricPath header is prepended and a new CRC is calculated. The FabricPath header is called the outer MAC (OMAC). The encapsulated MAC of the CE frame is called the inner MAC (iMAC). The DA and SA fields of the encapsulated frame are called the internal DA (iDA) and the internal SA (iSA).
The FabricPath address fields are encoded into the 48bit outer MAC address fields. The Outer Detestation address (ODA) and the outer Source address (OSA) both consist of the switch ID the sub-switch ID and the Port ID. Standard U/L (Universal/Local) and I/G (Individual/Group) bits are present in the MAC address.
Switch ID (SID): The 12 bit switch ID uniquely identifies every switch in the FabricPath domain. The switch ID can be used in the source (OSA) and destination address (ODA) of the frame.
Sub-switch ID (sSID): identifies the VPC+ Port-Channel at a VPC+ switch pair. It addresses a specific VPC+ Port-Channel as source and/or destination address. This behavior can be turned of at the VPC switch pair. If switching on sub-switch ID is turned of the switch falls back on MAC switching based on the MAC table. The sub-switch ID is locally significant to a VPC+ switch pair. The sub-switch ID can be used in the source (OSA) and destination address (ODA) of the frame.
Port ID: also known as the Local Identifier (LID) can be used to address a specific physical or logical port at a switch. This allows the egress switch to directly deliver the frame to the right switch port without consulting the MAC table. The port ID is locally significant to each switch.
Ethernet Type: 0x8903 identifies the Ethernet frame as FabricPath frame.
FTAG: the function of the FTAG depends. If the frame is a multi-destination frame then the FTAG holds the Multi Destination Tree (MDT) number the multi destination frame is assigned to. The FTAG in a unicast frame holds the topology number where the unicast frame is assigned to. Each topology features its own MDTs. The default topology is topology 0.
TTL: default Time To Live (TTL) is 32. Each FabricPath switch decreases the TTL by one. Zero TTL frames are discarded. The default TTL can be configured. Unicast, unknown unicast and broadcast share the same configurable TTL setting. Multicast allows a different configurable TTL setting.
For unicast routing the combination of the SID, SSID and LID are used in the ODA and OSA. For multi destination traffic this is different:
- For broadcast and multicast frames the inner destination address (iDA) is copied to the ODA of the FabricPath header.
- For unknown unicast frames a reserved MAC address of 01:0F:FF:C1:01:C0 is used in the ODA.
Two tables play a role in FabricPath unicast frame forwarding. The MAC address table and the switch ID table which holds the best routes to other switches in the FabricPath domain. The two tables are used as following:
- A classical Ethernet frame enters the ingress FabricPath switch, the destination MAC address is looked up in the MAC Address table. The MAC address table associates the destination MAC address with the SID of the egress FabricPath switch. After successful lookup of the egress SID in the MAC table The Ingress switch does a lookup of the destination SID in the switch ID table. According to the switch ID table the next hop is the core FabricPath switch.
- The core FabricPath switch switches based on the destination SID. The core switch does a lookup of the destination SID in the switch ID table and forwards the frame to the next hop interface towards the egress switch.
- The egress switch determines that it is the egress switch after a lookup in the switch ID table. Then the frame is forwarded to the egress classical Ethernet interface based on the LID in the ODA or based on the egress switch MAC address table.
- A classical Ethernet frame with MAC address f092.1c03.1053 enters switch 1001. Switch 1001 does a MAC table lookup searching for the destination. The MAC table points to egress switch 2003.
- A lookup in the switch ID table shows that the egress switch ID 2003 is reachable via Po1 or Po2. The hashing algorithm selects Po1. The CE frame is encapsulated in a FabricPath frame and is send of to the core switch connected to Po1.
- The core switch will route the frame based on the SID found in the ODA of the FabricPath frame. The core switch looks up SID 2003 in its switch ID table. SID 2003 can be reached via Po1. Frame is send to switch 2003.
- The FabricPath frame enters egress switch 2003. The egress switch does a lookup of the destination SID in the ODA of the FabricPath header. It finds out that the frame is destined to this switch.
- The frame is de-capsulated and because the LID in the ODA was not used (the switch can because of the missing LID not address the egress interface directly) the switch must do a MAC Address table lookup to find the egress interface for destination f092.1c03.1053. The frame is send out of interface e1/11.
FabricPath uses multi destination trees (MDTs) to forward multi destination traffic aka unknown unicast, broadcast and multicast. The left side of the figure shows a physical network topology, the right hand pictures show two multi destination trees originating from two different roots. Currently NX-OS supports two MDTs on which multi destination traffic is forwarded.
- Tree one is used for unknown unicast, broadcast and layer 2 multicast (non IP multicast).
- Tree one and two are used to load share IP Multicast.
Every tree is addressed by a unique ID called the forwarding tag (FTAG). The FTAG field is part of the FabricPath header and is used for multi destination traffic forwarding.
NOTE: The FTAG is also used in unicast to identify different FabricPath topologies (see the FabricPath Topologies sheet). Per topology two MDT trees are supported.
This is an example of the multi destination forwarding table of tree 1,switch 2000 (with SID 2000). Although the table is shortened it tells the switch that multi destination packets for all other switches should be forwarded out of port Po1 (Po1 is the only interface connecting to tree 1). Every switch has a unique MDT forwarding table, if we world look at the forwarding table of switch 101 then this would display (assuming the LAGs of switch 101 are numbered 1 to 4):
- switch 1000 via Po1
- switch 1001 via Po2
- switch 102, 1002 via Po3
- switch 201, 202, 2000, 2001, 2002, 2003, 2004 via Po4
Reverse Path Forwarding (RPF) checks assure that multi destination frames are never forwarded out of the interface the frame was received, this prevents the forming of Ethernet loops. You can easily see how a multi destination frame sourced by switch will 2000 travel along the tree reaching all switches in the network.
Every switch calculates its own MDT forwarding table. The MDT root is used as a reference point in the calculation of the tree. The root of the tree should be configured at a central switch. A central switch minimizes the diameter of the tree and therefore the number of hops a multi destination frame is traveling.
First the root for MDT tree 1 is elected, higher values are better. The parameters in order of preference for root selection are the root priority, system ID and switch ID. After election of the root for MDT 1 this root elects a root for MDT tree 2 based on the same parameters. Imagine there are two switches with root priority 200 and 201 then the root for MDT 1 will be the switch with root priority 201 and the root for MDT two will be the switch with root priority 200 (assuming other switches left default with a root priority of 64).
Every FabricPath edge switch by default performs IGMP snooping to discover multicast group membership. The switch listens to IGMP reports, IGMP queries and PIM hellos. This process feeds the multicast routing (snooping) table of the switch The multicast routing table tells the switch which CE interface is interested in which multicast group traffic.
Next to populating the switch multicast routing table the FabricPath IS-IS IGMP process floods the group membership interest to other FabricPath switches in the domain via group membership link state PDUs (GM-LSP). Every switch in the FabricPath domain will learn about interested receivers for a given multicast group. The result is that every switch only sends IP multicast traffic to interested switches. The fact that IP multicast traffic is only send to interested locations and not everywhere enables automatic pruning of IP multicast traffic.
This example shows the result of IGMP snooping and the flooding of GM-LSPs. The servers connected to the switches 1001 and 1002 send out IGMP membership reports corresponding to multicast group (*, 220.127.116.11). Both the IGMP snooping tables of the switches 1001 and 1002 show that traffic destined for 18.104.22.168 must be send out of e1/3. The group membership information is also send via GM-LSPs to all other FP switches in the domain. The result of this is shown in the figure, the IS-IS multicast routing table at switch 1001 shows that multicast traffic destined for (*, 22.214.171.124) also must be send to switch 1002. Switch 1002 shows the opposite, multicast traffic destined for (*, 126.96.36.199) will also be send to switch 1001. Switch 2002 received GM-LSPs from both 1001 and 1002. This is reflected in the IS-IS multicast routing table, multicast traffic destined for (*, 188.8.131.52) will be send to both switches, 1001 and 1002.
The switches also use its multi destination forwarding table of tree 1 and 2, IP multicast traffic is load shared among both trees, based on hashing (multi destination forwarding table not shown in this figure). The information in the multi destination tree is combined with the information of the IS-IS multicast routing table to send IP multicast to all interested destination only. After receiving the IP multicast frame at the switch the local IGMP multicast routing (snooping) table sends the IP multicast packet out of the interested interfaces.
FabricPath core switches are not involved in MAC learning. core switches forward FabricPath frames based on the SID in the ODA. The necessary routing information for the core switch is available in the switch ID table. edge switches learn MAC address in two different ways:
- Via CE interfaces: MAC learning is executed as known in classical Ethernet, the switch learns any source MAC address from every frame received at a CE interface. These are called local MAC entries and they point to outgoing CE switch interfaces.
- Via FabricPath interfaces: not all source MAC addresses from frames originated from FP core interfaces are learned. These source addresses are only learned if the destination address matches an already learned MAC entry. This is called conversational learning.
The advantage of conversational MAC learning is that MAC tables don’t fill up with entries that are probably never used. The switch only learns a MAC address from a FP core port if there is a session “conversation” going on between two end hosts.
Broadcast source MAC addresses are never learned from a FP core port. This is because the destination address FF does not exist in the MAC table of the edge switch, remember: the switch only learns the source MAC address if the destination MAC address matches an already learned local MAC entry.
Broadcast frames play an important role in conversational learning, they are used to update existing entries in the MAC address table. If a host moves it sends a gratuitous ARP which is a broadcast packet. If the gratuitous ARP is received by an edge switch and the source MAC of the ARP is already in the MAC table then the entry is refreshed with the latest information.
IP and non-IP multicast behave a bit different. Source MAC addresses from multicast frames are always learned. This is because very common protocols need the source address of the multicast frame for proper communication. This happens at the edge and the core switches. For example OSPF sends information to various multicast destination addresses. If the receiver would never learn the source address of the sender the OSPF could not work in a FabricPath environment.
The opinions expressed in this blog are my own views and not those of Cisco.