Configuring PFC Deadlock Prevention
Overview
Consider a data center running RoCE (RDMA over Converged Ethernet) traffic for high-performance computing workloads. These workloads require low-latency, lossless traffic flow, which PFC is used to enforce. However, as network congestion builds up, PFC pause frames are triggered, potentially leading to a deadlock if multiple paths become blocked.
By employing a PFC deadlock prevention solution, the network can identify RoCE flows that are prone to triggering deadlocks. The solution adjusts the queue priorities so that other critical flows are not blocked, and it reduces the load on congested paths. This prevents the generation of circular wait conditions and ensures the smooth operation of high-priority traffic, ensuring that business-critical applications continue to function smoothly without being interrupted by PFC-induced deadlocks.
How PFC Deadlock Prevention Works in Practice
Monitoring and Analytics
Figure 1. PFC Hook Flows
Figure 1 shows a CLOS network, it is a highly scalable and high-performance switching network commonly used in modern data centers. It is a multi-stage network topology, typically with multiple leaf and spine switches, which is designed to handle massive amounts of data with minimal latency and is often used to interconnect thousands of servers. Usually, PFC is deployed to manage flow control and avoid packet loss.
PFC Uplink Port Group
As shown in Figure 1, interfaces Te-1/1/1 and Te-1/1/2 are the uplinks connecting Leaf2 to the spines. They are added to the PFC uplink port group so that the system treats them collectively, allowing the device to manage them as a single entity when assessing traffic flow and PFC behavior.
High-Risk Hook Flow
As shown in Figure 1, when a leaf device detects that the same business flow (i.e., a specific set of traffic identified by its characteristics, such as source/destination IP, port, etc.) is traversing multiple interfaces within the PFC uplink port group, it marks this flow as a high-risk hook flow.
When a high-risk hook flow generates congestion across multiple interfaces (uplinks), PFC pause frames may be issued by the leaf to its upstream spine switches. If both interfaces in the uplink group send pause frames, and the upstream spine switches are also congested, it can result in a circular wait scenario (deadlock). The switches are effectively waiting for each other to release the paused traffic, leading to a network stall.
The Deadlock Prevention solution proactively monitors the data center network for high-risk hook flow that may lead to the generation of PFC pause frames.
Dynamic Queue Management
After the device receives the packet, it modifies the DSCP value and the corresponding dot1p priority of the packet, so that the packet is forwarded in the new dot1p priority queue using the new DSCP value.
The PFC deadlock prevention function in CLOS networks works by creating PFC uplink port groups that combine uplink interfaces on leaf devices together. The system detects high-risk flows that traverse these grouped uplinks and identifies them as potential deadlock triggers (high-risk hook flows). By preemptively modifying queue priorities and managing these flows, the system prevents deadlocks from occurring, ensuring the stability and efficiency of data center networks.
Restrictions and Guidelines
When you configure PFC deadlock prevention, follow these restrictions and guidelines:
PFC Deadlock Prevention is only supported on Trident3-X5, Trident3-X7 and Tomahawk3 platforms.
To ensure proper functioning, it is important that if any Equal-Cost Multi-Path (ECMP) output interfaces exist within the PFC uplink port group, all of these ECMP interfaces must be included in the group. Failing to do so may result in incorrect queue switching for Layer 3 traffic on the PFC uplink port group interfaces, leading to unexpected modifications of the DSCP (Differentiated Services Code Point) values. This could impair traffic handling and potentially lead to inefficiencies or incorrect prioritization.
Each device supports only one PFC uplink port group.
Configuring PFC Deadlock Prevention
Procedure
Step 1 Create a PFC uplink port group.
set class-of-service interface <interface-name> pfc-uplink-group <group-name>
Step 2 Modify the queue priority of hook flow packets that match the PFC uplink port group and the original DSCP value.
set class-of-service pfc-uplink-group <group-name> original-dscp <origin-value> to-code-point <queue>
Step 3 Modify the queue priority of hook flow packets that match the PFC uplink port group and the original DSCP value. If this command is not configured, it means the DSCP value carried by the packets will not be adjusted.
set class-of-service pfc-uplink-group <group-name> original-dscp <origin-value> dscp <value>
NOTE:
When configuring on the Trident3-X5 and Trident3-X7 platforms, the configurations in step 2 and step 3 both need to be configured and submitted in the same commit.
Step 4 Commit the configuration.
commit
Configuration Example
The following commands complete the configurations:
Create a PFC uplink port group group1.
Modify the queue priority to 4 and DSCP value to 48 of the hook flow packets that match the PFC uplink port group group1 and the original DSCP value 32.
admin@PICOS# set class-of-service interface te-1/1/1 pfc-uplink-group group1
admin@PICOS# set class-of-service interface te-1/1/2 pfc-uplink-group group1
admin@PICOS# set class-of-service pfc-uplink-group group1 original-dscp 32 to-code-point 4
admin@PICOS# set class-of-service pfc-uplink-group group1 original-dscp 32 dscp 48
admin@PICOS# commit
Copyright © 2024 Pica8 Inc. All Rights Reserved.