High CPU Utilization
CPU (Central Processing Unit) utilization is the percentage of time the CPU spends on processing, as compared to the time it sits idle.
The Pica8 PicOS switches will start to experience problems if the CPU utilization begins to reach 75%. User doesn't have to be concerned about brief periods of high CPU utilization. However, if CPU utilization remains consistently high, user needs to investigate further.
The symptoms include:
- Poor system performance
- Switch management slower than usual
- Ping to the management interface times out
- Packet drops
Step 1
To diagnose why a PicOS switch is slow, start with the top command. By default, top runs in interactive mode and updates its output every few seconds.
admin@Switch$top top - 22:02:10 up 40 days, 13:06, 1 user, load average: 0.01, 0.02, 0.00 Tasks: 61 total, 2 running, 59 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.5 us, 0.5 sy, 0.0 ni, 94.5 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem: 2073452 total, 243996 used, 1829456 free, 10728 buffers KiB Swap: 0 total, 0 used, 0 free, 106620 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6089 root 40 0 128m 31m 13m S 8.6 1.6 576:55.72 pica_lcmgr 6986 root 40 0 65456 27m 4144 S 1.3 1.3 30:18.82 xorp_rtrmgr 6080 root 40 0 44156 22m 11m R 0.7 1.1 18:38.42 pica_sif 7075 root 9 -11 41412 3980 3464 S 0.7 0.2 51:39.71 pica_monitor 1 root 40 0 2520 852 736 S 0.0 0.0 7:35.99 init 2 root 40 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root rt 0 0 0 0 S 0.0 0.0 0:01.73 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:04.61 ksoftirqd/0 5 root rt 0 0 0 0 S 0.0 0.0 0:02.29 migration/1 6 root 20 0 0 0 0 S 0.0 0.0 0:06.39 ksoftirqd/1 7 root 20 0 0 0 0 S 0.0 0.0 0:06.47 events/0 8 root 20 0 0 0 0 S 0.0 0.0 0:00.14 events/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.07 khelper 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr 134 root 20 0 0 0 0 S 0.0 0.0 0:00.02 sync_supers 136 root 20 0 0 0 0 S 0.0 0.0 0:00.20 bdi-default 137 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/0 138 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kblockd/1 144 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ata/0
CPU-bound load is caused when too many processes contend for CPU resources. To check whether load is CPU-bound, check the third line of output:
%Cpu(s): 4.5 us, 0.5 sy, 0.0 ni, 94.5 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
Each of the percentages indicates the fraction of CPU time consumed by a category of tasks. Symbol us indicates the CPU usage of user processes, and sy indicates the CPU usage of the kernel and other system processes. CPU-bound load should result in a high percentage of either us (user) or sy (system) CPU time.
If the user or system percentage is high, the load is likely to be CPU-bound. To narrow down the root cause, look at the processes that consume the highest percentage of CPU resources. By default, top sorts processes based on the percentage of CPU usage, with the top consumers appearing first. Once user is armed with the knowledge of processes with the highest CPU utilization, further troubleshooting can be done.
Step 2
Check the log messages for any errors or warning for the following important PicOS modules:
PicOS L2/L3 Mode: pica_lcmgr, pica_sif, and xorp_rtrmgr
PicOS OVS Mode: ovs-vswitchd, and ovsdb-server
By default, the log file is /tmp/log/messages. The file may be huge with possibly several thousand lines. Examine the log file using the cat command, with an appropriate pipe (|) filter function:
admin@Switch$cat /tmp/log/messages | grep lcmgr Aug 29 2015 01:30:57 SPINE-B local0.info : [PICA_MONITOR]Process pica_lcmgr, running, PID 12983 Aug 29 2015 01:30:57 SPINE-B local0.info : [PICA_MONITOR]Monitor for process pica_lcmgr started Sep 1 2015 18:07:00 SPINE-B local0.warning : [PICA_MONITOR]pica_lcmgr cpu rate limit 0.80, cpu 0.81 working-rate 0.77, happened 1 times Sep 3 2015 13:13:32 XorPlus local0.err : [RTRMGR]XRL Death: class lcmgr01 instance lcmgr01-87510959a54d13e97bb845d35f267ad0@127.0.0.1 Sep 3 2015 13:14:28 XorPlus local0.info : [PICA_MONITOR]Process pica_lcmgr, running, PID 23228 Sep 3 2015 13:14:28 XorPlus local0.info : [PICA_MONITOR]Monitor for process pica_lcmgr started Sep 7 2015 05:05:10 SPINE-B local0.warning : [PICA_MONITOR]pica_lcmgr cpu rate limit 0.80, cpu 0.80 working-rate 0.77, happened 1 times Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.480|ovs|00001|lcmgr_shared|INFO|XOVS got system mac (48:6e:73:02:04:64) Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.481|ovs|00002|lcmgr_shared|INFO|XOVS got QE port mode (0) Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.481|ovs|00003|lcmgr_shared|INFO|XOVS got flow count limitation (0, 0, 0, 2000, 2000) Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.492|ovs|00001|lcmgr_shared(sif_handler1)|INFO|sif handler created <Some output omitted>
Step 3
Check the core dump in the /pica/core directory.
Step 4
To display the virtual interfaces configured on the switch, use the ifconfig command at the Linux shell:
admin@Switch$ifconfig eth0 Link encap:Ethernet HWaddr 48:6e:73:02:04:63 inet addr:192.168.42.110 Bcast:192.168.42.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2379952 errors:0 dropped:0 overruns:0 frame:0 TX packets:1060135 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:354374731 (337.9 MiB) TX bytes:152816006 (145.7 MiB) Base address:0x2000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:98303973 errors:0 dropped:0 overruns:0 frame:0 TX packets:98303973 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:860429061 (820.5 MiB) TX bytes:860429061 (820.5 MiB) vlan.3 Link encap:Ethernet HWaddr 48:6e:73:02:04:64 inet addr:10.10.3.1 Bcast:10.10.3.255 Mask:255.255.255.0 inet6 addr: fe80::4a6e:73ff:302:464/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:36973 errors:0 dropped:0 overruns:0 frame:0 TX packets:36446 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:2927602 (2.7 MiB) TX bytes:2743024 (2.6 MiB)
Step 5
To display packets on a specific virtual interface, use the tcpdump command at the Linux shell:
admin@Switch$sudo tcpdump -i vlan.3 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vlan.3, link-type EN10MB (Ethernet), capture size 65535 bytes
To debug the protocol messages between the switch and the controller, use the ovs-ofctl snoop command in the OVS mode. The following commands debug the protocol messages exchanged between the br0 bridge and the controller:
admin@Switch$ovs-ofctl snoop br0
Common Causes
In the CrossFlow mode, both L2/L3 and OVS processes are running. The switch has to process both OVS protocol messages and the L2/L3 packets like BPDUs, OSPF packets, and BGP packets. The switch is likely to have a higher CPU utilization in the CrossFlow mode compared with the L2/L3 or OVS modes.
Normally, the CPU-bound packets are less than 1000 pps (packets per second), and the CPU utilization is not high. However, the eth0 management interface has no rate limiting configured. Therefore, an attacker can send a large number of packets to the management interface, making the switch slow and even unusable for legitimate traffic.
Possible Fixes
User can deploy the following fixes for high CPU utilization:
- Add a default drop flow for table-miss packets, to prevent these packet from causing high CPU utilization.
- Remove some flows with actions: Controller, LOCAL
- Make sure that the controller is not sending exessive OpenFlow messages to the switch.
- Configure the management interface eth0 at a low speed like 10 Mbps, using the ethtool -s eth0 speed 10 command.
- Reload the switch
Copyright © 2024 Pica8 Inc. All Rights Reserved.