CPU (Central Processing Unit) utilization is the percentage of time the CPU spends on processing, as compared to the time it sits idle.

The Pica8 PicOS switches will start to experience problems if the CPU utilization begins to reach 75%. User doesn't have to be concerned about brief periods of high CPU utilization. However, if CPU utilization remains consistently high, user needs to investigate further.

The symptoms include:

Poor system performance
Switch management slower than usual
Ping to the management interface times out
Packet drops

Step 1

To diagnose why a PicOS switch is slow, start with the top command. By default, top runs in interactive mode and updates its output every few seconds.

admin@Switch$top
top - 22:02:10 up 40 days, 13:06,  1 user,  load average: 0.01, 0.02, 0.00
Tasks:  61 total,   2 running,  59 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.5 us,  0.5 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem:   2073452 total,   243996 used,  1829456 free,    10728 buffers
KiB Swap:        0 total,        0 used,        0 free,   106620 cached
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND                               
 6089 root      40   0  128m  31m  13m S   8.6  1.6 576:55.72 pica_lcmgr                            
 6986 root      40   0 65456  27m 4144 S   1.3  1.3  30:18.82 xorp_rtrmgr                           
 6080 root      40   0 44156  22m  11m R   0.7  1.1  18:38.42 pica_sif                              
 7075 root       9 -11 41412 3980 3464 S   0.7  0.2  51:39.71 pica_monitor                          
    1 root      40   0  2520  852  736 S   0.0  0.0   7:35.99 init                                  
    2 root      40   0     0    0    0 S   0.0  0.0   0:00.00 kthreadd                              
    3 root      rt   0     0    0    0 S   0.0  0.0   0:01.73 migration/0                           
    4 root      20   0     0    0    0 S   0.0  0.0   0:04.61 ksoftirqd/0                           
    5 root      rt   0     0    0    0 S   0.0  0.0   0:02.29 migration/1                           
    6 root      20   0     0    0    0 S   0.0  0.0   0:06.39 ksoftirqd/1                           
    7 root      20   0     0    0    0 S   0.0  0.0   0:06.47 events/0                              
    8 root      20   0     0    0    0 S   0.0  0.0   0:00.14 events/1                              
    9 root      20   0     0    0    0 S   0.0  0.0   0:00.07 khelper                               
   13 root      20   0     0    0    0 S   0.0  0.0   0:00.00 async/mgr                             
  134 root      20   0     0    0    0 S   0.0  0.0   0:00.02 sync_supers                           
  136 root      20   0     0    0    0 S   0.0  0.0   0:00.20 bdi-default                           
  137 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kblockd/0                             
  138 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kblockd/1                             
  144 root      20   0     0    0    0 S   0.0  0.0   0:00.00 ata/0

CPU-bound load is caused when too many processes contend for CPU resources. To check whether load is CPU-bound, check the third line of output:

%Cpu(s):  4.5 us,  0.5 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st

Each of the percentages indicates the fraction of CPU time consumed by a category of tasks. Symbol us indicates the CPU usage of user processes, and sy indicates the CPU usage of the kernel and other system processes. CPU-bound load should result in a high percentage of either us (user) or sy (system) CPU time.

If the user or system percentage is high, the load is likely to be CPU-bound. To narrow down the root cause, look at the processes that consume the highest percentage of CPU resources. By default, top sorts processes based on the percentage of CPU usage, with the top consumers appearing first. Once user is armed with the knowledge of processes with the highest CPU utilization, further troubleshooting can be done.

Step 2

Check the log messages for any errors or warning for the following important PicOS modules:

PicOS L2/L3 Mode: pica_lcmgr, pica_sif, and xorp_rtrmgr

PicOS OVS Mode: ovs-vswitchd, and ovsdb-server

By default, the log file is /tmp/log/messages. The file may be huge with possibly several thousand lines. Examine the log file using the cat command, with an appropriate pipe (|) filter function:

admin@Switch$cat /tmp/log/messages | grep lcmgr
Aug 29 2015 01:30:57 SPINE-B local0.info : [PICA_MONITOR]Process pica_lcmgr, running, PID 12983
Aug 29 2015 01:30:57 SPINE-B local0.info : [PICA_MONITOR]Monitor for process pica_lcmgr started
Sep  1 2015 18:07:00 SPINE-B local0.warning : [PICA_MONITOR]pica_lcmgr cpu rate limit 0.80, cpu 0.81 working-rate 0.77, happened 1 times
Sep  3 2015 13:13:32 XorPlus local0.err : [RTRMGR]XRL Death: class lcmgr01 instance lcmgr01-87510959a54d13e97bb845d35f267ad0@127.0.0.1
Sep  3 2015 13:14:28 XorPlus local0.info : [PICA_MONITOR]Process pica_lcmgr, running, PID 23228
Sep  3 2015 13:14:28 XorPlus local0.info : [PICA_MONITOR]Monitor for process pica_lcmgr started
Sep  7 2015 05:05:10 SPINE-B local0.warning : [PICA_MONITOR]pica_lcmgr cpu rate limit 0.80, cpu 0.80 working-rate 0.77, happened 1 times
Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.480|ovs|00001|lcmgr_shared|INFO|XOVS got system mac (48:6e:73:02:04:64)
Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.481|ovs|00002|lcmgr_shared|INFO|XOVS got QE port mode (0)
Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.481|ovs|00003|lcmgr_shared|INFO|XOVS got flow count limitation (0, 0, 0, 2000, 2000)
Sep 11 2015 05:37:12 SPINE-B daemon.notice : 05:37:12.492|ovs|00001|lcmgr_shared(sif_handler1)|INFO|sif handler created

<Some output omitted>

Step 3

Check the core dump in the /pica/core directory.

Step 4

To display the virtual interfaces configured on the switch, use the ifconfig command at the Linux shell:

 admin@Switch$ifconfig
eth0      Link encap:Ethernet  HWaddr 48:6e:73:02:04:63  
          inet addr:192.168.42.110  Bcast:192.168.42.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2379952 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1060135 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:354374731 (337.9 MiB)  TX bytes:152816006 (145.7 MiB)
          Base address:0x2000 
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:98303973 errors:0 dropped:0 overruns:0 frame:0
          TX packets:98303973 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:860429061 (820.5 MiB)  TX bytes:860429061 (820.5 MiB)
vlan.3    Link encap:Ethernet  HWaddr 48:6e:73:02:04:64  
          inet addr:10.10.3.1  Bcast:10.10.3.255  Mask:255.255.255.0
          inet6 addr: fe80::4a6e:73ff:302:464/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:36973 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36446 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:2927602 (2.7 MiB)  TX bytes:2743024 (2.6 MiB)

Step 5

To display packets on a specific virtual interface, use the tcpdump command at the Linux shell:

admin@Switch$sudo tcpdump -i vlan.3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan.3, link-type EN10MB (Ethernet), capture size 65535 bytes

To debug the protocol messages between the switch and the controller, use the ovs-ofctl snoop command in the OVS mode. The following commands debug the protocol messages exchanged between the br0 bridge and the controller:

admin@Switch$ovs-ofctl snoop br0

Common Causes

In the CrossFlow mode, both L2/L3 and OVS processes are running. The switch has to process both OVS protocol messages and the L2/L3 packets like BPDUs, OSPF packets, and BGP packets. The switch is likely to have a higher CPU utilization in the CrossFlow mode compared with the L2/L3 or OVS modes.

Normally, the CPU-bound packets are less than 1000 pps (packets per second), and the CPU utilization is not high. However, the eth0 management interface has no rate limiting configured. Therefore, an attacker can send a large number of packets to the management interface, making the switch slow and even unusable for legitimate traffic.

Possible Fixes

User can deploy the following fixes for high CPU utilization:

Add a default drop flow for table-miss packets, to prevent these packet from causing high CPU utilization.
Remove some flows with actions: Controller, LOCAL
Make sure that the controller is not sending exessive OpenFlow messages to the switch.
Configure the management interface eth0 at a low speed like 10 Mbps, using the ethtool -s eth0 speed 10 command.
Reload the switch