High latency and firewall concurrent connection count on NSX Edge after upgrading to NSX-V 6.4.12

Update: This issue is resolved in NSX-V 6.4.13

After the upgrade to NSX DataCenter for vSphere 6.4.12 at one of my customer environments we experienced high network latency with jitter in multiple of our tenants starting two days after the upgrade. One of our edges got instable with latencies up to 5 seconds following a customer environments outage. The affected edge performed also a automatic failover after a few days uptime.

There were some anomalies in the behavior of this edge in comparison to before the update.

  • High and jitter affected latency for packets which traverse through the edge

  • Slow but permanently increasing firewall concurrent connection count

  • Display of firewall ruleset logs on the console (STDOUT) of the active NSX Edge node

  • Higher CPU load than before

 

After a failover to the Standby Edge the concurrent connection count drops but after some time the same behavior can be monitored there.

 In the NSX Edge logs the following exception can be observed:

kernel[]: []:  [kern.alert] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048

kernel[]: []:  [kern.alert] IP: [<ffffffff8246a480>] nf_xfrm_me_harder+0x30/0x110 [nf_nat]

It seems to be that a kernel exception disrupts the stable operation of the NSX Edge.

 

 Workaround

To work around the issue a login via root to an edge node is necessary. I recommend contacting VMware support.

 

Part 1

Firstly, it is necessary to get the edge root password from the NSX manager. Make sure to have a backup first. The following steps are partly from the VMware KB 2149630 https://kb.vmware.com/s/article/2149630

 

1.       Login to the NSX manager via SSH or console

2.       debug engineeringmode enable

3.       st eng

4.       Password: IAmOnThePhoneWithTechSupport

5.       /home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh

6.       Copy the desired root passwords of the edges.

 

Part 2

1.       Login to the affected NSX Edge using the admin account on the VM console. I recommend using the standby edge first, if you are impaired from console outputs of the firewall.

2.       Enable (use admin password)

3.       debug engineeringmode enable

4.       st eng (use the root password collected from NSX Manager)

5.       cd /opt/vmware/vshield/Framework

6.       Take a backup of the file: cp config_manager.pm config_manager.pm.orig (keep the old file)

7.       vi config_manager.pm

8.       Search for configManagerDone (Finding should be around line 227)

9.       Get in insert mode, add '#' at the begin of the line

configManagerDone($configManagerData->{"highAvailability"}, $configManagerData->{"iptables"}{"changed"});

Result should like #configManagerDone($configManagerData->{"highAvailability"}, $configManagerData->{"iptables"}{"changed"});

10.    Save and close the file using wq!

11.   check with less config_manager.pm that the change was successful

Perform a failover to the active edge and perform the same steps with this one. Reboot of the NSX Edge is not necessary but I personally would recommend it. Just a side note: The Workaround would be lost if the edges get redeployed. I would expect that this issue will be fixed in an upcoming NSX-V release.

 Update: This issue has been resolved with NSX-V 6.4.13

 

Comments