CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


Tim Hall has done it again! He has just released the 2nd edition of "Max Power".
Rather than get into details here, I urge you to check out this announcement post.
It's a massive upgrade, and well worth checking out. -E

 

Results 1 to 8 of 8

Thread: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

  1. #1
    Join Date
    2017-08-17
    Posts
    4
    Rep Power
    0

    Default RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Hi,

    I wonder if anyone could provide a fresh perspective on an issue I'm currently seeing on our VRRP 12600 cluster (running take 292).

    I have been spending some time running through Tim Hall's optimization guide and lo and behold we found issues with our cluster immediately at Layer 1. After seeing a huge amount of RX-DRP / RX-OVR (FIFO errors), we decided to bond our interfaces to alleviate the work load on the NICs. I have also ensured our CPU cores were properly assigned to the right amount of SecureXL / CoreXL instances (4/8 respectively), CoreXL dynamic dispatcher is also enabled on both members.

    After implementing the aforementioned steps we saw a huge decrease in RX-DRP packets and consequently RX-OVR packets.

    RX-OVR (FIFO Errors) are now only seen during a policy install. Additionally, RX-DRPs (rx_no_buffer_count & rx_missed_errors) seem to be occurring on the Sync interface during a policy push. However just to confirm policy installs work without fail.

    Bond Groups for both Members:

    Bonding Interface: 57
    Bond Configuration
    xmit-hash-policy layer3+4
    down-delay 200
    primary Not configured
    lacp-rate fast
    mode 8023AD
    up-delay 100
    mii-interval 100
    Bond Interfaces
    eth1-05
    eth1-06
    Bonding Interface: 1020
    Bond Configuration
    xmit-hash-policy layer3+4
    down-delay 200
    primary Not configured
    lacp-rate fast
    mode 8023AD
    up-delay 100
    mii-interval 100
    Bond Interfaces
    eth1-03
    eth1-04



    netstat -ni

    ### VRRP MASTER

    Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
    Mgmt 1500 0 419629 0 0 0 9224 0 0 0 BMRU
    Sync 1500 0 13193458 0 0 0 987349565 0 0 0 BMRU
    bond57 1500 0 707206595 0 30631 30631 272991646 0 0 0 BMmRU
    bond1020 1500 0 840536193 0 30208 30208 917547806 0 0 0 BMmRU
    eth1-01 1500 0 8391266 0 0 0 7900558 0 0 0 BMRU
    eth1-02 1500 0 3125569 0 0 0 3439995 0 0 0 BMRU
    eth1-03 1500 0 415699799 0 9633 9633 529833113 0 0 0 BMsRU
    eth1-04 1500 0 424836492 0 20575 20575 387714738 0 0 0 BMsRU
    eth1-05 1500 0 370501427 0 20848 20848 141704219 0 0 0 BMsRU
    eth1-06 1500 0 336705283 0 9783 9783 131287520 0 0 0 BMsRU
    eth1-07 1500 0 4814660 0 0 0 7135935 0 0 0 BMRU
    eth1-07.77 1500 0 140833 0 0 0 253486 0 0 0 BMRU
    eth1-07.1401 1500 0 4671436 0 0 0 6882432 0 0 0 BMRU
    eth2-01 1500 0 1125899034 0 76492 76492 744094831 0 0 0 BMRU
    eth2-03 1500 0 158541450 0 0 0 142103980 0 0 0 BMRU
    eth2-04 1500 0 2233127 0 0 0 2338893 0 0 0 BMRU
    lo 16436 0 8789081 0 0 0 8789081 0 0 0 LRU


    ### VRRP STANDBY

    Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
    Mgmt 1500 0 419779 0 0 0 64583 0 0 0 BMRU
    Sync 1500 0 986749821 0 10166 0 13126952 0 0 0 BMRU
    bond57 1500 0 3942827 0 0 0 164314643 0 0 0 BMmRU
    bond1020 1500 0 10974665 0 0 0 529804085 0 0 0 BMmRU
    eth1-01 1500 0 288161 0 0 0 187952 0 0 0 BMRU
    eth1-02 1500 0 551027 0 0 0 222817 0 0 0 BMRU
    eth1-03 1500 0 3719492 0 0 0 192491576 0 0 0 BMsRU
    eth1-04 1500 0 7255173 0 0 0 337312539 0 0 0 BMsRU
    eth1-05 1500 0 2493709 0 0 0 58920576 0 0 0 BMsRU
    eth1-06 1500 0 1449118 0 0 0 105394118 0 0 0 BMsRU
    eth1-07 1500 0 372127 0 0 0 47769 0 0 0 BMRU
    eth1-07.77 1500 0 152100 0 0 0 5639 0 0 0 BMRU
    eth1-07.1401 1500 0 219987 0 0 0 42130 0 0 0 BMRU
    eth2-01 1500 0 6674366 0 114 114 28062699 0 0 0 BMRU
    eth2-03 1500 0 1370175 0 0 0 1256169 0 0 0 BMRU
    eth2-04 1500 0 206785 0 0 0 39320 0 0 0 BMRU
    lo 16436 0 847780 0 0 0 847780 0 0 0 LRU




    After further investigation, the OS message files show this during a policy install:

    ### VRRP MASTER

    Feb 23 13:03:31 2018 FW-CL1 kernel: [fw4_1];fwioctl: Policy has started. Extending dead timeouts
    Feb 23 13:03:31 2018 FW-CL1 kernel: [fw4_1];FW-1: [cul_policy_freeze][CUL - Member] fwha_cul_policy_freeze_state_change: set Policy Freeze [ON], FREEZING state machine at ACTIVE (time=1378814, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE, freeze_timeout=300, freeze_event_timeout=150)
    Feb 23 13:03:31 2018 FW-CL1 kernel: [fw4_1];fwha_hp_periodic_run: Policy has ended 120 seconds ago. Returning to regular timeouts
    Feb 23 13:03:38 2018 FW-CL1 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 1 to 0
    Feb 23 13:03:42 2018 FW-CL1 kernel: [fw4_0];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:42 2018 FW-CL1 kernel: [fw4_1];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:42 2018 FW-CL1 kernel: [fw4_2];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:43 2018 FW-CL1 kernel: [fw4_3];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:43 2018 FW-CL1 kernel: [fw4_4];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:43 2018 FW-CL1 kernel: [fw4_5];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:43 2018 FW-CL1 kernel: [fw4_6];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:44 2018 FW-CL1 kernel: [fw4_7];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:44 2018 FW-CL1 kernel: [fw4_0];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:45 2018 FW-CL1 kernel: [fw4_0];FW-1: Warning: The eth1-02 interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:46 2018 FW-CL1 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0];FW-1: Warning: The Sync interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0];FW-1: Warning: The Mgmt interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_1];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_2];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_3];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_4];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0];State synchronization on member 1 is in risk. Please examine your synchronization network to avoid further problems !
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_5];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to ACTIVE
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_6];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_7];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to ACTIVE
    Feb 23 13:03:47 2018 FW-CL1 kernel: [fw4_0];FW-1: SIM (SecureXL Implementation Module) SecureXL device detected.
    Feb 23 13:03:48 2018 FW-CL1 kernel: [fw4_1];fwioctl: Policy has ended. Continuing extending dead timouts (fwha_cul_policy_done_time=1378963)
    Feb 23 13:03:49 2018 FW-CL1 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 4 (time=1378963, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE)
    Feb 23 13:03:55 2018 FW-CL1 kernel: [fw4_1]; Sync
    Feb 23 13:03:55 2018 FW-CL1 kernel: [fw4_1];Stopping ClusterXL
    Feb 23 13:03:55 2018 FW-CL1 kernel: [fw4_1];Starting ClusterXL
    Feb 23 13:03:55 2018 FW-CL1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to ACTIVE


    ### VRRP STANDBY


    Feb 23 13:03:28 2018 FW-CL2 kernel: [fw4_1];fwioctl: Policy has started. Extending dead timeouts
    Feb 23 13:03:28 2018 FW-CL2 kernel: [fw4_1];FW-1: [cul_policy_freeze][CUL - Member] fwha_cul_policy_freeze_state_change: set Policy Freeze [ON], FREEZING state machine at ACTIVE (time=1374159, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE, freeze_timeout=300, freeze_event_timeout=150)
    Feb 23 13:03:28 2018 FW-CL2 kernel: [fw4_1];fwha_hp_periodic_run: Policy has ended 120 seconds ago. Returning to regular timeouts
    Feb 23 13:03:31 2018 FW-CL2 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 0 to 1
    Feb 23 13:03:35 2018 FW-CL2 kernel: [fw4_0];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:35 2018 FW-CL2 kernel: [fw4_1];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:36 2018 FW-CL2 kernel: [fw4_2];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:36 2018 FW-CL2 kernel: [fw4_3];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:36 2018 FW-CL2 kernel: [fw4_4];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:36 2018 FW-CL2 kernel: [fw4_5];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_6];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_7];fw_kmalloc_impl: alloc_ranges: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0];FW-1: Warning: The Sync interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0];FW-1: Warning: The Mgmt interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0];FW-1: Warning: The eth1-02 interface is not protected by the anti-spoofing feature.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; Your network may be at risk. In the future, it is recommended that you
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0]; define anti-spoofing protection before installing the Security Policy.
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_1];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_2];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to FAILURE
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_3];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_4];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_5];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to ACTIVE
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_6];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_7];fw_kmalloc_impl: b_replace: allocates 0 bytes
    Feb 23 13:03:37 2018 FW-CL2 kernel: [fw4_0];FW-1: SIM (SecureXL Implementation Module) SecureXL device detected.
    Feb 23 13:03:38 2018 FW-CL2 kernel: [fw4_1];fwioctl: Policy has ended. Continuing extending dead timouts (fwha_cul_policy_done_time=1374253)
    Feb 23 13:03:39 2018 FW-CL2 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 4 (time=1374253, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE)
    Feb 23 13:03:43 2018 FW-CL2 kernel: [fw4_1]; Sync
    Feb 23 13:03:45 2018 FW-CL2 clish[14742]: user logged from admin
    Feb 23 13:03:45 2018 FW-CL2 xpand[12628]: admin localhost t +volatile:clish:admin:14742 t
    Feb 23 13:03:45 2018 FW-CL2 clish[14742]: User admin logged in with ReadWrite permission
    Feb 23 13:03:45 2018 FW-CL2 clish[14742]: cmd by admin: Start executing : show interfaces ... (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
    Feb 23 13:03:45 2018 FW-CL2 clish[14742]: cmd by admin: Processing : show interfaces all (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
    Feb 23 13:03:45 2018 FW-CL2 xpand[12628]: admin localhost t -volatile:clish:admin:14742
    Feb 23 13:03:45 2018 FW-CL2 clish[14742]: User admin logged out from CLI shell
    Feb 23 13:03:46 2018 FW-CL2 kernel: [fw4_0];FW-1: State synchronization is in risk. Please examine your synchronization network to avoid further problems !
    Feb 23 13:03:46 2018 FW-CL2 kernel: [fw4_0];FW-1: Please refer to documentation for details on this issue. Any change must be applied to ALL cluster members
    Feb 23 13:03:46 2018 FW-CL2 kernel: [fw4_4];FW-1: fwldbcast_recv: delta sync connection with member 0 was lost and regained.4252 updates were lost.
    Feb 23 13:03:46 2018 FW-CL2 kernel: [fw4_4];FW-1: fwldbcast_recv: received sequence 0xe2dee8 (fragm 0, index 1), last processed seq 0xe2ce4b
    Feb 23 13:03:48 2018 FW-CL2 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 1 to 0
    Feb 23 13:03:55 2018 FW-CL2 kernel: [fw4_1];Stopping ClusterXL
    Feb 23 13:03:55 2018 FW-CL2 kernel: [fw4_1];Starting ClusterXL
    Feb 23 13:03:55 2018 FW-CL2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to ACTIVE


    I'm aware I currently have no anti-spoof on the sync / mgmt / eth1-02 interfaces, I'll be addressing this at a later date. From my research and limited understanding of the inner workings, it looks like the ClusterXL state is flapping during a policy install. I hate to assume, but can I assume this is not the correct behavior? I know that HA is being controlled by our VRRP solution; so I would be surprised if the ClusterXL state change when utilizing VRRP would cause problems, however could this "flapping" of ClusterXL state cause RX-DRP / RX-OVR to occur? My research also led me to implement the ClusterXL Freeze state mechanism (fwha_freeze_state_machine_timeout), however I'm unsure if this is even taken into account when utilizing VRRP as the HA control mechanism. My next step which I wanted to refrain from doing is to increase the ring buffer size on the NICs which experience the overrun / RX-DRPs...is this advisable? The receive ring buffer size is currently default on all interfaces (256).

    The other correlation I have discovered (which does happen during a policy push) is the spike in CPU utilization specifically with the "FW_Full" process. So as far as I can determine, I think the issue could stem from either a sync problem during policy push and / or high CPU utilization. Please find additional information regarding the cluster below:


    cphaprob state

    ### VRRP MASTER

    Cluster Mode: Sync only (OPSEC) with IGMP Membership

    Number Unique Address Firewall State (*)

    1 (local) 192.168.222.1 Active
    2 192.168.222.2 Active

    (*) FW-1 monitors only the sync operation and the security policy
    Use OPSEC's monitoring tool to get the cluster status

    ### VRRP Standby

    Cluster Mode: Sync only (OPSEC) with IGMP Membership

    Number Unique Address Firewall State (*)

    1 192.168.222.1 Active
    2 (local) 192.168.222.2 Active

    (*) FW-1 monitors only the sync operation and the security policy
    Use OPSEC's monitoring tool to get the cluster status


    According to checkpoint, this is the correct status for 3rd party cluster solutions (https://supportcenter.checkpoint.com...tionid=sk39676)


    fw ctl pstat

    ### VRRP Master

    System Capacity Summary:
    Memory used: 48% (2015 MB out of 4163 MB) - below watermark
    Concurrent Connections: 54106 (Unlimited)
    Aggressive Aging is not active

    Hash kernel memory (hmem) statistics:
    Total memory allocated: 1264254976 bytes in 308656 (4096 bytes) blocks using 4 pools
    Initial memory allocated: 436207616 bytes (Hash memory extended by 828047360 bytes)
    Memory allocation limit: 3491758080 bytes using 512 pools
    Total memory bytes used: 742207176 unused: 522047800 (41.29%) peak: 1071841844
    Total memory blocks used: 209386 unused: 99270 (32%) peak: 281346
    Allocations: 2102340433 alloc, 0 failed alloc, 2094256471 free

    System kernel memory (smem) statistics:
    Total memory bytes used: 2155698788 peak: 2240126236
    Total memory bytes wasted: 89962854
    Blocking memory bytes used: 105275976 peak: 128005064
    Non-Blocking memory bytes used: 2050422812 peak: 2112121172
    Allocations: 24443742 alloc, 0 failed alloc, 24413901 free, 0 failed free
    vmalloc bytes used: 2027540816 expensive: no

    Kernel memory (kmem) statistics:
    Total memory bytes used: 1601178360 peak: 1922008788
    Allocations: 2126767822 alloc, 0 failed alloc
    2118657383 free, 0 failed free
    External Allocations: 17838848 for packets, 160360423 for SXL

    Cookies:
    992275190 total, 150985 alloc, 150985 free,
    8217186 dup, 394997496 get, 142318468 put,
    1811398012 len, 3253938 cached len, 0 chain alloc,
    0 chain free

    Connections:
    27182658 total, 20876916 TCP, 6109542 UDP, 83614 ICMP,
    112586 other, 2029 anticipated, 0 recovered, 54125 concurrent,
    126967 peak concurrent

    Fragments:
    3226877 fragments, 1594602 packets, 83 expired, 0 short,
    0 large, 2 duplicates, 181 failures

    NAT:
    55412975/0 forw, 97115755/0 bckw, 94661131 tcpudp,
    1020720 icmp, 60203600-21080316 alloc

    Sync:
    Version: new
    Status: Able to Send/Receive sync packets
    Sync packets sent:
    total : 917259805, retransmitted : 9972, retrans reqs : 0, acks : 363
    Sync packets received:
    total : 1590824, were queued : 194372, dropped by net : 0
    retrans reqs : 2480, received 21898 acks
    retrans reqs for illegal seq : 0
    dropped updates as a result of sync overload: 0



    ### VRRP STANDBY

    System Capacity Summary:
    Memory used: 29% (1215 MB out of 4163 MB) - below watermark
    Concurrent Connections: 5860 (Unlimited)
    Aggressive Aging is not active

    Hash kernel memory (hmem) statistics:
    Total memory allocated: 741761024 bytes in 181094 (4096 bytes) blocks using 2 pools
    Initial memory allocated: 436207616 bytes (Hash memory extended by 305553408 bytes)
    Memory allocation limit: 3491758080 bytes using 512 pools
    Total memory bytes used: 281705088 unused: 460055936 (62.02%) peak: 631434236
    Total memory blocks used: 78690 unused: 102404 (56%) peak: 159870
    Allocations: 2715118592 alloc, 0 failed alloc, 2711749724 free

    System kernel memory (smem) statistics:
    Total memory bytes used: 1529839500 peak: 1644712308
    Total memory bytes wasted: 11533905
    Blocking memory bytes used: 14971448 peak: 30916668
    Non-Blocking memory bytes used: 1514868052 peak: 1613795640
    Allocations: 217612513 alloc, 0 failed alloc, 217605344 free, 0 failed free
    vmalloc bytes used: 1500190968 expensive: no

    Kernel memory (kmem) statistics:
    Total memory bytes used: 1059305020 peak: 1442298112
    Allocations: 2932714274 alloc, 0 failed alloc
    2929341790 free, 0 failed free
    External Allocations: 4352 for packets, 82607945 for SXL

    Cookies:
    2244018020 total, 0 alloc, 0 free,
    58859 dup, 4154121895 get, 1317659282 put,
    3565343498 len, 77867 cached len, 0 chain alloc,
    0 chain free

    Connections:
    267670 total, 151879 TCP, 69548 UDP, 2182 ICMP,
    44061 other, 0 anticipated, 0 recovered, 5860 concurrent,
    121049 peak concurrent

    Fragments:
    93818 fragments, 46343 packets, 226 expired, 0 short,
    0 large, 0 duplicates, 0 failures

    NAT:
    39699249/0 forw, 624255/0 bckw, 627098 tcpudp,
    11244 icmp, 606322-1231442 alloc

    Sync:
    Version: new
    Status: Able to Send/Receive sync packets
    Sync packets sent:
    total : 1681794, retransmitted : 0, retrans reqs : 2480, acks : 21877
    Sync packets received:
    total : 700438138, were queued : 434704951, dropped by net : 9835
    retrans reqs : 0, received 339 acks
    retrans reqs for illegal seq : 0
    dropped updates as a result of sync overload: 2925



    fw ctl affinity -l -r -v

    ### VVRP MASTER

    CPU 0: eth1-05 (irq 186) eth1-01 (irq 59) eth2-03 (irq 155)
    CPU 1: eth1-06 (irq 202) Sync (irq 203) Mgmt (irq 219)
    CPU 2: eth1-02 (irq 75) eth1-03 (irq 91) eth1-04 (irq 107)
    CPU 3: eth1-07 (irq 218) eth2-01 (irq 123) eth2-04 (irq 171)
    CPU 4: fw_7
    CPU 5: fw_6
    CPU 6: fw_5
    CPU 7: fw_4
    CPU 8: fw_3
    CPU 9: fw_2
    CPU 10: fw_1
    CPU 11: fw_0
    All: mpdaemon rad in.geod lpd fwd usrchkd vpnd cprid cpd


    ### VRRP STANDBY

    CPU 0: eth1-01 (irq 75) eth1-02 (irq 91) eth1-04 (irq 123)
    CPU 1: eth1-03 (irq 107) eth2-01 (irq 139) eth2-04 (irq 187)
    CPU 2: eth1-05 (irq 202) eth1-06 (irq 218) eth1-07 (irq 234)
    CPU 3: Sync (irq 203) Mgmt (irq 219) eth2-03 (irq 171)
    CPU 4: fw_7
    CPU 5: fw_6
    CPU 6: fw_5
    CPU 7: fw_4
    CPU 8: fw_3
    CPU 9: fw_2
    CPU 10: fw_1
    CPU 11: fw_0
    All: rad fwd mpdaemon usrchkd in.geod vpnd cprid cpd



    fw ctl multik get_mode


    ### VRRP MASTER

    Current mode is On


    ### VRRP STANDBY

    Current mode is On



    Any help / advice would be greatly appreciated. Please let me know if you require any additional information.


    Cheers,

    Jon

  2. #2
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,224
    Rep Power
    13

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Quote Originally Posted by griff0001 View Post
    I'm aware I currently have no anti-spoof on the sync / mgmt / eth1-02 interfaces, I'll be addressing this at a later date. From my research and limited understanding of the inner workings, it looks like the ClusterXL state is flapping during a policy install. I hate to assume, but can I assume this is not the correct behavior? I know that HA is being controlled by our VRRP solution; so I would be surprised if the ClusterXL state change when utilizing VRRP would cause problems, however could this "flapping" of ClusterXL state cause RX-DRP / RX-OVR to occur.
    My research also led me to implement the ClusterXL Freeze state mechanism (fwha_freeze_state_machine_timeout), however I'm unsure if this is even taken into account when utilizing VRRP as the HA control mechanism. My next step which I wanted to refrain from doing is to increase the ring buffer size on the NICs which experience the overrun / RX-DRPs...is this advisable? The receive ring buffer size is currently default on all interfaces (256).
    Not directly, no. Since VRRP is being used, all ClusterXL is dealing with for the most part is state synchronization and reporting the firewall code's status to VRRP. A flap in ClusterXL (really just state sync) is not that big of a deal with VRRP, the Cluster Under Load (CUL) messages may look a little distressing as well but are normal in R77.30 and later.

    The other correlation I have discovered (which does happen during a policy push) is the spike in CPU utilization specifically with the "FW_Full" process. So as far as I can determine, I think the issue could stem from either a sync problem during policy push and / or high CPU utilization. Please find additional information regarding the cluster below:
    All your other posted commands indicate that the cluster is operating as it should. To some degree getting RX-DRPs is expected when pushing policy on a busy firewall due to the high CPU utilization that occurs across all CPUs at that time, combined with all traffic being temporarily queued during the atomic load of the new policy into the kernel. Once that atomic load finishes a CPU-intensive rematch of all open connections against the new installed policy has to be conducted before traffic can begin flowing again. Due to all these intensive events occurring, it is entirely possible that the interface ring buffers will fill up before the SND/IRQ cores can empty them and some RX-DRPs will occur. Typically you would not see RX-OVRs accumulate as well during this period, unless one of the following is occurring:

    1) Some NIC hardware/drivers report RX-DRP and RX-OVR as the same value, so both these counters are always incremented together and show the exact same value at all times.

    2) Some of the newer NICs/drivers in some situations seem to have the ability to detect that the ring buffer is full and instead of just dropping the frame (thus incrementing RX-DRP), the NIC holds onto the frame and tries again later. This "holding on" to the frame requires buffer space in the NIC hardware itself, and if this goes on for long enough the NIC hardware buffer eventually fills up and incoming frames from the wire then begin to be lost (which is exactly what an RX-OVR condition is).

    So for your specific situation I'd recommend the following:

    1) Try setting "Connection Persistence" on the gateway object to "Keep all connections" instead of "Rematch Connections", this will significantly reduce the duration of high CPU load during policy installs and perhaps help your firewall to avoid RX-DRPs completely. Keep in mind though that existing connections that are prohibited by a newly-installed policy will still be allowed to continue and not be killed immediately if you select this option.

    2) Check the load of CPUs 4-11 with top..1. If they are consistently running with at least 50% idle during your busiest period, try reducing the number of kernel instances (firewall workers) by 1 or 2 to allocate more SND/IRQ resources, doing so may help spread the SoftIRQ processing across more SND/IRQ cores and help avoid RX-DRPs.

    3) Finally, assuming that you only accumulate RX-DRPs during policy load and very rarely at any other time, this is one of the VERY LIMITED SITUATIONS in which increasing the interface ring buffer size is recommended and will probably help the firewall make it through a policy load without having any RX-DRPs due to the extra buffering.
    --
    Second Edition of my "Max Power" Firewall Book
    Now Available at http://www.maxpowerfirewalls.com

  3. #3
    Join Date
    2017-08-17
    Posts
    4
    Rep Power
    0

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Many thanks for the response. I know I had a lot of questions there. I also need to say that your book is top class!

    I can confirm that CPUs 4-11 rarely dip below 80% idle (even during peak I will occasionally see one or two drop to 60% for a split second). I can also confirm under "Connection Persistence" that we have "keep connections" selected.

    I'll try a 6/6 approach regarding SecureXL/CoreXL instances initially. If I still see the overruns at that point during a policy install, I'll increase the rx ring buffer size on the relevant interfaces in an incremental fashion.

    It will probably be a few days before I can implement the changes, however I'll update this post asap.

    Many thanks again for the advice.

    Cheers,

    Jon

  4. #4
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,623
    Rep Power
    9

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Just a few thing i noticed. Memory usage looks a little strange. 700 meg higher on active node. Not sure if you noticed but peak connections were a bit high as well. Could be from choking on policy install. As noted top output would be interesting.

  5. #5
    Join Date
    2017-08-17
    Posts
    4
    Rep Power
    0

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    @jflemingeds Thanks for the input. Regarding your observation of the active member utilizing 700 meg more than the standby, I'm not experienced enough to know if this is normal behavior in a VRRP cluster running IPS, VPN, URLF blades (we also implement close to 100 NAT rules). Regarding peak connections, our current settings are set to allow the gateway to automatically mange concurrent connection limits and "connection hash table sizes and memory pool". Although I have nothing to back up my point of view (and I may well be wrong), I would expect the active member to utilize the additional resource in this scenario? Whats your thoughts? Please find the top output from both members below:

    ### VRRP ACTIVE MEMBER:

    top - 14:06:42 up 4 days, 15:42, 1 user, load average: 2.77, 2.19, 2.08
    Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.2%us, 0.4%sy, 0.0%ni, 83.0%id, 0.0%wa, 0.4%hi, 16.0%si, 0.0%st
    Mem: 5971980k total, 5710648k used, 261332k free, 52408k buffers
    Swap: 10482404k total, 260k used, 10482144k free, 1071180k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    11713 admin 15 0 0 0 0 S 28 0.0 483:00.03 fw_worker_0
    11720 admin 15 0 0 0 0 S 19 0.0 368:17.32 fw_worker_7
    11715 admin 15 0 0 0 0 S 19 0.0 349:24.30 fw_worker_2
    11717 admin 15 0 0 0 0 S 18 0.0 333:57.31 fw_worker_4
    11714 admin 15 0 0 0 0 S 17 0.0 333:25.76 fw_worker_1
    11718 admin 15 0 0 0 0 R 16 0.0 324:31.41 fw_worker_5
    11716 admin 15 0 0 0 0 S 15 0.0 324:47.63 fw_worker_3
    11719 admin 15 0 0 0 0 S 14 0.0 304:20.86 fw_worker_6
    13227 admin 15 0 683m 256m 26m S 2 4.4 149:15.17 fw_full
    7251 nobody 15 0 27832 10m 4528 S 1 0.2 0:18.94 httpd
    13719 admin 15 0 210m 193m 9m S 1 3.3 10:18.93 rad
    13967 admin 15 0 435m 262m 14m S 1 4.5 27:25.86 usrchkd
    12608 admin 15 0 2876 1556 1304 S 0 0.0 5:23.13 netflowd
    12609 admin 15 0 23736 9544 6680 S 0 0.2 4:38.29 snmpd
    13223 admin 15 0 34172 10m 8496 S 0 0.2 7:51.60 routed
    13970 admin 15 0 52624 30m 9364 S 0 0.5 0:46.60 wstlsd
    19628 admin 15 0 5088 1132 772 S 0 0.0 1:06.43 pkxld
    28678 admin 15 0 2244 1184 832 R 0 0.0 0:00.59 top
    1 admin 15 0 1976 724 624 S 0 0.0 0:01.96 init
    2 admin RT -5 0 0 0 S 0 0.0 0:05.28 migration/0
    3 admin 15 0 0 0 0 S 0 0.0 0:00.06 ksoftirqd/0
    4 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 admin RT -5 0 0 0 S 0 0.0 0:02.19 migration/1
    6 admin 15 0 0 0 0 S 0 0.0 0:00.09 ksoftirqd/1
    7 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1
    8 admin RT -5 0 0 0 S 0 0.0 0:01.74 migration/2
    9 admin 15 0 0 0 0 S 0 0.0 0:00.05 ksoftirqd/2
    10 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/2
    11 admin RT -5 0 0 0 S 0 0.0 0:01.62 migration/3
    12 admin 15 0 0 0 0 S 0 0.0 0:00.07 ksoftirqd/3
    13 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/3

    ### VRRP STANDBY MEMBER

    top - 14:07:07 up 4 days, 15:32, 1 user, load average: 0.35, 0.48, 0.51
    Tasks: 230 total, 1 running, 229 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.1%hi, 1.9%si, 0.0%st
    Mem: 5971980k total, 5302208k used, 669772k free, 305276k buffers
    Swap: 10482404k total, 236k used, 10482168k free, 1496836k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    11796 admin 15 0 0 0 0 S 4 0.0 101:34.49 fw_worker_7
    11792 admin 15 0 0 0 0 S 3 0.0 104:17.27 fw_worker_3
    11790 admin 15 0 0 0 0 S 3 0.0 95:08.93 fw_worker_1
    11789 admin 15 0 0 0 0 S 2 0.0 71:20.86 fw_worker_0
    11794 admin 15 0 0 0 0 S 2 0.0 62:34.92 fw_worker_5
    11795 admin 15 0 0 0 0 S 2 0.0 34:39.59 fw_worker_6
    11791 admin 15 0 0 0 0 S 2 0.0 112:18.29 fw_worker_2
    11793 admin 15 0 0 0 0 S 2 0.0 98:57.98 fw_worker_4
    12692 admin 15 0 149m 37m 26m S 0 0.6 5:13.54 snmpd
    1 admin 15 0 1972 720 624 S 0 0.0 0:01.97 init
    2 admin RT -5 0 0 0 S 0 0.0 0:05.75 migration/0
    3 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
    4 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 admin RT -5 0 0 0 S 0 0.0 0:01.67 migration/1
    6 admin 15 0 0 0 0 S 0 0.0 0:00.03 ksoftirqd/1
    7 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1
    8 admin RT -5 0 0 0 S 0 0.0 0:01.16 migration/2
    9 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
    10 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/2
    11 admin RT -5 0 0 0 S 0 0.0 0:00.98 migration/3
    12 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3
    13 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/3
    14 admin RT -5 0 0 0 S 0 0.0 0:01.63 migration/4
    15 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/4
    16 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/4
    17 admin RT -5 0 0 0 S 0 0.0 0:01.37 migration/5
    18 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/5
    19 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/5
    20 admin RT -5 0 0 0 S 0 0.0 0:05.62 migration/6
    21 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/6
    22 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/6


    Cheers,

    Jon

  6. #6
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,623
    Rep Power
    9

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Quote Originally Posted by griff0001 View Post
    @jflemingeds Thanks for the input. Regarding your observation of the active member utilizing 700 meg more than the standby, I'm not experienced enough to know if this is normal behavior in a VRRP cluster running IPS, VPN, URLF blades (we also implement close to 100 NAT rules). Regarding peak connections, our current settings are set to allow the gateway to automatically mange concurrent connection limits and "connection hash table sizes and memory pool". Although I have nothing to back up my point of view (and I may well be wrong), I would expect the active member to utilize the additional resource in this scenario? Whats your thoughts? Please find the top output from both members below:

    ### VRRP ACTIVE MEMBER:

    top - 14:06:42 up 4 days, 15:42, 1 user, load average: 2.77, 2.19, 2.08
    Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.2%us, 0.4%sy, 0.0%ni, 83.0%id, 0.0%wa, 0.4%hi, 16.0%si, 0.0%st
    Mem: 5971980k total, 5710648k used, 261332k free, 52408k buffers
    Swap: 10482404k total, 260k used, 10482144k free, 1071180k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    11713 admin 15 0 0 0 0 S 28 0.0 483:00.03 fw_worker_0
    11720 admin 15 0 0 0 0 S 19 0.0 368:17.32 fw_worker_7
    11715 admin 15 0 0 0 0 S 19 0.0 349:24.30 fw_worker_2
    11717 admin 15 0 0 0 0 S 18 0.0 333:57.31 fw_worker_4
    11714 admin 15 0 0 0 0 S 17 0.0 333:25.76 fw_worker_1
    11718 admin 15 0 0 0 0 R 16 0.0 324:31.41 fw_worker_5
    11716 admin 15 0 0 0 0 S 15 0.0 324:47.63 fw_worker_3
    11719 admin 15 0 0 0 0 S 14 0.0 304:20.86 fw_worker_6
    13227 admin 15 0 683m 256m 26m S 2 4.4 149:15.17 fw_full
    7251 nobody 15 0 27832 10m 4528 S 1 0.2 0:18.94 httpd
    13719 admin 15 0 210m 193m 9m S 1 3.3 10:18.93 rad
    13967 admin 15 0 435m 262m 14m S 1 4.5 27:25.86 usrchkd
    12608 admin 15 0 2876 1556 1304 S 0 0.0 5:23.13 netflowd
    12609 admin 15 0 23736 9544 6680 S 0 0.2 4:38.29 snmpd
    13223 admin 15 0 34172 10m 8496 S 0 0.2 7:51.60 routed
    13970 admin 15 0 52624 30m 9364 S 0 0.5 0:46.60 wstlsd
    19628 admin 15 0 5088 1132 772 S 0 0.0 1:06.43 pkxld
    28678 admin 15 0 2244 1184 832 R 0 0.0 0:00.59 top
    1 admin 15 0 1976 724 624 S 0 0.0 0:01.96 init
    2 admin RT -5 0 0 0 S 0 0.0 0:05.28 migration/0
    3 admin 15 0 0 0 0 S 0 0.0 0:00.06 ksoftirqd/0
    4 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 admin RT -5 0 0 0 S 0 0.0 0:02.19 migration/1
    6 admin 15 0 0 0 0 S 0 0.0 0:00.09 ksoftirqd/1
    7 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1
    8 admin RT -5 0 0 0 S 0 0.0 0:01.74 migration/2
    9 admin 15 0 0 0 0 S 0 0.0 0:00.05 ksoftirqd/2
    10 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/2
    11 admin RT -5 0 0 0 S 0 0.0 0:01.62 migration/3
    12 admin 15 0 0 0 0 S 0 0.0 0:00.07 ksoftirqd/3
    13 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/3

    ### VRRP STANDBY MEMBER

    top - 14:07:07 up 4 days, 15:32, 1 user, load average: 0.35, 0.48, 0.51
    Tasks: 230 total, 1 running, 229 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.1%hi, 1.9%si, 0.0%st
    Mem: 5971980k total, 5302208k used, 669772k free, 305276k buffers
    Swap: 10482404k total, 236k used, 10482168k free, 1496836k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    11796 admin 15 0 0 0 0 S 4 0.0 101:34.49 fw_worker_7
    11792 admin 15 0 0 0 0 S 3 0.0 104:17.27 fw_worker_3
    11790 admin 15 0 0 0 0 S 3 0.0 95:08.93 fw_worker_1
    11789 admin 15 0 0 0 0 S 2 0.0 71:20.86 fw_worker_0
    11794 admin 15 0 0 0 0 S 2 0.0 62:34.92 fw_worker_5
    11795 admin 15 0 0 0 0 S 2 0.0 34:39.59 fw_worker_6
    11791 admin 15 0 0 0 0 S 2 0.0 112:18.29 fw_worker_2
    11793 admin 15 0 0 0 0 S 2 0.0 98:57.98 fw_worker_4
    12692 admin 15 0 149m 37m 26m S 0 0.6 5:13.54 snmpd
    1 admin 15 0 1972 720 624 S 0 0.0 0:01.97 init
    2 admin RT -5 0 0 0 S 0 0.0 0:05.75 migration/0
    3 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
    4 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 admin RT -5 0 0 0 S 0 0.0 0:01.67 migration/1
    6 admin 15 0 0 0 0 S 0 0.0 0:00.03 ksoftirqd/1
    7 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1
    8 admin RT -5 0 0 0 S 0 0.0 0:01.16 migration/2
    9 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
    10 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/2
    11 admin RT -5 0 0 0 S 0 0.0 0:00.98 migration/3
    12 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3
    13 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/3
    14 admin RT -5 0 0 0 S 0 0.0 0:01.63 migration/4
    15 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/4
    16 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/4
    17 admin RT -5 0 0 0 S 0 0.0 0:01.37 migration/5
    18 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/5
    19 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/5
    20 admin RT -5 0 0 0 S 0 0.0 0:05.62 migration/6
    21 admin 15 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/6
    22 admin RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/6


    Cheers,

    Jon
    I don't see anything major based on top output. Its a little strange that snmpd is using so much more memory on standby but otherwise seems fine.

  7. #7
    Join Date
    2017-08-17
    Posts
    4
    Rep Power
    0

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    I thought I would share the update as it may help anyone that comes across this problem in the future.

    After amending the allocation of CPU cores for the SecureXL / CoreXL instances to 6/6 and amending the RX buffer to 1024, I have seen a massive decrease in overruns; however a few hundred are still occurring during a policy push. For clarity please see the progress below:

    - No bonded interfaces and default RX buffer / Default CPU core allocation for SecureXL / CoreXL instances

    RESULT: 10s of thousands of overruns


    - Bonded interfaces with default RX buffer / CPU core allocation for SecureXL / CoreXL instances (4/8 respectively)


    RESULT: A few thousand overruns


    - Bonded interfaces with an increase RX buffer of 1024


    RESULT: Hundreds of overruns


    As you can see its steady progress, but I think I'm getting there. I'll be increasing the RX buffer to 2048, I anticipate this will eliminate the overrun issue altogether.

    @ShadowPeak.com
    @jflemingeds

    Thanks for taking the time to reply to my post.

    Cheers,

    Jon

  8. #8
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,224
    Rep Power
    13

    Default Re: RX-DRP / RX-OVR (FIFO Errors) / ClusterXL State change during policy install

    Quote Originally Posted by griff0001 View Post
    I thought I would share the update as it may help anyone that comes across this problem in the future.

    After amending the allocation of CPU cores for the SecureXL / CoreXL instances to 6/6 and amending the RX buffer to 1024, I have seen a massive decrease in overruns; however a few hundred are still occurring during a policy push. For clarity please see the progress below:

    - No bonded interfaces and default RX buffer / Default CPU core allocation for SecureXL / CoreXL instances

    RESULT: 10s of thousands of overruns


    - Bonded interfaces with default RX buffer / CPU core allocation for SecureXL / CoreXL instances (4/8 respectively)


    RESULT: A few thousand overruns


    - Bonded interfaces with an increase RX buffer of 1024


    RESULT: Hundreds of overruns


    As you can see its steady progress, but I think I'm getting there. I'll be increasing the RX buffer to 2048, I anticipate this will eliminate the overrun issue altogether.

    @ShadowPeak.com
    @jflemingeds

    Thanks for taking the time to reply to my post.

    Cheers,

    Jon
    Thanks for the update. You could also try enabling Multi-Queue on the problematic interfaces (not sure why I didn't mention that option before) but if all the firewall's CPUs are heavily loaded during the policy load/rematch trying to spread SoftIRQ processing for a single interface across multiple SND/IRQ cores via Multi-Queue probably won't help much.
    --
    Second Edition of my "Max Power" Firewall Book
    Now Available at http://www.maxpowerfirewalls.com

Similar Threads

  1. Gateway loses state table on policy install
    By sleith in forum Firewall Blade
    Replies: 3
    Last Post: 2014-07-16, 18:43
  2. Member state down but no other errors?
    By craig999 in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 12
    Last Post: 2010-09-20, 10:12
  3. ClusterXL cluster stopping and starting on Policy Install
    By banduraj in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 19
    Last Post: 2009-09-11, 23:43
  4. Install Errors on Solaris 10 and NGX
    By clubnbabyseals in forum Installing And Upgrading
    Replies: 3
    Last Post: 2006-07-18, 22:16
  5. Install on Suse 10 OSS CA Errors
    By winsoc in forum Installing And Upgrading
    Replies: 1
    Last Post: 2006-05-24, 17:59

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •