CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


Tim Hall has done it again! He has just released the 2nd edition of "Max Power".
Rather than get into details here, I urge you to check out this announcement post.
It's a massive upgrade, and well worth checking out. -E

 

Results 1 to 13 of 13

Thread: Migrate clusterXL from High Avaibility to Load Sharing

  1. #1
    Join Date
    2009-01-23
    Location
    France
    Posts
    31
    Rep Power
    0

    Default Migrate clusterXL from High Avaibility to Load Sharing

    Hi everybody

    We have 2 box DL360 G5 in ClusterXL Splat R75.20 High avaibility since 3 years with no problems.
    This morning, we had a big pb with our intranet NIC interface of the active member with a rate of 12k RX errors/s ( it's a quad NIC )
    We failover to the 2nd member, but result was the same ( 12k RX errors/s ). All traffic was slow and some services unavailables because this problem
    No malformed packets, yes, a lot of traffic and i think NIC interface was satured (400 Mb/s). CPU and memory was fine during the incident.
    Support advice to migrate in load sharing to solve temporarly the problem.
    Can you advice me to migrate from High avaibility to Load Sharing multicast ( we already have licences for Load Sharing ) ?

    Thx a lot
    Last edited by gustave69; 2015-01-12 at 12:20.

  2. #2
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,648
    Rep Power
    9

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    Hi everybody

    We have 2 box DL360 G5 in ClusterXL Splat R75.20 High avaibility since 3 years with no problems.
    This morning, we had a big pb with our intranet NIC interface of the active member with a rate of 12kb RX errors/s ( it's a quad NIC )
    We failover to the 2nd member, but result was the same ( 12kb/s RX errors ). All traffic was slow and some services unavailables because this problem
    No malformed packets, yes, a lot of traffic and i think NIC interface was satured (400 Mb/s). CPU and memory was fine during the incident.
    Support advice to migrate in load sharing to solve temporarly the problem.
    Can you advice me to migrate from High avaibility to Load Sharing multicast ( we already have licences for Load Sharing ) ?

    Thx a lot
    I'm sure a lot of people are going to chime in, but in shortl, first multicast ls is not easy to setup. It requires setting static arp entires all over the place.

    2nd i don't see any way it could lessen the load on the nics. in multicast ls every node gets each packet and then decides what to do with it. This means you have 12k RX errors on both nodes.

    I think your better off at looking at the boxes. Do you know what kind of RX errors they are? ethtool -S $nic will show you. You can increase the buffer size on the nic as a bandaid, but most of the time the root issue is too much traffic. The options mostly are the following

    enabled securexl (if not already)
    run fwaccel (these are securexl stats) stat and make sure templates are enabled all the way through your rule set (it will say which line is disabling them)
    enabled corexl (if you haven't already)
    set nic affinity to your major nics have a dedicated cpu
    use remaining cpus for corexl and maybe leave 1 free cpu for everything else
    check rule set for highest hit rules and move this up the rule set

  3. #3
    Join Date
    2005-11-25
    Location
    United States, Southeast
    Posts
    857
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    Support advice to migrate in load sharing to solve temporarly the problem.
    Support is being an idiot. Do not follow the silly recommendation given to you. "Lets change the whole environment in response to one undiagnosed metric." Put that shotgun down Cowboy.

    Is that 12,000 errors a second or 12000 bits a second in errors?

    What does 'ethtool -S <interface name>' say? ie: ethtool -S eth0
    That should break out the Rx error to more granular/specific errors..

    What does the switch say? Is it having a sudden burst in broadcast/multicast/unicast traffic? ASIC gone bad? Any other systems report issues?
    Connections table overrun? Perhaps InfoSec team kicked off a Scan during business hours.

    Did you push a new IPS update just before the issue? Update and push another..

    Hopefully you're collecting SNMP stats from the firewalls; NIC counters, connections table counters, Accepted packets, Dropped packets etc.

    Analysis of the firewall logs can also indicate if you had a sudden increase of traffic from a specific source, such as InfoSec's port scanner.. Don't forget to look whether there is a local firewall log on the gateway.

    Keep in mind that the firewall is the Canary. Don't blame the Canary. Find the real problem that's killing the Canary.
    Last edited by alienbaby; 2015-01-09 at 19:22.

  4. #4
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,248
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Load Sharing, ugh. Please post "ethtool -S" stats for the interface in question. Output from "sim affinity -l" would be helpful too.

  5. #5
    Join Date
    2006-09-26
    Posts
    3,190
    Rep Power
    16

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by ShadowPeak.com View Post
    Load Sharing, ugh. Please post "ethtool -S" stats for the interface in question. Output from "sim affinity -l" would be helpful too.
    Do you have any spare interfaces that you can move the trouble NIC to that new interfaces? Is the NIC Intel or Broadcom?

    If not, is it possible to move off the trouble interface and share with another existing interfaces via 802.1q?

  6. #6
    Join Date
    2009-01-23
    Location
    France
    Posts
    31
    Rep Power
    0

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Thx all for your responses

    Our NIC interface is a Intel
    Errors were 12000 errors/s (discards)

    Below, more infos/statistics:
    SecureXL is not installed
    CoreXL is not enabled

    ethtool -i eth0
    driver: e1000
    version: 7.6.12-NAPI
    firmware-version: 5.10-2
    bus-info: 0000:0d:00.0

    ethtool -g eth0
    Ring parameters for eth0:
    Pre-set maximums:
    RX: 4096
    RX Mini: 0
    RX Jumbo: 0
    TX: 4096
    Current hardware settings:
    RX: 4096
    RX Mini: 0
    RX Jumbo: 0
    TX: 4096

    ethool -S eth0:
    NIC statistics:
    rx_packets: 1528049191
    tx_packets: 1328984250
    rx_bytes: 591319162994
    tx_bytes: 989088472500
    rx_broadcast: 23446128
    tx_broadcast: 12341
    rx_multicast: 37829411
    tx_multicast: 37877223
    rx_errors: 0
    tx_errors: 0
    tx_dropped: 0
    multicast: 37829411
    collisions: 0
    rx_length_errors: 0
    rx_over_errors: 0
    rx_crc_errors: 0
    rx_frame_errors: 0
    rx_no_buffer_count: 60911785
    rx_missed_errors: 64881229
    tx_aborted_errors: 0
    tx_carrier_errors: 0
    tx_fifo_errors: 0
    tx_heartbeat_errors: 0
    tx_window_errors: 0
    tx_abort_late_coll: 0
    tx_deferred_ok: 0
    tx_single_coll_ok: 0
    tx_multi_coll_ok: 0
    tx_timeout_count: 0
    tx_restart_queue: 2511
    rx_long_length_errors: 0
    rx_short_length_errors: 0
    rx_align_errors: 0
    tx_tcp_seg_good: 0
    tx_tcp_seg_failed: 0
    rx_flow_control_xon: 0
    rx_flow_control_xoff: 0
    tx_flow_control_xon: 0
    tx_flow_control_xoff: 0
    rx_long_byte_count: 591319162994
    rx_csum_offload_good: 0
    rx_csum_offload_errors: 0
    rx_header_split: 0
    alloc_rx_buff_failed: 0
    tx_smbus: 0
    rx_smbus: 0
    dropped_smbus: 0
    Last edited by gustave69; 2015-01-10 at 12:54.

  7. #7
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,248
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    Thx all for your responses

    Our NIC interface is a Intel
    Errors were 12000 errors/s (discards)

    Below, more infos/statistics:
    SecureXL is not installed
    CoreXL is not enabled

    ethtool -i eth0
    driver: e1000
    version: 7.6.12-NAPI
    firmware-version: 5.10-2
    bus-info: 0000:0d:00.0

    ethtool -g eth0
    Ring parameters for eth0:
    Pre-set maximums:
    RX: 4096
    RX Mini: 0
    RX Jumbo: 0
    TX: 4096
    Current hardware settings:
    RX: 4096
    RX Mini: 0
    RX Jumbo: 0
    TX: 4096

    ethool -S eth0:
    NIC statistics:
    rx_packets: 1528049191
    (snip)
    rx_no_buffer_count: 60911785
    rx_missed_errors: 64881229
    (snip)
    Yeah I bet your performance isn't very good with 4-8% packet loss due to buffering misses. Looks like you have tried cranking the RX ring buffer to maximum and it didn't help, no surprise there as doing that almost never deals with actual problem. You say the CPU is normal, that's impossible unless you are looking at the overall CPU average and not the individual cores. I can guarantee if you run the top command then hit "1" while it is running you will find that CPU 0 is getting smoked in si (softIRQ) processing and to a lesser degree hi (hardware interrupt) processing. CPU 0 is processing SoftIRQs for all interfaces because you have SecureXL off. Either you need to turn it on (and CoreXL too) so you can pick up automatic interface affinity to spread the IRQ processing around or you need to set manual affinity for the busy interface(s) via fwaffinity.conf so they can get enough CPU.

    Your e1000 driver version is somewhat old, might be worth trying to have Check Point TAC give you the latest e1000 driver. But lacking any other info I'd say CPU 0 is getting butchered and updating the driver is unlikely to help. You can confirm this by running "cat /proc/interrupts" to see the IRQ distribution and the top command mentioned earlier.
    Last edited by ShadowPeak.com; 2015-01-10 at 15:47.

  8. #8
    Join Date
    2009-01-23
    Location
    France
    Posts
    31
    Rep Power
    0

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by ShadowPeak.com View Post
    Yeah I bet your performance isn't very good with 4-8% packet loss due to buffering misses. Looks like you have tried cranking the RX ring buffer to maximum and it didn't help, no surprise there as doing that almost never deals with actual problem. You say the CPU is normal, that's impossible unless you are looking at the overall CPU average and not the individual cores. I can guarantee if you run the top command then hit "1" while it is running you will find that CPU 0 is getting smoked in si (softIRQ) processing and to a lesser degree hi (hardware interrupt) processing. CPU 0 is processing SoftIRQs for all interfaces because you have SecureXL off. Either you need to turn it on (and CoreXL too) so you can pick up automatic interface affinity to spread the IRQ processing around or you need to set manual affinity for the busy interface(s) via fwaffinity.conf so they can get enough CPU.

    Your e1000 driver version is somewhat old, might be worth trying to have Check Point TAC give you the latest e1000 driver. But lacking any other info I'd say CPU 0 is getting butchered and updating the driver is unlikely to help. You can confirm this by running "cat /proc/interrupts" to see the IRQ distribution and the top command mentioned earlier.
    Your're right.
    When traffic is "normal" the "si" softirq for CPU0 is around 40%
    We have 2 Xeon Quad core E5440 2,83 Ghz per box

    Click image for larger version. 

Name:	top.jpg 
Views:	158 
Size:	60.6 KB 
ID:	894

    Click image for larger version. 

Name:	interrupts.jpg 
Views:	103 
Size:	99.4 KB 
ID:	895

    Last year, we had pb when we tried to enable coreXL (2 instances). After enable CoreXL, we reboot, and when we pushed policy, cluster was down. 2 members didn't see each other anymore.
    So first, I will try to install and enable secureXL
    Can I set manual affinity for the busy interface via fwaffinity.conf without enable coreXL ?

    Thx a lot
    Last edited by gustave69; 2015-01-11 at 09:55.

  9. #9
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,248
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    Your're right.
    When traffic is "normal" the "si" softirq for CPU0 is around 40%
    We have 2 Xeon Quad core E5440 2,83 Ghz per box

    Click image for larger version. 

Name:	top.jpg 
Views:	158 
Size:	60.6 KB 
ID:	894

    Click image for larger version. 

Name:	interrupts.jpg 
Views:	103 
Size:	99.4 KB 
ID:	895

    Last year, we had pb when we tried to enable coreXL (2 instances). After enable CoreXL, we reboot, and when we pushed policy, cluster was down. 2 members didn't see each other anymore.
    So first, I will try to install and enable secureXL
    Can I set manual affinity for the busy interface via fwaffinity.conf without enable coreXL ?

    Thx a lot
    You must enable CoreXL on both cluster members at the same time or you will see the exact cluster down/ready situation you just described.

    You can try just enabling SecureXL first, but you really need to enable CoreXL too. If you just enable SecureXL it will move the IRQ processing onto several different cores thus alleviating the SoftIRQ processing bottleneck, but since you have CoreXL off that single INSPECT firewall instance will go to 100% instantly and become your next bottleneck. Performance will be just as bad if not worse with high packet latency replacing high packet loss.

    Hopefully you are running at least code level R75.40 where a lot of SecureXL issues were fixed, prior to that enabling SecureXL could break various legacy applications and some other unusual network situations.

    Look in $FWDIR/conf/fwaffinity.conf and ensure all you see there is "i default auto" other than comment lines that start with #. I'd strongly recommend trying both CoreXL/SecureXL enabled with automatic affinities first before trying to set up manual affinities in this file.

    However before doing that, what process is pegging CPU5 in user (us) space? It should be showing up in top just below all the CPU numbers. That looks unusual. It is fw_worker? fwd?

    Edit: I just noticed in one of your older posts that you were using Traditional Mode VPNs at one point. Please tell me you have converted to Simplified mode (VPN column in the rulebase), otherwise you won't be able to turn on CoreXL and your performance will probably continue to be terrible no matter what you do.
    Last edited by ShadowPeak.com; 2015-01-11 at 13:46.

  10. #10
    Join Date
    2009-01-23
    Location
    France
    Posts
    31
    Rep Power
    0

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by ShadowPeak.com View Post
    You must enable CoreXL on both cluster members at the same time or you will see the exact cluster down/ready situation you just described.

    You can try just enabling SecureXL first, but you really need to enable CoreXL too. If you just enable SecureXL it will move the IRQ processing onto several different cores thus alleviating the SoftIRQ processing bottleneck, but since you have CoreXL off that single INSPECT firewall instance will go to 100% instantly and become your next bottleneck. Performance will be just as bad if not worse with high packet latency replacing high packet loss.

    Hopefully you are running at least code level R75.40 where a lot of SecureXL issues were fixed, prior to that enabling SecureXL could break various legacy applications and some other unusual network situations.

    Look in $FWDIR/conf/fwaffinity.conf and ensure all you see there is "i default auto" other than comment lines that start with #. I'd strongly recommend trying both CoreXL/SecureXL enabled with automatic affinities first before trying to set up manual affinities in this file.

    However before doing that, what process is pegging CPU5 in user (us) space? It should be showing up in top just below all the CPU numbers. That looks unusual. It is fw_worker? fwd?

    Edit: I just noticed in one of your older posts that you were using Traditional Mode VPNs at one point. Please tell me you have converted to Simplified mode (VPN column in the rulebase), otherwise you won't be able to turn on CoreXL and your performance will probably continue to be terrible no matter what you do.

    I've found the process on CPU5. It was a HP Insight Management process - Stopped/started and CPU5 go back to 100% idle

    This monday, workday, the softirq rate on CPU0 was around 96-98% permanently on high traffic
    My boss want a quick result with no cost. So i will first enable SecureXL
    To do this, i just need to install performance pack and enable secureXL in cpconfig ? No more actions ?
    We will try later to activate CoreXL for 2 instances ( No licences needed )

    My problematic cluster is only firewall cluster. No VPN. VPNs are on another cluster, but always in traditional mode ....
    Version of my cluster is R75.20, not R75.40. Is it a problem for SecureXL ?
    Last edited by gustave69; 2015-01-12 at 14:46.
    *** JE SUIS CHARLIE ***

  11. #11
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,248
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    I've found the process on CPU5. It was a HP Insight Management process - Stopped/started and CPU5 go back to 100% idle
    OK great, non Check Point related.

    This monday, workday, the softirq rate on CPU0 was around 96-98% permanently on high traffic
    My boss want a quick result with no cost. So i will first enable SecureXL
    To do this, i just need to install performance pack and enable secureXL in cpconfig ? No more actions ?
    We will try later to activate CoreXL for 2 instances ( No licences needed )

    My problematic cluster is only firewall cluster. No VPN. VPNs are on another cluster, but always in traditional mode ....
    Version of my cluster is R75.20, not R75.40. Is it a problem for SecureXL ?
    On that version you shouldn't need to install Performance Pack, it should already be there. Just run cpconfig on the active member, then select the menu option to enable SecureXL. It will take effect immediately and be permanent across reboots. Wait 60 seconds then run "sim affinity -l" to see the automatic processor allocations by interface.

    R75.20 should be OK, I'd recommend executing a test plan for all your critical applications and components immediately after enabling it on the active member to be sure. Don't forget to enable SecureXL on the standby once your test plan checks out.

    Once SecureXL is on, you'll need to schedule an outage window to get CoreXL enabled. Yes you have a cluster and there should not be an outage enabling CoreXL but I'd get a window just to CYA based on what you reported earlier. If something is going to break SecureXL is the most likely place, CoreXL is not known for breaking stuff as long as you enable it on both cluster members at the same time.

  12. #12
    Join Date
    2009-01-23
    Location
    France
    Posts
    31
    Rep Power
    0

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Until today, we hadn't had authorization to activate SecureXL.
    So, we tried to start process "irqbalance" which is present on splat in /etc/init.d. All irqs were well balanced on all 8 cores. But core with irq of internal interface was 100% full, and always lot of dropped packet.
    So this afternoon, we have activated SecureXL.
    Softirqs seems to be balanced only on core0 and core1 ( is it normal ?) . Accelered connection and accelered packets rate are around 90%. Not bad.
    We will see that on high traffic tomorrow morning.

    # sim affinity -l
    eth0 : 1
    eth1 : 1
    eth2 : 0
    eth7 : 0
    eth8 : 1


    So now, we will try to enable CoreXL. Hoping this time, activation will be alright. I'll let you knows
    Last edited by gustave69; 2015-01-19 at 18:10.
    *** JE SUIS CHARLIE ***

  13. #13
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,248
    Rep Power
    14

    Default Re: Migrate clusterXL from High Avaibility to Load Sharing

    Quote Originally Posted by gustave69 View Post
    Until today, we hadn't had authorization to activate SecureXL.
    So, we tried to start process "irqbalance" which is present on splat in /etc/init.d. All irqs were well balanced on all 8 cores. But core with irq of internal interface was 100% full, and always lot of dropped packet.
    So this afternoon, we have activated SecureXL.
    Softirqs seems to be balanced only on core0 and core1 ( is it normal ?) .
    Your box must have 8 cores so that sounds right since CoreXL is off. Make sure all you have in the $FWDIR/conf/fwaffinity.conf is "i default auto" although setting interface affinities in that file should no longer matter now that SecureXL is enabled.

    Accelered connection and accelered packets rate are around 90%. Not bad.
    We will see that on high traffic tomorrow morning.

    # sim affinity -l
    eth0 : 1
    eth1 : 1
    eth2 : 0
    eth7 : 0
    eth8 : 1


    So now, we will try to enable CoreXL. Hoping this time, activation will be alright. I'll let you knows
    Those are great acceleration statistics and why the box seems to be behaving better, I have a feeling your box will be much healthier during high traffic even without CoreXL enabled. I would still go ahead and enable it though for the 2 cores you are licensed for, once you do that Firewall Workers will run on Core 6 & 7 and your interfaces should spread out with automatic interface affinity across cores 0-5 if I recall how CoreXL licensing works correctly. Everything might bunch up due to your 2-core licenses on the first two cores, can't remember exactly. Been awhile since I dealt with a CoreXL license that had fewer cores permitted than physical ones due to the rise of Check Point appliances which always come with a matching CoreXL license...

    Edit: After a quick test in my lab, if you are only licensed for 2 cores everything will bunch up on the 2 cores including all SND cores and Firewall Worker cores. When only 2 cores are present (whether limited by license or there are only 2 physical cores present) you will definitely want SecureXL on BUT whether you should enable CoreXL is a bit of a crapshoot in the real world. At the moment with SecureXL enabled but CoreXL disabled you have 2 SND cores able to process interface IRQs (this includes all accelerated traffic too) and just 1 Firewall Worker instance to handle non-accelerated traffic. Enabling CoreXL will let you have 2 Firewall Worker instances but so much of your traffic is accelerated and handled by the 2 SND cores that enabling CoreXL might actually hurt performance in your case. As I said it is a bit of a crapshoot. I'd advise leaving SecureXL on but CoreXL off for a few days and see how the firewall handles it for a couple of busy periods. Get some good performance measurements during your busy periods, then try enabling CoreXL and see what happens.
    Last edited by ShadowPeak.com; 2015-01-20 at 02:21.

Similar Threads

  1. how to move from High Availability new mode to Load Sharing Multicast
    By 3lizar in forum Management High Availability
    Replies: 5
    Last Post: 2012-03-16, 07:12
  2. ClusterXL load sharing SPLAT performances issue
    By philuxe in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 5
    Last Post: 2008-08-05, 10:49
  3. Switch/Router config - ClusterXL R65 Load Sharing Multicast
    By bytor in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 1
    Last Post: 2008-07-15, 02:44
  4. Help about ClusterXL Load Sharing
    By wiz4rd in forum Check Point SecurePlatform (SPLAT)
    Replies: 5
    Last Post: 2006-11-28, 10:47
  5. Difference Between Load Sharing and High Availability
    By Cornelius in forum Management High Availability
    Replies: 3
    Last Post: 2006-09-21, 15:07

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •