CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


First, I hope you're all well and staying safe.
Second, I want to give a "heads up" that you should see more activity here shortly, and maybe a few cosmetic changes.
I'll post more details to the "Announcements" forum soon, so be on the lookout. -E

 

Results 1 to 11 of 11

Thread: Question regarding failover in ClusterXL (and not only)

  1. #1
    Join Date
    2017-10-10
    Posts
    7
    Rep Power
    0

    Default Question regarding failover in ClusterXL (and not only)

    Hello

    i got a question regarding failover, especially regarding SecureXL. I know that people say 'zero down time', but does it actually mean that 100% of the traffic is replicated? I mean, does remote access vpn connections, L2L connections, or accelerated packets (securexl) also failover to the standby firewall? Is there any type of traffic that is 'dropped'?

  2. #2
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,252
    Rep Power
    17

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Melinbonian View Post
    Hello

    i got a question regarding failover, especially regarding SecureXL. I know that people say 'zero down time', but does it actually mean that 100% of the traffic is replicated? I mean, does remote access vpn connections, L2L connections, or accelerated packets (securexl) also failover to the standby firewall? Is there any type of traffic that is 'dropped'?
    Anything being done in a user space process on the active firewall that fails will not survive the failover. Anything tracked in the kernel (which is most operations including all the ones you listed) should survive.

    However the other factor is whether the failure of the active member is catastrophic or just an impairment; this will dictate how much packet traffic is lost during the switchover. Here is an excerpt from the second edition of Max Power which I'm working on right now; there is a whole new chapter dealing specifically with ClusterXL performance issues and in particular "slow" failovers:

    1. The default “dead” timer for ClusterXL is approximately 2.5 seconds. If the active member suffers a catastrophic failure (such as the power cord being pulled or a Gaia system crash/panic), the standby member must wait the dead interval before concluding the active member has failed and going active. During that wait period no traffic will pass through the cluster. However for administrative failovers using clusterXL_admin or other partial failures (such as a single network interface getting unplugged or running a service-impacting command such as fw unloadlocal), failover to the standby should happen immediately with minimal packet loss.

    2. If the active cluster member’s CPUs are running at 80% utilization or higher, by default in R77.30 gateway and later the Cluster Under Load (CUL) mechanism is invoked, which extends the ClusterXL dead timer from 2.5 seconds to 10 seconds. The purpose of CUL is to avoid spurious and unnecessary failovers due to transient high CPU loads on the active cluster member. Needless to say if a catastrophic failure occurs on the active member while CUL is active, the standby member will have to wait much longer before taking over. To determine if the CUL mechanism is currently (or previously) active on your cluster, run grep cul_load_freeze /var/log/messages*, as CUL logs all information about its operation to the gateway’s syslog. Making sure your cluster members are properly tuned and optimized as outlined throughout this book can help keep your cluster members well below 80% CPU utilization and avoid invoking CUL.
    --
    Third Edition of my "Max Power 2020" Firewall Book
    Now Available at http://www.maxpowerfirewalls.com

  3. #3
    Join Date
    2017-10-10
    Posts
    7
    Rep Power
    0

    Default Re: Question regarding failover in ClusterXL (and not only)

    Thanks for the answer! Yes indeed, i am talking for a controlled failover (clusterxl admindown) and i am positively surprised to hear that all of the above 'should' be unaffected, especially the remote access connections i was taking for granted that they would 'feel' a disconnection!

    One thing though, what does user space process involve? Because never heard the definition :D I hope by user it means the admin himself not the host users.

  4. #4
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    422
    Rep Power
    16

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Melinbonian View Post
    Thanks for the answer! Yes indeed, i am talking for a controlled failover (clusterxl admindown) and i am positively surprised to hear that all of the above 'should' be unaffected, especially the remote access connections i was taking for granted that they would 'feel' a disconnection!

    One thing though, what does user space process involve? Because never heard the definition :D I hope by user it means the admin himself not the host users.
    Transient cryptographic keys are not synchronized, so VPNs will have to renegotiate. That is a fairly fast process. Connections within the VPN shouldn't drop, but you may get a hiccup on VoIP calls, for example.

    "User-space" is a computing term meaning something is in a process a user could conceivably administer rather than in the operating system kernel. User-space processes on Check Point include things like the URL filtering lookup (though once a connection is approved, the URL filtering system isn't generally checked again), logging, and dynamic routing.

    Things at this level would need to provide their own failover mechanics. Some do, some don't. For dynamic routing to "survive" a failover, you can use the "graceful restart" functionality in OSPF and BGP. This allows the firewalls to tell neighbors to hold their routing tables static for a few minutes while everything renegotiates and the graph reconverges. The neighbor adjacency does actually drop, but the graceful restart functionality allows traffic forwarding to continue without interruption.

    There are still user-mode systems for antivirus scanning, though this is not the default. I believe threat emulation communications (shuttling the files back and forth) are handled by a user-mode process, though I don't think the connections are. I forget what else is handled by user-mode processes, though.

  5. #5
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,252
    Rep Power
    17

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Bob_Zimmerman View Post
    Transient cryptographic keys are not synchronized, so VPNs will have to renegotiate. That is a fairly fast process. Connections within the VPN shouldn't drop, but you may get a hiccup on VoIP calls, for example.

    "User-space" is a computing term meaning something is in a process a user could conceivably administer rather than in the operating system kernel. User-space processes on Check Point include things like the URL filtering lookup (though once a connection is approved, the URL filtering system isn't generally checked again), logging, and dynamic routing.

    Things at this level would need to provide their own failover mechanics. Some do, some don't. For dynamic routing to "survive" a failover, you can use the "graceful restart" functionality in OSPF and BGP. This allows the firewalls to tell neighbors to hold their routing tables static for a few minutes while everything renegotiates and the graph reconverges. The neighbor adjacency does actually drop, but the graceful restart functionality allows traffic forwarding to continue without interruption.

    There are still user-mode systems for antivirus scanning, though this is not the default. I believe threat emulation communications (shuttling the files back and forth) are handled by a user-mode process, though I don't think the connections are. I forget what else is handled by user-mode processes, though.
    Don't worry the second edition of Max Power will remind you, :-) I call process space on the firewall the "fourth path" (in addition to SXL, PXL & F2F) and will be covering it extensively.
    --
    Third Edition of my "Max Power 2020" Firewall Book
    Now Available at http://www.maxpowerfirewalls.com

  6. #6
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    422
    Rep Power
    16

    Default Re: Question regarding failover in ClusterXL (and not only)

    Now that I think about it, since transient crypto keys don't synchronize, TLS-MitM'd connections (most notably, HTTPS connections if HTTPS inspection is enabled) won't survive failover. That doesn't matter for most things, but long-running downloads over HTTPS would die.

    Any connection which terminates on the firewall rather than through the firewall would die. The firewalls don't synchronize SSH keys, web UI state, or anything like that. Strictly, the connection table entries would survive the failover, but the client would then be "connected" to a port on the server which isn't listening. This is why the dynamic routing adjacencies have to be rebuilt.



    It strikes me that Check Point should really offer some kind of hardware watchdog to rapidly detect cluster member failure. Build a couple of screw terminals into the firewall. One pair for a relay (SPST-NO, non-latching) to indicate "my" state, one pair to read remote state. Every 10ms (for example), if the LOM card doesn't get a response from an OS-level watchdog daemon, it stops powering the relay. If the member loses power, so does the relay. Bam. Low-latency remote failure indication like a FONIC. If I detect the peer's relay has opened, I have a pretty good idea something has gone wrong. Only works for two-member clusters directly, but it wouldn't be hard to extend, especially if it was treated only as a notification system to trigger in-band checking.

    Tweak the scheduler a little to pin the watchdog to a particular processor core and prevent other processes from being scheduled on that core for more than a millisecond at a time. The default timings and scheduler settings would have to be adjusted for boxes with fewer cores. Still, could be done, and it could significantly reduce failover latency.

    Edited to add: I said LOM card. Meant SMC.
    Last edited by Bob_Zimmerman; 2017-10-25 at 16:42.

  7. #7
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,668
    Rep Power
    13

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Bob_Zimmerman View Post
    It strikes me that Check Point should really offer some kind of hardware watchdog to rapidly detect cluster member failure. Build a couple of screw terminals into the firewall. One pair for a relay (SPST-NO, non-latching) to indicate "my" state, one pair to read remote state. Every 10ms (for example), if the LOM card doesn't get a response from an OS-level watchdog daemon, it stops powering the relay. If the member loses power, so does the relay. Bam. Low-latency remote failure indication like a FONIC. If I detect the peer's relay has opened, I have a pretty good idea something has gone wrong. Only works for two-member clusters directly, but it wouldn't be hard to extend, especially if it was treated only as a notification system to trigger in-band checking.

    Tweak the scheduler a little to pin the watchdog to a particular processor core and prevent other processes from being scheduled on that core for more than a millisecond at a time. The default timings and scheduler settings would have to be adjusted for boxes with fewer cores. Still, could be done, and it could significantly reduce failover latency.

    Edited to add: I said LOM card. Meant SMC.
    That sounds harder then "good enough".

  8. #8
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    422
    Rep Power
    16

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by jflemingeds View Post
    That sounds harder then "good enough".
    Sure, but it's leaps and bounds easier to develop a system like this than it is to develop a FONIC. And this would be used by at least one order of magnitude more people.

    Given how terrible Check Point's "appliances" are in every regard, you'd think they would at least try to give people a reason to want them.

  9. #9
    Join Date
    2017-10-10
    Posts
    7
    Rep Power
    0

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Bob_Zimmerman View Post
    Transient cryptographic keys are not synchronized, so VPNs will have to renegotiate. That is a fairly fast process. Connections within the VPN shouldn't drop, but you may get a hiccup on VoIP calls, for example.

    "User-space" is a computing term meaning something is in a process a user could conceivably administer rather than in the operating system kernel. User-space processes on Check Point include things like the URL filtering lookup (though once a connection is approved, the URL filtering system isn't generally checked again), logging, and dynamic routing.

    Things at this level would need to provide their own failover mechanics. Some do, some don't. For dynamic routing to "survive" a failover, you can use the "graceful restart" functionality in OSPF and BGP. This allows the firewalls to tell neighbors to hold their routing tables static for a few minutes while everything renegotiates and the graph reconverges. The neighbor adjacency does actually drop, but the graceful restart functionality allows traffic forwarding to continue without interruption.

    There are still user-mode systems for antivirus scanning, though this is not the default. I believe threat emulation communications (shuttling the files back and forth) are handled by a user-mode process, though I don't think the connections are. I forget what else is handled by user-mode processes, though.
    Thanks for the clarification. I had a failover sometime ago (contorlled one) and wasnt sure whether remote VPN users would be affected so considered worse case scenario and informed users that they could have a 'momentary' outage. Now i assume i didnt have to since they should never be affected, if understood correctly.

  10. #10
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    422
    Rep Power
    16

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Melinbonian View Post
    Thanks for the clarification. I had a failover sometime ago (contorlled one) and wasnt sure whether remote VPN users would be affected so considered worse case scenario and informed users that they could have a 'momentary' outage. Now i assume i didnt have to since they should never be affected, if understood correctly.
    It depends on what they are doing. VoIP calls, for example, will almost certainly experience a brief drop. Everything should resume with little to no action from the users, though.

  11. #11
    Join Date
    2017-10-10
    Posts
    7
    Rep Power
    0

    Default Re: Question regarding failover in ClusterXL (and not only)

    Quote Originally Posted by Bob_Zimmerman View Post
    It depends on what they are doing. VoIP calls, for example, will almost certainly experience a brief drop. Everything should resume with little to no action from the users, though.
    Thanks again!

Similar Threads

  1. ClusterXL unexpected/hidden failover
    By laf_c in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 10
    Last Post: 2017-06-08, 09:43
  2. ClusterXL Issue with Failover
    By The_Dude in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 31
    Last Post: 2017-02-02, 06:27
  3. ClusterXL failover timings
    By tangerine0072000 in forum R75.40 (GAiA)
    Replies: 1
    Last Post: 2013-08-30, 10:29
  4. unable to failover r75.30 clusterXL using smartdashboard
    By lordbigsack in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 4
    Last Post: 2012-03-14, 04:43
  5. interface monitoring for failover in clusterXL
    By sebastan_bach in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 12
    Last Post: 2010-02-18, 03:05

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •