CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


First, I hope you're all well and staying safe.
Second, I want to give a "heads up" that you should see more activity here shortly, and maybe a few cosmetic changes.
I'll post more details to the "Announcements" forum soon, so be on the lookout. -E

 

Page 1 of 2 12 LastLast
Results 1 to 20 of 32

Thread: ClusterXL Issue with Failover

  1. #1
    Join Date
    2016-09-19
    Posts
    5
    Rep Power
    0

    Default ClusterXL Issue with Failover

    I encountered a weird issue that I have not ever seen with a clusterXL and am hoping to get a better idea on the process or why this would happen. Any comments and answers are greatly appreciated.

    Current Set-up:
    2x Open servers running GAIA 77.30 in Active\Standby (Sync Int is a crossover)

    fw1 = active
    fw2 = standby

    monitored interfaces:
    eth0
    eth1
    eth2
    eth3
    eth4

    sync interface:
    eth5 (via crossover)


    Issue:
    It started off with everything was normal:
    Fw1 was active.
    Fw2 was standby.

    Then, out of the blue we lost all access to our networks to and from this firewall. We were dead in the water.

    So I had to console into the FWs.

    I first consoled into Fw1 and ran "cphaprob stat" which showed:

    Code:
    Number     Unique Address  Assigned Load   State       
    
    1 (local)  x.x.x.1     0%            Down           
    2          x.x.x.2     0%              Standby

    I have never seen this before and it looks like Fw1 failed and Fw2 never took over as the Active member.

    I then consoled into Fw2 and all that was shown was a blank black screen. No Prompt no nothing. The server did look like there was a red light on the power indicator. (Upon research looks like I would have to open the server to check the Motherboard for lights).

    Rebooted both boxes and everything came back up fine.

    Decided to start looking through logs and on FW1 I saw this log which caused the fail over:

    Code:
    fw1 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Interface Active Check (desc eth5 interface is down, member 2 (x.x.x.2) reports more interfaces up)
    fw1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
    fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    fw1 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
    fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Did the Sync Interface go down on Fw1 because Fw2 had a issue (hardware related) and its a crossover connection? Why didn't Fw1 try to take over Active state, when Fw2 wasn't responding?

    Can someone help explain what happened in this?
    Can someone help explain how the ClusterXL is suppose to work (like a flow)?

    Thanks!

  2. #2
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,668
    Rep Power
    13

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by The_Dude View Post
    I encountered a weird issue that I have not ever seen with a clusterXL and am hoping to get a better idea on the process or why this would happen. Any comments and answers are greatly appreciated.

    Current Set-up:
    2x Open servers running GAIA 77.30 in Active\Standby (Sync Int is a crossover)

    fw1 = active
    fw2 = standby

    monitored interfaces:
    eth0
    eth1
    eth2
    eth3
    eth4

    sync interface:
    eth5 (via crossover)


    Issue:
    It started off with everything was normal:
    Fw1 was active.
    Fw2 was standby.

    Then, out of the blue we lost all access to our networks to and from this firewall. We were dead in the water.

    So I had to console into the FWs.

    I first consoled into Fw1 and ran "cphaprob stat" which showed:

    Code:
    Number     Unique Address  Assigned Load   State       
    
    1 (local)  x.x.x.1     0%            Down           
    2          x.x.x.2     0%              Standby

    I have never seen this before and it looks like Fw1 failed and Fw2 never took over as the Active member.

    I then consoled into Fw2 and all that was shown was a blank black screen. No Prompt no nothing. The server did look like there was a red light on the power indicator. (Upon research looks like I would have to open the server to check the Motherboard for lights).

    Rebooted both boxes and everything came back up fine.

    Decided to start looking through logs and on FW1 I saw this log which caused the fail over:

    Code:
    fw1 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Interface Active Check (desc eth5 interface is down, member 2 (x.x.x.2) reports more interfaces up)
    fw1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
    fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    fw1 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
    fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Did the Sync Interface go down on Fw1 because Fw2 had a issue (hardware related) and its a crossover connection? Why didn't Fw1 try to take over Active state, when Fw2 wasn't responding?

    Can someone help explain what happened in this?
    Can someone help explain how the ClusterXL is suppose to work (like a flow)?

    Thanks!
    wow, never seen down / standby on clusterxl. Did you grab /var/log/messages* from both members after reboot? What was the last thing shown before the failure? $FWDIR/log/fwd.elg would be worth looking at as well. fwd is the process that handles sync packets. Its possible that eth5 down is from the reboot. Hard to say since dmesg isn't time stamped.

    The only time i've seen something close to this is on flash based IPSO (granted vrrp). There is an issue where a run of flash cards will hard lock. The kernel will still send vrrp messages. The firewall will respond to ping (still in kernel). Anything in userland times out (http,ssh,etc) and all disk reads will never return and dead lock the userland processes.

    Any chance there is a disk going bad? ClusterXL does a lot of checking to try to figure out the state of the other member and of the local network.

    Oh btw... one other thing. Do you have multiple clusters connected to the same vlan by chance? Like clusterA and clusterB are connected to same vlan? That can cause some problems.

    Do you have a jumbo hotfix installed by chance?

  3. #3
    Join Date
    2016-09-19
    Posts
    5
    Rep Power
    0

    Default Re: ClusterXL Issue with Failover

    Sorry I forgot to add times
    Code:
    Jan 11 03:54:46 2017 FW1 ctasd[9530]: Save SenderID lists
    Jan 11 03:54:46 2017 FW1 ctasd[9530]: Save SenderId lists finished
    Jan 11 04:00:01 2017 FW1 crond[13221]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 04:06:20 2017 FW1 pm[6970]: Restarted /bin/frontstage[28974], count=1882
    Jan 11 04:06:20 2017 FW1 pm[28974]: init LD_LIBRARY_PATH for /bin/frontstage
    Jan 11 04:06:20 2017 FW1 pm[6970]: Reaped:  frontstage[28974]
    Jan 11 04:06:20 2017 FW1 pm[6970]: Scheduled frontstage for +900 secs 
    Jan 11 04:07:25 2017 FW1 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Interface Active Check (desc eth5 interface is down, member 2 (x.x.x.2) reports more interfaces up)
    Jan 11 04:07:25 2017 FW1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
    Jan 11 04:07:25 2017 FW1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Jan 11 04:07:25 2017 FW1 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
    Jan 11 04:07:25 2017 FW1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Jan 11 04:10:01 2017 FW1 crond[7013]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 04:20:01 2017 FW1 crond[483]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 04:21:20 2017 FW1 pm[6970]: Restarted /bin/frontstage[3138], count=1883
    Jan 11 04:21:20 2017 FW1 pm[3138]: init LD_LIBRARY_PATH for /bin/frontstage
    Jan 11 04:21:20 2017 FW1 pm[6970]: Reaped:  frontstage[3138]
    Jan 11 04:21:20 2017 FW1 pm[6970]: Scheduled frontstage for +900 secs 
    Jan 11 04:30:01 2017 FW1 crond[26508]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 04:36:20 2017 FW1 pm[6970]: Restarted /bin/frontstage[9718], count=1884
    Jan 11 04:36:20 2017 FW1 pm[9718]: init LD_LIBRARY_PATH for /bin/frontstage
    Jan 11 04:36:20 2017 FW1 pm[6970]: Reaped:  frontstage[9718]
    Jan 11 04:36:20 2017 FW1 pm[6970]: Scheduled frontstage for +900 secs 
    Jan 11 04:40:01 2017 FW1 crond[20047]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 04:41:15 2017 FW1 xpand[6987]: admin localhost t +installer:update_status -1 
    Jan 11 04:41:15 2017 FW1 xpand[6987]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 04:41:15 2017 FW1 xpand[6987]: admin localhost t +installer:packages:HOTFIX_R77.20:legacy_hotfix 1 
    Jan 11 04:41:15 2017 FW1 xpand[6987]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 04:41:15 2017 FW1 xpand[6987]: admin localhost t +installer:packages:HOTFIX_R77.20:pkg_type 1 
    Jan 11 04:41:15 2017 FW1 xpand[6987]: Configuration changed from localhost by user admin by the service dbset
    This was the last messages from FW1 before it went to the down state at 4:07am.

    We never rebooted FW2 until after we rebooted FW1, so FW2 was just sitting at a blank screen when I consoled into it. Here are the last logs I had from FW2 before the failover happened at 4:07am:
    Code:
    FW2:
    Jan 11 03:55:16 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:18 2017 FW2 xpand[6986]: admin localhost t -installer:packages:Check_Point_R77_20_R77_30_T204.tgz:tag:importance  
    Jan 11 03:55:18 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:19 2017 FW2 xpand[6986]: admin localhost t +installer:packages:Check_Point_R77_20_R77_30_T204.tgz:tag:importance latest 
    Jan 11 03:55:19 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:19 2017 FW2 xpand[6986]: admin localhost t -installer:packages:Check_Point_R77_30_Hotfix_sk112829_FULL.tgz:tag:importance  
    Jan 11 03:55:19 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:19 2017 FW2 xpand[6986]: admin localhost t +installer:packages:Check_Point_R77_30_Hotfix_sk112829_FULL.tgz:tag:importance latest 
    Jan 11 03:55:19 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:19 2017 FW2 xpand[6986]: admin localhost t -installer:packages:Check_Point_R77_30_JUMBO_HF_1_Bundle_T205_FULL.tgz:tag:importance  
    Jan 11 03:55:19 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:19 2017 FW2 xpand[6986]: admin localhost t +installer:packages:Check_Point_R77_30_JUMBO_HF_1_Bundle_T205_FULL.tgz:tag:importance latest 
    Jan 11 03:55:19 2017 FW2 xpand[6986]: Configuration changed from localhost by user admin by the service dbset
    Jan 11 03:55:20 2017 FW2 xpand[6986]: admin localhost p -# Generated by /bin/confd on Wed Jan 11 00:54:54 2017
    Jan 11 03:55:20 2017 FW2 xpand[6986]: admin localhost p +# Generated by /bin/confd on Wed Jan 11 03:55:20 2017
    Jan 11 03:55:20 2017 FW2 xpand[6986]: admin localhost p -installer:last_update_time Wed\ Jan\ 11\ 00\:54\:44\ 2017
    Jan 11 03:55:20 2017 FW2 xpand[6986]: admin localhost p +installer:last_update_time Wed\ Jan\ 11\ 03\:55\:10\ 2017
    Jan 11 03:55:20 2017 FW2 xpand[6986]: load_sql_config>(empty,empty)
    Jan 11 03:55:34 2017 FW2 xpand[6986]: admin localhost t -volatile:configurationChange  
    Jan 11 03:55:34 2017 FW2 xpand[6986]: admin localhost t -volatile:configurationSave  
    Jan 11 04:00:01 2017 FW2 crond[5132]: (root) CMD (/usr/lib/sa/sa1 1 1)
    Jan 11 06:52:17 2017 FW2 syslogd 1.4.1: restart.
    We are currently on Jumbo 165.

    We do not have another cluster in the same VLAN.

    Here is the fwd.elg from FW1:
    Code:
    [FWD 7677 4064483024]@FW1[10 Jan 15:37:43] ha_fetch_callback: Cluster policy installation successful
     fwdgxsam_init(): gx_sam_proxy_create failed.
     Unable to open '/dev/fw6v0': No such file or directory
     coreXL_aff_handler: This is a cb respond to: FW1_INSTALLED msg
     coreXL_aff_handler: User has not enabled auto core affinity
     FireWall-1 daemon going to die on sig  15
     [11 Jan  6:46:17] fwd: restarting vpnd
     [11 Jan  6:46:17] fwd: restarting in.msd
     [11 Jan  6:46:17] fwd: restarting pdpd
     [11 Jan  6:46:17] fwd: restarting pepd
     [11 Jan  6:46:17] fwd: restarting usrchkd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting wstlsd
     [11 Jan  6:46:17] fwd: restarting in.acapd
     [11 Jan  6:46:17] fwd: restarting in.geod
     [11 Jan  6:46:17] fwd: restarting ted
    and from FW2:
    Code:
    [FWD 7673 4064507600]@FW2[10 Jan 15:37:34] ha_fetch_callback: Cluster policy installation successful
     fwdgxsam_init(): gx_sam_proxy_create failed.
     coreXL_aff_handler: This is a cb respond to: FW1_INSTALLED msg
     coreXL_aff_handler: User has not enabled auto core affinity
     Unable to open '/dev/fw6v0': No such file or directory
    [fwd 7733 4064708304]@FW2[11 Jan  6:52:44] fwd: Wed Jan 11 06:52:44 2017
    
     Unable to open '/dev/fw6v0': No such file or directory
     Log asynch buffer size was initialized with size: 196608
     Log buffer initialized with size: 4096
    This is the weirdest thing I have ever seen and CP Support haven't responded to me after 2 days of research yet.

    What would happen if all of a sudden the crossover cable on the sync interface goes bad? Would that initialize a fail-over?

  4. #4
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,668
    Rep Power
    13

    Default Re: ClusterXL Issue with Failover

    I don't think your problem is eth5 going down. I think the problem is the unknown state one of the firewall members entered because of what seems like might be a hardware issue.

    I see 3 things you can do about this.

    1. schedule a test. Get a window to do a test and let the business know its impacting and that you're trying to establish a root cause for the failure. In said window run cphaprob stat, cphaprob -a if, cphaprob list on both members. yank eth5, run same commands. My guess is your going to see active attention with one member reporting active the and other reporting down and life will go on without an issue. If it doesn't grab those commands and cpinfos and get a tac case going.

    2. take backups of everything and replicate in a vm environment and do the test from number 1. Down side is this will take a little longer and you can't say %100 that the same issue will happen since its basically different hardware (VMs).

    3. replace and rebuild the fw that froze and then do option 1 again. I mean you already have some indication of a hardware failure. Wouldn't hurt to freshen things up a bit.

  5. #5
    Join Date
    2010-09-20
    Posts
    73
    Rep Power
    13

    Default Re: ClusterXL Issue with Failover

    Our experience is that crossover capable is generally bad when it comes to troubleshooting issues like this. You may have a split brain scenario if the sync network's down but the member can still see each other on the other interfaces. You can start with connecting them through a switch and see if one of the two sides is down.

    Were there any configuration changes made? Are they both running the exact same software version? Same licenses? Same CoreXL and SecureXL configuration? These may cause sync issues. it's possible someone has changed the configuration without you knowing.

    There are a number of other things to look for, but starting from those above is a good first step.

  6. #6
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by indeni View Post
    Our experience is that crossover capable is generally bad when it comes to troubleshooting issues like this. You may have a split brain scenario if the sync network's down but the member can still see each other on the other interfaces. You can start with connecting them through a switch and see if one of the two sides is down.
    I don't think connecting through a switch is good either. That's a single point of failure :-(

  7. #7
    Join Date
    2014-11-14
    Location
    Ottawa Canada
    Posts
    364
    Rep Power
    9

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by cciesec2006 View Post
    That's a single point of failure :-(
    sk92804: Sync Redundancy in ClusterXL

    In short, bond some interfaces, set that bond to be the Sync, set each physical of the bond to different switches on the same VLAN/broadcast domain.

  8. #8
    Join Date
    2010-09-20
    Posts
    73
    Rep Power
    13

    Default Re: ClusterXL Issue with Failover

    Agree we both. Moving to a switch is just a first step :)

  9. #9
    Join Date
    2016-09-19
    Posts
    5
    Rep Power
    0

    Default Re: ClusterXL Issue with Failover

    Thanks! I will look into trying a bond for the future! Still waiting for response from CP to say why or how a down\standby can happen

  10. #10
    Join Date
    2009-04-30
    Location
    Colorado, USA
    Posts
    2,252
    Rep Power
    17

    Default Re: ClusterXL Issue with Failover

    1) Check hardware sensors and power supply status on the member that froze over a period of a few minutes, see if anything is bouncing into a bad place or borderline:

    cpstat -o 5 -f sensors os
    cpstat -o 5 -f power_supply os

    2) Unlikely to show anything, but if the issue happened in the last 30 days try sar like this:

    sar -A -f /var/log/sa/sa(two digit day number of month it happened) > /var/log/stuff
    more /var/log/stuff

    A lot of data to be sure but may show something interesting or a sudden lack of resources building up around when the issue started.

    3) User space getting starved out for some resource as theorized by jflemingeds is a definite possibility, the sar output above may help it if had enough resources held to still operate during the issue. Have seen hard drives totally fail yet the firewall keeps on trucking (at least in the kernel) while anything in user space that does not already have all the pages loaded it needs to execute gets hung or eventually killed.

    4) Only other way I can see a down/standby state happening is if the cluster was under freeze at the time for high CPU load (which may correlate with #3), this may or may not be logged, see sk101649.

    5) Just the sync interface going down while being connected with a crossover cable should not have caused down/standby unless it is some kind of bug.
    --
    Third Edition of my "Max Power 2020" Firewall Book
    Now Available at http://www.maxpowerfirewalls.com

  11. #11
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by ShadowPeak.com View Post
    5) Just the sync interface going down while being connected with a crossover cable should not have caused down/standby unless it is some kind of bug.
    what are the interfaces on these servers? Intel, broadcom, Qlogic, etc???? anything other than Intel is suspect, IMHO.

    ethtool -i ethX (where X is 0, 1, 2, etc...)

    lspci | grep Ethernet

    For example, these below are Intel NICs:

    Expert@Power-1-P:0]# ethtool -i Mgmt | grep driver
    driver: e1000e
    [Expert@Power-1-P:0]# lspci | grep Ethernet
    03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    08:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
    08:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
    0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0c:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0d:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0d:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0e:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0f:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    0f:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    11:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
    11:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
    [Expert@Power-1-P:0]#

  12. #12
    Join Date
    2006-03-08
    Location
    Lausanne
    Posts
    1,030
    Rep Power
    18

    Default Re: ClusterXL Issue with Failover

    Down / Standby can only happen if the Standby member cannot process Pnote from the failing member. Considering all above, I bet on some HW or SF failure that caused machine to freeze. Really bad luck
    -------------

    Valeri Loukine
    CCMA, CCSM, CCSI
    http://checkpoint-master-architect.blogspot.com/

  13. #13
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by jdmoore0883 View Post
    sk92804: Sync Redundancy in ClusterXL

    In short, bond some interfaces, set that bond to be the Sync, set each physical of the bond to different switches on the same VLAN/broadcast domain.
    @jdmoore0883: Did you actually test this?

    well I have a clusterXL R77.30 with JHFA 205 with a bond interface LACP 802.3AD active/active as SYNC interface conected to a SINGLE Cisco Catalyst switch. This is intel NIC. I set the bond interface as standard on the server running checkpoint. Configuration on the switch, Cisco catalyst 3750 very standard (Etherchannel with spanning-tree portfast trunk, you know the usual stuff).

    Now my clusterXL is in Active/Standby normal state. I then log into the switch and shutdown one of the swichports that the bond interface is belonging to. As soon as I did that, my cluster now become Active/down.

    That should NOT have happened right? Well guess what, it does :-(

    thoughts?

  14. #14
    Join Date
    2014-11-14
    Location
    Ottawa Canada
    Posts
    364
    Rep Power
    9

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by cciesec2006 View Post
    @jdmoore0883: Did you actually test this?
    A - Personally, myself, with equipment at my immediately disposal? No. I don't have the personal funds to do so, and as a mere Diamond Engineer here at Checkpoint, we haven't got quite THAT level of non-Checkpoint equipment in our lab. That being said, I have myself, a customer with this configuration working fine, and have come across several other customer running that configuration without issue.

    Quote Originally Posted by cciesec2006 View Post
    ...my cluster now become Active/down.

    That should NOT have happened right?
    A - On the face of it, I would tend to agree with you, but we need more details to be absolutely sure where the problem lies... Perhaps it is with the bonding itself, maybe the issue had nothing to with the interfaces themselves... What was the problem noted in 'cphaprob -ia list'? What was the output of 'cphaprob -a if'? Can you otherwise confirm that the bonding itself is working as expected?

  15. #15
    Join Date
    2006-03-08
    Location
    Lausanne
    Posts
    1,030
    Rep Power
    18

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by cciesec2006 View Post
    @jdmoore0883: Did you actually test this?

    well I have a clusterXL R77.30 with JHFA 205 with a bond interface LACP 802.3AD active/active as SYNC interface conected to a SINGLE Cisco Catalyst switch. This is intel NIC. I set the bond interface as standard on the server running checkpoint. Configuration on the switch, Cisco catalyst 3750 very standard (Etherchannel with spanning-tree portfast trunk, you know the usual stuff).

    Now my clusterXL is in Active/Standby normal state. I then log into the switch and shutdown one of the swichports that the bond interface is belonging to. As soon as I did that, my cluster now become Active/down.

    That should NOT have happened right? Well guess what, it does :-(

    thoughts?
    cphaprob stat -a if and cphaprob -a list should show you why Pnote is generated and on which interface. If bond is correctly configured and in the right LACP state, cluster members should stay Active/Standby.

    Repeat the test and post the mentioned commands, then we will be able to answer your question properly
    -------------

    Valeri Loukine
    CCMA, CCSM, CCSI
    http://checkpoint-master-architect.blogspot.com/

  16. #16
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by varera View Post
    cphaprob stat -a if and cphaprob -a list should show you why Pnote is generated and on which interface. If bond is correctly configured and in the right LACP state, cluster members should stay Active/Standby.

    Repeat the test and post the mentioned commands, then we will be able to answer your question properly
    This is the standard configuration on both the checkpoint gateways and switch. Straight forward and it is not rocket science. bond1 is the SYNC interface with 802.3AD active/active mode:

    GW1:
    add bonding group 1
    add bonding group 1 interface eth4
    add bonding group 1 interface eth8
    set bonding group 1 mode 8023AD
    set bonding group 1 down-delay 200
    set bonding group 1 lacp-rate slow
    set bonding group 1 mii-interval 100
    set bonding group 1 up-delay 200
    set bonding group 1 xmit-hash-policy layer3+4
    set interface bond1 ipv4-address 192.0.2.1 mask-length 30

    GW2:
    add bonding group 1
    add bonding group 1 interface eth4
    add bonding group 1 interface eth8
    set bonding group 1 mode 8023AD
    set bonding group 1 down-delay 200
    set bonding group 1 lacp-rate slow
    set bonding group 1 mii-interval 100
    set bonding group 1 up-delay 200
    set bonding group 1 xmit-hash-policy layer3+4
    set interface bond1 ipv4-address 192.0.2.2 mask-length 30

    Switch:

    interface GigabitEthernet101/1/0/34
    description ETH4
    switchport
    switchport mode trunk
    load-interval 30
    spanning-tree portfast edge trunk
    channel-group 3 mode active

    interface GigabitEthernet101/1/0/35
    description ETH8
    switchport
    switchport mode trunk
    load-interval 30
    spanning-tree portfast edge trunk
    channel-group 3 mode active

    interface Port-channel3
    description BOND1
    switchport
    switchport mode trunk
    load-interval 30
    spanning-tree portfast edge trunk

    @jdmoore0883, did your customer test it in PRODUCTION and verify that it works as expected? Did they actually shutdown one of the interfaces as part of the trunk and verify that the cluster is still Active/Standby?

  17. #17
    Join Date
    2006-03-08
    Location
    Lausanne
    Posts
    1,030
    Rep Power
    18

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by cciesec2006 View Post
    This is the standard configuration on both the checkpoint gateways and switch. Straight forward and it is not rocket science. bond1 is the SYNC interface with 802.3AD active/active mode:
    I know how to do this, thanks.

    I was asking about actual info concerning the issue you reported.

    Just to make sure we are on the same wavelength, I have many systems running as described, without Active/Down if just one of bond links is broken. It should be quite easy to configure and to make it work, but from time to time there is something that needs to be fixed: a typo, a faulty cable, a broken port. Without diagnostics I have asked above we can only guess and brag.

    Granted, it is fun, but it is not productive and not helpful, in case you actually want to fix this thing.
    -------------

    Valeri Loukine
    CCMA, CCSM, CCSI
    http://checkpoint-master-architect.blogspot.com/

  18. #18
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by varera View Post
    I know how to do this, thanks.

    I was asking about actual info concerning the issue you reported.

    Just to make sure we are on the same wavelength, I have many systems running as described, without Active/Down if just one of bond links is broken. It should be quite easy to configure and to make it work, but from time to time there is something that needs to be fixed: a typo, a faulty cable, a broken port. Without diagnostics I have asked above we can only guess and brag.

    Granted, it is fun, but it is not productive and not helpful, in case you actually want to fix this thing.
    I think I know what the issue is but I don't have an answer.

    I change the bond SYNC interface to 802.3AD from active/active (etherchannel) to active/standby (no etherchannel). Guess what, if I shutdown one of the interfaces as part of the bond, the cluster still Active/Standby :-). btw, it does not matter which interface I shutdown, my cluster is still in active/standby. Therefore, confirmed that cable, typo, broken port is not an issue :-)

    Maybe 802.3AD active/active is not meant for SYNC interface.

  19. #19
    Join Date
    2006-03-08
    Location
    Lausanne
    Posts
    1,030
    Rep Power
    18

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by cciesec2006 View Post
    I think I know what the issue is but I don't have an answer.

    I change the bond SYNC interface to 802.3AD from active/active (etherchannel) to active/standby (no etherchannel). Guess what, if I shutdown one of the interfaces as part of the bond, the cluster still Active/Standby :-). btw, it does not matter which interface I shutdown, my cluster is still in active/standby. Therefore, confirmed that cable, typo, broken port is not an issue :-)

    Maybe 802.3AD active/active is not meant for SYNC interface.
    Actually no, there still can be an issue of LACP not forming properly. I am okay not to continue, but your logic here is flawed.
    -------------

    Valeri Loukine
    CCMA, CCSM, CCSI
    http://checkpoint-master-architect.blogspot.com/

  20. #20
    Join Date
    2006-09-26
    Posts
    3,200
    Rep Power
    20

    Default Re: ClusterXL Issue with Failover

    Quote Originally Posted by varera View Post
    Actually no, there still can be an issue of LACP not forming properly.
    can you tell me what those issues might be? I've ruled out the switchport, cable, typo and NIC interface. As mentioned before, the setup for the bond interface 802.3AD is a very simple one that I posted earlier.

    Btw, I use the same switch ports, same cable to connect to my Cisco ASR routers for 802.3AD and it works fine.

    Therefore, I am interested to know what you think might be the issue here.

    Thanks,

Page 1 of 2 12 LastLast

Similar Threads

  1. ClusterXL failover timings
    By tangerine0072000 in forum R75.40 (GAiA)
    Replies: 1
    Last Post: 2013-08-30, 10:29
  2. ClusterXL failover breaks existing connections with static nat.
    By shukalo83 in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 2
    Last Post: 2012-11-17, 08:26
  3. unable to failover r75.30 clusterXL using smartdashboard
    By lordbigsack in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 4
    Last Post: 2012-03-14, 04:43
  4. interface monitoring for failover in clusterXL
    By sebastan_bach in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 12
    Last Post: 2010-02-18, 03:05
  5. ClusterXL long switching time by failover
    By Izzio in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 4
    Last Post: 2006-04-26, 11:30

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •