CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


Tim Hall has done it again! He has just released the 2nd edition of "Max Power".
Rather than get into details here, I urge you to check out this announcement post.
It's a massive upgrade, and well worth checking out. -E

 

Results 1 to 18 of 18

Thread: HA Failover appears to be caused by sync interface

  1. #1
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default HA Failover appears to be caused by sync interface

    Hello,

    Yesterday my HA pair of Check Point 5800's experienced a unexpected failover. I was able to retrieve the local message logs and have included them below. If I am reading them correctly it appears that the sync interface failed. Today everything is business as usual and I was able to do an admin failover to the other cluster member and everything worked fine.

    I did notice that the speed for the sync interfaces on both of these gateways is set to auto negotiate and is currently running at 100Mbps full duplex. Should I hard code these sync interfaces to 1000Mbps full, or is there a reason they are at 100?

    Do you think the low speed setting could have caused the failover? If not any other ideas?

    Thank you.

    CLUSTER MEMBER A

    Nov 14 18:24:25 PROBLEM DR Enabled; Master To Slave [Problem]


    Nov 14 18:11:00 2018 msgcu-intfw1 kernel: [fw4_0];fwh323_cpas_decide_mon_only: failed
    Nov 14 18:11:32 2018 msgcu-intfw1 last message repeated 54 times
    Nov 14 18:12:43 2018 msgcu-intfw1 last message repeated 63 times
    Nov 14 18:14:11 2018 msgcu-intfw1 last message repeated 61 times
    Nov 14 18:15:32 2018 msgcu-intfw1 last message repeated 11 times
    Nov 14 18:16:37 2018 msgcu-intfw1 last message repeated 25 times
    Nov 14 18:17:46 2018 msgcu-intfw1 last message repeated 17 times
    Nov 14 18:18:47 2018 msgcu-intfw1 last message repeated 22 times
    Nov 14 18:19:57 2018 msgcu-intfw1 last message repeated 2 times
    Nov 14 18:22:02 2018 msgcu-intfw1 last message repeated 11 times
    Nov 14 18:23:09 2018 msgcu-intfw1 kernel: [fw4_0];fwh323_cpas_decide_mon_only: failed
    Nov 14 18:23:25 2018 msgcu-intfw1 last message repeated 9 times
    Nov 14 18:24:08 2018 msgcu-intfw1 kernel: igb: Sync NIC Link is Down
    Nov 14 18:24:10 2018 msgcu-intfw1 kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 1 to FAILURE due to the member report message
    Nov 14 18:24:10 2018 msgcu-intfw1 kernel: [fw4_1];FW-1: fwha_update_state: ID 1 (state STANDBY -> FAILURE) (time 91367.8)


    CLUSTER MEMBER B

    Nov 14 18:24:08 2018 msgcu-intfw2 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Interface Active Check (desc Sync interface is down, 7 interfaces required, only 6 up)
    Nov 14 18:24:08 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to FAILURE
    Nov 14 18:24:08 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Nov 14 18:24:08 2018 msgcu-intfw2 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down (the change may not be allowed).
    Nov 14 18:24:08 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Nov 14 18:24:10 2018 msgcu-intfw2 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down (the change may not be allowed).
    Nov 14 18:24:10 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: igb: Sync NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_0];FW-1: State synchronization is in risk. Please examine your synchronization network to avoid further problems !
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_0];FW-1: Please refer to documentation for details on this issue. Any change must be applied to ALL cluster members
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwldbcast_recv: delta sync connection with member 0 was lost and regained.2748 updates were lost.
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwldbcast_recv: received sequence 0x9e9ab6 (fragm 0, index 1), last processed seq 0x9e8ff9
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to ACTIVE due to pnote Interface Active Check (desc <NULL>)
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to STANDBY
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to STANDBY
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: recv(header) returns 0
    Nov 14 18:24:25 2018 msgcu-intfw2 last message repeated 6 times
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 0 to FAILURE due to the member report message
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];fwha_set_backup_mode: Try to update local state to ACTIVE because of ID 0 is not ACTIVE or READY. (This attempt may be blocked by other machines)
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to READY
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to READY
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state ACTIVE -> FAILURE) (time 28187.9)
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];member 1 (172.25.2.1) is down
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_state_change_implied: Try to update local state from READY to ACTIVE because all other machines confirmed my READY state
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to ACTIVE
    Nov 14 18:24:25 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to ACTIVE
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: recv(header) returns 0
    Nov 14 18:24:25 2018 msgcu-intfw2 last message repeated 27 times
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11627]: recv(header) returns 0
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: entering cpcl_vrf_master_init()
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: leaving cpcl_master_init()
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: cpcl_vrf_master_listen_accept(6294): entering cpcl_vrf_master_listen_accept
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: cpcl_vrf_master_listen_accept(6383): leaving cpcl_vrf_master_listen_accept
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: cpcl_vrf_recv_from_instance_manager(6109): instance 0 entering cpcl_vrf_recv_from_instance_manager
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: cpcl_vrf_recv_from_instance_manager(6166): instance 0 received fd 26
    Nov 14 18:24:25 2018 msgcu-intfw2 routed[11633]: cpcl_vrf_recv_from_instance_manager(6267): instance 0 leaving cpcl_vrf_recv_from_instance_manager
    Nov 14 18:24:32 2018 msgcu-intfw2 kernel: [fw4_0];fwh323_cpas_decide_mon_only: failed
    Nov 14 18:24:42 2018 msgcu-intfw2 last message repeated 14 times
    Nov 14 18:24:42 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 0 to STANDBY due to the member report message
    Nov 14 18:24:42 2018 msgcu-intfw2 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state FAILURE -> STANDBY) (time 28205.1)
    Nov 14 18:24:44 2018 msgcu-intfw2 kernel: [fw4_0];fwh323_cpas_decide_mon_only: failed

  2. #2
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,637
    Rep Power
    9

    Default Re: HA Failover appears to be caused by sync interface

    I think you may want to replace that cable. Possibly double check both firewalls sync interface config to make sure they're auto/auto (think you did that already). That would be the correct setup for 1gig copper. If they aren't picking up 1000/full my guess would be cable is normal cat 5 instead of cat5e or cat6. Assuming there isn't a switch in between.
    Last edited by jflemingeds; 3 Weeks Ago at 11:48. Reason: auto / auto

  3. #3
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by jflemingeds View Post
    I think you may want to replace that cable. Possibly double check both firewalls sync interface config to make sure they're auto/auto (think you did that already). That would be the correct setup for 1gig copper. If they aren't picking up 1000/full my guess would be cable is normal cat 5 instead of cat5e or cat6. Assuming there isn't a switch in between.
    How can I replace the sync cable without causing the two security gateways to freak out and possibly end up with a split brain?

  4. #4
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,637
    Rep Power
    9

    Default Re: HA Failover appears to be caused by sync interface

    it shouldn't but if you want to be safe do this on the standby.

    clusterXL_admin down

    replace cable

    clusterXL_admin up

  5. #5
    Join Date
    2006-09-26
    Posts
    3,172
    Rep Power
    16

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by jflemingeds View Post
    it shouldn't but if you want to be safe do this on the standby.

    clusterXL_admin down

    replace cable

    clusterXL_admin up
    It could be the NIC itself. Checkpoint is notoriously known for using cheap hardware.

  6. #6
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by cciesec2006 View Post
    It could be the NIC itself. Checkpoint is notoriously known for using cheap hardware.

    I replaced the cable twice with brand new CAT 6 and the sync ports are still negotiating at 100 / full. I also tried disabling the ports in the web gui and re enable ling with the same result.

    If I turn off auto negotiation and set each gateway to 1000 / full is there anyway to see if they are actually operating at that speed?

  7. #7
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,637
    Rep Power
    9

    Default Re: HA Failover appears to be caused by sync interface

    ethtool Sync (assuming that is the interface name).

    reply with output from both. Might be interestring to plug a laptop into the Sync port on both firewalls to see if it run at 1000/full when set to auto.

    Also reply with this as well. This will show the drive and version.

    ethtool -i Sync

  8. #8
    Join Date
    2006-09-26
    Posts
    3,172
    Rep Power
    16

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by mjensen View Post
    I replaced the cable twice with brand new CAT 6 and the sync ports are still negotiating at 100 / full. I also tried disabling the ports in the web gui and re enable ling with the same result.

    If I turn off auto negotiation and set each gateway to 1000 / full is there anyway to see if they are actually operating at that speed?
    These are Gig ports so you should NOT do anything to it. It should work at 1G out of the box.

    As I've mentioned before, look like Checkpoint is using cheap ass hardware. Look like the sync port is broken. Do you have any spare interfaces on the 5800 that you can use? It comes with 8 interfaces, excluding the Sync interface right? You can use other interfaces for SYNC. I don't think it has to be the Sync interface.

  9. #9
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by cciesec2006 View Post
    These are Gig ports so you should NOT do anything to it. It should work at 1G out of the box.

    As I've mentioned before, look like Checkpoint is using cheap ass hardware. Look like the sync port is broken. Do you have any spare interfaces on the 5800 that you can use? It comes with 8 interfaces, excluding the Sync interface right? You can use other interfaces for SYNC. I don't think it has to be the Sync interface.


    SECURITY GATEWAY 1

    [Expert@msgcu-intfw1:0]# ethtool Sync
    Settings for Sync:
    Supported ports: [ TP ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 100Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: pumbg
    Wake-on: d
    Current message level: 0x00000007 (7)
    Link detected: yes
    [Expert@msgcu-intfw1:0]#


    [Expert@msgcu-intfw1:0]# ethtool -i Sync
    driver: igb
    version: 4.1.2
    firmware-version: 0. 6-2
    bus-info: 0000:08:00.0
    [Expert@msgcu-intfw1:0]#


    -------

    SECURITY GATEWAY 2

    [Expert@msgcu-intfw2:0]# ethtool Sync
    Settings for Sync:
    Supported ports: [ TP ]
    Supported link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes: 10baseT/Half 10baseT/Full
    100baseT/Half 100baseT/Full
    1000baseT/Full
    Advertised auto-negotiation: Yes
    Speed: 100Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: pumbg
    Wake-on: d
    Current message level: 0x00000007 (7)
    Link detected: yes
    [Expert@msgcu-intfw2:0]#



    [Expert@msgcu-intfw2:0]# ethtool -i Sync
    driver: igb
    version: 4.1.2
    firmware-version: 0. 6-2
    bus-info: 0000:08:00.0
    [Expert@msgcu-intfw2:0]#




    --------

    I will try connecting a laptop to the sync interfaces later today and see what happens

  10. #10
    Join Date
    2006-09-26
    Posts
    3,172
    Rep Power
    16

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by mjensen View Post
    [Expert@msgcu-intfw1:0]# ethtool -i Sync
    driver: igb
    version: 4.1.2
    firmware-version: 0. 6-2
    bus-info: 0000:08:00.0
    [Expert@msgcu-intfw1:0]#

    [Expert@msgcu-intfw2:0]# ethtool -i Sync
    driver: igb
    version: 4.1.2
    firmware-version: 0. 6-2
    bus-info: 0000:08:00.0
    [Expert@msgcu-intfw2:0]#
    Is this me for the firmware on the NIC is really old? I don't have 5800 but my looks much newer even though my NIC is already three years old:

    ethtool -i eth8
    driver: igb
    version: 4.1.2
    firmware-version: 1.7, 0x80000d38, 17.5.10
    bus-info: 0000:01:00.0

  11. #11
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by cciesec2006 View Post
    Is this me for the firmware on the NIC is really old? I don't have 5800 but my looks much newer even though my NIC is already three years old:

    ethtool -i eth8
    driver: igb
    version: 4.1.2
    firmware-version: 1.7, 0x80000d38, 17.5.10
    bus-info: 0000:01:00.0
    I was under the impression that drivers got updated with Jumbo Hotfixes. Is this not the case?

  12. #12
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    270
    Rep Power
    12

    Default Re: HA Failover appears to be caused by sync interface

    Drivers might, sure. It looks like firmware is not, though. Not sure how much that matters.
    Zimmie

  13. #13
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by Bob_Zimmerman View Post
    Drivers might, sure. It looks like firmware is not, though. Not sure how much that matters.


    I had the same issue happen again this morning. Is it possible for me to update the drivers of these NIC's or is that something only Check Point can do?

  14. #14
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Am I supposed to be using a straight through cable for the sync interface or a crossover cable? Some clusters in my environment use straight through and others user crossover. I don't know if this is significant or not.

  15. #15
    Join Date
    2006-09-26
    Posts
    3,172
    Rep Power
    16

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by mjensen View Post
    Am I supposed to be using a straight through cable for the sync interface or a crossover cable? Some clusters in my environment use straight through and others user crossover. I don't know if this is significant or not.
    It makes no difference between straight through or cross cables. The NIC card can detect both.

    Why not moving your Sync interface to another un-used port?

  16. #16
    Join Date
    2007-03-30
    Location
    DFW, TX
    Posts
    270
    Rep Power
    12

    Default Re: HA Failover appears to be caused by sync interface

    Sometimes, you can get newer drivers from the TAC than are currently shipping in generally-available versions. For a while, the shipping e1000 version (7.3.15-NAPI) was pretty janky, and a newer version the TAC had (7.6.15.5) solved a lot of strange traffic problems. I haven't heard of that kind of problem since Check Point switched to the igb driver, but I also haven't been looking that hard.

    As for straight-through versus crossover, the overwhelming majority of copper gigabit ports have Auto-MDI/X. This lets them negotiate which pair is used for what. It's practically required, since gigabit uses all four pairs. As a result of this, you generally don't need to care about straight-through versus crossover anymore except in specific weird situations.

    I highly, highly recommend moving your sync to a bond. Any time I deploy a firewall, almost all of its interfaces are bonds, even if the bond only has one member. This lets me control which physical port backs a given logical interface very easily. It's particularly useful for sync because you can start with a single direct cable between the firewalls, then you can wire another interface with a switch between the firewalls and add it to the bond. You can then make your sync arbitrarily redundant.
    Zimmie

  17. #17
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    Quote Originally Posted by Bob_Zimmerman View Post
    Sometimes, you can get newer drivers from the TAC than are currently shipping in generally-available versions. For a while, the shipping e1000 version (7.3.15-NAPI) was pretty janky, and a newer version the TAC had (7.6.15.5) solved a lot of strange traffic problems. I haven't heard of that kind of problem since Check Point switched to the igb driver, but I also haven't been looking that hard.

    As for straight-through versus crossover, the overwhelming majority of copper gigabit ports have Auto-MDI/X. This lets them negotiate which pair is used for what. It's practically required, since gigabit uses all four pairs. As a result of this, you generally don't need to care about straight-through versus crossover anymore except in specific weird situations.

    I highly, highly recommend moving your sync to a bond. Any time I deploy a firewall, almost all of its interfaces are bonds, even if the bond only has one member. This lets me control which physical port backs a given logical interface very easily. It's particularly useful for sync because you can start with a single direct cable between the firewalls, then you can wire another interface with a switch between the firewalls and add it to the bond. You can then make your sync arbitrarily redundant.


    Hello,

    That is very interesting. I didn't know I could make a bond interface with only one member and I like your reason for doing it with sync interfaces.

    After several hours with support troubleshooting this issue we determined that cluster member B was not capable of operating at 1000Mbps /full because every time we tried to manually set it to that we would receive a message stating something to the affect of it not being supported or capable. Based on this Check Point support RMA'd me a new 5800.

    I have since installed R77.30 , its new license, and latest GA Jumbo Hotfix on the replacement, and connected it to the network. I checked the link speed on the sync interfaces and it appears the RMA did not resolve this issue as the sync interfaces are still only negotiating at 100Mbps /full.

    I have a more pressing problem at the moment with this cluster.

    On the new gateway (gateway B) I enabled ClusterXL through "cpconfig" (i must have forgot to select the answer indicating this gateway was going to be a member of a cluster during the install) , and then proceeded with the required reboot to have the change take affect. After the reboot security gateway B made itself active and both security gateways reported being in a state of active attention. This caused a outage to my organization:( When I ran "cphaprob stat" on security gateway B it only showed itself and didn't even show the other member.

    To stop the outage I immediately issued the "halt" command on security gateway B.

    Now I am stuck in a spot with security gateway b psychically disconnected from all network cables to avoid it from trying to become active (and cause another outage) and I don't know how I can safely bring security gateway B into the cluster.

    I really appreciate any suggestions / recommendations.

    I have thought about connecting to the console port of sg 2, doing a "clusterXL_admin down", then going into SmartConsole and removing sg 2 from the cluster, push policy, login to sg 2 and do a "clusterXL_admin up" then re add sg2 to the cluster in SmartConsole, apply policy, and then maybe sg2 will join the cluster properly?

    If I remove a security gateway from a cluster does that take away the VIP for all or any of the interfaces?

    Again your help is greatly appreciated.

  18. #18
    Join Date
    2018-04-18
    Posts
    38
    Rep Power
    0

    Default Re: HA Failover appears to be caused by sync interface

    I was able to get my cluster back together correctly without a service interruption. Now I'm just back to the original problem of the sync interfaces not operating at 1000Mbps /full. I will try moving the sync interfaces to different physical ports and see what happens.....

Similar Threads

  1. R77.30 with JHFA 205 1st Sync and 2nd Sync interface?
    By cciesec2006 in forum Installing And Upgrading
    Replies: 6
    Last Post: 2017-02-11, 09:51
  2. One side of one interface never appears in FW Monitor
    By kmccubbin in forum fw monitor, tcpdump and Wireshark
    Replies: 4
    Last Post: 2015-12-08, 19:19
  3. Clustered FW's failover due to sync interface.
    By dkostuik in forum SmartDashboard
    Replies: 5
    Last Post: 2011-07-11, 06:14
  4. Sync interface
    By Mindi in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 4
    Last Post: 2009-04-07, 08:30
  5. Same Sync IP's on Cluster caused warning?
    By menz456 in forum IPsec VPN Blade (Virtual Private Networks)
    Replies: 3
    Last Post: 2009-02-27, 12:24

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •