CPUG: The Check Point User Group

Resources for the Check Point Community, by the Check Point Community.


I'd like to thank everyone involved for making "The CPUG Challenge" a great success.
We helped a lot of people see and learn a bit more about R80.10, while having some fun.
We will be using this success to try and bring more events to more locations soon. -E

 

Results 1 to 20 of 20

Thread: BGP Failover Time

  1. #1
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default BGP Failover Time

    Hi guys,

    I'm having some strange issues with BGP failover times in my network.
    My topology is the following:
    2 CP firewalls connected in cluster HA, each one of them connects to a different switch and runs BGP to it.
    One of my test scenarios is powering off the active member and seeing how much time it will take to restore my sessions which run over BGP to an endpoint behind the switches.
    When powering off the active member the time it took to failover was about 10 seconds.
    When powering off the passive member (which became active) it took up to 30 seconds.
    The thing I don't understand is why I see such a big difference between them, isn't it supposed to be more or less the same?

  2. #2
    Join Date
    2008-07-31
    Location
    Netherlands, Europe
    Posts
    1,088
    Rep Power
    11

    Default Re: BGP Failover Time

    Switching members off is not a most common failure method, also test the time it takes when you fail over software wise, by ie detaching a cable or using clusterXL_admin down on the primary member.
    That said, how much time was there between the power on of the original active member and the second failover test by switching of the backup?
    Did the primary get the time to learn all routes properly?
    Regards, Maarten.
    Dual P1 R77.30, VSX, IPSO, SPLAT, GAIA mostly.

  3. #3
    Join Date
    2006-09-26
    Posts
    2,974
    Rep Power
    13

    Default Re: BGP Failover Time

    Quote Originally Posted by msjouw View Post
    Switching members off is not a most common failure method
    This is very common method. For example, the "Active" firewall takes a power hit and rebooted so that "Standby" firewall is now active. It is more common than you think.

    Btw, what version of Checkpoint and hotfix?

  4. #4
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,321
    Rep Power
    8

    Default Re: BGP Failover Time

    Thats a bit strange. I'm not following why you aren't peering with both switches in both cluster members. A cluster in HA acts like a logical single unit. Granted only the active unit talks BGP, but generally the cluster members have the same config.

    Can you show the BGP config of both members?

    That being said you haven't explained much about what your seeing post cluster failover. You'll need to show info about BGP like msjouw pointed out to help understand what the difference is.

  5. #5
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    Quote Originally Posted by msjouw View Post
    Switching members off is not a most common failure method, also test the time it takes when you fail over software wise, by ie detaching a cable or using clusterXL_admin down on the primary member.
    That said, how much time was there between the power on of the original active member and the second failover test by switching of the backup?
    Did the primary get the time to learn all routes properly?
    I waited till the original active member powered on to a stage where I could log in to it.
    The original primary got time to learn the routes, yes.

    Quote Originally Posted by cciesec2006 View Post
    This is very common method. For example, the "Active" firewall takes a power hit and rebooted so that "Standby" firewall is now active. It is more common than you think.

    Btw, what version of Checkpoint and hotfix?
    I'm using 4400 CP appliances and the version is 77.30 (don't remember the hotfix right now).

    Quote Originally Posted by jflemingeds View Post
    Thats a bit strange. I'm not following why you aren't peering with both switches in both cluster members. A cluster in HA acts like a logical single unit. Granted only the active unit talks BGP, but generally the cluster members have the same config.

    Can you show the BGP config of both members?

    That being said you haven't explained much about what your seeing post cluster failover. You'll need to show info about BGP like msjouw pointed out to help understand what the difference is.
    I'm not peering with both of the switches because of the lack in physical ports on the FW (I have all of the ports used, connected to different switches).
    Unfortunately I can't show it right now cause I don't have it but I'll try to get it ASAP.
    It's quite straight forward - I'm using AS 10 and the peer is using AS 1.
    It's directly connected (P2P), configured with local address (the interface which connects to the switch), ping option, always keep alive option, log etc.
    The firewalls are configured absolutely the same except for the local address and the peer address.
    While conducting tests like disconnecting an interface etc my times are quite the same. The only big difference is when rebooting/shutting down the members.

  6. #6
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,321
    Rep Power
    8

    Default Re: BGP Failover Time

    oh i see, so you work for CSNET and your peering with Level 3, ok.

    I'm still not following connections. I would expect you would only need 1 port per firewall but maybe i'm missing something. BTW you know about vlans right? ;). Anyway, have a nice weekend. I'm sure i'll hardly be compulsively hitting refresh at all on the cpug site.

    Wait.. i just noticed. Why is the local address different? Are you not peering to the VIP?

  7. #7
    Join Date
    2008-07-31
    Location
    Netherlands, Europe
    Posts
    1,088
    Rep Power
    11

    Default Re: BGP Failover Time

    Quote Originally Posted by jflemingeds View Post
    Why is the local address different? Are you not peering to the VIP?
    indeed indeed, why are you not using the VIP, this exactly what I was wondering?

    @cciesec2006 we run around 500 gw's for about 150 customers worldwide and in the civilized countries where most customers use a UPS in front of the firewall, a poweroutage of 1 of the 2 should not be happening and this is certainly my experience that only the countries where the power is very unreliable, we sometimes see a full poweroutage for at least an hour. All DC setups should have uninterupted power available and should never see a power failure.
    Regards, Maarten.
    Dual P1 R77.30, VSX, IPSO, SPLAT, GAIA mostly.

  8. #8
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    Quote Originally Posted by jflemingeds View Post
    oh i see, so you work for CSNET and your peering with Level 3, ok.

    I'm still not following connections. I would expect you would only need 1 port per firewall but maybe i'm missing something. BTW you know about vlans right? ;). Anyway, have a nice weekend. I'm sure i'll hardly be compulsively hitting refresh at all on the cpug site.

    Wait.. i just noticed. Why is the local address different? Are you not peering to the VIP?
    What are you talking about??? If you don't want to help that's OK brother...
    The topology is the following:
    Click image for larger version. 

Name:	bgp.jpg 
Views:	14 
Size:	16.2 KB 
ID:	1296

    Each and one of the members is peering to one of the switches with a unique address which is the interface that connects to the switch.
    I didn't use the VIP because the FW's are not connecting to the same switch - was I wrong with that assumption?

    Quote Originally Posted by msjouw View Post
    indeed indeed, why are you not using the VIP, this exactly what I was wondering?

    @cciesec2006 we run around 500 gw's for about 150 customers worldwide and in the civilized countries where most customers use a UPS in front of the firewall, a poweroutage of 1 of the 2 should not be happening and this is certainly my experience that only the countries where the power is very unreliable, we sometimes see a full poweroutage for at least an hour. All DC setups should have uninterupted power available and should never see a power failure.
    I understand, we do have UPS and different power sources but we must conduct tests that cover ALL possible scenarios and this is one of them.
    By the way, it also could happen because of a hardware failure.

  9. #9
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,321
    Rep Power
    8

    Default Re: BGP Failover Time

    [QUOTE=guyxgreen;96783]What are you talking about??? If you don't want to help that's OK brother...
    The topology is the following:
    Click image for larger version. 

Name:	bgp.jpg 
Views:	14 
Size:	16.2 KB 
ID:	1296

    You left out the AS numbers on your drawing. What is the router on top running? Also why isn't there an interconnect between the switches? Do they not have the same vlans on them? </cornfused>

    Look up who owns AS 1 and AS 10 and then see about private AS numbers. Its a bit like pulling some random subnet out of the sky instead of using a rfc 1918 range.

  10. #10
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    [QUOTE=jflemingeds;96784]
    Quote Originally Posted by guyxgreen View Post
    What are you talking about??? If you don't want to help that's OK brother...
    The topology is the following:
    Click image for larger version. 

Name:	bgp.jpg 
Views:	14 
Size:	16.2 KB 
ID:	1296

    You left out the AS numbers on your drawing. What is the router on top running? Also why isn't there an interconnect between the switches? Do they not have the same vlans on them? </cornfused>

    Look up who owns AS 1 and AS 10 and then see about private AS numbers. Its a bit like pulling some random subnet out of the sky instead of using a rfc 1918 range.
    It's an isolated network which doesn't connect to the internet so we don't really care about private AS numbers.
    The router on top is running OSPF but he is out of the scope in this case because the switches are L3 and they run BGP.
    The switches are not connected between them because that's the topology of the client and we can't change it and yes, they have the same VLAN's on them.

  11. #11
    Join Date
    2011-08-02
    Location
    http://spikefishsolutions.com
    Posts
    1,321
    Rep Power
    8

    Default Re: BGP Failover Time

    You can't expect clustering to work right if the firewalls are not on the same vlan and able to communicate with each other. This is a pretty broken design. I think you should have a beer and rethink a lot of this. Thats my plan at least (well not the rethink part). Do you really even need BGP? Is anything else talking BGP? Just trying to understand the setup.

    Do you know all the routes on the other side of the l3 switches. Seems like ospf would be better bet there but maybe since you don't have control of the switches that isn't an option.

    Ok one last question. The interfaces on the switches. Are they on the same subnet?

  12. #12
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    Quote Originally Posted by jflemingeds View Post
    You can't expect clustering to work right if the firewalls are not on the same vlan and able to communicate with each other. This is a pretty broken design. I think you should have a beer and rethink a lot of this. Thats my plan at least (well not the rethink part). Do you really even need BGP? Is anything else talking BGP? Just trying to understand the setup.

    Do you know all the routes on the other side of the l3 switches. Seems like ospf would be better bet there but maybe since you don't have control of the switches that isn't an option.

    Ok one last question. The interfaces on the switches. Are they on the same subnet?
    I will try to explain it as best as I can.
    I have a direct connection between the two FW which is designated for synchronization so it doesn't go through the switches and that's the way the cluster works in our case.
    We use BGP because of a demand to use a different routing protocol than the one the client is running in his internal network (OSPF).
    We redistribute our VLAN interfaces to the BGP on the switches and from there it redistributes to OSPF and vice versa.

  13. #13
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    I just want to point out that there's nothing wrong with the cluster itself.
    Failover works properly but we experience a difference in re-establishing communication with the client network through BGP.
    A little reminder:
    Scenarion 1 - We power off/reboot the active member while running an ICMP from our PC to the client's PC. Failover takes place and communication comes back after 10 seconds.
    Scenarion 2 - We power off/reboot the passive member which became active but this time it takes up to 30 seconds for the communication to come back.
    And so it repeats with every test round we did.
    So my question is why? Is there a way to overcome this? I can't afford to have an outage of 30 seconds.

  14. #14
    Join Date
    2013-09-25
    Location
    Bucharest
    Posts
    597
    Rep Power
    4

    Default Re: BGP Failover Time

    Let's see:), what are the value for keep alive and hold timers?

    Or simply said: post your BGP config on both router and the other peer.

  15. #15
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    Quote Originally Posted by laf_c View Post
    Let's see:), what are the value for keep alive and hold timers?

    Or simply said: post your BGP config on both router and the other peer.
    I can't access and post the configs right now but I'll post the template I'm using:

    Cisco Configuration

    router bgp 1
    bgp router-id 100.100.100.1
    bgp log-neighbor-changes
    timers bgp 2 6
    neighbor BGP_1 peer-group
    neighbor BGP_1 remote-as 10
    neighbor BGP_1 update-source FastEthernet0/1
    neighbor 100.100.199.1 peer-group BGP_1
    !
    address-family ipv4
    neighbor BGP_1 soft-reconfiguration inbound
    neighbor 100.100.199.1 activate
    no auto-summary
    no synchronization
    exit-address-family

    CP Configuration

    set bgp external remote‑as 1 peer 100.100.199.2 on
    local‑address 100.100.199.1 on
    holdtime 6
    keepalive 2
    send‑keepalives on
    accept‑routes all
    log‑warnings on

  16. #16
    Join Date
    2006-09-26
    Posts
    2,974
    Rep Power
    13

    Default Re: BGP Failover Time

    I've worked with both Checkpoint and Cisco and I do have some experiences with routing on both platforms, not an expert by any means.

    I am looking at the diagram you posted and I have the followings questions.

    1- the checkpoint Cluster in HA mode, the firewall interface of FW1 that is connected to the switch SW1, can you please provide "ifconfig" of that interface? Can you tell me what interface that is?
    2- the checkpoint Cluster in HA mode, the firewall interface of FW2 that is connected to the switch SW2, can you please provide "ifconfig" of that interface? Can you tell me what interface that is?
    3- In the topology of the cluster, are those interface set as "cluster" or "non-monitor-private"?


    It is very unusual for sw1 and sw2 not trunking between one another to have common VLAN.

  17. #17
    Join Date
    2013-09-25
    Location
    Bucharest
    Posts
    597
    Rep Power
    4

    Default Re: BGP Failover Time

    Quote Originally Posted by guyxgreen View Post
    I can't access and post the configs right now but I'll post the template I'm using:

    Cisco Configuration

    router bgp 1
    bgp router-id 100.100.100.1
    bgp log-neighbor-changes
    timers bgp 2 6
    neighbor BGP_1 peer-group
    neighbor BGP_1 remote-as 10
    neighbor BGP_1 update-source FastEthernet0/1
    neighbor 100.100.199.1 peer-group BGP_1
    !
    address-family ipv4
    neighbor BGP_1 soft-reconfiguration inbound
    neighbor 100.100.199.1 activate
    no auto-summary
    no synchronization
    exit-address-family

    CP Configuration

    set bgp external remote‑as 1 peer 100.100.199.2 on
    local‑address 100.100.199.1 on
    holdtime 6
    keepalive 2
    send‑keepalives on
    accept‑routes all
    log‑warnings on
    Here's what I would try: run a sniffer on both units for TCP/179 and paste what you see there.

  18. #18
    Join Date
    2006-03-08
    Location
    Lausanne
    Posts
    959
    Rep Power
    12

    Default Re: BGP Failover Time

    drouter is syncing dynamic routes between the cluster members over port 2010. By default it should be allowed via implied rules. If you do not sue those, it then can be blocked and cause late BGP conversion after a failover.
    -------------

    Valeri Loukine
    CCMA, CCSM, CCSI
    http://checkpoint-master-architect.blogspot.com/

  19. #19
    Join Date
    2017-08-03
    Posts
    8
    Rep Power
    0

    Default Re: BGP Failover Time

    Quote Originally Posted by varera View Post
    drouter is syncing dynamic routes between the cluster members over port 2010. By default it should be allowed via implied rules. If you do not sue those, it then can be blocked and cause late BGP conversion after a failover.
    You were the closest one to the real core of the problem.
    After contacting CP, I can finally say this issue is over! :)
    What we found out is that the passive member didn't learn from the active member about the BGP routes it had in it's routing table (apparently that's how CP works with BGP).
    The routing updates are sent through the synchronization interface/s via port 2010 as varera mentioned.
    When we saw that nothing is being sent, we manually restarted the routing daemon on the passive member and rebooted the machine.
    After it powered on - voila!
    I could see every BGP route in the routing table marked as 'Kernel'.
    Now the BGP failover times were symmetric and after some tuning and tweaking on both CP and Cisco I could lower the times to 10-12 seconds.

  20. #20
    Join Date
    2013-09-25
    Location
    Bucharest
    Posts
    597
    Rep Power
    4

    Default Re: BGP Failover Time

    Quote Originally Posted by guyxgreen View Post
    You were the closest one to the real core of the problem.
    After contacting CP, I can finally say this issue is over! :)
    What we found out is that the passive member didn't learn from the active member about the BGP routes it had in it's routing table (apparently that's how CP works with BGP).
    The routing updates are sent through the synchronization interface/s via port 2010 as varera mentioned.
    When we saw that nothing is being sent, we manually restarted the routing daemon on the passive member and rebooted the machine.
    After it powered on - voila!
    I could see every BGP route in the routing table marked as 'Kernel'.
    Now the BGP failover times were symmetric and after some tuning and tweaking on both CP and Cisco I could lower the times to 10-12 seconds.
    To sum up this story, was this a bug or a missing feature u enabled after reboot?

Similar Threads

  1. Long time, first time
    By SyKosys in forum Introductions
    Replies: 10
    Last Post: 2016-04-12, 07:05
  2. how to check the failover time in checkpoint
    By yzme83 in forum SmartView Tracker
    Replies: 11
    Last Post: 2013-03-01, 19:36
  3. How can we determine or prove the Outage at the time of failover
    By ktarvind@rediffmail.com in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 3
    Last Post: 2010-02-15, 16:38
  4. NAT and bgp failover
    By DrkNite in forum NAT (Network Address Translation)
    Replies: 0
    Last Post: 2009-07-09, 15:07
  5. ClusterXL long switching time by failover
    By Izzio in forum Clustering (Security Gateway HA and ClusterXL)
    Replies: 4
    Last Post: 2006-04-26, 11:30

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •