| CPUG | |
| The Check Point User Group | |
| A Resource For The Check Point Community. Fast. Useful. Independent. | |
|
| |||||||
![]() |
| | LinkBack | Thread Tools | Display Modes |
| |||
| I have a pair of Secureplatform NGx R65 with HFA_02 and hf_49 running in Active/Active Unicast mode mode. The policy on the gateways is "Any Any Accept log". This is what I am seeing from gateway #1: [Expert@NGx-gw1]# cphaprob state Cluster Mode: Load Sharing (Unicast) Number Unique Address Assigned Load State 1 (local) 192.2.0.1 30% Active (pivot) 2 192.2.0.2 70% Active [Expert@NGx-gw1]# This is what I am seeing from gateway #2: [Expert@NGx-gw2]# cphaprob state Cluster Mode: Load Sharing (Unicast) Number Unique Address Assigned Load State 1 192.2.0.1 0% Active (pivot) 2 (local) 192.2.0.2 100% Active [Expert@NGx-gw2]# Gateway #1 shows up with the correct load but gateway #2 is not. Why? If I reboot gateway #2, it will come back with with the right 30%/70% load. However, if I reboot gateway #1, gateway #2 will come back with 0%/100% with gateway #2 taking 100% of the load. Anyone seen this before? Thanks. |
| |||
| sounds like one of the gateways isn't getting the right CCP or sync information. Check the interface statuses (cphaprob -a if and cphaprob -i list) and look for errors reported here. Check that both boxes are SPLAT and not SPLAT Pro (or vice versa. Also, change your Cluster control protocol to broadcast from multicast and see if that helps (particularly if you're using Cisco or cheap switches). Finally, are there any other NGX clusters on the same segment as this one? In rare cases, the cluster IDs for the broadcast traffic can interfere with each other. |
| |||
| Quote:
- Both gateways are running Secureplatform, NOT SPLAT Pro - The gateways are connected to Cisco Catalyst 6513 switches with MFC-720. Therefore I don't think these are cheap switches, - I enable "spanningtree portfast" on access ports and "spanningtree portfast trunk" on trunk ports, - The ccp is broadcast on both gateways when I have this issues: [Expert@NGx-gw1]# cphaprob -a if Required interfaces: 3 Required secured interfaces: 1 eth0 UP non sync(non secured), broadcast eth2 UP sync(secured), broadcast eth1 UP non sync(non secured), broadcast (eth1.140 ) Virtual cluster interfaces: 3 eth0 192.168.15.193 eth1.140 192.168.192.1 eth1.150 192.168.193.1 [Expert@NGx-gw1]# [Expert@NGx-gw2]# cphaprob -a if Required interfaces: 3 Required secured interfaces: 1 eth0 UP non sync(non secured), broadcast eth2 UP sync(secured), broadcast eth1 UP non sync(non secured), broadcast (eth1.140 ) Virtual cluster interfaces: 3 eth0 192.168.15.193 eth1.140 192.168.192.1 eth1.150 192.168.193.1 [Expert@NGx-gw2]# One other thing I noticed: 1- reboot gw1, gw2 gets all the traffics. That's normal. 2- after gw1 comes back online, gw2, shows 100% load and 0% for gw1. On gw1, it shows 30% for gw1 and 70% for gw2. But, if I tried to ssh to a hosts behind the firewall, gw1 gest all the traffics, nothing going across gw2 3- after rebooting gw2, everthing comes back normal. If I tried to ssh to a host behind the firewall, I see traffics on both gw1 and gw2, as confirmed wth tcpdump weird. Last edited by cciesec2006; 2008-08-02 at 19:16. |
| |||
| Cisco 6513.... I think that's where the issue is. I recall some strange things about the way that 6500 switches and ClusterXL go together. Check the arp caches, and I'd also start looking at the CCP broadcast packets themselves - I have some vague memory of seeing something like this where the 6513 either got confused because of the arp entires or blocked the CCP packets on the VLANs. I guess the other thing to check is that the sync traffic is actually getting between the devices. Test a transparent failover with something that uses a data connection (ftp is good for this...). Test a failover and a failback. Use the fw tab pstat and fw tab -t connections -s commands to see if connections are actually getting synced. Finally, is your sync interface a crossover cable or another switch/VLAN? I'd recommend a test where you plug the Sync network into a hub/switch rather than a crossover, to see if there's an issue with confusion about sync interface failures. Good luck, from experience I know that CXL can be a real pain to troubleshoot. My final question would be to assess the need for an Active/Active setup. I always worry about this when there's only 2 nodes, because while it buys you performance, it potentially exposes you a non-redundant solution. Assume both boxes get to 60% utilisation, and one dies. 120% of traffic doesn't go through one box well, and you now actually have a single point of failure again when the box trying to manage 120% of traffic falls over.... |
| |||
| Thank you Thorpuse. Just so you know I have an identical setup R55 gateways connecting to another 6513 switch and I have no issue whatsoever. But I took your advice connect the NGx R65 gateways to a Catalyst 2960. It did not resolve the issue. For most of the implementation that I've worked with for the past eight years, we always use X-over cable for sync interface. If you connect the sync interface into a dedicate hub/VLAN/switch, you introduce another point of failure, don't you think? Running Active/Active is not my idea. It was imposed upon me by my boss. For the past five years, the firewall never exceeds 20% cpu utilization and the maxium connection never exceeds 11k. But my boss thinks that "hey, we already purchase clusterXL load-sharing license, let use it." While I don't like Active/active myself, I do see a benefits with this in the sense that you know both firewalls are working versus Active/Standby. When the Active firewalls goes down, you do not know how the standby firewall will react because it is "standby". There could be hidden problem with the Standby that we just may not know. This is especially true in a high throughput/connection environment. I've seen it happened on the Nokia IP1220 myself. The Active Nokia crashed and failover to the Standby Nokia via VRRP and the standby firewall crashed as well. |
| |||
| Certainly sounds like the sync interface connection is the issue then, not the switch... I'd be interested to see if all the interfaces can see each other properly in the 100/0 position. It seems like to me that a NIC isn't coming up correctly or isn't being detected at the right stage in the firewall start process. If you can create the fail condition again, I'd suggest checking connectivity on all interfaces using the member server's IP - I suspect that you'll find one of them won't be able to ping/arp correctly for the other's IP, or even that the interface isn't coming up correctly during the failback. One other thought - are the member server's IP addresses on the same subnets to the cluster IPs, or different ones? If different, I wonder if the routing to allow the arp publications aren't being missed. HTH, this problem sounds really familiar to me, but I can't recall exactly where I've seen it before.... |
![]() |
| Thread Tools | |
| Display Modes | |
| |