Dear Mary,
Dear Audience,
We discovered multiple issues:
- Interrupt while installing the Policy [solved]:
Checkpoint created a Hotfix (Hotfix 603) which solved this problem. CP told us to integrate this hotfix soonest in HFA05. HFA05 arrived on 22nd February. But according to the release notes it wasn't integrated. We're looking forward to HFA06 and hope they will integrate this into any further version.
- Failover when the Standby-Node is active [solved]:
This may be rather a ClusterXL-specific problem. We're using the "Maintain current Active Gateway"-Option to prevent that the FW to changes back when a failover ocurred and the node with the higher priority gets available. (see ClusterXL Userguide p. 65). Finally we set the parameter "fwha_freeze_state_machine_timeout" (read more below) to 60 seconds, which solved this problem.
- Failover causes a loss of connectivity between 5 and 60 seconds [unsolved].
First, there are "2 types" of failovers. Earlier we performed a failover by shutting down a interface on one node. With this method a loss of connectivity of 50-60 seconds occurred immediately. CP told us then to use the command "clusterXL_admin up/down" to trigger the failover. With this method the connection is not interruptet for 40 seconds and then 5 pings are lost. (voodoo?).
You can decide by yourself which method reflects the nature of a real failover better. (We're using VOIP through this FW, so 5 seconds is very an upper limit)
How can "High Availability" be interpreted in seconds??
- GateD sends not all routes to all OSPF-Neighbours [unsolved]:
Our GateD-Update is another tragic episode. First CP told us to update the CPadvr-R60**.rpm using rpm -Uhv.. but some part of the post-installation-script crashed with a segfault. Then CP responded that we have to erase the old package first and THEN install the new rpm (-ihv). But this didn't work too (the same post-installation-script segfaulted). So we extracted all files within the rpm to a temporary directory and compared the md5sums of all files. Only the binary of the GateD-Daemon himself changed. So we updated just this file.
cpstart started the new gated without problem and our tests showed that all OSPF-Neighbours now have all routes. A day later (my headache caused by the one-more-cp-problem-is-solved-champagne is almost gone) I discovered that the routes are now gone.. I'm now investigating why the routes have been disappeared.
Sidenote:
We started with one problem, found 3 more and have now solved 2/4 problems (
in a year !! ). (I really won't think about interpolating this to a lifecycle of 3 years... )
We should start to sell our troublesooting/bugtracking work to CP.
This product simply does not work as specified. (Have you ever bought a car, which motor died for some seconds when you shifted to another gear? I mean, policy installation is essential in case of a firewall, or failover-behavior in case of a cluster. How the heck does CP test their products??)
Best wishes,
Manuel
fwha_freeze_state_machine_timeout State synchronization during policy installation may, in certain cases, cause a cluster member to initiate a failover. To prevent this situation, you can modify the security gateway global parameter fwha_freeze_state_machine_timeout. This parameter sets the number of seconds, during policy installation, in which no state synchronization will be performed. You should set this parameter to the shortest period needed to eliminate the issue; the recommended value is 30 seconds.
This parameter is not related to the synchronization mechanism in any way. It is related to what Check Point calls the "state machine". The "state machine" is responsible for determining the state of each machine, i.e. if the machine is active/standby/down. When the state of the machine is changed, failover results. During install policy, there are cases, in which, the state is changed, and consequently an unwanted failover may occur. Correctly setting fwha_freeze_state_machine_timeout should prevent the unwanted failover.
Correctly setting fwha_freeze_state_machine_timeout should also prevent unwanted failovers in 3rd party environments, especially in cases in which the 3rd party environment may bring the cluster down, during policy installation. In 3rd party environments, the state of the cluster member is determined by the 3rd party environment. Whereas, in ClusterXL, the state of the cluster member is determined by the ClusterXL state machine code, which may cause unwanted failovers during policy installation.