1. The default “dead” timer for ClusterXL is approximately 2.5 seconds. If the active member suffers a catastrophic failure (such as the power cord being pulled or a Gaia system crash/panic), the standby member must wait the dead interval before concluding the active member has failed and going active. During that wait period no traffic will pass through the cluster. However for administrative failovers using
clusterXL_admin or other partial failures (such as a single network interface getting unplugged or running a service-impacting command such as
fw unloadlocal), failover to the standby should happen immediately with minimal packet loss.
2. If the active cluster member’s CPUs are running at 80% utilization or higher, by default in R77.30 gateway and later the Cluster Under Load (CUL) mechanism is invoked,
which extends the ClusterXL dead timer from 2.5 seconds to 10 seconds. The purpose of CUL is to avoid spurious and unnecessary failovers due to transient high CPU loads on the active cluster member. Needless to say if a catastrophic failure occurs on the active member while CUL is active, the standby member will have to wait much longer before taking over. To determine if the CUL mechanism is currently (or previously) active on your cluster, run
grep cul_load_freeze /var/log/messages*, as CUL logs all information about its operation to the gateway’s syslog. Making sure your cluster members are properly tuned and optimized as outlined throughout this book can help keep your cluster members well below 80% CPU utilization and avoid invoking CUL.
Bookmarks