I encountered a weird issue that I have not ever seen with a clusterXL and am hoping to get a better idea on the process or why this would happen. Any comments and answers are greatly appreciated.
Current Set-up:
2x Open servers running GAIA 77.30 in Active\Standby (Sync Int is a crossover)
fw1 = active
fw2 = standby
monitored interfaces:
eth0
eth1
eth2
eth3
eth4
sync interface:
eth5 (via crossover)
Issue:
It started off with everything was normal:
Fw1 was active.
Fw2 was standby.
Then, out of the blue we lost all access to our networks to and from this firewall. We were dead in the water.
So I had to console into the FWs.
I first consoled into Fw1 and ran "cphaprob stat" which showed:
Code:
Number Unique Address Assigned Load State
1 (local) x.x.x.1 0% Down
2 x.x.x.2 0% Standby
I have never seen this before and it looks like Fw1 failed and Fw2 never took over as the Active member.
I then consoled into Fw2 and all that was shown was a blank black screen. No Prompt no nothing. The server did look like there was a red light on the power indicator. (Upon research looks like I would have to open the server to check the Motherboard for lights).
Rebooted both boxes and everything came back up fine.
Decided to start looking through logs and on FW1 I saw this log which caused the fail over:
Code:
fw1 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Interface Active Check (desc eth5 interface is down, member 2 (x.x.x.2) reports more interfaces up)
fw1 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to FAILURE
fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
fw1 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
fw1 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
Did the Sync Interface go down on Fw1 because Fw2 had a issue (hardware related) and its a crossover connection? Why didn't Fw1 try to take over Active state, when Fw2 wasn't responding?
Can someone help explain what happened in this?
Can someone help explain how the ClusterXL is suppose to work (like a flow)?
Thanks!
Bookmarks