Update
After a lot of problems with this, I came to a situation where I could reproduce the problem.
Finally, after the problem showing itself again on 1 cluster that had the problem earlier and now another one, we have a good view of what is happening and how you can see what is happening.
When routeD is restarted on the VRRP Master,
- the routeD process crashes every 10 seconds, using pidof routed you can see the routed process id's
- the VRRP mode goes back to it's initial state and the coldstart timer will start the countdown, restarting every time routeD crashes
- the VRRP driver holds the VIP's on the failing box
- the VRRP driver keeps sending VRRP hello packets from the failing box
- the backup system remains in Backup state, as it keeps receiving VRRP hello's
- when you look at show route, a part of the learned routes will show as kernel routes and part of them will show no nexthop and a ? as the route type
Today we finally were able to create a routeD crash dump, as by default the usermode crashdums are not enabled, enable them with
um_core enable and reboot, all coredump settings in clish will not enable them, also ulimit will not do the trick.
To get the VRRP cluster back to normal operation, just raise the priority on the Backup!! Changing the priority on the Master will not have any effect as this is not forwarded to the VRRP driver.
As soon as the Master receives a higher priority hello packet however, the driver switches to backup mode, now routed stops crashing and the system comes back to normal live.
R&D is now investigating the crash dump.
PS the way to get routed to start the crash loop:
on the active member issue:
tellpm process:routed
tellpm process:routed t
Make sure that this is not a production cluster as it WILL stop routing....
Bookmarks