Heartbeat
Each node in the cluster sends
out a multicast heart beat that tells the other member of the cluster that it
is alive and healthy. By default a cluster node will consider another node dead
if it misses the heartbeat from that node for 10 seconds.
The interface used for
heartbeats is configured in the cluster.conf file (see configuration section
for more details ). When discussion cluster configuration with Redhat support
they highly recommended that a cross-connect between the two nodes is not used
and that an interface connected to a switch and a private VLAN be used for
heartbeats. They also recommended that this be the same interface used to
initiate fencing (See below).
Quorum
One of the most dangerous
situations that can happen in clusters is that both nodes become active at the
same time. This is especially true for clusters that share storage resources.
In this case both cluster nodes could be writing to the data on shared storage
which will quickly cause data corruption.
When both nodes becoming active
it is called “split brain” and can happen when a cluster stops receiving
heartbeats from its partner node. Since the two nodes are no longer
communicating they do not know if the problem is with the other node or if the
problem is with itself.
For example say that the
passive node stops receiving heartbeats from the active node due to a network
failure of the heartbeat network. In this case if the passive node starts the
cluster services then you would have a split-brain situation.
Most clusters use a Quorum Disk
to prevent this from happening. The Quorum Disk is a small shared disk that
both nodes can access at the same time. Whichever node is currently the active
node writes to the disk periodically (usually every couple of seconds) and the
passive node checks the disk to make sure the active node is keeping it up to
date.
When a node stops receiving
heartbeats from its partner node it looks at the Quorum Disk to see if it has
been updated. If the other node is still updating the Quorum Disk then the
passive node knows that the active node is still alive and does not start the
cluster services.
Redhat clusters support Quorum
Disks, but Redhat support had recommended not to use one since they are
difficult to configure and can become problematic. Instead they recommend to
relying on Fencing to prevent split brain.
Fencing
One of the strategies used by
Redhat clusters to prevent split brain is a concept called fencing.
While there are several
different types of fencing, fencing via the HP iLO devices (or similar) built
into the servers is the recommended method. With this type of Fencing when the
passive node stops receiving heartbeats from the active node it will connect to
the iLO of the active node and reboot the active node. Once the passive node
reboots (i.e. fences) the active node it will then start the cluster services.
By rebooting the active node
the passive node can be sure that the active node is no longer running the
cluster services and it is safe to start them.
From a design point of view the
NIC used to connect to the iLO of a node’s partner server is the NIC that
should also be used for heartbeat. This ensures that the node that lost its
connection to the heartbeat network cannot fence its partner
No comments:
Post a Comment