Tuesday, January 21, 2014

Cluster Terms



Heartbeat
Each node in the cluster sends out a multicast heart beat that tells the other member of the cluster that it is alive and healthy. By default a cluster node will consider another node dead if it misses the heartbeat from that node for 10 seconds.
The interface used for heartbeats is configured in the cluster.conf file (see configuration section for more details ). When discussion cluster configuration with Redhat support they highly recommended that a cross-connect between the two nodes is not used and that an interface connected to a switch and a private VLAN be used for heartbeats. They also recommended that this be the same interface used to initiate fencing (See below).

Quorum
One of the most dangerous situations that can happen in clusters is that both nodes become active at the same time. This is especially true for clusters that share storage resources. In this case both cluster nodes could be writing to the data on shared storage which will quickly cause data corruption.
When both nodes becoming active it is called “split brain” and can happen when a cluster stops receiving heartbeats from its partner node. Since the two nodes are no longer communicating they do not know if the problem is with the other node or if the problem is with itself.
For example say that the passive node stops receiving heartbeats from the active node due to a network failure of the heartbeat network. In this case if the passive node starts the cluster services then you would have a split-brain situation.
Most clusters use a Quorum Disk to prevent this from happening. The Quorum Disk is a small shared disk that both nodes can access at the same time. Whichever node is currently the active node writes to the disk periodically (usually every couple of seconds) and the passive node checks the disk to make sure the active node is keeping it up to date.
When a node stops receiving heartbeats from its partner node it looks at the Quorum Disk to see if it has been updated. If the other node is still updating the Quorum Disk then the passive node knows that the active node is still alive and does not start the cluster services.
Redhat clusters support Quorum Disks, but Redhat support had recommended not to use one since they are difficult to configure and can become problematic. Instead they recommend to relying on Fencing to prevent split brain.

Fencing
One of the strategies used by Redhat clusters to prevent split brain is a concept called fencing.
While there are several different types of fencing, fencing via the HP iLO devices (or similar) built into the servers is the recommended method. With this type of Fencing when the passive node stops receiving heartbeats from the active node it will connect to the iLO of the active node and reboot the active node. Once the passive node reboots (i.e. fences) the active node it will then start the cluster services.
By rebooting the active node the passive node can be sure that the active node is no longer running the cluster services and it is safe to start them.
From a design point of view the NIC used to connect to the iLO of a node’s partner server is the NIC that should also be used for heartbeat. This ensures that the node that lost its connection to the heartbeat network cannot fence its partner

No comments:

Post a Comment