RHEL7 Failover Cluster

In this tutorial we will see how to create a cluster between two Red Hat Enterprise Linux (RHEL) 7 virtual machines. I installed them on ESXi platform, with 2 vPCU and 2 GB of RAM each, but this is a detail. Three IP addresses are required, one for each RHEL server and one as a virtual IP for the cluster. Virtual machines need to communicate with each other.

In my sample architecture I have:

  • Server 1
    • Name: vm-rh74-svr1
    • IP: 192.168.0.124
  • Server 2
    • Name: vm-rh74-svr2
    • IP: 192.168.0.125
  • Cluster VIP
    • IP: 192.168.0.126

rhel-cluster

Installation

We install the RHEL 7.x operating system with minimal configuration on both servers and configure them with a static IP address. To avoid communication problems between the nodes we disable SELinux and disable the firewall on both.

Once installed, we add an entry for each node in the /etc/hosts file for name resolution, on both virtual machines.

	127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
	::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

	192.168.0.124 vm-rh74-svr1 vm-rh74-svr1.localdomain
	192.168.0.125 vm-rh74-svr2 vm-rh74-svr2.localdomain

If you don’t have an active RHEL Subscription, let’s set up the local repository for yum, one for the standard RHEL packages, and one for the HighAvailability package group.

	# echo "[rhel-dvd]" > /etc/yum.repos.d/rhel-dvd.repo
	# echo "name=rhel-dvd" >> /etc/yum.repos.d/rhel-dvd.repo
	# echo "baseurl=file:///mnt/cdrom" >> /etc/yum.repos.d/rhel-dvd.repo
	# echo "enabled=1" >> /etc/yum.repos.d/rhel-dvd.repo
	# echo "gpgcheck=0" >> /etc/yum.repos.d/rhel-dvd.repo

	# echo "[rhel-ha-dvd]" > /etc/yum.repos.d/rhel-ha-dvd.repo
	# echo "name=rhel-ha-dvd" >> /etc/yum.repos.d/rhel-ha-dvd.repo
	# echo "baseurl=file:///mnt/cdrom/addons/HighAvailability" >> /etc/yum.repos.d/rhel-ha-dvd.repo
	# echo "enabled=1" >> /etc/yum.repos.d/rhel-ha-dvd.repo
	# echo "gpgcheck=0" >> /etc/yum.repos.d/rhel-ha-dvd.repo

So, we install pcs pacemaker fence-agents-all, on both virtual machines .

	# yum --disablerepo=\* --enablerepo=rhel-dvd,rhel-ha-dvd install pcs pacemaker fence-agents-all

In order to use pcs for cluster configuration and allow communication between nodes, a password must be set on each node for the user ID hacluster, which is the pcs administration account. It is recommended to use the same password for the hacluster user on each node.

	# passwd hacluster
	Changing password for user hacluster.
	New password:
	Retype new password:

Now we can start pscd, corosync and pacemaker on both nodes.

	# systemctl start pcsd.service
	# systemctl enable pcsd.service
	# systemctl daemon-reload
	# systemctl start corosync.service
	# systemctl enable corosync.service
	# systemctl daemon-reload
	# systemctl start pacemaker.service
	# systemctl enable pacemaker.service
	# systemctl daemon-reload

And at this point we proceed with the authentication of the psc hacluster user for each node.

	# pcs cluster auth vm-rh74-svr1 vm-rh74-svr2
	Username: hacluster
	Password:
	vm-rh74-svr1: Authorized
	vm-rh74-svr2: Authorized

Cluster configuration

Once pcs is installed, you need to run the following command from vm-rh74-svr1 to create the cluster vm-rh74-cluster. This will propagate the cluster configuration files to both nodes.

	# pcs cluster setup --start --name vm-rh74-cluster vm-rh74-svr1 vm-rh74-svr2

Then let’s check the status of the cluster.

	# pcs cluster status
	Cluster Status:
	Stack: corosync
	Current DC: vm-rh74-svr2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
	Last updated: Mon Jun 18 17:06:32 2018
	Last change: Mon Jun 18 17:06:07 2018 by hacluster via crmd on vm-rh74-svr2
	2 nodes configured
	0 resources configured

	PCSD Status:
	 vm-rh74-svr1: Online
	 vm-rh74-svr2: Online

We add the VirtualIP resource.

	# pcs resource create VirtualIP IPaddr2 ip=192.168.0.126 cidr_netmask=24 --group mygroup

Then let’s check the cluster status again.

	# pcs status
	Cluster name: vm-rh74-cluster
	WARNING: no stonith devices and stonith-enabled is not false
	Stack: corosync
	Current DC: vm-rh74-svr2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
	Last updated: Mon Jun 18 17:23:59 2018
	Last change: Mon Jun 18 17:19:19 2018 by root via cibadmin on vm-rh74-svr1

	2 nodes configured
	1 resource configured

	Online: [ vm-rh74-svr1 vm-rh74-svr2 ]

	Full list of resources:

	Resource Group: mygroup
		 VirtualIP  (ocf::heartbeat:IPaddr2):       Stopped

	Daemon Status:
	  corosync: active/disabled
	  pacemaker: active/disabled
	  pcsd: active/enabled

The message above indicates that the VirtualIP resource has stopped, which means that there are errors with the cluster. To check for errors in the configuration and if there are still any we can use the following command.

	# crm_verify -L -V
	   error: unpack_resources:     Resource start-up disabled since no STONITH resources have been defined
	   error: unpack_resources:     Either configure some or disable STONITH with the stonith-enabled option
	   error: unpack_resources:     NOTE: Clusters with shared data need STONITH to ensure data integrity
	Errors found during check: config not valid

From the above message there is still an error related to STONITH (Shoot The Other Node In The Head), which is a mechanism used to ensure that you are not in the situation where both nodes think they are active and own the virtual IP, also called a split brain situation. Since we have a basic 2 node cluster, we will simply disable the stonith option.

	# pcs property set stonith-enabled=false

We can also configure quorum settings. Quorum describes the minimum number of nodes in the cluster that must be active, for the cluster be considered available, and by default, quorum is considered too low if the total number of nodes is less than twice the number of active nodes. This means that for a 2 node cluster both nodes must be available to have an active cluster. In our case this wouldn’t make sense, because we’re experimenting with a two-node cluster, so we disable quorum checking.

	# pcs property set no-quorum-policy=ignore

To avoid split brain situations, we can configure a location constraint for cluster resources, in order to prefer one node. An INFINITY value for the score indicates that the resource will prefer that node if that node is available, but does not prevent the resource from running on another node if the preferred node is unavailable.

	# pcs constraint location mygroup prefers vm-rh74-svr1 INFINITY
	# pcs constraint --full
	Location Constraints:
	  Resource: mygroup
	   Enabled on: vm-rh74-svr1 (score:INFINITY) (id:location-mygroup-vm-rh74-svr1-INFINITY)
	   Enabled on: INFINITY (score:INFINITY) (id:location-mygroup-INFINITY-INFINITY)
	Ordering Constraints:
	Colocation Constraints:
	Ticket Constraints:

Now check the status with both nodes active.

	# pcs status
	Cluster name: vm-rh74-cluster
	Stack: corosync
	Current DC: vm-rh74-svr2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
	Last updated: Tue Jun 19 11:11:05 2018
	Last change: Tue Jun 19 10:45:09 2018 by root via cibadmin on vm-rh74-svr1

	2 nodes configured
	1 resource configured

	Online: [ vm-rh74-svr1 vm-rh74-svr2 ]

	Full list of resources:

	Resource Group: mygroup
		VirtualIP  (ocf::heartbeat:IPaddr2):       Started vm-rh74-svr1

	Daemon Status:
	 corosync: active/enabled
	 pacemaker: active/enabled
	 pcsd: active/enabled

From the previous message the cluster is online and the VirtualIP resource is assigned to node vm-rh74-svr1.

Now let’s shut down the vm-rh74-svr1 node and check the status.

	# pcs status
	Cluster name: vm-rh74-cluster
	Stack: corosync
	Current DC: vm-rh74-svr2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
	Last updated: Tue Jun 19 11:18:59 2018
	Last change: Tue Jun 19 10:45:09 2018 by root via cibadmin on vm-rh74-svr1

	2 nodes configured
	1 resource configured

	Online: [ vm-rh74-svr2 ]
	OFFLINE: [ vm-rh74-svr1 ]
	 
	Full list of resources:

	Resource Group: mygroup
		VirtualIP  (ocf::heartbeat:IPaddr2):       Started vm-rh74-svr2

	Daemon Status:
	 corosync: active/enabled
	 pacemaker: active/enabled
	 pcsd: active/enabled

As you can see the cluster is still online and the VirtualIP resource is now started on the second node vm-rh74-svr2.

We then restart the vm-rh74-svr1 node and, after booting, check the status again.

	# pcs status
	Cluster name: vm-rh74-cluster
	Stack: corosync
	Current DC: vm-rh74-svr2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
	Last updated: Tue Jun 19 11:20:55 2018
	Last change: Tue Jun 19 10:45:09 2018 by root via cibadmin on vm-rh74-svr1

	2 nodes configured
	1 resource configured

	Online: [ vm-rh74-svr1 vm-rh74-svr2 ]

	Full list of resources:

	Resource Group: mygroup
		VirtualIP  (ocf::heartbeat:IPaddr2):       Started vm-rh74-svr1

	Daemon Status:
	 corosync: active/enabled
	 pacemaker: active/enabled
	 pcsd: active/enabled

As we can see the cluster is online and the VirtualIP resource is now started on the first node vm-rh74-svr1. This happened because we configured a location constraint for cluster resources to prefer node vm-rh74-svr1.

Cluster migration tests can also be triggered with the following command:

	# pcs cluster stop vm-rh74-svr1
	vm-rh74-svr1: Stopping Cluster (pacemaker)...
	vm-rh74-svr1: Stopping Cluster (corosync)...

	# pcs cluster start vm-rh74-svr1
	vm-rh74-svr1: Starting Cluster...