Managing OpenStack services on HA Controller Nodes

OpenStack services on HA controller nodes are managed by Pacemaker. If you need to manage the services, you can place an HA controller node in standby mode to move it to another node. You can also place an entire set of HA controllers into standby mode, for example, in preparing for a site power outage. If necessary, you can instruct Pacemaker to stop managing a specific resource.

About this task

Under normal conditions, all services on an HA controller are stopped as a set by telling Pacemaker to put an HA controller node into standby mode. Services are started as a set by telling Pacemaker to take a HA controller out of standby mode. While it runs, Pacemaker monitors individual services, restarts them after failures, and moves them between nodes as required. When Pacemaker reports failures, it attempts to recover from and clear the failure. The recover can take anywhere from a few seconds to an hour depending on the failure. If the problem continues, the failure might require further investigation and manual intervention to resolve.

For HA controller nodes with internal DB2® installed, one HA controller node runs the virtual IP service, HAProxy service, and the master role for the IBM DB2 high availability disaster recovery (HADR) service. This HA controller node is the HADR primary, and is also referred to in this article as the primary HA controller. The node that is running these services changes in response to failures, or management actions like updating a node and shutting down nodes. This node has the most current database contents and also has the most current RabbitMQ message queue contents if it is the last node that is shut down.

To stop all the services on a single HA controller node, use the knife os manage services standby command:

$ knife os manage services standby --node HA_NODE_FQDN

Placing an HA controller node in standby mode, by using the standby action of the IBM Cloud Manager with OpenStack services command, instructs the HA components on that node to stop any running OpenStack services. Any OpenStack services that are found active are stopped, and if necessary, they are moved to another HA controller node. Using the standby action is the preferred way to move the DB2 HADR primary, the virtual IP, and any other resources that are Active/Passive off from a specific node. Active/Passive means that they run on only one node at a time. Once the resource or resources that you want to move are removed from the node, use the knife os manage services unstandby command to allow Pacemaker to restart and resume managing service on this node. For more information about the unstandby action, see details later in this document

If the node is powered off while in standby mode, it remains in standby mode when powered on. If a node is not in standby, it will not be in standby when powered on and Pacemaker will immediately start services. A node can also be placed in standby while it is powered off to prevent Pacemaker from starting services when it is powered on. To place a node in standby while it is powered off, run the pcs cluster standby command from one of the other HA controller nodes:

$ pcs cluster standby HA_NODE_FQDN

The knife os manage services standby command can be used multiple times to place more than one node into standby, however it is important to do this in an orderly way. To put the entire set of HA controller nodes into standby, see the information in Shutting down or restarting all HA controller nodes simultaneously (restarting the nodes in this procedure is optional). If you want to put some but not all controller nodes into standby mode, and one of the nodes that you plan to put in standby is the primary HA controller, ensure that you put the primary HA controller in standby mode last. This is important because it minimizes the impacts of moving services from node to node.

Important: If the last active HA controller node with DB2 installed is put into standby, this node must be taken out of standby first. Otherwise, start all nodes at the same time by taking the entire cluster out of standby as described later in this document. This node contains the most current DB2 database updates and RabbitMQ message queue contents. Restarting it first ensures that it remains the primary HA controller. Starting another HA controller node first might damage the HADR configuration and require restoring the databases on the DB2 HADR standby nodes from the new primary controller. The new primary controller might not have a record of all of the database updates.

To start all of the services on a single HA controller node, use the knife os manage services unstandby command:

$ knife os manage services unstandby --node HA_NODE_FQDN

Bringing an HA controller node out of standby mode by using the unstandby action of the IBM Cloud Manager with OpenStack services command allows it to run OpenStack services again if they are stopped. However, it does not automatically make it take over as the active node for active/passive OpenStack services.

The entire set of HA controllers can be put into standby mode. This method might be used, for example, when you are preparing for a site power outage. Putting an entire set of HA controller in standby mode is different from putting nodes into standby one by one. Pacemaker puts the entire cluster (all HA controllers) into standby. When the entire cluster is put in standby, it does not require moving services from node to node.

To put all HA controllers into standby, use the knife os manage services standby command with the --topology-file option:

$ knife os manage services standby --topology-file your_topology_file

Note: Pacemaker might report resource failures while stopping the services on all HA controller nodes. This is normal and Pacemaker should automatically recover from and clear the failures.

The standby --topology-file command returns before all the nodes are put in standby. Before you take other actions, use the knife os manage services status –topology-file your_topology_file command that is described later to monitor the services and wait until all services on the controller nodes are stopped.

To take all HA controllers out of standby mode, use the knife os manage services unstandby command with the --topology-file option:

$ knife os manage services unstandby --topology-file your_topology_file

Note: Pacemaker might report resource failures while starting the services on all HA controller nodes. This is normal and Pacemaker should automatically recover from and clear the failures.

To identify the current primary HA controller, use the knife os manage services status for one of the HA controller nodes. Early in the output it identifies which node is running the virtual IP service, HAProxy service, and the master role for DB2 HADR. In the following example, controller2.example.com is the primary HA controller.

$ knife os manage services status --node HA_NODE_FQDN
controller1.example.com Full list of resources:
controller1.example.com
controller1.example.com  ibm-os-virtualip      (ocf::heartbeat:IPaddr2):       Started controller2.example.com
controller1.example.com  ibm-os-haproxy        (ocf::ibm-openstack:haproxy_agent):     Started controller2.example.com
controller1.example.com  Master/Slave Set: ibm-os-db2hadr-master [ibm-os-db2hadr]
controller1.example.com      ibm-os-db2hadr    (ocf::ibm-openstack:db2hadr):   Master controller2.example.com
controller1.example.com      ibm-os-db2hadr    (ocf::ibm-openstack:db2hadr):   Started controller1.example.com
controller1.example.com      ibm-os-db2hadr    (ocf::ibm-openstack:db2hadr):   Started controller3.example.com
controller1.example.com      Masters: [ controller2.example.com ]
controller1.example.com      Slaves: [ controller1.example.com ]

To tell Pacemaker not to start or stop a service, use:

$ pcs resource unmanage resource-name

You need to run this command from only one of the controller nodes. Pacemaker stops managing the resource on all nodes. After you have Pacemaker unmanage the resource, the service can be started and stopped manually without Pacemaker interfering.

To have Pacemaker start managing the service again, use the following command:

$ pcs resource manage resource-name