Application automation: Minimizing manual intervention

One key requirement for an application to function successfully under PowerHA® SystemMirror® is that the application be able to start and stop without any manual intervention.

Application start scripts

Create a start script that starts the application. The start script should perform any clean-up or preparation necessary to ensure proper startup of the application, and also to properly manage the number of instances of the application that need to be started. When the application controller is added to a resource group. PowerHA SystemMirror calls this script to bring the application online as part of processing the resource group. Because the cluster daemons call the start script, there is no option for interaction. Additionally, upon a PowerHA SystemMirror fallover, the recovery process calls this script to bring the application online on a standby node. This allows for a fully automated recovery, and is why any necessary cleanup or preparation should be included in this script.

PowerHA SystemMirror calls the start script as the root user. It might be necessary to change to a different user in order to start the application. The su command can accomplish this. Also, it might be necessary to run thenohup command on commands that are started in the background and have the potential to be ended upon exit of the shell.

For example, a PowerHA SystemMirror cluster node might be a client in a Network Information Service (NIS) environment. If this is the case and you need to use the su command to change the user ID, there must be a route to the NIS server at all times. In the event that a route does not exist and the su command is attempted, the application script hangs. You can avoid this situation by enabling the PowerHA SystemMirror cluster node to be an NIS client. That way, a cluster node has the ability to access its own NIS map files to validate a user ID.

The start script should also check for the presence of required resources or processes. This will ensure an application can start successfully. If the necessary resources are not available, a message can be sent to the administration team to correct this and restart the application.

Start scripts should be written so that they determine whether one instance of the application is already running and not start another instance unless multiple instances are desired. Keep in mind that the start script might be run after a primary node has failed. There might be recovery actions necessary on the backup node in order to restart an application. This is common in database applications. Again, the recovery must be able to run without any interaction from administrators.

Application stop scripts

The most important aspect of an application stop script is that it completely stops an application. Failure to do so might prevent PowerHA SystemMirror from successfully completing a takeover of resources by the backup nodes. In stopping, the script might need to address some of the same concerns that the start script addresses, such as NIS and the su command.

The application stop script should use a phased approach. The first phase should be an attempt to stop the cluster services and bring resource groups offline. If processes refuse to end, the second phase should be used to forcefully ensure that all processing is stopped. Finally, a third phase can use a loop to repeat any steps necessary to ensure that the application has ended completely.

Be sure that your application stop script exits with the value 0 when the application has been successfully stopped. In particular, examine what happens if you run your stop script when the application is already stopped. Your script must exit with 0 in this case as well. If your stop script exits with a different value, this tells PowerHA SystemMirror that the application is still running, although possibly in a damaged state. The event_error event will be run and the cluster will enter an error state. This check alerts administrators that the cluster is not functioning properly.

Keep in mind that PowerHA SystemMirror allows 360 seconds by default for events to complete processing. A message indicates that the cluster has been in reconfiguration too long appears until the cluster completes its reconfiguration and returns to a stable state. This warning might be an indication that a script is hung and requires manual intervention. If this is a possibility, you might want to consider stopping an application manually before stopping PowerHA SystemMirror.

You can change the time period before the config_too_long event is called.

Application start and stop scripts and dependent resource groups

In PowerHA SystemMirror, support for dependent resource groups allows you to configure the following options:

  • Three levels of dependencies between resource groups, for example a configuration in which node A depends on node B, and node B depends on node C. PowerHA SystemMirror prevents you from configuring circular dependencies.
  • A type of dependency in which a parent resource group must be online on any node in the cluster before a child (dependent) resource group can be activated on a node.

If two applications must run on the same node, both applications must reside in the same resource group.

If a child resource group contains an application that depends on resources in the parent resource group and, then upon fallover conditions, and if the parent resource group falls over to another node, the child resource group is temporarily stopped and automatically restarted. Similarly, if the child resource group is concurrent, PowerHA SystemMirror takes it offline temporarily on all nodes, and brings it back online on all available nodes. If the fallover of the parent resource group is not successful, both the parent and the child resource groups go into an ERROR state.

Note that when the child resource group is temporarily stopped and restarted, the application that belongs to it is also stopped and restarted. Therefore, to minimize the chance of data loss during the application stop and restart process, customize your application controller scripts to ensure that any uncommitted data is stored to a shared disk temporarily during the application stop process and read back to the application during the application restart process. It is important to use a shared disk because the application might be restarted on a node other than the one on which it was stopped.

Application tier issues

Often, applications have a multitiered architecture (for example, a database tier, an application tier, and a client tier). Consider all tiers of an architecture if one or more is made highly available through the use of PowerHA SystemMirror.

For example, if the database is made highly available and a fallover occurs, consider whether actions should be taken at the higher tiers in order to automatically return the application to service. If so, it might be necessary to stop and restart application or client tiers. This can be facilitated in one of two ways. One way is to run the cli_on_node command on the tiers, and the other is to use a remote execution command such as rsh, rexec, or ssh.

Note: Certain methods, such as the use of ~/.rhosts files, pose a security risk.

Using dependent resource groups

To configure complex clusters with multitiered applications, you can use parent-child dependent resource groups. You might also want to consider using location dependencies.

Using the Clinfo API

Clinfo API is the cluster information daemon. You can write a program using the Clinfo API to run on any tiers that would stop and restart an application after a fallover has completed successfully. In this sense, the tier, or application, becomes cluster aware, responding to events that take place in the cluster.

Using pre-event and post-event scripts

Another way to address the issue of multitiered architectures is to use pre-event and post-event scripts around a cluster event. These scripts would call a remote execution command, such as rsh, rexec, or ssh, to stop and restart the application.