IBM Support

Live Partition Mobility Performance

Troubleshooting


Problem

This document describes the most common causes that impact Live Partition Mobility (LPM) performance and things to consider to resolve the problem.

Symptom

LPM is taking long to complete. At the end, the LPM process reports 100% completed, and so does the AIX mobile partition and Mover Service Partitions (MSPs).
On AIX:
# errpt |grep -i mig
08917DC6 CLIENT_PMIG_STARTED INFO S  Client Partition Migration Started
A5E6DB96 CLIENT_PMIG_DONE    INFO S  Client Partition Migration Completed


On MSPs:
$ errlog|grep -i mig
3EB09F5A MVR_MIG_COMPLETED   INFO S  Migration completed successfully
Environments where the source and target MSPs do not have an adequate bandwidth may experience high delays during the partition mobility operation. For example, if a mobile partition is migrating from an MSP with a 10GB network to an MSP with an LHEA (which is not recommended for optimal performance), the effective network throughput during LPM may be extremely low. In some cases, it may cause the Suspend and Resume stages to take longer times causing the mobile partition to become inaccessible during that time.

Cause

The most common causes for LPM performance include but are not limited to:
1. Low VIOS (CPU or memory) resources or
2. Low network bandwidth between the mover service partitions (MSPs) or
3. High disk I/O (more commonly seen on NPIV SAN disks)
4. Concurrent migrations on busy network.
5. Disk level validation is enabled on VIOS.

Environment

LPM running VIOS 3.1  and mobile partition running AIX.

Diagnosing The Problem

If LPM environment includes the client partition running a cluster application, which is time-sensitive (i.e. network/disk heartbeat dependent), the LPM performance may have an impact on the application. For example, if the mobile partition is part of a cluster, the client may experience a failover or node eviction due to network/disk heartbeat timeout.

If you are experiencing application issues on the mobile partition during the LPM process or immediately after the LPM process is reported as successfully completed, please, engage collaboration from your application support to determine why the application failed (i.e. disk I/O, network timeout, etc).

Resolving The Problem

Live Partition Mobility Performance Considerations

Active partition migration involves moving the state of a partition from one system to another while the partition is still running. The mover service partitions working with the hypervisor use partition virtual memory functions to track changes to partition memory state on the source system while it is transferring memory state to the destination system.

During the migration phase, an initial transfer of the mobile partition’s physical memory from the source to the destination occurs. Because the mobile partition is still active, a portion of the partition’s resident memory will almost certainly have changed during this pass. The hypervisor keeps track of these changed pages for retransmission to the destination system in a dirty page list. It makes additional passes through the changed pages until the mover service partitions detects that a sufficient amount of pages are clean or the timeout is reached.

The speed and load of the network that is used to transfer state between the source and destination systems influence the time required for both the transfer of the partition state and the performance of any remote paging operations.

The amount of changed resident memory after the first pass is controlled more by write activity of the hosted applications than by the total partition memory size. Nevertheless, a reasonable assumption is that partitions with a large memory requirement have higher numbers of changed resident pages than smaller ones.

To ensure that active partition migrations are truly nondisruptive, even for large partitions, the POWER Hypervisor resumes the partition on the destination system before all the dirty pages have been migrated over to the destination. If the mobile partition tries to access a dirty page that has not yet been migrated from the source system, the hypervisor on the destination sends a demand paging request to the hypervisor on the source to fetch the required page.

Providing a high-performance network between the source and destination mover partitions and reducing the partition’s memory update activity prior to migration will improve the latency of the state transfer phase of migration. We suggest using a dedicated network for state transfer, with a nominal bandwidth of at least 1 Gbps.



Recommendations for the most common causes of LPM performance include (but are not limited to) the following:

1. Verify the source and target VIO servers have enough CPU and memory resources by running VIOS Performance Advisor tool on source and target VIOs. This is more of a proactive approach. If VIO resources are overutilized, the Performance Advisor tool will generate a *.xml file with new suggested values. Examine the file and make the recommended change(s) before re-attempting LPM. The generated recommended values will be based on VIO workload going on at the time the performance data is being collected.


For Partition Mobility, it is recommended to have a minimum of one full, dedicated CPU on each source and target VIOS.

2. If a time-sensitive (network dependent) application is being impacted, and from the application support side, it is determined the application issue was due to network timeout around the time the LPM performance was observed, please, check the network adapter type used for the mobility operation on the source and target MSPs.
A dedicated 10-Gigabit connection for LPM operations is strongly recommended between the source and target MSPs. See Preparing the network configuration for partition mobility for more details.

3. If high disk I/O is observed when the LPM performance issue is going on, please, contact your Storage Support Representative for investigation.

4. If concurrent LPM operations are being attempted over a busy network, try to LPM one client after another.

5. NPIV disk level validation for LPM is enabled on VIOS.
NPIV LUN or disk level validation is an LPM enhancement introduced with VIOS 2.2.4.0. This functionality is disabled by default, but when enabled, disk validation can add a considerable amount of time to partition mobility validation for clients that are using NPIV disks. The amount of time that is required to validate NPIV devices up to the disk level depends on the number of disks that are mapped to a client. For larger configurations, the additional time that is spent in validation might have a noticeable impact on the overall time that is required to migrate the partition.

(In cases where the zoning is correct, but the storage is not properly provisioned, the validation may eventually fail after the long delay.)

To check if disk level validation is enabled and how to disable it:


On source VIO server(s)
$ lsdev -dev vioslpm0 -attr|grep src_lun_val >off=disabled, on=enabled
$ chdev -dev vioslpm0 -attr src_lun_val=off  >to disable it. The change is effective immediately.

On destination VIO server(s)
$ lsdev -dev vioslpm0 -attr|grep dest_lun_val >off=disabled, on=enabled
$ chdev -dev vioslpm0 -attr dest_lun_val=off  >to disable it. The change is effective immediately.

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"ARM Category":[],"Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions"}]

Document Information

Modified date:
19 May 2023

UID

isg3T1022733