IBM Support

Operating System Disk Balancing Support

Troubleshooting


Problem

This document describes operating system disk balancing support which moves data between disk units.

Resolving The Problem

In V4R4 and above, the STRASPBAL command is available to balance disk utilization across all of the disk arms in an ASP. STRASPBAL can be run during normal operations; however, it may affect performance and therefore is not recommended during critical production. (You should refer to the Q & A section below for further details regarding performance)

Operating system disk balancing support is comprised of the following commands:

o STRASPBAL Command

This command is used to start the ASP disk balancing function. Following are the types of disk balancing supported:

*CAPACITY: Balance DASD in an ASP by capacity. This option spreads the data on all units within an ASP so all units have equal percent used.

Note: This option typically has no effect if any drives in the ASP were marked with *ENDALC.

*USAGE: Balance DASD in an ASP by utilization. After tracing the ASP for a period of time (see the TRCASPBAL command) to identify frequently used (hot) data and infrequently used (cold) data on each disk unit within the ASP. This option will move cold data onto hot disk units to prevent new allocations of hot data on the hot drives.  Note: As of V7R1 TR7, usage balancing will also move hot data off of hot arms.

*HSM: Support the Hierarchical Storage Management (HSM) requirement to move infrequently referenced data to less expensive DASD (for example, compressed disk units). After tracing the ASP for a period of time (see the TRCASPBAL command) to identify frequently used (hot) data and infrequently used (cold) data on each disk unit within the ASP, this option can be used to move cold data to the compressed disk units and move hot data on compressed disk units to noncompressed disk units. Or, on a system with Solid State Drives (SSD), it will move hot data to the SSDs and cold data to the standard disk units. This option runs in two phases - first it moves cold data, and after that completes, it moves the hot data.

Note: When run from an interactive session, the user job will be input inhibited while the move of cold data takes place and will then free up (allow command entry) after submitting the background tasks to move hot data.

The following options were added in V5R2 to drain data from disk units. Refer to document N1019072, Moving Data Off DASD before Removing from ASP for further details.

*ENDALC: Used to mark specific unit(s) for no new allocations.

Note: Existing data on the drive can still be read and updated. Also, new space can still be allocated on a drive marked for *ENDALC by journaling or if the ASP is full.

*MOVDTA: Moves data off of drives that were previously marked with the *ENDALC option. (as long is it will not cause the ASP to exceed it's storage threshold)

*RSMALC: Allows the system to resume allocation of new data on specific unit(s) that were previously marked with the *ENDALC option.

The following options were added in V7R1 to "sweep" data with a media preference on or off of Solid State Devices (SSD) and enhance the *HSM function:

*MP: Start the sweeper function to move data with a media preference according to the specified subtype. Refer to document N1011666, Customer Use of SSD (Solid State Drives) .

SUBTYPE(*CALC): Used with *HSM balance to move high use (hot) data to SSD units and low use (cold) data to HDD units. Used with *MP balance to move data with *SSD media preference to SSD drives and data without *SSD media preference to HHDs.

SUBTYPE(*HDD): Used with *HSM balance to move hot data to uncompressed units and cold data to compressed units (Note: Compression is no longer supported). Used with *MP balance to move data marked with *SSD media preference from HDD units to SSD units.

SUBTYPE(*SSD): Used with *HSM balance to move hot data to SSD units and cold data to HDD units (same as *CALC). Used with *MP balance to move data without *SSD media preference from SSD units to HDD units.
o ENDASPBAL Command

This command is used to end the ASP balancing function before the specified time limit expires or when *NOMAX is used for a time limit.
o TRCASPBAL Command

This is the command to trace ASP activity. It is used to collect the statistics that the STRASPBAL command *USAGE and *HSM balance types require to move data. TRCASPBAL should be run for at least 25 minutes during period(s) of high disk usage (% busy).
o CHKASPBAL Command

This command was added in V5R2 with the *MOVDTA, *ENDALC and *RSMALC options and is used to check which auxiliary storage pool (ASP) balance function is currently active and which units have been marked to not allow new allocations (*ENDALC).

Also in V4R4, "Add units to ASPs, and balance data" options were added to the following screens:

o Work with ASP Configuration screen, available from DST at limited paging.
o Work with Disk Configuration screen, available from DST or SST at full paging.
o Add All Disk Units to the System screen, available during an attended IPL of a system with nonconfigured disk units.

Note: The balance function that is started by adding a drive from service tools (DST or SST) using the Add and Balance option marks the new drives so that no new data can be allocated on them. Only old data can be moved to them from the other drives in the ASP by the balance function. This prevents the new (empty) drives from getting all new data allocations and becoming "hot" which could cause performance problems. The balance function started this way uses the same tasks as the STRASPBAL command and can be ended using the ENDASPBAL command. However, the ENDASPBAL command leaves the drives marked for no new allocations until STRASPBAL *CAPACITY is allowed to run to completion (indicating that there is no more work to do or the specified time limit has expired).

Following are some common or frequently asked questions:

Q: Where is the STRASPBAL command documented?
A: The Software Information Center ( http://publib.boulder.ibm.com/iseries/ ), the CL Reference manual (SC41-5722), Recovering your system (SC41-5304) Section Balancing an Auxiliary Storage Pool, and the online help for the command contain information on the normal use of the command. The Hierarchical Storage Management Guide (SC41-5351) contains specific information for the *HSM option.

Q: Where are the commands running?
A: They do not show up as jobs in WRKACTJOB as they run below the MI as tasks. The WRKSYSACT command or the DST/SST Display/Alter/Dump task list will show tasks that have names that start with:

o SMIOCOLLECT
o SMTRCTASK
o SMDASDUTILTASK
o SMBALUNIT
o SMEQ

Q: Does STRASPBAL move all types of data?
A: No. The following data will not be moved by STRASPBAL:

o Licensed Internal Code on the load source
o Storage Management directories
o Temporary objects
o Varied on NetWork Storage Descriptions (NWSD) and their linked Storage Spaces (NWSSTG)
o Objects that are currently being used or pinned in main storage

Q: Why did the STRASPBAL command finish early?
A: Check completion message CPI1475 for the ending code. Maybe the function finished because of an error, or because ASP balance is already running (use CHKASPBAL), or because the ASP is completely balanced.

Q: After running STRASPBAL *CAPACITY, why aren't all drives are perfectly balanced?
A: STRASPBAL may not have been run with a long enough time limit, there may be drives marked for no allocation (*ENDALC), or there may be data that can not be moved (as stated above).

Note: The load source is often different in the final percentages. The command does the best it can at making the percentages even; however, there is no guarantee it will make them all even.

Q: Why, after running TRCASPBAL and STRASPBAL *USAGE, does the system say the ASP is balanced?
A: *USAGE needs to see a lot of hot and cold data and units with statistically high utilization in order to begin figuring out what to do. If no units were overworked during the TRCASPBAL, then STRASPBAL would not begin to look at data to move.

Q: If STRASPBAL *CAPACITY is started and then an IPL is performed, will it continue after the IPL?
A: If STRASPBAL *CAPACITY is started from DST/SST as part of an add disk unit and balance menu option, it will restart automatically at the next IPL. Otherwise, it will need to be restarted manually after the IPL completes.

Q: What is the performance impact of TRCASPBAL or STRASPBAL?
A: Benchmarks done with a trace being collected did not show any noticeable degradation. *USAGE movement can/will cause a performance degradation, but because the system targets cold data to move, it should be minimal. *HSM movement is more noticeable because the system is moving hot and cold data; however, this depends on user ASP usage.

*CAPACITY movement results will vary. For instance, adding two units to a one-unit system will probably be noticeable; however, adding one unit to a ten-unit system probably will not. Most people will not do an add at prime time, because if nothing is done, the system will make the new drive a hot spot and, generally, that is noticed. The balance function can be ended; however, the system picks on those drives which might cause a performance problem and that may last longer than letting the system do the balance. Doing the add and balance offshift is better as it minimizes the impact to all.

Because STRASPBAL makes heavy use of memory in the machine pool, there is a possibility of significant system wide performance degradation if the following conditions exist before running STRASPBAL.
1. WRKSYSSTS shows machine pool fault rates above 10 per second.
2. System value QPFRADJ is set to 0 (no adjustment)
Additional memory can be moved into the machine pool to correct the problem.

Q: Are there ways to speed up STRASPBAL?
A: STRASPBAL function is memory and CPU intensive. If you can add more memory to the Machine Pool, by either memory pool adjustments or DLPAR add of additional memory, you will likely see improved run time. Similarly, if you DLPAR add more processing resources, the throughput of the function is likely to improve. Setting the PRIORITY parameter to *HIGH will give the function more access to the partition resources. Be aware, this can cause degradation of other function/applications, due to resource contention.

Q: Is there any performance difference using STRASPBAL as opposed to the 'Add and Balance' option in Service Tools?
A: The 'Add-and-Balance' option and STRASPBAL *CAPACITY actually run the exact same SLIC code; therefore, there is no difference in performance.

Q: Can a STRASPBAL be done while a TRCASPBAL is being done for the same ASP?
A: This is not allowed.

Q: What happens when STRASPBAL attempts to move data that is damaged or encounters disk problems?
A: STRASPBAL does not move objects; it moves extents. Therefore, when a bad sector is encountered, it is dealt with (use parity or go to the mirrored pair) and a VLOG is cut if there was actual data loss (normal process). However, no object specific handler is called. Therefore, no damage is set or changed from partial to full. The setting of damage occurs when the user touches that page of the object. The point to remember is that the sector was bad and that STRASPBAL did not make it go bad; the damage will be detected by the user at the same point whether STRASPBAL was used or not.

Q: Does the Add units to ASPs and balance data function do both operations at once or one after the other?
A: The add part is the same as before and then gives control back to the user with the disk balance running in the background. This allows for normal operations to continue. If the add is done at DST, the background tasks will not start until after the storage management IPL step. If the add is done at SST, the background function starts after control is given back to the user.

Q: Can a STRASPBAL *USAGE be restarted?
A: Another trace should be run prior to restarting STRASPBAL.

Q: Does the STRASPBAL command move journals and journal receivers?
A: The STRASPBAL command did not move journals and journal receivers in R520 or older releases. Starting at R530 journal objects will balance in user ASPs and IASPs, but not the system ASP. In R710 and above, STRASPBAL. will move journal data in all ASPs, including the system ASP.

Q: How does STRASPBAL *USAGE determine AVERAGE, HOT, and COLD disk units?
A: The average is the % busy (over the entire TRCASPBAL period) of each disk unit in the ASP is added. That sum is divided by the quantity of units in the ASP. If a disk unit is more than 90% busy, it is HOT. If a disk unit is less than 6% busy, it is COLD. If a disk unit is 5 percentage points above the average, it is HOT. If a disk unit is 5 percentage points below the average, it is COLD. Units not fitting the criteria above are considered AVERAGE.

Q: Can the system still allocate storage on a disk drive even if it's been marked for end allocation?
A: Even if allocation for a disk unit has been ended using STRASPBAL *ENDALC, journaling can still allocate storage (create receivers) on that drive. Also, the system can still allocate storage on that drive if the system is running out of space (ASP is full).

Q: Why does restarting STRASPBAL *MOVDTA seem to not do anything?
A: If STRASPBAL *ENDALC then *MOVDTA was run with a time limit (or ended with ENDASPBAL) and then re-started, it would be normal to not see data moving for quite some time. This is because it starts in the same place each time rather than from where it left off. In this case you will see SMEQ* tasks taking CPU, but not much IO. After it gets past the point where it previously ended, then it should find more data to move.

Another issue may be that there is not enough memory in the system machine pool. Before being able to move data, the entire permanent storage directory needs to be scanned to determine what data to move. If there is not much memory in the machine pool, it can take a while to page in the directory and gather the needed information about the disk units.

Q: Why can CHKASPBAL display messages CPI18A4 and CPI18A6 indicating ASP balancing is not active even though there are still SMEQ tasks running?
A: When ENDASPBAL has been issued or STRASPBAL ends due to the specified time limit being reached, a termination flag is set. If system resources (typically the machine memory pool) are constrained, the SMEQ tasks may be busy (thrashing - trying to page data in and out of memory, or waiting for processor time, and so on) and may not immediately recognize the termination flag. VLOG 1000 20C9 indicates when all the SLIC tasks associated with the command complete.

Q: When I add units to an ASP, which should I do? 1) Add the units without balancing, 2) Use the Add and Balance option, 3) Add the units and then use STRASPBAL *CAPACITY?
A: It depends on your situation. The options include:

Option 1: Adding the units and not balancing may be the best choice if:

o You are adding the units to an ASP that is used only for Journaling
o You are adding more drives to the ASP than are currently in the ASP


Option 2: The Add and Balance option is best for preventing performance problems caused by "hot" disk arms.

Option 3: Add the units, and then use the STRASPBAL *CAPACITY command when:

o Adding drives to immediately alleviate a serious storage condition in the ASP.
o The system is in restricted state or there is very little activity in the target ASP.
o You are adding more drives to the ASP than what are currently in the ASP.


Q: Does STRASPBAL need to run in a restricted state?
A: The Start ASP Balance (STRASPBAL) command can be run during normal operations. If performance is impacted during normal operations, the ENDASPBAL can be used to end the balancing function. The STRASPBAL command can then be run at another time. You should refer to the command help text for further information about these commands.

Q: I have run TRCASPBAL and STRASPBAL *USAGE; however, my busy disk drives are still busy. Why?
A: In V7R1 TR6 and earlier, STRASPBAL *USAGE will only move cold (unused) data. Hot (frequently used) data is not moved by this option. If you have enough available space in the ASP, you may wish to try draining the drive (STRASPBAL *ENDALC and STRASPBAL *MOVDTA) to see if it alleviates the problem. You may also want to try using iDoctor to trace the disk access to determine which objects are being access and why which jobs/programs to help determine another course of action.

Q: I have run TRCASPBAL and STRASPBAL *HSM; however, my SSD units are much less full than my disk drives. Why?
A: This may be due to one of the issues discussed above, such as not running TRCASPBAL long enough, or not running it at the right time(s). However, there is also an advanced analysis command in SST - PercentToSSD - which can be used to try and move more data to the SSD units by adjusting the algorithm used by STRASPBAL *HSM. There is one parameter, called -PER, where the user specifies the percentage of the IO operations they want moved to the SSD units in the traced ASP.

Q: I ran TRCASPBAL. Where is the trace data kept and how can I view it?
A: The TRCASPBAL buffers are internal to SLIC and are not accessible by users. There are no commands or tools to view or query the data. The buffer storage is resident main storage carved out of the machine pool. Because the trace data is kept in memory, it is lost at IPL time.

Messages associated with disk balancing:

CPC18A1 Tracing started for ASP &1 with name &2.
CPC18A2 Tracing ended for ASP &1 with name &2.
CPC18A3 Trace data cleared on ASP &1 with name &2.
CPC18A4 ASP balance function &1 started.
CPC18A5 &3 started for ASP &1 with name &2.
CPC18A6 Disk balancing ended for ASP &1 with name &2.
CPC18A7 &1 request processed.
CPD18AA Disk balancing detected an error for ASP &2 with name &3.
CPI1474 ASP balancing successfully started for ASP &1.
CPI1475 ASP balancing for ASP &1 ended.
CPI1476 ASP tracing successfully started for ASP &1.
CPI1477 ASP tracing for ASP &1 ended.
CPI1478 ASP tracing data successfully cleared for ASP &1.
CPI18A1 Resume or end allocation selection has completed.
CPI18A2 Resume or end allocation selection failed.
CPI18A3 Unit &1 is selected for end allocation.
CPI18A4 No units are selected for end allocation.
CPI18A5 ASP balancing type &1 is active for ASP &2.
CPI18A6 ASP balancing is not active for ASP &1.
CPI18A7 ASP tracing is active for ASP &1 with name &2.
CPI18A8 ASP tracing is not active for ASP &1 with name &2.
CPF18A9 ASP tracing for ASP &1 already started.
CPF18AA ASP tracing not active for ASP &1.
CPF18AB ASP balancing for ASP &1 already started.
CPF18AC ASP balancing not active for ASP &1.
CPF18AD ASP &1 must contain more than a single unit.
CPF18AE ASP &1 does not contain trace data.
CPF18AF ASP &1 does not contain mixed unit types.
CPF18B1 Trace function currently running for ASP &1.
CPF18B2 Balance function running for ASP &1.
CPF18B3 Balance type not valid for ASP &1.


Licensed Internal Code log (also known as VLOG) entries associated with disk balancing:

Major Code Minor Code Description
1000 2090 Start media preference mover
1000 2091 Media preference mover error
1000 2092 Cancel media preference mover
1000 2093 Media preference mover timeout
1000 2094 Media preference move finished
1000 20C1 Indicates that a user issued a STRASPBAL *ENDALC/*RSMALC for a disk unit(s).
1000 20C5 Log time and info relating to Equalizer Utility
1000 20C6 An ASP free space error was detected.
1000 20C7 A DASD Balancer has been started for an Asp.
1000 20C8 A DASD Balancer unit task has completed.
1000 20C9 A DASD Balancer has completed for an Asp.
1000 20CB A DASD Balancer unit task for hot data has completed
1000 20CC No unit to balance

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m0z0000000CRjAAM","label":"Disk Information"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions"}]

Historical Number

16803317

Document Information

Modified date:
18 April 2023

UID

nas8N1019618