Release Notes
Abstract
This document contains many important known issues for the product PureData System for Operational Analytics. This document used the same or similar content that exists in the fixpack specific known issues documents and is intended to be used for all known issues from PDOA versions V1.0.0.6/V1.1.0.2 and up in the future.
Content
- General: This issue occurs when the system is running on the impacted versions. The impacted version refers to the version that is running on the system
- Fixpack: This issue occurs when applying a fixpack. The impacted version refers to the version that is being applied during the fixpack application.
- I_V#.#.#.#: In the search bar use I_V and the version of PDOA to see the Known Issues related to a specific version. The search bar applies to all columns in the table.
- Use a combination of the column sort and search lookup to find known issues related to the symptoms experienced.
| Reference Number | Type | Impacted Versons | Symptom | Resolution |
|---|---|---|---|---|
| KI002423 Unable to filter system console events using the time filter | General | I_V1.0.0.0 |
Unable to filter system console events using the time filter
In the system console Events pane, when you select a time interval value of Last 24 Hours or Last Hour there are no events displayed.
|
Workaround:
-----------
1. Select the time interval value of All to display all events.
2. If the events are not sorted in descending order, click the Updated on field in the table to sort the events in descending order. The most recent events are displayed first in the table. Fixed:
-----------
Addressed in V1.0.0.2-V1.0.0.5 fixpacks and All V1.1 Systems
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI003332
Mozilla Firefox 20.0 is an unsupported browser
|
General |
I_V1.0.0.0
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.0
I_V1.1.0.1
|
Mozilla Firefox 20.0 is an unsupported browser
The system console GUI fails to load the login page or provide any corrective instructions when viewed through the latest version of the Mozilla Firefox browser (20.0) released on April 2, 2013.
|
Workaround:
-----------
Install and enable the Firefox ESR release from the following page: http://www.mozilla.org/en-US/firefox/organizations/all.html
Alternatively, use another supported browser such as Internet Explorer V8 or V9.
Fixed:
-----------
F_V1.0.0.6/F_V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI002637
A resume option has changed for the miupdate command
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
A resume option has changed for the miupdate command
The -e alternative to the --resume option for the miupdate command has been changed to -resume.
|
Workaround:
-----------
Do not use the -e option to resume the update process. Use the -resume option instead. For example, to resume the update process after the system console suspends the update to reboot one or more nodes in the system, the following miupdate command options can be used:
miupdate --resume | -resume [management | prepare | apply | commit] Fixed:
-----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance where miupdate is no longer used for fixpack application.
|
|
KI002702
During fix pack registration the console might hang while trying to SSH to localhost (127.0.0.1)
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
During fix pack registration the console might hang while trying to SSH to localhost (127.0.0.1)
During fix pack registration, there is at least one command to SSH to localhost (127.0.0.1) to perform some tasks. However, SSH does not recognize the localhost host key under the IP address 127.0.0.1, and waits for you to enter Yes to accept the host key.
|
Workaround:
-----------
Before you install the fix pack, enter the following command as root on the management host to accept the host key:
ssh 127.0.0.1 ls
When prompted to accept the new host key, enter Yes.
Fixed:
-----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance which no longer uses this fixpack registration mechanism.
|
|
KI002696
The system console GUI becomes inaccessible during the management phase or after the management phase of a fix pack installation completes
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
The system console GUI becomes inaccessible during the management phase or after the management phase of a fix pack installation completes
During the management phase of a fix pack installation, if you attempt to log in to the system console GUI or to use it for tasks other than installing the fix pack, the system console GUI becomes inaccessible at the URL https://management_host_name, where management_host_name represents the host name or IP address of the management host.
The system console GUI might also become inaccessible after the management phase completes. |
Workaround:
-----------
Restart the system console GUI:
1. Log in to the management host as the root user. 2. Run the following command: miresolve -restart
Fixed:
-----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI003581
During the apply phase of a fix pack installation the progress bar incorrectly indicates the stage is 100% complete
|
Fixpack | I_V1.0.0.1 |
During the apply phase of a fix pack installation the progress bar incorrectly indicates the stage is 100% complete
During the apply phase of a fix pack installation, the progress bar in the system console GUI might indicate that the stage is 100% complete but the Apply Fix Pack window indicates that stage 4 (Apply to non-management hosts) is in a Running state.
|
Workaround:
-----------
1. Ignore the percentage complete value displayed in the progress bar.
2. Verify that the apply stage has completed successfully. The Apply Fix Pack window displays a Completed state for stage 4 (Apply to non-management hosts) when the apply stage has completed successfully. Fixed:
-----------
V1.0.0.2 Fixed in V.0 FP2
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI007645
After the commit phase of a fix pack installation the progress bar incorrectly indicates the stage is less than 100% complete
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
After the commit phase of a fix pack installation the progress bar incorrectly indicates the stage is less than 100% complete
After the commit phase of a fix pack installation completes successfully, the progress bar in the system console GUI might indicate that the stage is less than 100% complete but the Apply Fix Pack window indicates that stage 5 (Commit) is in a Completed state.
|
Workaround:
-----------
1. Ignore the percentage complete value displayed in the progress bar.
2. Verify that the commit stage has completed successfully. The Apply Fix Pack window displays a Completed state for stage 5 (Commit) to indicate the commit stage has completed successfully. Fixed:
-----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI004047
Paging space is set incorrectly on the management node
|
General | I_V1.0 |
Paging space is set incorrectly on the management node
Paging space on the management node is set to the incorrect value of 48 GB. The paging space needs to be set to a value of 128 GB.
|
Workaround:
-----------
Use the mgmt_ps script to set the value of the paging space to 128 GB on the management host.
Note: You do not need to change the value of the paging space on the standby management host. The paging space is set to the correct value of 128 GB on the standby management host 1. Obtain the mgmt_ps.zip file from IBM Support. 2. Log in to the management host as the root user. 3. Copy the mgmt_ps.zip file to a temporary directory on the management host. 4. Navigate to the temporary directory and extract the mgmt_ps script. 5. Grant execute permission on the script:
6. Issue the following command to run the script and increase the paging space to 128 GB:
Fixed:
-----------
Affected customers may only apply the workaround. There will be no automated fix as part of any appliance fixpack.
|
|
KI003962
The fix pack installation hangs when only a directory is specified as the location of the fix pack file
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
The fix pack installation hangs when only a directory is specified as the location of the fix pack file
When specifying the location of the fix pack file on the management host, if you specify only a directory and do not include the fix pack file name, the fix pack installation hangs. No warning message or error message is displayed.
|
Workaround:
-----------
Refresh the system console and specify the full path and the file name of the fix pack file in the Add Fix Pack window.
Fixed:
-----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI004076
The miinfo -d -c command incorrectly indicates that the versions of some components are higher than expecte
|
General | I_V1.0.0.2 |
The miinfo -d -c command incorrectly indicates that the versions of some components are higher than expected
When you run the miinfo -d -c command, the command incorrectly indicates that the versions of the following software components are higher than expected:
|
Workaround:
-----------
Ignore the warning about the higher than expected versions of these software components. The version returned by the command is correct and is the expected version.
Fixed:
-----------
V1.0.0.3
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004088
The system console cannot access Systems Director after the miauth command is used to change the password for the restuser user
|
General |
I_V1.0.0.0
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.1.0.0
|
The system console cannot access Systems Director after the miauth command is used to change the password for the restuser user
If you do not stop the system console before you use the miauth command to change the password for the restuser user, as documented in the password change procedure Changing the passwords for system console component users, the system console cannot access Systems Director.
|
Workaround:
-----------
1. Log in to the management host as root.
2. Stop the system console:
3. Unlock Systems Director:
4. Start the system console:
IMPORTANT: For future password changes for the restuser user, use the password change procedure that is documented in Changing the passwords for system console component users. Fixed:
-----------
V1.0.0.5/V1.1.0.1 has fundamental changes that impact this issue.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004070
The system console becomes unresponsive after the fix pack file is uploaded
|
Fixpack |
I_V1.0.0.2
|
The system console becomes unresponsive after the fix pack file is uploaded
After you upload the fix pack file to the system, after approximately one hour the Add Fix Pack window in the system console remains unresponsive and greyed out.
|
.Workaround:
-----------
Verify that the system console is unresponsive by attempting to open the Welcome page in the system console.
2. Determine if there are any system console modules that are stopped. Issue the following command as root on the management host:
3. If there are modules that are stopped, issue the following command to restart the modules:
Wait approximately 10 minutes for the system console modules to restart, and then continue with the fix pack installation. Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004082
The miupdate -resume current command hangs after it fails or after it finishes successfully
|
Fixpack |
I_V1.0.0.2
|
The miupdate -resume current command hangs after it fails or after it finishes successfully
The miupdate -resume current command hangs when resuming the fix pack installation after the M_Failed_Apply_Nonimpact substage.
|
.Workaround:
-----------
1. Log in as root on the management host and issue the following command:
In the output returned by the command, identify the process IDs for the miupdate process and the MIUpdateCLI process. In the following sample output, the miupdate process ID is 11599914 and the MIUpdateCLI process ID is 12976198. root 11599914 1 0 12:25:11 - 0:00 /bin/sh /opt/IBM/mi/bin/miupdate -resume current -n root 12976198 11599914 0 12:25:11 - 0:02 /usr/java6_64/bin/java -classpath .:/opt/IBM/mi/lib/*:/opt/IBM/mi/lib/log4j/* -Dlog4j.configuration=file:/opt/IBM/mi/configuration/log4j.properties -Djavax.net.ssl.trustStore=/opt/ibm/director/lwi/security/keystore/ibmjsse2.jks -DUSERDIR=/usr/IBM/applmgmt/isas.server com.ibm.isas.cli.command.application.MIUpdateCLI -resume current -n root 21037306 35848408 0 12:39:58 pts/8 0:00 grep resume 2. For each process ID, issue the following command to terminate the hanging process:
where <pid> represents the process ID of the miupdate process or the process ID of the MIUpdateCLI process.
3. Click the Resume button in the system console. Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004991
Error resuming the fix pack installation after the M_Applied_Nonimpact substage
|
Fixpack | I_V1.0.0.2 |
Error resuming the fix pack installation after the M_Applied_Nonimpact substage
The fix pack installation fails in the Apply to management hosts stage. After you address the issue and click Resume current phase in the system console, the system console does not respond and the fix pack installation does not continue.
|
.Workaround:
-----------
1. Log in to the management host as the root user.
2. Verify that the fix pack installation is in the M_Applied_Nonimpact substage:
3. Resume the fix pack installation:
Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004991
Error resuming the fix pack installation after the M_Failed_Apply_Nonimpact substage
|
Fixpack | I_V1.0.0.2 |
Error resuming the fix pack installation after the M_Failed_Apply_Nonimpact substage
The fix pack installation fails in the Apply to management hosts stage. After you address the issue and click Resume current phase in the system console, the system console does not respond and the fix pack installation does not continue .
|
.Workaround:
-----------
1. Log in to the management host as the root user.
2. Verify that the fix pack installation is in the M_Failed_Apply_Nonimpact substage: appl_ls_cat 3. Resume the fix pack installation: miupdate -u management Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004080
The system console displays incorrect status in the Apply Fix Pack window after the fix pack installation fails in the Apply to non-management hosts stage
|
Fixpack | I_V1.0.0.2 |
The system console displays incorrect status in the Apply Fix Pack window after the fix pack installation fails in the Apply to non-management hosts stage
If the Apply to non-management hosts stage of the fix pack installation fails, the fix pack installation stops but the system console incorrectly displays the status of the stage as Waiting in the Apply Fix Pack window. The status of the stage is Failed to apply without impact to non-management hosts and is displayed correctly in the Fix Pack Detail panel. The fix pack installation cannot be resumed by clicking the Start current stage button.
|
.Workaround:
-----------
1. Log in to the management host as root.
2. Review the fix pack installation log /BCU_share/aixapply/pflayer/pl_update.log and correct the problem identified in the log. 3. Click the Resume button in the system console to resume the fix pack installation. Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004080
Error resuming the fix pack installation from the status Failed to apply without impact to non-management hosts
|
Fixpack | I_V1.0.0.2 |
Error resuming the fix pack installation from the status Failed to apply without impact to non-management hosts
In the Apply to non-management hosts stage of the fix pack installation, when you click the Resume button in the system console to resume the installation from the status Failed to apply without impact to non-management hosts, the fix pack installation stops before the end of the Apply to non-management hosts stage. The status of the fix pack installation in the Fix Pack Details panel is Applied without impact to non-management hosts.
|
.Workaround:
-----------
1. Log in to the management host as root.
2. Verify that the fix pack installation is in the Applied non impact to non management hosts status:
3. Issue the following command:
In the output returned by the command, identify the process IDs for the miupdate process and the MIUpdateCLI process. In the following sample output, the miupdate process ID is 11599914 and the MIUpdateCLI process ID is 12976198. root 11599914 1 0 12:25:11 - 0:00 /bin/sh /opt/IBM/mi/bin/miupdate -resume current -n root 12976198 11599914 0 12:25:11 - 0:02 /usr/java6_64/bin/java -classpath .:/opt/IBM/mi/lib/*:/opt/IBM/mi/lib/log4j/* -Dlog4j.configuration=file:/opt/IBM/mi/configuration/log4j.properties -Djavax.net.ssl.trustStore=/opt/ibm/director/lwi/security/keystore/ibmjsse2.jks -DUSERDIR=/usr/IBM/applmgmt/isas.server com.ibm.isas.cli.command.application.MIUpdateCLI -resume current -n root 21037306 35848408 0 12:39:58 pts/8 0:00 grep resume 4. For each process ID, issue the following command to terminate the hanging process: kill -9 <pid> where <pid> represents the process ID of the miupdate process or the process ID of the MIUpdateCLI process. 5. Issue the following command to resume the fix pack installation:
Fixed:
-----------
This issue is fixed in V1.0.0.3 and higher and V1.1.0.1.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI003892
The system console GUI is not accessible after the fix pack is installed
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
The system console GUI is not accessible after the fix pack is installed
After the fix pack installation completes you might not be able to log into the system console GUI because the profiles files on the management host were not correctly updated.
|
.Workaround:
-----------
During a fix pack installation, profiles are automatically backed up on the management host.
You can restore the profiles using the following steps: 1. Log in to the management host as the root user. 2. Identify the most recent profile backups in the /.profile, /etc/profile, and /etc/security/profile directories. The file names include the date the back up was created. 3. Issue the the following commands:
Fixed:
-----------
The only fix is to apply the workaround if encountered.
|
|
KI004754
Fix pack installation stops with user validation failure at preview stage
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
Fix pack installation stops with user validation failure at preview stage
If the bcuaix core warehouse instance owner password was previously changed by using a method other than the miauth command method, a user validation failure occurs when you attempt to run the preview stage of the fix pack installation procedure. This failure stops the preview stage from running.
|
Workaround:
-----------
Change the password for the bcuaix core warehouse instance owner by using the following supported miauth command method:
1. Log in to the management host as the root user and issue the following command:
2. Run the preview stage of the fix pack installation procedure. Fixed:
-----------
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004971
Resuming a failed preview stage results in a fix pack installation state of Waiting instead of Resuming
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
|
Resuming a failed preview stage results in a fix pack installation state of Waiting instead of Resuming
If the preview stage fails, the system console Fix Pack panel shows an error message. After fixing the environment and resuming the preview stage of the fix pack installation process, the status of the preview stage in the system console is shown as Waiting instead of Running.
|
Workaround:
-----------
If the preview stage of the installation process is in the Waiting state, start the preview stage again by clicking the Resume button in the system console.
Fixed:
-----------
V1.0.0.4
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004907
Fix pack installation fails due to an HMC connection issue during the preview stage
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
I_V1.0.0.4
|
Fix pack installation fails due to an HMC connection issue during the preview stage
The fix pack installation fails during the preview stage because the HMC cannot connect to an endpoint. The following error message is displayed:
ENV_VALIDATION::PREVIEW::FAILED::CEC connectivity validation with HMC failed |
Workaround:
-----------
Identify and replace the invalid IP addresses stored in the HMC.
1. Log in to the HMC as the hscroot user and run the following command:
2. Identify the invalid IP addresses. An invalid IP address is in a Connecting state and shows a connection error code. In the following example output, the IP address 172.17.255.1 is invalid. resource_type=sys,type_model_serial_num=8231-E2C*101F9ER,sp=unavailable,sp_phys_loc=unavailable,ipaddr=172.17.255.1,alt_ipaddr=unavailable,state=Connecting,connection_error_code=Connecting 0000-0000-00000000 resource_type=sys,type_model_serial_num=8231-E2C*101F9ER,sp=primary,sp_phys_loc=U78AB.001.WZSGRHY-P1,ipaddr=172.17.254.254,alt_ipaddr=unavailable,state=Connected resource_type=sys,type_model_serial_num=8231-E2C*101F9FR,sp=primary,sp_phys_loc=U78AB.001.WZSGRJN-P1,ipaddr=172.17.254.255,alt_ipaddr=unavailable,state=Connected 3. For each invalid IP address, run the following command as the hscroot user on the HMC to remove it:
where invalid_IP_address is the invalid IP address you identified in step 2. Fixed:
-----------
V1.0.0.5/V1.1.0.1:. HMC FW in this validated stack address this issue.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI005059
HMC fails to restart during management stage firmware upgrade
|
Fixpack
|
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
|
HMC fails to restart during management stage firmware upgrade
During the management stage firmware upgrade of the HMC, the HMC fails to restart and results in a failure to upgrade the HMC firmware.
A symptom of the issue can be seen in the /BCU_share/aixappl/pflayer/log/pl_update.log file, which shows the following example log entries:
[22 May 2014 07:56:03,249] <28311760 UPDT PREP TRACE host01> For message id::631 [22 May 2014 07:56:03,251] <28311760 UPDT PREP ERROR host01> Error on nodes (172.23.1.245). [22 May 2014 07:56:03,268] <28311760 UPDT PREP INFO host01> STEP_END::3::HMC_UPD::FAILED When the HMC fails to restart, the HMC local console displays the following error message after the HMC is manually restarted: Critical Error 1901 A critical error has prevented normal HMC startup. Please reboot the HMC and try again. If the problem persists, contact your support personnel. 1901: HMC Startup aborted due to a malfunction of a required module. |
Workaround:
-----------
After encountering the HMC restart failure, complete the following steps to resume the firmware upgrade process:
1. Verify that the HMC is offline by running the ping command. If the HMC is confirmed to be offline, manually start the HMC. 2. After the HMC comes up, wait 10 minutes to be certain that all of the HMC services are started. 3. Resume the management stage update process by running the following command on the management host as root:
Fixed:
-----------
No permanent fix.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI005062
V7000 storage drive firmware upgrade fails
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
|
V7000 storage drive firmware upgrade fails
The V7000 storage drive firmware upgrade process fails due to a failure of the utilitydriveupgrade command of the drive upgrade utility to be successfully installed, even after the successful installation of the installation package. Investigate the pl_update.trace log to determine if the utilitydriveupgrade command has failed.
Sample information from the PL log in /BCU_share/aixappl/pflayer/log/pl_update.trace: [20 May 2014 03:06:54,797] <40566876 CTRL TRACE stgkf301> command: ssh admin@172.23.3.201 LANG=en_US svcservicetask applysoftware -file IBM_INSTALL_driveUpgrade_130610 [20 May 2014 03:06:54,798] <40566876 CTRL TRACE stgkf301> CMMVC6227I The package installed successfully. [20 May 2014 03:06:54,798] <40566876 CTRL TRACE stgkf301> Rc = 1 [20 May 2014 03:06:54,799] <40566876 CTRL DEBUG stgkf301> Error String: CMMVC6227I The package installed successfully. [20 May 2014 03:06:54,799] <40566876 CTRL DEBUG stgkf301> command succeeded - drive upgrade utility installed successfully [20 May 2014 03:06:54,800] <40566876 CTRL DEBUG stgkf301> verifying installation of utilitydriveupgrade command [20 May 2014 03:06:55,014] <40566876 CTRL TRACE stgkf301> command: ssh admin@172.23.3.201 utilitydriveupgrade [20 May 2014 03:06:55,015] <40566876 CTRL TRACE stgkf301> rbash: utilitydriveupgrade: command not found [20 May 2014 03:06:55,015] <40566876 CTRL TRACE stgkf301> Rc = 127 [20 May 2014 03:06:55,016] <40566876 CTRL DEBUG stgkf301> utilitydriveupgrade did not get installed properly [20 May 2014 03:06:55,016] <40566876 CTRL TRACE stgkf301> workaround: driveupgradeutility is having some issues, applying workaround by installing softwareupgrade test utility and retrying [20 May 2014 03:06:55,017] <40566876 CTRL DEBUG stgkf301> apply: retrying install of drive upgrade utility [20 May 2014 03:06:55,038] <40566876 CTRL DEBUG stgkf301> apply: Uploading update file /BCU_share/bwr2/firmware/storage/2076/image/imports/testupdate/IBM2076_INSTALL_upgradetest_11.15 [20 May 2014 03:06:55,235] <40566876 CTRL TRACE stgkf301> command: LANG=en_US scp /BCU_share/bwr2/firmware/storage/2076/image/imports/testupdate/IBM2076_INSTALL_upgradetest_11.15 admin@172.23.3.201:/home/admin/upgrade [20 May 2014 03:06:55,236] <40566876 CTRL TRACE stgkf301> Rc = 0 [20 May 2014 03:06:55,237] <40566876 CTRL DEBUG stgkf301> apply: uploaded test update file IBM2076_INSTALL_upgradetest_11.15 to storwize [20 May 2014 03:06:55,638] <40566876 CTRL TRACE stgkf301> command: ssh admin@172.23.3.201 LANG=en_US svctask applysoftware -file IBM2076_INSTALL_upgradetest_11.15 [20 May 2014 03:06:55,638] <40566876 CTRL TRACE stgkf301> CMMVC6227I The package installed successfully. [20 May 2014 03:06:55,639] <40566876 CTRL TRACE stgkf301> Rc = 1 [20 May 2014 03:06:55,639] <40566876 CTRL DEBUG stgkf301> Error String: CMMVC6227I The package installed successfully. [20 May 2014 03:06:55,640] <40566876 CTRL DEBUG stgkf301> apply: upgrade test utility installed successfully [20 May 2014 03:06:55,641] <40566876 CTRL DEBUG stgkf301> Successfully applied workaround, retrying utilitydriveupgrade installation [20 May 2014 03:06:55,702] <40566876 CTRL TRACE stgkf301> { Entering Ctrl::Updates::Storwize::get_imagefile_name (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storwize.pm line 1104) [20 May 2014 03:06:55,703] <40566876 CTRL TRACE stgkf301> Args:["/BCU_share/bwr2/firmware/storage/2076/image/imports","driveutility"] [20 May 2014 03:06:55,755] <40566876 CTRL TRACE stgkf301> command: ls /BCU_share/bwr2/firmware/storage/2076/image/imports/driveutility [20 May 2014 03:06:55,755] <40566876 CTRL TRACE stgkf301> IBM_INSTALL_driveUpgrade_130610 [20 May 2014 03:06:55,756] <40566876 CTRL TRACE stgkf301> Rc = 0 [20 May 2014 03:06:55,756] <40566876 CTRL DEBUG stgkf301> prepare: type driveutility image file: IBM_INSTALL_driveUpgrade_130610 [20 May 2014 03:06:55,757] <40566876 CTRL TRACE stgkf301> Return: 0 [20 May 2014 03:06:55,757] <40566876 CTRL TRACE stgkf301> Exiting Ctrl::Updates::Storwize::get_imagefile_name } [20 May 2014 03:06:55,758] <40566876 CTRL DEBUG stgkf301> prepare: 172.23.3.201: installing drive upgrade utility [20 May 2014 03:06:56,060] <40566876 CTRL TRACE stgkf301> command: ssh admin@172.23.3.201 LANG=en_US svcservicetask applysoftware -file IBM_INSTALL_driveUpgrade_130610 [20 May 2014 03:06:56,061] <40566876 CTRL TRACE stgkf301> CMMVC5993E The specified upgrade package does not exist. [20 May 2014 03:06:56,061] <40566876 CTRL TRACE stgkf301> Rc = 1 [20 May 2014 03:06:56,062] <40566876 CTRL DEBUG stgkf301> Error String: CMMVC5993E The specified upgrade package does not exist. [20 May 2014 03:06:56,063] <40566876 CTRL DEBUG stgkf301> drive upgrade utility failed to install:172.23.3.201 [20 May 2014 03:06:56,063] <40566876 CTRL DEBUG stgkf301> utilitydriveupgrade installation failed even after re-installing software test utility [20 May 2014 03:06:56,064] <40566876 CTRL DEBUG stgkf301> Unable to fix problems in utilitydriveupgrade, drive update failed |
Workaround:
-----------
Resume the firmware upgrade procedure by running the following command from the management host as root:
Note: If the problem persists, contact IBM Support with the following logs at hand: /BCU_share/aixappl/pflayer/log -> All of the files within this directory /log/pfmgt.trace ssh admin@<storwize_ip> svc_snap Fixed:
-----------
No permanent fix.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI005064
V7000 storage drive goes offline after firmware upgrade succeeds
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2
I_V1.0.0.3
|
V7000 storage drive goes offline after firmware upgrade succeeds
The V7000 storage drive might go offline after a successful firmware upgrade. Investigate the pl_update.trace log to determine if a drive has gone offline.
Sample information from the PL log in /BCU_share/aixappl/pflayer/log/pl_update.trace: [14 May 2014 07:34:36,828] <30867468 CTRL TRACE stgkf201> command: ssh admin@172.23.2.206 LANG=en_US utilitydriveupgrade -drivemodel ALL -filename IBM2076_DRIVE_20130314 [14 May 2014 07:34:36,829] <30867468 CTRL TRACE stgkf201> Upgrading drive id 47, ( 1 / 48 ) [14 May 2014 07:34:36,829] <30867468 CTRL TRACE stgkf201> Upgrading drive id 23, ( 2 / 48 ) [14 May 2014 07:34:36,829] <30867468 CTRL TRACE stgkf201> Upgrading drive id 35, ( 3 / 48 ) [14 May 2014 07:34:36,830] <30867468 CTRL TRACE stgkf201> Upgrading drive id 11, ( 4 / 48 ) [14 May 2014 07:34:36,830] <30867468 CTRL TRACE stgkf201> Upgrading drive id 4, ( 5 / 48 ) [14 May 2014 07:34:36,830] <30867468 CTRL TRACE stgkf201> ERROR: Drive 4 is no longer online after being upgraded. [14 May 2014 07:34:36,831] <30867468 CTRL TRACE stgkf201> The current drive status is offline [14 May 2014 07:34:36,831] <30867468 CTRL TRACE stgkf201> Rc = 1 [14 May 2014 07:34:36,832] <30867468 CTRL DEBUG stgkf201> Extracted msg from NLS: apply: 172.23.2.206 ssh admin@172.23.2.206 LANG=en_US utilitydriveupgrade -drivemodel ALL -filename IBM2076_DRIVE_20130314 command failed. |
Workaround:
-----------
1. Bring the offline drives back online by running the following command on the management host as root:
2. Fix the error sequence number by running the following command on the management host as root:
3. Check the current status of the drive by running the following command on the management host as root:
4. Resume the firmware upgrade procedure by running the following command from the management host as root:
Note: If the problem persists, contact IBM Support with the following logs at hand: /BCU_share/aixappl/pflayer/log -> All of the files within this directory /log/pfmgt.trace ssh admin@<storwize_ip> svc_snap Fixed:
-----------
No permanent fix.
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004969
Network switch port configuration causes loss of connection to corporate network during reboot after firmware upgrade
|
Fixpack | I_V1.0.0.3 |
Network switch port configuration causes loss of connection to corporate network during reboot after firmware upgrade
The 1 Gbps network switches might lose their connections to the corporate network because ports are configured with bpdu-guard, usually on ports 43-48. After the firmware upgrade and reboot as part of the Fix Pack 1.0.0.3 installation, the network switch ports might go into errdisable mode and be disabled if bridge protocol data units (BPDU) frames are detected.
|
Workaround:
-----------
The following steps should only be performed on V1.0 Appliances and not on V1.1 Appliances.
To prevent this issue from occurring, disable bpdu-guard on the uplink ports of both of the 1 Gbps network switches.
1. Log in to the 1 Gbps network switches as the admin user. 2. Verify that the 1 Gbps network switches have bpdu-guard enabled on its ports.
Fixed:
-----------
V1.0.0.4
V1.1.0.0 Fixed as part of appliance deployment and changes to networking setup.
|
|
KI005272
Paging devices are missing on all hosts except management host or paging space is not set to auto on management host
|
General |
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5 |
Paging devices are missing on all hosts except management host or paging space is not set to auto on management host
These issues might occur as a result of the fix pack installation. The paging00 device is missing when a check is done by running the lsps -a command on a host. The result of this missing paging device is that the paging space is cut in half from the standard 128 GB.
An indication that your system is experiencing the missing paging device issue is shown in the output that results from running the following command on any host as root: $ dsh -n $ALL "lsps -a" | dshbak -c In the following example output, the paging00 device is missing on all of the hosts except the management host (host01): HOSTS ------------------------------------------------------------------------- host01 ------------------------------------------------------------------------------- Page Space Physical Volume Volume Group Size %Used Active Auto Type Chksum paging00 hdisk0 rootvg 65536MB 3 yes yes lv 0 hd6 hdisk0 rootvg 65536MB 3 yes yes lv 0 HOSTS ------------------------------------------------------------------------- host02, host03, host04, host05, host06 ------------------------------------------------------------------------------- Page Space Physical Volume Volume Group Size %Used Active Auto Type Chksum hd6 hdisk0 rootvg 64512MB 1 yes yes lv 0 A related issue is that the paging space on the management host might not be set to auto. |
Workaround:
-----------
To restore the missing paging device on a host, run the following commands as root on each host that is missing the paging00 device:
To set the paging space on the management host paging00 device to auto, run the following command on the management host as root:
Fixed:
-----------
The only fix is through the workaround.
|
|
KI005121
Ethernet network switch firmware upgrade fails during apply stage of fix pack installation
|
Fixpack | I_V1.0.0.3 |
Ethernet network switch firmware upgrade fails during apply stage of fix pack installation
During the apply stage of the fix pack installation process to upgrade the firmware of the Ethernet network switches, a comparison with the expected firmware image version might fail. If this happens, the firmware for the switch images has not been successfully uploaded.
You can confirm that this issue has occurred by checking the /var/aixappl/pflayer/log/pl_update.trace log. Run the following command to examine the log file: tail -f /var/aixappl/pflayer/log/pl_update.trace Look for the corresponding messages (highlighted in bold here) in the following example log output: [05 Jun 2014 19:48:16,849] <26542124 UPDT APPI DEBUG host01mgmt> Command Status:BNT:net3:172.23.1.251:1:Compare of firmware failed for switch after copy. [05 Jun 2014 19:48:16,850] <26542124 UPDT APPI DEBUG host01mgmt> BNT:net1:172.23.1.253:0:Compare of firmware success for switch after copy. [05 Jun 2014 19:48:16,850] <26542124 UPDT APPI DEBUG host01mgmt> BNT:net0:172.23.1.254:0:Compare of firmware success for switch after copy. [05 Jun 2014 19:48:16,850] <26542124 UPDT APPI DEBUG host01mgmt> BNT:net2:172.23.1.252:0:Compare of firmware success for switch after copy. [05 Jun 2014 19:48:16,850] <26542124 UPDT APPI DEBUG host01mgmt> , Command Status->1 [05 Jun 2014 19:48:16,856] <26542124 UPDT ERROR host01mgmt> TASK_END::13::1 of 1::NetFWUPD::172.23.1.251::::RC=1::The NET FW update failed on the node 172.23.1.251. [05 Jun 2014 19:48:16,858] <26542124 UPDT INFO host01mgmt> TASK_END::13::1 of 1::NetFWUPD::172.23.1.253::::RC=0::The NET FW update is successful on the node 172.23.1.253. [05 Jun 2014 19:48:16,860] <26542124 UPDT INFO host01mgmt> TASK_END::13::1 of 1::NetFWUPD::172.23.1.254::::RC=0::The NET FW update is successful on the node 172.23.1.254. [05 Jun 2014 19:48:16,862] <26542124 UPDT INFO host01mgmt> TASK_END::13::1 of 1::NetFWUPD::172.23.1.252::::RC=0::The NET FW update is successful on the node 172.23.1.252. [05 Jun 2014 19:48:17,011] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=Infrastructure AND Solution_version=3.0.3.1, to update status of Product [05 Jun 2014 19:48:17,108] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Sub_module_type=Infrastructure AND Solution_version=3.0.3.1, to update status of sub module [05 Jun 2014 19:48:17,537] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=Infrastructure AND Solution_version=3.0.3.1, to update status of Product [05 Jun 2014 19:48:17,602] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Sub_module_type=Infrastructure AND Solution_version=3.0.3.1, to update status of sub module [05 Jun 2014 19:48:17,931] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=Infrastructure AND Solution_version=3.0.3.1, to update status of Product [05 Jun 2014 19:48:17,998] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Sub_module_type=Infrastructure AND Solution_version=3.0.3.1, to update status of sub module [05 Jun 2014 19:48:18,385] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=Infrastructure AND Solution_version=3.0.3.1, to update status of Product [05 Jun 2014 19:48:18,454] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Sub_module_type=Infrastructure AND Solution_version=3.0.3.1, to update status of sub module [05 Jun 2014 19:48:18,600] <26542124 UPDT APPI ERROR host01mgmt> Error on nodes (172.23.1.251). [05 Jun 2014 19:48:18,632] <26542124 UPDT APPI INFO host01mgmt> STEP_END::13::NetFW_UPD::FAILED [05 Jun 2014 19:48:18,642] <26542124 UPDT APPI DEBUG host01mgmt> Error occured in apply for product netfw1 [05 Jun 2014 19:48:18,719] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=netfw1 AND Solution_version=3.0.3.1, to update status of Product [05 Jun 2014 19:48:18,864] <26542124 UPDT APPI ERROR host01mgmt> Apply (impact) phase for solution has failed. [05 Jun 2014 19:48:18,935] <26542124 UPDT APPI DEBUG host01mgmt> Executing query Logical_name=bwr3 AND Solution_version=3.0.3.1, to update status of Solution [05 Jun 2014 19:48:19,088] <26542124 UPDT APPI INFO host01mgmt> PHASE_END APPLY IMPACT [05 Jun 2014 19:48:19,090] <26542124 UPDT APPI ERROR host01mgmt> The apply phase for the release 'bwr3' failed. |
Workaround:
-----------
Resume the firmware upgrade procedure by running the following command on the management host as root:
Fixed:
-----------
V1.0.0.4
V1.1.0.1
V1.0.0.6/V1.1.0.2 has fundamental changes that impact this issue.
|
|
KI004393
Tuning database queries returns an error message that the database is not configured for tuning
|
General |
I_V1.0
I_V1.1
.
|
Tuning database queries returns an error message that the database is not configured for tuning
After you have tuned some queries successfully, you might see an error message that states the database is not configured for tuning when you attempt to tune additional queries.
This problem can occur when you tune DDL statements using the web console, for example by selecting "Tune All with This Web Console" from the Execution Summary tab in the SQL dashboard. When DDL statements are tuned, they are also executed under the user ID that deployed the database, and some of the tables that are created under the schema of that user ID can cause problems with query tuning. |
Workaround:
-----------
If the error occurs, connect to the database and look for tables that use the schema of the user ID that was used to deploy the database.
For those objects, delete the following tables, which can cause the error: ADVISE_* EXPLAIN_* OBJECT_METRICS You can also remove the following tables, which are not needed: QT_* OPT_PROFILE To prevent the error from happening again, ensure that you do not tune DDL statements. You can use dashboard filters in the database performance monitor web console to remove DDL statements from the grid so that you do not tune them as part of a workload. Fixed:
-----------
Only fix is the workaround when encountered.
|
|
KI004721
The dirinst1 IBM Systems Director instance owner user password cannot include certain characters
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0
|
The dirinst1 IBM Systems Director instance owner user password cannot include certain characters
When changing the password for the dirinst1 IBM Systems Director instance owner user, certain characters contained within the password causes the password change procedure to fail.
|
Workaround:
-----------
Do not include any of the following characters in the password for the dirinst1 IBM Systems Director instance owner user:
! % ˆ & ( ) | " ' ? , < > * $ @ = + [ ] \ / ; : . { } ` ˜ # - ´ ¨ In addition, do not begin the password for the dirinst1 IBM Systems Director instance owner user with any of the following characters: _ 0 1 2 3 4 5 6 7 8 9 Fixed:
-----------
V1.0.0.5/V1.1.0.1 has fundamental changes that impact this issue.
|
|
KI005291
Logging out from storage web console results in error message
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
Logging out from storage web console results in error message
This issue occurs when you access the storage web console from the system console by clicking System > Service Level Access, selecting one of the links in the IBM Storwize V7000 section, and logging out by clicking the Log out link on the storage web console. When you log out of the storage web console, a 404 File not found error is displayed.
|
Workaround:
-----------
When this issue occurs, click the Back button of your browser to return to the IBM Storwize V7000 login page.
Fixed:
-----------
Apply workaround if encountered.
V1.0.0.6/V1.1.0.2 has fundamental change that impact this issue.
|
|
KI004678
All selected events are not deleted from Events page of system console
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
All selected events are not deleted from Events page of system console
As a result of selecting a number of events to delete from the Events page of the system console by using the Delete selected events button (red cross) that is located in the upper-right corner of the events table above the Actions column, the Delete the number selected events that are currently showing option deletes only 1 or 2 of the selected events, not all.
|
Workaround:
-----------
If you want to delete a small number of events, you can use the Delete button located in the Action column for each individual event.
If you want to delete a large number of events, you can filter events by choosing event text and attribute settings to list events and then delete the list of filtered events. It is important to note that this procedure deletes all of the events that match the current filter, even if the listed events are not individually selected for deletion. To delete a large number of filtered events from the Events page of the system console, do the following steps: 1. Navigate to the Events page by clicking System > Events found either in the System menu, or by expanding the Welcome page Working with the system section and clicking the link in Review system events. 2. Set the event filters located across the top of the events table to specifically list the events that you want to delete. Refresh the events list to see the list of filtered events. 3. To make the Delete selected events button visible above the Action column of the events table, select at least one of the filtered events. 4. Click the Delete selected events button. 5. To delete all of the events in the filtered list, select the Delete all events that match the current filter option. Click OK. Fixed:
-----------
No fix, if encountered apply the workaround.
V1.0.0.6/V1.1.0.2 has fundamental change that impact this issue.
|
|
KI004952
Status of powered-off Ethernet network switches are not correctly shown in system console
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
Status of powered-off Ethernet network switches are not correctly shown in system console
On the system console Hardware > Network Devices page, it is not possible to monitor the actual status of the Ethernet network switches because they are always shown as Available and Powered On.
|
Workaround:
-----------
To reliably verify the availability of Ethernet network switches by using the system console, do the following steps:
1. Navigate to the Events page by clicking System > Events found either in the System menu, or by expanding the Welcome page Working with the system section and clicking the link in Review system events. 2. Search for the keyword 'switch' on the Events page and filter the events with severity Critical or Informational. The following example output, showing only three relevant columns, is the result of a filtered search: Event text Type Severity Eth_network_switch_name is offline. Network switch Critical Eth_network_switch_name is online. Network switch Informational Fixed:
-----------
No fix, if encountered apply the workaround.
V1.0.0.6/V1.1.0.2 has fundamental change that impact this issue.
|
|
KI004434
An error message about password synchronization is displayed when using the miauth -pw command
|
Generel |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 |
An error message about password synchronization is displayed when using the miauth -pw command
When attempting to use the miauth -pw command to reset a password for a node or hardware device, the following output is displayed in the system console:
Changing the password for user 'username' on resource 'device'. The password change on the 'device' resource failed. The script failed with the following message: ------------- spawn /usr/bin/passwd username Changing password for "username" username's Old password: 3004-604 Your entry does not match the old password. where username is the name of the user for the password that you are attempting to change, and device is the name of the node or hardware device. The error message is displayed because a command other than miauth was previously used to change the node or hardware device password. Using a method other than the miauth command method results in password synchronization issues. |
Workaround:
-----------
1. Run the following command:
miauth -u user_name -p device_type -pw -oldpw where user_name is the user name of the node or device, and device_type is the type of node or hardware device. The device_type options are os, hmc, net, san, and storage. 2. When you are prompted for a new password, enter the password that you want to use for the node or hardware device you have specified. 3. When you are prompted to enter the old password, enter the password that was originally provided when it was changed with a command other than miauth. After the password change is complete, a message is displayed indicating that the operation was successful. Fixed:
-----------
V1.0.0.3
V1.0.0.6/V1.1.0.2 has fundamental change that impact this issue.
|
|
KI004752
Warehouse tools administration console hangs after several HA failures
|
General | I_V1.0.0.0 I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 |
Warehouse tools administration console hangs after several HA failures
After several HA failures, IBM Tivoli System Automation for Multiplatforms (Tivoli SA MP) might not be able to restart the warehouse tools administration console because the warehouse tools application server (WASAPP) is stuck in an Online state.
|
Workaround:
-----------
Kill the warehouse tools application server profile process by completing the following steps:
1. Log in to the management node as the root user. 2. Delete the warehouse tools application server profile pid file by running the following command:
Deletion of the server pid file causes an inconsistency that eventually kills the pid process. The killed pid process removes the Stuck Online status on the application server and the monitor returns an Offline status. Tivoli SA MP can then restart the warehouse tools administration console. Fixed:
-----------
V1.0.0.4+, V1.1.0.1: Issue is fixed.
V1.0.0.6/V1.1.0.2: Warehouse Tools is removed from the appliance.
|
|
KI005210
Database partitions moved during manual fail over that times out can result in corrupted db2nodes.cfg file when resources are restarted on source node
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0 |
Database partitions moved during manual fail over that times out can result in corrupted db2nodes.cfg file when resources are restarted on source node
This issue can occur when doing a manual fail over or during a fail over when the source is still able to run resources. If IBM Tivoli System Automation for Multiplatforms is not able to start all of the database partitions on the target node during a fail over, it marks the target as Failed offline and moves the resources back onto the source node. However, if two or more database partitions were successfully started during the fail over to the target node, but the start of the other database partitions ultimately timed out, then the db2nodes.cfg file can become corrupted during the move back to the source node.
The corruption of the db2nodes.cfg file is due to the fact that the database partitions are started serially on the target node during a fail over, but after an unsuccessful fail over of all of the database partitions, the database partitions are started in parallel when moved back to the source node with DB2 expecting serial starts. The symptoms of this issue is an unsuccessful fail over after running the hafailover command and a db2nodes.cfg file that is missing one or more database partition entries. |
Workaround:
-----------
1. To restore the cluster, stop any DB2 resources by running the following command as root on the administration host:
2. Manually add the missing database partition entries to the db2nodes.cfg file. 3. Restart the DB2 resources by running the following command as root on the administration host:
Fixed:
-----------
Apply HA Tools 2.0.0.4 and higher.Updates to HA Tools reduce the chance of db2nodes.cfg corruption.
V1.0.0.5/V1.1.0.1 includes HA Tools 2.0.5.0.
|
|
KI004994
Running hafailover command results in message that resources failed to start
|
General |
I_V1.0
I_V1.1
|
Running hafailover command results in message that resources failed to start
On occasion, running the hafailover command can result in the display of the following message:
Failed to start all resources. Check state with hals This message is erroneous because sometimes the failover process takes longer to complete than the configured timeout period. All of the resources did not start because the failover process is still underway. |
Workaround:
-----------
Periodically check the status of the failover by running the following command on the management host as root:
hals If the resources are in a Pending online state, wait a few minutes before repeating the previous command. If all of the resources are Online and accessible after a reasonable period of time, no further action is required. If all of the resources are not Online and accessible after a reasonable period of time, contact IBM Support for assistance. Fixed:
-----------
No fix. See workaround above. |
|
KI004767
Upgrade of database performance monitor to version 5.3 results in warning that performance monitoring is not fully enabled
|
Fixpack | I_V1.0.0.3 |
Upgrade of database performance monitor to version 5.3 results in warning that performance monitoring is not fully enabled
Some, rather than all, of the tables with an IBMPDQ schema are dropped from the opmdb performance monitor database during the database performance monitor upgrade process from version 5.2 to version 5.3. As a result, the following warning message is displayed in the database performance monitor user interface:
Performance monitoring is not fully enabled |
Workaround:
-----------
Drop the remaining tables with an IBMPDQ schema in the opmdb performance monitor database by completing the following steps:
1. Bring the database performance monitor user interface offline by running the following command on the management host as the root user:
2. Before proceeding to the next step, verify that the DPM COMPONENT has gone from a Pending to an Offline OPSTATE by running the following command:
3. Connect to the opmdb performance monitor database by running the following command on the management host as the db2opm user:
4. Drop the IBMPDQ schema and all of the objects contained within it by running the following command:
5. Stop and restart the database performance monitor by running the following commands on the management host as the root user:
After the IBMPDQ schema is dropped, stopping and restarting the database performance monitor in step 5 results in a re-creation of the IBMPDQ schema and all of its objects, and the monitored database configuration is synchronized with the repository server. Note: A result of stopping the database performance monitor in step 5 is a short period of uncollected database performance monitoring data. Fixed:
-----------
Only the workaround is available.
|
|
KI005415
Node power up might result in inoperative syslogd process
|
General |
I_V1.0
I_V1.1
|
Node power up might result in inoperative syslogd process
Powering up a node or nodes after a power shut down might result in the system log daemon (syslogd) not automatically starting on the host or hosts of the system. An inoperative syslogd results in the system log not being updated and error messages shown after running some commands.
To verify that syslogd is inoperative on any of the hosts, run the following command on any host as root: dsh -n ${ALL} 'lssrc -s syslogd | grep inoperative' | dshbak -c The following example output is a result of running the previous command and shows on which hosts the syslogd is inoperative: HOSTS ------------------------------------------------------------------------- host02, host03, host04, host05, host06, host08 ------------------------------------------------------------------------------- syslogd ras inoperative 3:23:38 PM To verify that an active syslogd is not logging events in the /var/log/syslog.out log file on each of the hosts, check for a zero file size. Run the following command on any host as root: dsh -n ${ALL} 'v=/var/log/syslog.out;([ -f ${v} ] && wc -l ${v}) || (touch ${v} && wc -l ${v})' The following example output is a result of running the previous command and shows that syslogd is not logging events on host04 because the system log does not contain any entries as indicated by its zero file size: host01: 28735 /var/log/syslog.out host03: 18750 /var/log/syslog.out host08: 1059 /var/log/syslog.out host06: 16870 /var/log/syslog.out host02: 3971 /var/log/syslog.out host05: 24312 /var/log/syslog.out host07: 18656 /var/log/syslog.out host04: 0 /var/log/syslog.out |
Workaround:
-----------
If the system log daemon is inoperative or not logging events on any of the hosts, complete the following steps:
1. Verify that the /var/log/syslog.out log file exists, and if the log file does not exist, create it. Both of these tasks can be completed by running the following command on any host as root:
2. Start the syslogd process by running the following command on any host as root:
3. Verify that an active system log daemon is logging events in the /var/log/syslog.out log file on each of the hosts by checking for a non-zero file size. Run the following command on any host as root:
4. If zero file size /var/log/syslog.out log files or inoperative system log daemons persist after completing the previous steps, contact IBM Support. Fixed:
-----------
None. See workaround.
|
|
KI006086
Hardware management console (HMC) updates will disable call-home functionality if the pre-fix-pack-3 HMC firmware level is older than V7.7.3
|
Fixpack | I_V1.0.0.3 |
Hardware management console (HMC) updates will disable call-home functionality if the pre-fix-pack-3 HMC firmware level is older than V7.7.3
After installing PDOA fix pack 3 you may find that call-home settings on the primary and secondary HMC have been reset. This will occur if the HMC was at any firmware level older than V7.7.3 prior to installing PDOA fix pack 3.
|
Workaround:
-----------
After completing the fix pack installation, check the HMC call home settings have not been reset using the following steps:
Note: If any of these checks fail, contact IBM Support. Fixed:
-----------
See workaround.
|
|
KI004960
AIX reboot operation fails during fix pack installation process despite successful server reboot
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 |
AIX reboot operation fails during fix pack installation process despite successful server reboot
Older versions of CAS have an unstable Communication State with IBM Systems Director. This unstable Communication State results in the appl_ls_hw command sometimes showing the server with a Communication State that is Off even though the server is reachable at times.
During the fix pack installation when the server reboots after the AIX update, there is a chance that the Communication State of the server is Off, which causes the AIX reboot operation to fail. When a server fails to reboot the AIX OS, an error message is displayed in the system console. You can also check the pl_update.log file for a message similar to the following example message: [18 Mar 2014 12:06:27,464] <29819052 UPDT APPI DEBUG host01mgmt> AIX update Reboot Error on node server5 [18 Mar 2014 12:06:27,514] <29819052 UPDT INFO host01mgmt> TASK_END::1::3 of 3::AIXUPD_REBOOT::172.23.1.17::::RC=1::The AIX reboot operation failed on 172.23.1.17. |
Workaround:
-----------
1. Log in to the host and verify that the server was actually restarted by running the following command:
2. Identify the logical name of the servers where the AIX reboot operation failed by running the following command from the management node as root:
3. To bring online the server that failed to reboot the AIX OS, run the following command from the management node as root:
4. After the server shows an Online status, the fix pack installation process can be resumed by running the following command:
Fixed:
-----------
V1.0.0.5/V1.1.0.1 have fundamental changes that impact this issue.
|
|
KI004991
System console shows inconsistent status of management or apply stage of fix pack installation
|
Fixpack | I_V1.0.0.3 |
System console shows inconsistent status of management or apply stage of fix pack installation
After resuming from a failed stage, the fix pack installation hangs. In the Fix Pack panel of the system console, the level 3.0.3.1 status indicates that the current stage was completed by displaying the status as Applied to management hosts or Applied to non-management hosts. However, clicking the View Progress button in the Fix Pack Details panel of the system console shows that the management or apply stage is in the Running state.
|
Workaround:
-----------
1. Log in as the root user on the management host and run the following command to determine if any miupdate -resume or MIUpdateCLI -resume processes are still running:
2. Terminate any running miupdate -resume or MIUpdateCLI -resume processes by running the following command:
3. Click the View Progress button in the Fix Pack Details panel of the system console to check if the status of the stage has changed.
Fixed:
----------
|
|
KI005115
Fix pack installation can't be resumed in the apply stage after a problem with the SSD drawer
|
Fixpack | I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 |
Fix pack installation can't be resumed in the apply stage after a problem with the SSD drawer
The fix pack installation fails in the apply stage due to a problem with the SSD drawer. You will see error messages in the /BCU_share/aixappl/pflayer/log/pl_update.log file that are similar to the following output:
After you address the problem with the SSD drawer, the device name of the SSD drawer might change when you reboot the node. If the device name is changed, you will not be able to resume the fix pack installation. |
Workaround:
-----------
Address the problem with the SSD drawer and, if necessary, change the device name of the SSD drawer back to the original name before you resume the fix pack installation.
1. Shut down the node connected to the SSD drawer:
2. Power off the BlueHawk SSD adapter and then power it back on. 3. Start the node. You can do this manually or through the HMC. 4. Run the configuration manager:
5. Verify if the name of the SSD drawer is reset to the original name. The original name is contained in the error message.
6. If the name of the SSD drawer was not reset to the original name, reset it.
7. Resume the fix pack installation:
Fixed:
----------
See Workaround.
|
|
KI004080
Error resuming the fix pack installation after the M_Applied_Nonimpact substage
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1
|
Error resuming the fix pack installation after the M_Applied_Nonimpact substage
The fix pack installation fails in the Apply to management hosts stage. After you address the issue and click Resume current phase in the system console, the system console does not respond and the fix pack installation does not continue.
|
Workaround:
-----------
1. Log in to the management host as the root user.
2. Verify that the fix pack installation is in the M_Applied_Nonimpact substage:
3. Resume the fix pack installation:
Fixed:
----------
See Workaround.
|
|
KI004080
Error resuming the fix pack installation after the M_Failed_Apply_Nonimpact substage
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1
|
Error resuming the fix pack installation after the M_Failed_Apply_Nonimpact substage
The fix pack installation fails in the Apply to management hosts stage. After you address the issue and click Resume current phase in the system console, the system console does not respond and the fix pack installation does not continue.
|
Workaround:
-----------
1. Log in to the management host as the root user.
2. Verify that the fix pack installation is in the M_Failed_Apply_Nonimpact substage:
3. Resume the fix pack installation:
Fixed:
----------
See Workaround
|
|
KI004080
Fix pack installation stops at M_Prepared state after completion of management prepare substage
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1
|
Fix pack installation stops at M_Prepared state after completion of management prepare substage
If a failure occurs during the early management substage that validates the environments of the management and the standby management hosts, the status of the fix pack installation hangs at the Prepared management hosts state, as shown on the left side of the Fix Packs panel in the system console or you can use the appl_ls_cat command to view states. You can also verify that miupdate processes are not running by issuing the ps -ef | grep -i miupdate command. The Previewed state, which preceded the Prepared management hosts state, signalled the successful completion of the Preview stage of the fix pack installation procedure.
After the management environments are fixed and the management stage of the fix pack installation procedure is resumed, the installation stops in the M_Prepared state, which means the fix pack installation procedure successfully completed the management prepare substage, but does not continue further. Attempts to normally restart the fix pack installation process do not work. |
Workaround:
-----------
To restart the fix pack installation procedure at the management stage, run the following commands in sequence from the management host as root:
miupdate -u management miupdate -u prepare miupdate -resume current miupdate -u management Fixed:
----------
See Workaround.
|
|
KI005346
Fix pack installation fails in the apply stage with GPFS start error
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 |
Fix pack installation fails in the apply stage with GPFS start error
Fix pack installation fails with a GPFS error after updating the V7000 storage during the apply stage. GPFS on the management host displays arbitrating state during multiple attempts to start and finally ends with a start error as shown in the following portion of example entries in the pl_update.log:
[11 Jun 2014 17:37:47,787] <9699536 GPFS DEBUG plsapd01> Executing command on 172.23.1.1-> ssh root@172.23.1.1 /usr/lpp/mmfs/bin/mmgetstate -Y [11 Jun 2014 17:37:49,274] <9699536 GPFS DEBUG plsapd01> Return code-> 0 [11 Jun 2014 17:37:49,275] <9699536 GPFS DEBUG plsapd01> Output-> , mmgetstate::HEADER:version:reserved:reserved:nodeName:nodeNumber:state:quorum:nodesUp:totalNodes:remarks:cnfsState [11 Jun 2014 17:37:49,275] <9699536 GPFS DEBUG plsapd01> mmgetstate::0:1:::plsapd01:1:arbitrating:1*:0:8::(undefined): [11 Jun 2014 17:37:49,276] <9699536 GPFS DEBUG plsapd01> GPFS state on node 172.23.1.1 : arbitrating [11 Jun 2014 17:37:49,277] <9699536 GPFS DEBUG plsapd01> GPFS is not yet started completly on 172.23.1.1. waiting for 30 sec more [11 Jun 2014 17:38:16,007] <23199746 GPFS DEBUG plsapd01> Timeout:GPFS is not active on node 172.23.1.3 [11 Jun 2014 17:38:19,277] <9699536 GPFS DEBUG plsapd01> Timeout:GPFS is not active on node 172.23.1.1 [11 Jun 2014 17:38:19,300] <21954626 STRT ERROR plsapd01> Could not start the product 'GPFS' on 'server7 server2' [11 Jun 2014 17:38:19,331] <20840644 UPDT APPI WARN plsapd01> Failed to start solution. [11 Jun 2014 17:38:19,381] <20840644 UPDT APPI DEBUG plsapd01> Executing query Logical_name=bwr1 AND Solution_version=3.0.3.1, to update status of Solution [11 Jun 2014 17:38:19,489] <20840644 UPDT APPI INFO plsapd01> PHASE_END APPLY IMPACT [11 Jun 2014 17:38:19,490] <20840644 UPDT APPI DEBUG plsapd01> Apply phase done for release 'bwr1' successfully [11 Jun 2014 17:38:19,491] <20840644 UPDT RESU INFO plsapd01> PHASE_END RESUME [11 Jun 2014 17:38:19,493] <20840644 UPDT RESU ERROR plsapd01> The resume phase for the release 'bwr1' failed. |
Workaround:
-----------
Fix the GPFS error and then continue the fix pack installation by completing the following steps.
1. Run the following commands on the management host as the root user:
2. Check the fix pack installation status by running the following command:
Fixed:
----------
V1.0.0.4
V1.1.0.0
|
|
KI005022
Message display is not updated during management stage after fix pack installation is resumed
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1 |
Message display is not updated during management stage after fix pack installation is resumed
he message display does not update when resuming the fix pack installation after an error during the management stage. The system console continues to show the following error message:
Error: Error on nodes (172.23.1.10).. Refer to the platform layer log file for details. To ... For example, when the management host reboots, the warning message that the management host is rebooting is not shown. This warning message is normally shown when the management host reboots during the management stage of a successful fix pack installation process that does not require to be resumed after an error or a failure. |
Workaround:
-----------
When the state of the Apply to management hosts stage changes to Suspended, it means that the management host is rebooting. The user must hover the mouse over the error message to see the reboot warning details. The user must then follow the instructions in the User response section of the reboot warning message to restart the system console.
Fixed:
----------
V1.0.0.6/V1.1.0.2 include design changes to the fixpack process such that this is not an issue.
|
|
KI004283
Ethernet switches might lose their configurations during firmware upgrade
|
Fixpack |
I_V1.0.0.3
I_V1.0.0.4 I_V1.0.0.5 I_V1.0.0.6
I_V1.1.0.1
I_V1.1.0.2
|
Ethernet switches might lose their configurations during firmware upgrade
During fix pack installation, the Ethernet switches might lose their configurations when the switch firmware is upgraded in two stages, first to an interim level, then to the target level.
|
Workaround:
-----------
Contact IBM Support and reference this technote. Do no proceed without working with a support engineer.
As a precautionary measure, automated backups of the switch configurations are taken before the preview phase and during the apply phase of the fix pack installation in V1.0.0.3 and above. The backup files are located on the management host in the /BCU_share/net_switch_backup folder.
Restore the Ethernet switch configurations from their configuration backup files by completing the following steps: 1. Copy the configuration files to a folder on a USB drive or on a local computer after the preview phase. As an example, kf3 is the USB drive folder name used here when the usbcopy command is run at a later step. Explanation: Instruct the customer to make a copy of the switch configuration backup files to a USB drive or their local computer after the preview stage because the switch usually loses its configurations during the apply phase. It will be impossible to remotely access the management host to make a copy of the configuration backup files after the apply phase if the switch has lost its configurations. 2. Connect a serial cable to the serial port on the switch and start a hyperterminal session. 3. Enter the following settings for the hyperterminal:
4. Click Enter on the hyperterminal session. 5. Enter the following password:
6. At the prompt, run the following command:
7. Run the following command:
8. Install the USB drive with both the OS and boot image firmware loaded. 9. Run the following commands:
10. Run the following command:
11. Log in to switch and enter the following password:
12. Enter the following commands:
Now run the usbcopy cmd by using the following general format:
For top 1G switch
For bottom 1G switch
For top 10G switch
For bottom 10G switch
13. Enter the following command:
Look for the file that contains the loaded configuration information by checking the time stamp. The file with the latest time stamp is the one most recently loaded.
14. Reload the switch configurations by running the following command:
Fixed:
----------
NA
|
|
KI005412
SAN switch status and monitoring information does not get updated after fix pack installation
|
Fixpack | I_V1.0.0.3 |
SAN switch status and monitoring information does not get updated after fix pack installation
This issue can occur for either of the following reasons:
The fix pack updates the console code and introduces a bug that prevents the SAN switch status in the Hardware > Network Devices panel from getting updated monitoring information. For example, if the SAN switch temperature changes, this does not get registered in the console. Another sign of this problem is seen in an add-node scenario. When adding a new node on which fix pack 3 was installed, all of the listed network devices, including the SAN switches, are shown as a Network Switch type. When you click on any switch listed in the Hardware > Network Devices panel to view the details, the Model field is not set. |
Workaround:
-----------
To fix this issue, a console code patch must be performed after the installation of fix pack 3.
1. Download the patched query_switches.groovy file from IBM Support. 2. With the patched file, replace the broken query_switches.groovy file in the following directory:
3. After replacing the broken query_switches.groovy file with the patched file, verify that the patched file has the following permissions:
Fixed:
----------
V1.0.0.4
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI004996
V7000 storage controller firmware upgrade hangs in upgrading state
|
Fixpack |
I_V1.0.0.3
|
V7000 storage controller firmware upgrade hangs in upgrading state
While upgrading the 6.4.0.3 level firmware of the V7000 storage controller to 6.4.1.6 level, the upgrade process hangs in the upgrading state. The following example message is shown:
apply: 172.23.1.237: broke out of timed wait after 36 iterations of maximum 36. update status is <upgrading> If the previous example message was not seen, you can verify the upgrade status by running the following command: svcinfo lssoftwareupgradestatus -nohdr The following example output is a result of running the previous command: upgrading |
Workaround:
-----------
Contact IBM Support before proceeding with the workaround. Only a qualified SSR or CE should be allowed to manipulate the hardware which can only be engaged through IBM Support as this minimizes the risk of loss of storage configuration or loss of data in the worst case.
1. Reseat the canister of the offline storage controller that is stuck in the upgrading state. Usually only one of the two storage canisters is stuck. The stuck canister is the one that does not respond to ping commands.
2. The upgrade status changes to the stalled state. Verify by running the following command as the superuser on the V7000 that is having the issue:
3. To abort the firmware upgrade process and revert the firmware back to its previous level, run the following command as the superuser on the V7000 that is having the issue:
4. To verify that the firmware upgrade process is downgrading the firmware back to its previous level, run the following command as the superuser on the V7000 that is having the issue:
5. The storage controller becomes inaccessible and does not respond to pings for a few minutes. This is expected. Wait until it becomes accessible before continuing to the next step. 6. After logging in to the cluster, the firmware is back at the 6.4.0.3 level. Verify that the firmware upgrade process is in an inactive state by running the following command as the superuser on the V7000 that is having the issue:
7. Copy the firmware upgrade image to the /home/admin/upgrade Storwize directory by running the following example command on the management host as root:
8. To resume the V7000 storage controller firmware upgrade, run the following command on the management host as root:
Fixed:
----------
V1.0.0.4 includes later V7000 firmware that reduces the risk of similar issues.
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance so the of mi* or console gui actions are no longer possible.
|
|
KI005345
V7000 storage drive firmware upgrade fails
|
Fixpack | I_V1.0.0.3 |
V7000 storage drive firmware upgrade fails
Firmware upgrades do not succeed for V7000 storage drives that have a different product ID that is not part of the upgrade bundle or have a higher firmware level than the upgrade target level.
|
Workaround:
-----------
V1.0.0.3 Fixpacks ONLY!
After the preview stage (stage 1) completes and before you start the management stage (stage 2), install a manual fix for the V7000s.
1. Download in binary mode the Storwize.pm_Ctrl file and the Storwize.pm_UPD file from the IBM Support.
2. Copy the files to the /BCU_share directory on the management host. 3. Determine the logical name assigned to the fix pack by running the following command:
4. Copy files to the appropriate locations by running the following commands as root on the management host:
Fixed:
----------
V1.0.0.4
|
|
KI005222
appl_stop command successfully stops and reboots HMC, but falsely reports return code 1
|
General | I_V1.0.0.3 |
appl_stop command successfully stops and reboots HMC, but falsely reports return code 1
The appl_stop command successfully stops and reboots the Hardware Management Console (HMC), but a failure return code of 1 is falsely reported. The appl_stop command calls the hmcshutdown command that succeeds to call the ForceShutdown script. The erroneous return code is caused by the script not completing before the exit of the hmcshutdown command.
The following list contains the commands that are affected by this false status reporting issue: appl_stop -l logical_name_of_hmc appl_stop -r resource_type_of_hmc appl_stop -l logical_name_of_hmc -R appl_stop -r resource_type_of_hmc -R Example output of the appl_stop command to stop the HMC: appl_stop -l hmc0 Checking the status for the resource 'hmc0'. Stopping the resource, 'hmc0'. This may take long time. The SSHD daemon started successfully on '172.23.4.241'. The power off operation failed for the resource hmc0. Error stopping the resource, hmc0. The HMC status shows that it is Stopping and eventually does stop. Example shows only the HMC part of the output: appl_ls_hw NAME HOSTNAME IP MODULE STATUS DESCRIPTION hmc0 9.3.2.16 Stopping IBM Hardware Management Console hmc1 9.3.2.17 Online IBM Hardware Management Console |
Workaround:
-----------
1. Ignore the false return code 1 that is reported after running the appl_stop command.
2. Run the appl_start command to restart the HMC after it was stopped or rebooted. 3. Ping the HMC after completion of the restart or reboot to verify that it is online. Fixed:
----------
V1.0.0.4
|
|
KI004775
System console fails to refresh properly and prompts for new log in after prolonged idle time, a stopped isas.server module, and an isas.console.system in a stopped or unknown state
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
System console fails to refresh properly and prompts for new log in after prolonged idle time, a stopped isas.server module, and an isas.console.system in a stopped or unknown state
After a prolonged idle time while logged in to the system console, no contents are displayed and you might see the error message Sorry, an error occurred when you refresh the system console display or click on the UI. You might also be prompted to log in to the system console again, but are unable to log in when you provide the correct user ID and password.
The system console fails to display the correct content because the isas.server module is stopped and the isas.console.system is stopped or in an unknown state. You might also see the log in prompt due to the GUI timing out, but because the isas.server module is down, the user ID and password cannot be authenticated and the Invalid user name and password error message is displayed. |
Workaround:
-----------
1. Verify that the isas.server module is stopped and the isas.console.system is stopped or in an unknown state by running the following command:
2. Start the isas.server module by running the following command:
Fixed:
----------
V1.0.0.6/V1.1.0.2 Fixpacks remove the appliance console from the appliance.
|
|
KI004810
SAS RAID adapter in SSD enclosure is in a Degraded state without symptoms
|
General | I_V1.0.0.0 I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 |
SAS RAID adapter in SSD enclosure is in a Degraded state without symptoms
When the SAS RAID adapter in an SSD enclosure encounters an error that the adapter does not understand, the adapter assumes a Degraded state and disables the adapter cache as a precaution.
Note: There is no significant impact with the degraded adapter state and a disabled adapter cache in a RAID 10 array. However, customer perception is impacted when they see a Degraded array state without an explanation or a hint about the nature of the problem. Diagnostic prerequisite: Verify the names of your SSD RAID adapters on the system. On the administration and standby administration hosts, the adapters are sissas1 and sissas2. On the data and standby hosts, the adapters are sissas2 and sissas3. Run the following command on the administration host and a data host: lsdev|grep "RAID SAS" The following output is a result of running the previous command on the administration host: (130) root @ paradise02: 7.1.0.0: / $ lsdev|grep "RAID SAS" sissas1 Available 06-00 PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 sissas2 Available 07-00 PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 The following output is a result of running the previous command on a data host: (130) root @ paradise05: 7.1.0.0: / $ lsdev|grep "RAID SAS" sissas2 Available 09-00 PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 sissas3 Available 0A-00 PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 Detection of the problem:
There is no indication in the errpt that there is a problem. However, you can obtain an indication that there is a problem by either of the following two methods: 1. Run the following command to check for a Degraded state of the RAID 10 Array on each of the SSD RAID adapters:
The following output is a result of running the previous command: 0) root @ paradise05: 7.1.0.0: /var/adm/ras $ sissasraidmgr -L -l sissas2 ------------------------------------------------------------------------ Name Resource State Description Size ------------------------------------------------------------------------ sissas2 FEFFFFFF Primary PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 sissas3 FEFFFFFF HA Linked Remote adapter SN 002AN00C hdisk2 FC0000FF Degraded RAID 10 Array (O/O) 1163GB pdisk2 000401FF Active SSD Array Member 387.9GB pdisk1 000400FF Active SSD Array Member 387.9GB pdisk5 000404FF Active SSD Array Member 387.9GB pdisk3 000402FF Active SSD Array Member 387.9GB pdisk4 000403FF Active SSD Array Member 387.9GB pdisk0 000007FF Active SSD Array Member 387.9GB 2. Run the diag command on the sissasX adapters, which returns an output and logs an entry in errpt:
The following output is a result of running the previous command: A PROBLEM WAS DETECTED ON Thu Jan 30 15:18:32 CST 2014 801014 The Service Request Number(s)/Probable Cause(s) (causes are listed in descending order of probability): 2D24-8150: Controller failure. Error log information: Date: Thu Jan 30 15:18:31 CST 2014 Sequence number: 364 Label: SAS_ERR2 sissas2 FRU: 00E7703 PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 U78AB.001.WZSGRJ9-P1-C1-T1-L1-T1 errpt: --------------------------------------------------------------------------- LABEL: SAS_ERR2 IDENTIFIER: CCC89167 Date/Time: Thu Jan 30 15:20:28 CST 2014 Sequence Number: 366 Machine Id: 00F72AA74C00 Node Id: paradise05 Class: H Type: TEMP WPAR: Global Resource Name: sissas3 Resource Class: adapter Resource Type: 14105303 Location: U78AB.001.WZSGRJ9-P1-C8-T1-L1-T1 VPD: PCIe2 3.1GB Cache RAID SAS Enclosure 6Gb x8 : Part Number.................00E7705 FRU Number..................00E7703 Serial Number...............YP11BG2AN00C Manufacture ID..............01BG EC Level....................1 ROM Level.(alterable).......015000ab Customer Card ID Number.....57C3 Product Specific.(Z1).......5 Product Specific.(Z2).......2D24 Feature Code/Marketing ID...EDR1-001 Machine/Cabinet Serial No...G2AL001 FRU Label...................P1-C2-T3 Description ADAPTER ERROR Recommended Actions PERFORM PROBLEM DETERMINATION PROCEDURES Detail Data ADDITIONAL HEX DATA 0001 0800 1910 00F0 0444 8200 0101 0000 0150 00AB 0000 00FF 57C3 8150 0000 0001 FEFF FFFF FFFF FFFF 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 88F9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 E210 00A0 FF00 0000 07A0 0001 1A1C EC5F 0000 0000 0000 0FE8 0444 8200 0150 00AB FFFF FFFF 1521 2204 0000 0000 0000 0000 0000 0000 0000 0000 FEFF FFFF FFFF FFFF 0000 0000 0000 88F9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 |
Workaround:
-----------
Contact IBM Support before proceeding with any workaround.
Fixed:
----------
V1.0.0.4 includes firmware updates that address many issues that lead to this scenario.
|
|
KI005061
IBM Systems Director storage control update fails during inventory collection
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 |
IBM Systems Director storage control update fails during inventory collection
The fix pack update fails during the management stage while updating the IBM Systems Director storage control component.
There is a known issue with IBM Systems Director where the ssh server drops the ssh connection during an update due to its high workload, many ciphers, or slow responses from the target, any of which can cause a timeout and result in collect inventory failure. Investigation of the PL trace file finds that the error is due to a connection issue. The following message is found in the PL trace file: Inventory collection failed for system "sysNode". Verify the connection to the system and collect inventory. The following example entries were found in the PL trace file: [22 May 2014 12:48:21,886] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> ATKUPD764I Update "com.ibm.director.storage.storagecontrol.mgr.AIX_4.2.4.fp1-build-00009" was installed on system "sysNode" successfully. [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> ATKUPD795I You must manually restart the IBM Systems Director management server after this install completes for the updates to take effect. [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> ATKUPD739I Collecting inventory on system "sysNode". [22 May 2014 12:48:21,887] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,888] <11403344 UPDT APPI TRACE stgkf301> ATKUPD711I Still collecting inventory for system "sysNode". [22 May 2014 12:48:21,888] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,888] <11403344 UPDT APPI TRACE stgkf301> ATKUPD706E Inventory collection failed for system "sysNode". Verify the connection to the system and collect inventory. [22 May 2014 12:48:21,888] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,888] <11403344 UPDT APPI TRACE stgkf301> ATKUPD572I Running compliance on system "sysNode". [22 May 2014 12:48:21,889] <11403344 UPDT APPI TRACE stgkf301> [22 May 2014 12:48:21,889] <11403344 UPDT APPI TRACE stgkf301> ATKUPD734E An error was encountered during the "Install Updates" task. Search above for previous related errors, fix each error, and then retry the operation. |
Workaround:
-----------
If the cause of the update failure was a connection issue due to high workloads or delays in the connection, the problem can be resolved by resuming the fix pack installation. Run the following command on the management host as root:
If the previous command fails to resume the fix pack installation, contact IBM Support. Fixed:
----------
V1.0.0.5/V1.1.0.1 where IBM System Director is disabled.
|
|
KI005116
IBM Systems Director managed end point goes offline after update and results in Storage Control update failure
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 |
IBM Systems Director managed end point goes offline after update and results in Storage Control update failure
After updating the IBM Systems Director (ISD), the processes might get locked and hit a resource crunch. The ISD local CAS agent fails to start and the server communication state is stuck at 3 (expected state value: 2) leading to a Storage Control (SC) update failure.
To confirm that this issue has occurred, check for the corresponding messages in the /var/aixappl/pflayer/log/pl_update.trace file by running the following command on the management host as root: tail -f /var/aixappl/pflayer/log/pl_update.trace The following example output is a result of running the previous command: [03 Jun 2014 23:45:54,341] <13631536 UPDT APPI TRACE hostname> [03 Jun 2014 23:45:54,341] <13631536 UPDT APPI TRACE hostname> ATKUPD275E The managed system "sysNode" is either offline or locked. Ensure that the system is available and attempt to perform the operation again. [03 Jun 2014 23:45:54,341] <13631536 UPDT APPI TRACE hostname> [03 Jun 2014 23:45:54,341] <13631536 UPDT APPI TRACE hostname> ATKUPD287E The install needed updates task has completed with errors. Read the above messages for details on the error. [03 Jun 2014 23:45:54,341] <13631536 UPDT APPI TRACE hostname> [03 Jun 2014 23:45:54,342] <13631536 UPDT APPI TRACE hostname> ATKUSC307E Command installneeded completed with errors. For more information, see the job log for job InstallNeededTask. [03 Jun 2014 23:45:54,342] <13631536 UPDT APPI TRACE hostname> [03 Jun 2014 23:45:54,342] <13631536 UPDT APPI TRACE hostname> command /opt/ibm/director/bin/smcli installneeded -v -F /BCU_share/bwr3/software/DirectorServer/imports/SCupdates4241 returned status = 32 |
Workaround:
-----------
1. Reboot the management host by running the following command on the management host as root:
2. After the management host comes up, verify that the ISD is Active, that the ISD version is 6.3.3.1, and that the local CAS agent is running by running the following commands on the management host as root:
3. Collect server inventory by running the following command on the management host as root:
4. Install the SC update by running the following command on the management host as root:
5. After the completion of the SC update installation, stop and restart the ISD by running the following commands on the management host as root:
6. Verify that the ISD version was updated to 6.3.3.1 by running the following command on the management host as root:
7. Verify that the SC version was updated to 4.2.4.1 by running the following command on the management host as root:
8. Resume the fix pack installation. Fixed:
----------
V1.0.0.5/V1.1.0.1 where IBM System Director is disabled.
|
|
KI005982
The "+" icon in system console Fix Packs panel is not displayed on systems with a language other than English system locale setting
|
General |
I_V1.0.0.3
|
The "+" icon in system console Fix Packs panel is not displayed on systems with a language other than English system locale setting
If your system locale is set to a language other than English, you will find that the "+" icon is missing when you attempt to add a new fix pack file to the management host on your 1.0.0.3 (Fix Pack 3) system by using the Fix Packs panel of the system console. Not being able to add and register the new fix pack prevents you from installing the new fix pack by using the system console.
|
Workaround:
-----------
To restore the missing "+" icon in the Fix Packs panel of the system console so that you can install the fix pack by using the system console, complete the following steps:
1. Set the system locale to English. 2. Restart the system console. 3. Install the fix pack by using the system console. 4. Commit the fix pack installation. 5. Set the system locale to the desired language. Fixed:
----------
V1.0.0.4 Console is updated with a fix for the issue.
V1.0.0.6/V1.1.0.2 Console is removed.
|
|
KI006755
Single sign-on from the system console to the Database Performance Monitoring (DPM) Console and/or the Warehouse Admin Console fails with Error 500
|
General | I_V1.0.0.4 |
Single sign-on from the system console to the Database Performance Monitoring (DPM) Console and/or the Warehouse Admin Console fails with Error 500
After installing PDOA fix pack 4 you may find that single sign-on from the console to the DPM and/or Warehouse Admin console(s) fails with an Error 500 message.
|
Workaround:
-----------
You must download the Unrestricted JDK JCE policy files and patch them into /usr/java6_64/jre/lib/security directory on the management host. The exact download locations and steps are listed below:
1. Download the Unrestricted JDK JCE policy files from https://www-01.ibm.com/marketing/iwm/iwm/web/preLogin.do?source=jcesdk . Select the option "Java 5.0 SR16, Java 6 SR13, Java 6 SR5 (J9 VM2.6), Java 7 SR4, Java 8 GA, and all later releases " for download. 2. Unzip the unrestrictedpolicyfiles.zip which will give you 2 JAR files: local_policy.jar and US_export_policy.jar 3. On the management node, as root, create a backup dir /usr/java6_64/jre/lib/security/backup and move the files local_policy.jar and US_export_policy.jar to this directory. 4. Place the new JARs of the same name in the management node's /usr/java6_64/jre/lib/security/ directory. 5. Run this command on the management node as root: miresolve -restart These 5 steps will fix the problem with single sign-on and restart the console. You can now login normally to the DPM and Warehouse Admin consoles. Fixed:
----------
V1.0.0.5/V1.1.0.1 Updates to the console remove its dependency on the java installed at the system level.
V1.0.0.6/V1.1.0.2 The console and Warehouse Tools are removed. With the conosle removed the SSO feature to DPM is no longer supported.
|
|
KI004026
A limited number of SSH sessions can be connected to an Ethernet network switch at the same time
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
A limited number of SSH sessions can be connected to an Ethernet network switch at the same time
The maximum number of simultaneous SSH sessions that can be connected to an Ethernet network switch is four.
The following message is displayed after attempting to create a fifth SSH session from the system console: java.net.ConnectException: A remote host refused an attempted connect operation. The following example message is displayed after attempting to create a fifth SSH session from a terminal application on a remote computer: Server unexpectedly closed network connection |
Workaround:
-----------
Do not attempt to connect more than the maximum number of four simultaneous SSH sessions to an Ethernet network switch.
Close one or more of the four existing SSH sessions to be able to connect one or more new SSH sessions to a maximum of four. Fixed:
----------
N/A. This is a limitation imposed by the Network Switch firmware. V1.0.0.6/V1.1.0.2 significantly reduce the amount of ssh sessions to the switches with the removal of the console.
|
|
KI004712
Storage web console links break after logging in as a superuser
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.0 I_V1.1.0.1 |
Storage web console links break after logging in as a superuser
The port number of any link that is selected from a storage web console page is lost and results in a 404 File not found error. This issue occurs when you access the storage web console from the system console by clicking System > Service Level Access, selecting one of the links in the IBM Storwize V7000 section, and after logging in as the superuser for the first time.
|
Workaround:
-----------
When this issue occurs, do the following steps:
1. Click the Back button of your browser to return to the IBM Storwize V7000 login page. Save the port number that you find in the URL in the address bar of your browser. 2. Click the Forward button of your browser to return to the error page. Insert the missing port number in the URL that you find in the address bar of your browser. Load this web page by pressing the Enter key or clicking the browser Load button. Fixed:
----------
Apply workaround if encountered.
V1.0.0.6/V1.1.0.2 The console has been removed which removes the availability of this Service Level Access feature.
|
|
KI007646
Configuration of some workload management variables is not possible with Firefox browser and low screen resolution
|
General |
I_V1.0
I_V1.1
|
Configuration of some workload management variables is not possible with Firefox browser and low screen resolution
When the screen resolution is lower than the 1280x1024 minimum browser resolution requirement for the Firefox browser, a scroll bar is not present to access the bottom portion of the database performance monitor system console panel. This issue prevents being able to configure some workload management variables.
|
Workaround:
-----------
To resolve this issue, you can take one of the following actions:
Fixed:
----------
N/A.
|
|
KI006085
Accesssys command might fail and cause a locked state after changing the network switch password
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0 |
Accesssys command might fail and cause a locked state after changing the network switch password
fter changing the password for a network switch, the process involves discovering the network switch and then running the accesssys command that requests secure access to the switch by using the new password. On occasion, the accesssys command fails to get access to the switch and the switch goes into a locked state.
To confirm that this issue occurred, look for return code 66 in the pflayer log file. The following pflayer log file example shows the accesssys command failed: [09 Jan 2014 12:07:20,367] <24248418 CTRL DEBUG host01> Executing command->/opt/ibm/director/bin/smcli discover -i 172.23.1.251 -t "Switch" [09 Jan 2014 12:07:31,595] <24248418 CTRL DEBUG host01> Ret Code-> 0 [09 Jan 2014 12:07:31,596] <24248418 CTRL DEBUG host01> Command output-> Discovery completion percentage 50% [09 Jan 2014 12:07:31,596] <24248418 CTRL DEBUG host01> Discovery completion percentage 100% [09 Jan 2014 12:07:31,596] <24248418 CTRL DEBUG host01> Discovery completed: [09 Jan 2014 12:07:31,596] <24248418 CTRL DEBUG host01> 100% [09 Jan 2014 12:07:31,597] <24248418 CTRL DEBUG host01> Waiting for 20 sec between discovery and accesssys [09 Jan 2014 12:07:51,598] <24248418 CTRL DEBUG host01> Executing command-> /opt/ibm/director/bin/smcli lssys -i 172.23.1.251 -t "Switch" [09 Jan 2014 12:07:52,011] <24248418 CTRL DEBUG host01> Ret Code-> 0 [09 Jan 2014 12:07:52,011] <24248418 CTRL DEBUG host01> Command output-> fcm_switch2 [09 Jan 2014 12:07:52,328] <24248418 CTRL DEBUG host01> Executing command ->/opt/ibm/director/bin/smcli accesssys -u admin -p ***** -i 172.23.1.251 [09 Jan 2014 12:07:52,606] <24248418 CTRL DEBUG host01> Ret Code-> 66 [09 Jan 2014 12:07:52,606] <24248418 CTRL DEBUG host01> Command output-> DNZCLI0727I : Waiting for request access to complete on... fcm_switch2 [09 Jan 2014 12:07:52,606] <24248418 CTRL DEBUG host01> Result Value: DNZCLI0736I : The system is not available. : fcm_switch2 [09 Jan 2014 12:07:52,649] <24248418 CTRL DEBUG host01> For net switch accessys waiting 20 sec more in case of failure [09 Jan 2014 12:08:12,650] <24248418 CTRL DEBUG host01> For net switch executing accessys again in case of failure [09 Jan 2014 12:08:12,970] <24248418 CTRL DEBUG host01> Executing command ->/opt/ibm/director/bin/smcli accesssys -u admin -p ***** -i 172.23.1.251 [09 Jan 2014 12:08:13,209] <24248418 CTRL DEBUG host01> Ret Code-> 66 [09 Jan 2014 12:08:13,209] <24248418 CTRL DEBUG host01> Command output-> DNZCLI0727I : Waiting for request access to complete on... fcm_switch2 [09 Jan 2014 12:08:13,209] <24248418 CTRL DEBUG host01> Result Value: DNZCLI0736I : The system is not available. : fcm_switch2 [09 Jan 2014 12:08:13,219] <24248418 CTRL ERROR host01> Discovery failed for net3. |
Workaround:
-----------
To unlock and access the network switch, run the following command:
smcli accesssys -i <switch_ip> -u admin -p <password> Fixed:
----------
V1.0.0.5/V1.1.0.1 IBM System Director is disabled as part of these fixpacks.
|
|
KI006330
Paths for fix pack file location on the system are restricted for security reasons
|
Fixpack |
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.1.0.1
|
Before you can install a fix pack, the fix pack files must be accessible on the management host. For security reasons, the valid management host paths from which the fix pack files can be accessed during the fix pack installation procedure have been restricted on your 1.0.0.3 (Fix Pack 3) system. You cannot install the fix pack if the fix pack files are not located in one of the valid management host paths. | Workaround:
-----------
You must locate the fix pack files on the management host in an absolute path that starts with only the following file systems:
The following examples are valid absolute paths on the management host:
Fixed:
----------
V1.0.0.6/V1.1.0.2 The fixpack mechanism has changed and this statement is no longer applicable.
|
|
KI006280
SAN switch password change fails
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0 |
Due to SAN switch listing limitations in IBM Systems Director (ISD), changing the SAN switch password fails with return code 20 displayed in the PL log file. Example PL log file output: [14 Apr 2015 06:49:34,225] <23003302 PASS DEBUG stgkf101> Executing command->/opt/ibm/director/bin/smcli lsver [14 Apr 2015 06:49:34,575] <23003302 PASS DEBUG stgkf101> ISD version->6.3.5 [14 Apr 2015 06:49:34,577] <23003302 PASS DEBUG stgkf101> 6.3.5.0 is equal to 6.3.5.0. [14 Apr 2015 06:49:34,578] <23003302 PASS DEBUG stgkf101> Executing command expect -f /opt/ibm/aixappl/pflayer/lib/create_snmp_profile.expect profile_172.23.1.31_2 172.23.1.31 Switch [14 Apr 2015 06:49:35,467] <23003302 PASS DEBUG stgkf101> From prcoess 33423556 :STDOUT [14 Apr 2015 06:49:35,467] <23003302 PASS DEBUG stgkf101> spawn smcli mkbasicdiscprofile -name profile_172.23.1.31_2 -i 172.23.1.31 -res Switch -snmpversion 1 [14 Apr 2015 06:49:35,467] <23003302 PASS DEBUG stgkf101> [14 Apr 2015 06:49:35,467] <23003302 PASS DEBUG stgkf101> Enter community strings as comma separated value [14 Apr 2015 06:49:35,467] <23003302 PASS DEBUG stgkf101> profile_172.23.1.31_2 [14 Apr 2015 06:49:35,469] <23003302 PASS DEBUG stgkf101> From process 33423556: STDERR: [14 Apr 2015 06:49:35,469] <23003302 PASS DEBUG stgkf101> [14 Apr 2015 06:49:35,470] <23003302 PASS DEBUG stgkf101> Exit code: 0 [14 Apr 2015 06:49:35,532] <23003302 PASS DEBUG stgkf101> Executing command->/opt/ibm/director/bin/smcli discover -p profile_172.23.1.31_2 [14 Apr 2015 06:49:41,820] <23003302 PASS DEBUG stgkf101> Ret Code-> 0 [14 Apr 2015 06:49:41,820] <23003302 PASS DEBUG stgkf101> Command output-> profile_172.23.1.31_2 Profile Based Discovery completion percentage 100% [14 Apr 2015 06:49:41,820] <23003302 PASS DEBUG stgkf101> profile_172.23.1.31_2 Profile Based Discovery completed: [14 Apr 2015 06:49:41,821] <23003302 PASS DEBUG stgkf101> 100% [14 Apr 2015 06:49:41,821] <23003302 PASS DEBUG stgkf101> [14 Apr 2015 06:49:41,824] <23003302 PASS DEBUG stgkf101> Waiting for 20 sec between discovery and accesssys [14 Apr 2015 06:50:01,824] <23003302 PASS DEBUG stgkf101> Executing command-> /opt/ibm/director/bin/smcli lssys -i 172.23.1.31 -t "Switch" [14 Apr 2015 06:50:02,174] <23003302 PASS DEBUG stgkf101> Ret Code-> 20 [14 Apr 2015 06:50:02,175] <23003302 PASS DEBUG stgkf101> Command output-> [14 Apr 2015 06:50:02,177] <23003302 PASS ERROR stgkf101> Discovery failed for 172.23.1.31 [14 Apr 2015 06:50:02,178] <23003302 PASS DEBUG stgkf101> Notify DirectorServer failed for node 172.23.1.31. [14 Apr 2015 06:50:02,180] <23003302 PASS ERROR stgkf101> Password change notification to Systems Director failed for nodes 172.23.1.31. [14 Apr 2015 06:50:02,182] <23003302 PASS INFO stgkf101> The script was executed with following status: [14 Apr 2015 06:50:02,183] <23003302 PASS INFO stgkf101> ---------------------------------------------------------- [14 Apr 2015 06:50:02,184] <23003302 PASS INFO stgkf101> SCHEMA::CHPW::LOGICAL_NAME::STATUS(PASS/FAIL)::DESCRIPTION [14 Apr 2015 06:50:02,185] <23003302 PASS ERROR stgkf101> CHPW::san0::FAIL::notify_director_server Password change failed for resource(s) 172.23.1.31. Example command line output: /opt/ibm/aixappl/pflayer/bin/appl_conf chpw -l san0 -u admin -p <new_password> spawn time /opt/ibm/aixappl/pflayer/bin/appl_conf chpw -l san0 -u admin -p Enter the new password Changing the password for user 'admin' on resource 'san0'. The password change was successful for the user 'admin' on resource 'san0'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Discovery failed for 172.23.1.31 Password change notification to Systems Director failed for nodes 172.23.1.31. The script was executed with following status: ---------------------------------------------------------- SCHEMA::CHPW::LOGICAL_NAME::STATUS(PASS/FAIL)::DESCRIPTION CHPW::san0::FAIL::notify_director_server Password change failed for resource(s) 172.23.1.31. ---------------------------------------------------------- |
Workaround:
-----------
To resolve the issue, complete the following steps:
1. Rediscover the SAN switch by running the following command:
2. Retry changing the SAN switch password by running the following command:
Fixed:
----------
V1.0.0.5/V1.1.0.1 IBM System Director is removed as part of these fixpacks.
|
|
KI006280
SAN switches are not listed by lssys command after password change
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0 |
SAN switches are not listed by lssys command after password change
After the successful change of the SAN switch passwords, the lssys command does not return a list of the SAN switches.
Example command line output to change the password: $ time appl_conf chpw -r san -u admin -p -o Enter the new password Enter the old password Changing the password for user 'admin' on resource 'san0'. The password change was successful for the user 'admin' on resource 'san0'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Password change notification to Systems Director is successful. Changing the password for user 'admin' on resource 'san1'. The password change was successful for the user 'admin' on resource 'san1'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Password change notification to Systems Director is successful. Changing the password for user 'admin' on resource 'san2'. The password change was successful for the user 'admin' on resource 'san2'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Password change notification to Systems Director is successful. Changing the password for user 'admin' on resource 'san3'. The password change was successful for the user 'admin' on resource 'san3'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Password change notification to Systems Director is successful. The script was executed with following status: ---------------------------------------------------------- SCHEMA::CHPW::LOGICAL_NAME::STATUS(PASS/FAIL)::DESCRIPTION CHPW::san0::PASS::The password change for 172.23.3.31 was successful. CHPW::san1::PASS::The password change for 172.23.3.32 was successful. CHPW::san2::PASS::The password change for 172.23.3.33 was successful. CHPW::san3::PASS::The password change for 172.23.3.34 was successful. ---------------------------------------------------------- Example lssys command output: $ smcli lssys -l -i 172.23.3.32 DNZCLI0241E : (Run-time error) The system with IP address or host name 172.23.3.32 was not found. Use the smcli lssys -A IPv4Address,HostName command to list IP addresses and host names for all systems. |
Workaround:
-----------
To resolve the issue, complete the following steps:
1. Rediscover the SAN switch by running the following command:
2. Retry listing the SAN switch by running the following command:
Fixed:
----------
V1.0.0.5/V1.1.0.1 IBM System Director is removed as part of these fixpacks.
|
|
KI006280
Ethernet network switch goes into partially locked state during password change
|
General |
I_V1.0.0.0
I_V1.0.0.1 I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.1.0.0 |
Ethernet network switch goes into partially locked state during password change
After changing the password for the Ethernet network switches, the switch status is listed as PartiallyLocked.
Example lssys command output before the password change: $ smcli lssys -A "CommunicationState,AccessState" adminnode_1: Unlocked, 2 DATA-STANDBY-SN101F54R: Unlocked, 2 datanode_4: Unlocked, 2 datanode_5: Unlocked, 2 ETHERNET0-IBM*8205-E6D*101F52R: Unlocked, 2 ETHERNET0-IBM*8205-E6D*101F53R: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F0CR: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F0DR: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F54R: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F55R: Unlocked, 2 fcm_switch1: Unlocked, 2 fcm_switch2: Unlocked, 2 mgt_switch1: Unlocked, 2 mgt_switch2: Unlocked, 2 Example output during the password change: /opt/ibm/aixappl/pflayer/bin/appl_conf chpw -r net -u admin -p <new_password> spawn time /opt/ibm/aixappl/pflayer/bin/appl_conf chpw -r net -u admin -p Enter the new password Changing the password for user 'admin' on resource 'net0'. The password change was successful for the user 'admin' on resource 'net0'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Discovery failed for 172.23.1.11 Password change notification to Systems Director failed for nodes 172.23.1.11. Changing the password for user 'admin' on resource 'net1'. The password change was successful for the user 'admin' on resource 'net1'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Discovery failed for 172.23.1.12 Password change notification to Systems Director failed for nodes 172.23.1.12. Changing the password for user 'admin' on resource 'net2'. The password change was successful for the user 'admin' on resource 'net2'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Discovery failed for 172.23.1.21 Password change notification to Systems Director failed for nodes 172.23.1.21. Changing the password for user 'admin' on resource 'net3'. The password change was successful for the user 'admin' on resource 'net3'. Updating the password in the database for user admin. The password was successfully updated in the database. Notifying Systems Director of a password change for user admin. Example lssys command output after the password change: $ smcli lssys -A "CommunicationState,AccessState" adminnode_1: Unlocked, 2 DATA-STANDBY-SN101F54R: Unlocked, 2 datanode_4: Unlocked, 2 datanode_5: Unlocked, 2 ETHERNET0-IBM*8205-E6D*101F52R: Unlocked, 2 ETHERNET0-IBM*8205-E6D*101F53R: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F0CR: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F0DR: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F54R: Unlocked, 2 ETHERNET0-IBM*8231-E2D*101F55R: Unlocked, 2 fcm_switch1: PartiallyLocked, 2 fcm_switch2: PartiallyLocked, 2 mgt_switch1: PartiallyLocked, 2 mgt_switch2: PartiallyLocked, 2 |
Workaround:
-----------
To resolve the issue, complete the following steps:
1. Remove the Ethernet network switch from the system by running the following command:
2. Rediscover the Ethernet network switch on the system by running the following command:
3. Retry changing the Ethernet network switch password by running the following command:
Fixed:
----------
V1.0.0.5/V1.1.0.1 IBM System Director is removed as part of these fixpacks.
|
|
KI006204
Fix pack installation phases show only highest version of Ethernet network switch firmware
|
General | I_V1.0.0.4 |
Fix pack installation phases show only highest version of Ethernet network switch firmware
During the preview, prepare, and apply phases of fix pack installation, both the log output and the appl_ls_cat command output show only the highest version of the firmware for the G8264 Ethernet network switch (7.9.12.0), even though the G8052 Ethernet network switch has a different firmware version (7.9.11.0).
Example log output after the preview phase: Note: This example output can also be viewed in MI preview. PRODUCT_UPDATES::SSDDrawerFW_PROD::SSDDrawerFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::67G5::67E5::SSDDrawerFW Update PRODUCT_UPDATES::PFW_PROD::PFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::AL740_156::AL770_048::PFW Update PRODUCT_UPDATES::PFW_PROD::PFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::AL770_098::AL770_048::PFW Update PRODUCT_UPDATES::FCAdapterFW_PROD::FCAdapterFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::202307::0315050680,0315050680,0315050680,0315050680,::FCAdapterFW 5273 Update PRODUCT_UPDATES::FCAdapterFW_PROD::FCAdapterFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::0320051000::0315050680,0315050680,0315050680,0315050680,::FCAdapterFW EN0Y Update PRODUCT_UPDATES::NetAdapterFW_PROD::NetAdapterFW::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::0400401800007::0310303970033::NetAdapterFW 1648 Update PRODUCT_UPDATES::GPFS_PROD::GPFS::Management,Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::4.1.0.6::3.4.0.14::GPFS PRODUCT_UPDATES::CAS_PROD::CAS::Mgmt_Standby,Admin,Admin_Standby,User,Data,Standby::6.3.3.0::6.3.0.3::CAS Update PRODUCT_UPDATES::StorageFW_PROD::StorageFW::Infrastructure::7.3.0.9::6.4.1.6::StorageFW Update PRODUCT_UPDATES::SANFW_PROD::SANFW::Infrastructure::v7.2.1d::v7.0.2c::SANFW Update PRODUCT_UPDATES::NetFW_PROD::NetFW::Infrastructure::7.9.12.0::7.7.3.0::NetFW Update END::PRODUCT_UPDATES Example appl_ls_cat command output after the preview phase: NAME VERSION STATUS OPERATION DESCRIPTION netfw1 7.9.12.0 Previewed manage NetFW Update However, after the commit phase, both firmware versions are displayed. Example appl_ls_cat command output after the commit phase: NAME VERSION STATUS OPERATION DESCRIPTION netfw2 7.9.12.0 Committed manage NetFW Update netfw3 7.9.11.0 Committed manage NetFW Update |
Workaround:
-----------
A resolution to this restriction is not available at this time.
Fixed:
----------
V1.0.0.5
|
|
KI007486
Flash storage upgrade gets stalled during apply phase
|
Fixpack |
I_V1.1.0.1
I_V1.1.0.2
|
Flash storage upgrade gets stalled during apply phase
Flash storage upgrade gets stalled during apply phase which results in the apply failure
Sample output of the failure from the PL log [06 Mar 2017 07:45:40,711] <4653948 CTRL DEBUG ibis01> apply: 172.23.1.182: waiting for upgrade to complete, iteration <1>: update_status=<upgrading 2> [06 Mar 2017 07:50:41,284] <4653948 CTRL DEBUG ibis01> get_update_status: status is <upgrading 2 [06 Mar 2017 07:50:41,284] <4653948 CTRL DEBUG ibis01> > [06 Mar 2017 07:50:41,286] <4653948 CTRL DEBUG ibis01> apply: 172.23.1.182: waiting for upgrade to complete, iteration <2>: update_status=<upgrading 2> [06 Mar 2017 07:55:42,812] <4653948 CTRL DEBUG ibis01> get_update_status: status is <stalled 23 [06 Mar 2017 07:55:42,812] <4653948 CTRL DEBUG ibis01> > [06 Mar 2017 07:55:42,813] <4653948 CTRL DEBUG ibis01> apply: 172.23.1.182: waiting for upgrade to complete, iteration <3>: update_status=<stalled 23> [06 Mar 2017 08:00:42,814] <4653948 CTRL DEBUG ibis01> apply: 172.23.1.182: broke out of timed wait after 4 iterations of maximum 48. update status is <stalled 23> [06 Mar 2017 08:00:42,815] <4653948 CTRL DEBUG ibis01> Extracted msg from NLS: apply: 172.23.1.182 Error: The update status of the end point is <stalled 23>. [06 Mar 2017 08:00:42,815] <4653948 CTRL DEBUG ibis01> apply: 172.23.1.182: error: error state, update status is <stalled 23> [06 Mar 2017 08:00:42,853] <7013026 CTRL DEBUG ibis01> apply: storage1: apply failed Verifying the status on the failed node by executing lssoftwareupgradestatus command $ ssh admin@172.23.1.182 IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete stalled 23 |
Workaround:
-----------
1. Abort the upgrade using applysoftware -abort command. Wait until the status becomes inactive
IBM_FlashSystem:ibisFlash_00:admin>applysoftware -abort IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete downgrading 23 IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete downgrading 23 ... IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete downgrading 23 IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete downgrading 23 IBM_FlashSystem:ibisFlash_00:admin>lssoftwareupgradestatus status percent_complete inactive 0 2. Verify if there are any events for internal errors "Node warmstarted due to an internal error" This is a known issue with flash storage. One can clear this using cheventlog IBM_FlashSystem:ibisFlash_00:superuser>lseventlog sequence_number last_timestamp object_type object_id object_name copy_id status fixed event_id error_code description secondary_object_type secondary_object_id 103 170209221539 node 2 node2 message no 980349 Node added 114 170301053034 drive 0 message no 988024 Flash module format complete 115 170301053034 drive 6 message no 988024 Flash module format complete ... 129 170306074033 cluster ibisFlash_00 message no 980506 Update prepared 130 170306075323 node 1 node1 alert no 074002 2030 Internal error canister 1 131 170306075401 enclosure 1 alert no 085048 2060 Reconditioning of batteries required ... 135 170306075411 cluster ibisFlash_00 message no 980509 Update stalled 136 170306075411 node 1 node1 alert no 009100 2010 Update process failed IBM_FlashSystem:ibisFlash_00:superuser> Here event 130 has internal error on node with error code 2030. Detail listing of event:- IBM_FlashSystem:ibisFlash_00:superuser>lseventlog 130 sequence_number 130 first_timestamp 170306075323 first_timestamp_epoch 1488815603 last_timestamp 170306075323 last_timestamp_epoch 1488815603 object_type node object_id 1 object_name node1 copy_id reporting_node_id reporting_node_name root_sequence_number event_count 1 status alert fixed no auto_fixed no notification_type error event_id 074002 event_id_text Node warmstarted due to an internal error error_code 2030 error_code_text Internal error machine_type 9840AE2 serial_number 1351351 FRU None fixed_timestamp fixed_timestamp_epoch callhome_type software sense1 41 73 73 65 72 74 20 46 69 6C 65 20 2F 62 75 69 sense2 6C 64 2F 74 6D 73 2F 53 56 43 5F 4F 44 45 5F 52 sense3 32 2F 32 30 31 35 2D 30 33 2D 30 39 5F 31 32 2D sense4 31 34 2D 30 34 2F 72 32 2F 73 72 63 2F 75 73 65 sense5 72 2F 64 72 76 2F 70 61 2F 70 6C 70 61 2E 63 20 sense6 4C 69 6E 65 20 31 35 39 33 00 00 00 00 00 00 00 sense7 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 secondary_object_type canister secondary_object_id 1 IBM_FlashSystem:ibisFlash_00:superuser> Event_id_text shows Node warmstarted due to an internal error We need to clear the event on the node using cheventlog fix command. IBM_FlashSystem:ibisFlash_00:admin>cheventlog -fix 130 IBM_FlashSystem:ibisFlash_00:admin>lseventlog sequence_number last_timestamp object_type object_id object_name copy_id status fixed event_id error_code description secondary_object_type secondary_object_id 103 170209221539 node 2 node2 message no 980349 Node added 114 170301053034 drive 0 message no 988024 Flash module format complete 115 170301053034 drive 6 message no 988024 Flash module format complete 117 170301053039 drive 2 message no 988024 Flash module format complete 118 170301053039 drive 7 message no 988024 Flash module format complete 119 170301053039 drive 8 message no 988024 Flash module format complete 120 170301053044 drive 1 message no 988024 Flash module format complete 121 170301053044 drive 3 message no 988024 Flash module format complete 122 170301053044 drive 4 message no 988024 Flash module format complete 123 170301053044 drive 5 message no 988024 Flash module format complete 124 170301053044 drive 9 message no 988024 Flash module format complete 129 170306074033 cluster ibisFlash_00 message no 980506 Update prepared 131 170306075401 enclosure 1 alert no 085048 2060 Reconditioning of batteries required 132 170306075401 enclosure 1 alert no 085048 2060 Reconditioning of batteries required 133 170306075406 enclosure 1 message no 988030 External data link degraded canister 2 134 170306075406 enclosure 1 message no 988030 External data link degraded canister 1 135 170306075411 cluster ibisFlash_00 message no 980509 Update stalled 137 170306223519 cluster ibisFlash_00 message no 980510 Update aborted 138 170306224949 node 2 node2 message no 980349 Node added 139 170306224949 cluster ibisFlash_00 message no 980508 Update Failed IBM_FlashSystem:ibisFlash_00:admin> 3. Once we have aborted the update and cleared the event we resume the upgrade Fixed:
----------
N/A
|
|
KI007488
Console does not allow resume after PFW update reboots management node
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
Console does not allow resume after PFW update reboots management node
When the non-management Apply is running, during the PFW update, the cec reboots. This brings down the management node and stops the fix pack update.
![]() |
Workaround:
-----------
To resume the update:
Verify the console is started using the mistatus command run as root on the management host. (0) root @ ibis01: 7.1.0.0: / $ mistatus CDTFS000063I The system console is started. (0) root @ ibis01: 7.1.0.0: / $ The command 'miupdate -resume' will not work. Instead: Determine the 'bwr#' associated with the fixpack. Use the appl_ls_cat command, run as root on the management host. $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 3.0.3.0 Committed Initial images for IBM PureData System for Operational Analytics bwr1 4.0.5.0 Applied Updates for IBM_PureData_System_for_Operational_Analytics Substitute the 'bwr#' found above, in this example it is 4.0.5.0 and run the appl_install_sw commmand as root on the management host. echo "appl_install_sw -l bwr1 -resume > /tmp/appl_install_sw_$(date +"%Y%m%d_%H%M%S").out 2>&1" | at now This will run the fixpack application outside of the session (these can be long running commands susceptible to terminal loss). Tail the /tmp/appl_install_sw_<date>.out file to view the progress. Fixed:
----------
V1.0.0.6/V1.1.0.2 The console has been removed in these fixpacks and the fixpack methodology has changed.
|
|
KI007489
Apply output shows failed even though there is no failure or error in the log
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
Apply output shows failed even though there is no failure or error in the log
The output excerpt from mi command line or through console GUI:
===================================================== Log file: Infrastructure Infrastructure SAN switch firmware: SANFW apply started 1 of 1 task completed Log file: Infrastructure Infrastructure Network switch firmware: NetFW apply started 1 of 1 task completed Log file: Infrastructure Infrastructure Storage firmware: StorageFW apply started 3 of 3 task completed Log file: Infrastructure Infrastructure Storage firmware: StorageFW apply started 3 of 3 task completed ======================== "The operation failed during the apply stage. The resume phase for the release 'bwr1' failed.. Refer to the platform layer log file for details." ======================== |
Workaround:
-----------
Verify that update was completed by running appl_ls_cat command.
In case of resume scenarios during mgmt update status should be M_Applied (0) root @ ibis01: 7.1.0.0: /BCU_share/aixappl/pflayer/log $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 4.0.4.2 Committed Updates for IBM_PureData_System_for_Operational_Analytics bwr1 4.0.5.0 M_Applied Updates for IBM_PureData_System_for_Operational_Analytics_DB2105 (0) root @ ibis01: 7.1.0.0: /BCU_share/aixappl/pflayer/log In case of resume scenarios during non management/core update status should be Applied (0) root @ ibis01: 7.1.0.0: /BCU_share/aixappl/pflayer/log $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 4.0.4.2 Committed Updates for IBM_PureData_System_for_Operational_Analytics bwr1 4.0.5.0 Applied Updates for IBM_PureData_System_for_Operational_Analytics_DB2105 (0) root @ ibis01: 7.1.0.0: /BCU_share/aixappl/pflayer/log If the status shows either Applied or M_Applied then proceed for the next step ignoring the failure message Fixed:
----------
V1.0.0.6/V1.1.0.2 The console is removed as part of these fixpacks and the fixpack methodology has changed.
|
|
KI007423
ISW update failed while updating MGMT update
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
ISW update failed while updating MGMT update
During the PDOA V1.1 FP5/FP1 management apply phase the fix pack may encounter an error. The pflayer log file may show the following error:
... [10 Nov 2016 20:07:24,829] <8913318 UPDT DEBUG ibis01> Node: 172.23.1.1 Return: 256 [10 Nov 2016 20:07:24,844] <8913318 UPDT ERROR ibis01> TASK_END::10::5 of 6::ISW_APPLY::172.23.1.1:: ::RC=1::CDTFS000048E An error occurred while updating InfoSphere Data Warehouse.\n\nDetails:\nThe command \"/BCU_share/bwr1/software/ISW/isw/install.bin -DDS_HA_MODE=TRUE -i silent -f /BCU_share/update_105_tFFF.rsp -Dprofile=BCU_share/bwr1/software/ISW/PDS -Dlog=/tmp/isw_full.log\" failed with the error:\n\n\"\"\n\nUser Response:\nContact IBM Support for assistance. |
Workaround:
-----------
There is a known issue with the ISW installer returning a status of 256 back to the caller of the install.bin command line even though the installation was a success. To verify:
1. Login to the management node as root in an ssh session. 2. Look for one of the following directories: /usr/IBM/dwe/appserver_001/iswapp_10.5/logs or /usr/IBM/dwe/appserver_001/iswapp_10/logs 3. Look for a file call 'ISWinstall_summary_<date>.log with a recent date. 4. Run the following: grep -i status logs/ISWinstall_summary_1701121953.log This should return a large number of lines with 'Status: SUCCESSFUL'. If this is the case, it is safe to resume the fix pack as the update was successful. Fixed:
----------
V1.0.0.6/V1.1.0.2 The component that contains ISW is known as Warehouse Tools and this component is removed in these fipxacks.
|
|
KI007457
OPM update failed due to DBI Connect connect issue in wait_for_start.pl
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
OPM update failed due to DBI Connect connect issue in wait_for_start.pl
During the management apply phase, the apply phase may fail with the following symptoms in the logs.
RC=1::Can't locate DBI.pm in @INC (@INC contains: /usr/opt/perl5/lib/5.10.1/aix-thread-multi /usr/opt/perl5/lib/5.10.1 /usr/opt/perl5/lib/site_perl/5.10.1/aix-thread-multi /usr/opt/perl5/lib/site_perl/5.10.1 /usr/opt/perl5/lib/site_perl .) at /BCU_share/bwr1/code/ISAS/Update/Common/OPM/scripts/wait_for_start.pl line 149.\n or DBI connect('OPMDB','',...) failed: [IBM][CLI Driver] SQL1031N The database directory cannot be found on the indicated file system. SQLSTATE=58031 at /BCU_share/bwr1/code/ISAS/Update/Common/OPM/scripts/wait_for_start.pl line 155 and hals shows that the DPM components are failed over to the standby management host. |
Workaround:
-----------
These messages indicate that the during the management apply phase, the DB2 Performance Monitor (DPM) component failed over to the management stand-by node. There are some known issues with DPM on startup that can lead to failures. If the above symptoms are seen then the next steps are to:
1. Use hals to determine if the DPM resources are indeed failed over. 2. Use lssam on the management host to determine if there are any failed states. 3. Use 'resetrsrc' on any DPM resources that are in a failed state. 4. Verify with lssam that the resources are no longer in a failed state. 5. Use 'hafailover <managementstandby> DPM' to move the DPM resources to the management host. 6. Verify that the DPM resources successfully moved to the management host. 7. Resume the Fix Pack Fixed:
----------
V1.0.0.6/V1.1.0.2 The update mechansim in these fixpack mechanism has changed.
|
|
KI007495
Storage update failed during apply phase because of drive update failure
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1 |
Storage update failed during apply phase because of drive update failure
Storage update fails during the apply phase because of drive update failures; one or more storage drives might be in the offline state.
Console output during failure: ======================== The operation failed during the apply stage. Storage update failed on 172.23.1.186. Refer to the platform layer log file for details. ======================== Sample output of the failure from the PL log: [08 Mar 2017 04:13:43,315] <28704888 CTRL DEBUG finch01> Extracted msg from NLS: apply: 172.23.1.186 ssh admin@172.23.1.186 ssh admin@172.23.1.186 LANG=en_US svctask applydrivesoftware -file IBM2076_DRIVE_20160923 -type firmware -drive 0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47 command failed. [08 Mar 2017 04:13:43,315] <28704888 CTRL DEBUG finch01> apply: 172.23.1.186: error: ssh admin@172.23.1.186 ssh admin@172.23.1.186 LANG=en_US svctask applydrivesoftware -file IBM2076_DRIVE_20160923 -type firmware -drive 0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47 command failed , rc=127 By using executing the lsdrive command, we can verify the drive statuses on the failed storage box. As an example: $ ssh superuser@172.23.1.186 "lsdrive" id status error_sequence_number use tech_type capacity mdisk_id mdisk_name member_id enclosure_id slot_id node_id node_name auto_manage 0 online member sas_hdd 837.9GB 0 ARRAY3 11 2 1 inactive 1 online member sas_hdd 837.9GB 0 ARRAY3 10 1 1 inactive 2 online member sas_hdd 837.9GB 0 ARRAY3 9 2 2 inactive 3 online member sas_hdd 837.9GB 0 ARRAY3 8 1 10 inactive 4 online member sas_hdd 837.9GB 0 ARRAY3 7 1 2 inactive 5 offline 273 failed sas_hdd 837.9GB 2 10 inactive 6 online member sas_hdd 837.9GB 0 ARRAY3 5 2 9 inactive 7 online member sas_hdd 837.9GB 0 ARRAY3 4 1 9 inactive 8 online member sas_hdd 837.9GB 0 ARRAY3 3 2 11 inactive 9 online member sas_hdd 837.9GB 0 ARRAY3 2 2 8 inactive 10 online member sas_hdd 837.9GB 0 ARRAY3 1 1 8 inactive 11 online member sas_hdd 837.9GB 0 ARRAY3 0 1 11 inactive 12 online member sas_hdd 837.9GB 1 ARRAY4 11 2 7 inactive 13 offline 268 spare sas_hdd 837.9GB 1 7 inactive 14 online member sas_hdd 837.9GB 1 ARRAY4 9 2 6 inactive 15 online member sas_hdd 837.9GB 1 ARRAY4 8 1 6 inactive 16 online member sas_hdd 837.9GB 1 ARRAY4 7 1 5 inactive 17 online member sas_hdd 837.9GB 1 ARRAY4 6 1 12 inactive Here we see that 2 drives (drive 5 and drive 13) are in the offline state which is the reason for the failure. |
Workaround:
-----------
Run the lsdrive to see the list of failed drives and then do the following:
1. Fix the drives. 2. Ensure the statuses of the drives are all online. 3. Resume the fix pack update. Fixed:
----------
V1.0.0.6/V1.1.0.2 The fixpack mechanism has changed in these fipxacks, however drive failures can lead to similar sypmtoms which similar workarounds.
|
|
KI007496
CECs is rebooted and /BCU_share is unmounted after a power firmware update
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
CECs is rebooted and /BCU_share is unmounted after a power firmware update
The upgrade of the CEC that hosts the management and admin nodes is done and the CEC gets rebooted and the run is halted due to this reboot.
When the CEC gets back online and the run is resumed, the BCU_share gets mounted on the management and admin nodes. Subsequently, the other CECs gets upgraded and rebooted. But when the respective nodes come back up, the BCU_share does not get mounted again and the upgrade proceeds to the point where it tries to access BCU_share and fails. Symptoms: 1. The output excerpt from the log ===================================================== [16 Nov 2016 08:32:35,041] <2884066 ADPT ERROR ibis01> Failed to unpack the adapter firmware file /BCU_share/bwr1/firmware/fc_adapter/df1000f114100104/image/df1000f114100104. 203305.aix.rpm on 172.23.1.4. ===================================================== 2. The /BCU_share NFS mount shared from the management host is not mounted on all hosts. |
Workaround:
-----------
After the failure is identified, simply resume. The resume code will verify that /BCU_share is mounted across the hosts.
Fixed:
----------
V1.0.0.6/V1.1.0.2 The fixpack mechanism has completely changed rendering this KI moot.
|
|
KI007479
miinfo compliance command shows compliance issue for some of the products
|
General |
I_V1.0.0.5
I_V1.1.0.1
|
miinfo compliance command shows compliance issue for some of the products
Running miinfo compliance command shows some levels are not correct. 'miinfo -d -c'
You may see the following under some servers: IBM Systems Director Common Agent 6.3.3.1 Higher IBM InfoSphere Warehouse 10.5.0.20151104_10.5.0.8..0 Lower IBM InfoSphere Optim Query Workload Tuner The version of the product cannot be determined or it is not installed. NA |
Workaround:
-----------
1. IBM Systems Director Common Agent 6.3.3.1 Higher
The common agent should no longer be tracked by the compliance software. This is a defect in the compliance check program and will not impact the operation of the appliance. 2. IBM InfoSphere Warehouse 10.5.0.20151104_10.5.0.8..0 Lower The InfoSphere Warehouse software compliance check uses 20151117 instead of 20151104. This is a defect in the compliance check code and will not impact the operation of the appliance. 3. IBM InfoSphere Optim Query Workload Tuner The version of the product cannot be determined or it is not installed. NA This is normal for nodes that are currently running as standby hosts. The compliance checker has a limitation in that it cannot check the level when a core host is currently a designated standby host. Fixed:
----------
V1.0.0.6/V1.1.0.2
MI* commands are no longer used.
Warehouse Tools is removed as part of the fixpack.
|
|
KI007503
FP5 cannot be directly applied to FP3 without some additional fixes and modifications.
|
Fixpack | I_V1.0.0.5 |
FP5 cannot be directly applied to FP3 without some additional fixes and modifications.
The IBM PureData System for Operational Analytics V1.0 FP5 package cannot be directly applied to FP3 environments. There are three distinct issues with FP5 on FP3 environments.
1. It is possible to register the fixpack, however after registration the console will no longer start. During the preview step similar messages to the following will appear in the log.
2. It is possible to apply the fixpack via command line, however the fixpack will fail validation as the firmware levels on the V7000 is too low for FP5 to update. 3. It is possible to apply the fixpack via command line, however the fixpack will fail validation as the firmware levels on the SAN switches is too low for FP5 to update. |
Workaround:
-----------
See the document 'How to apply the IBM PureData System for Operational Analytics V1.0 FP5 on a FP3 environment?' for more information about the FP3 to FP5 scenario.
Fixed:
----------
Only applies to V1.0.0.3 to V1.0.0.5 scnearios.
|
|
KI007523
Failed paths after the fixpack apply stage
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
Failed paths after the fixpack apply stage
The AIX hosts may have failed paths to the external storage. The run the following command as root on the management host:
dsh -n $ALL "lspath | grep hdisk | grep -v Enabled | wc -l" | dshbak -c Will return output if there are failed paths to the storage. |
Workaround:
-----------
There are two remedy options.
1. Reboot the host with the failed paths. This will effectively bounce the port. This may require an outage. 2. Follow the instructions below to bounce only the port. All access to the storage is fully redundant with multiple paths which is why the system can start even with failed paths. This method avoids an outage and effectively bounces the port. The following should be done one port at a time and should be performed either in an outage window, or a time when the system will have very low I/O activity. a. For each host login, determine the ports with failed paths using the 'lspath | grep hdisk | grep -v Enabled | while read stat disk dev rest;do echo "${dev}";done | sort | uniq' command. This command will return the uniq set of ports connected to hdisk devices that are Missing or Failed. Example output: fscsi10 b. For each port, create a script and update the 'export id=1' to match the # in the fscsi# id of the failed path. This script will then remove all paths to that port, set the port to the defined state, and will then rediscover the paths. This effectively bounces the port. c. Change the id to match the fscsi<id> number. Run the each script to remove the paths and to put the device in defined state, then use cfgmgr to reinitialize. This should create all of the new paths. Run these scripts one at a time and then verify that the path no longer appears in the command shown in a. export id=1 lspath -p fscsi${id} | while read st hd fs;do echo $hd;done | sort | uniq | while read disk;do rmpath -d -l ${disk} -p fscsi${id};done rmdev -l sfwcomm${id};rmdev -l fscsi${id};rmdev -l fcs${id} cfgmgr -s d. After the commands run, verify that there are no more failed paths over the port and that the port has discovered the existing paths. Bouncing the port in this way preserves any settings stored in ODM for the fcs and fscsi devices. Fixed:
----------
V1.0.0.6/V1.1.0.2 This fixpack uses a different mechanism wihch has not shown to have this issue.
|
|
KI007537
DB2 and/or ISW preview failures due to incorrect fixpack or incomplete DB2 10.5 upgrade.
|
Fixpack |
I_V1.0.0.5
|
DB2 and/or ISW preview failures due to incorrect fixpack or incomplete DB2 10.5 upgrade.
I downloaded and registered the fixpack, however I'm receiving preview errors related to the DB2 and / or InfoSphere Warehouse (ISW) levels.
There are two different fix central downloads for IBM PureData System for Operational Analytics V1.0 fixpack 5. IBM PureData System for Operational Analytics Fix Pack 5 (for systems with DB2 Version 10.1) and IBM PureData System for Operational Analytics Fix Pack 5 (for systems with DB2 Version 10.5) There are a couple of scenarios where problems arise. 1. Customer has DB2 V10.1, downloads and registers the fixpack with DB2 10.5. 2. Customer has followed the instructions to uplift or upgrade DB2 to 10.5 by following the technote Upgrading an IBM PureData System for Operational Analytics Version 1.0 environment to DB2 10.5 and downloads the fixpack with DB2 10.1. 3. Customer has only partially followed the instructions to uplift or upgrade DB2 to 10.5 by following the technote Upgrading an IBM PureData System for Operational Analytics Version 1.0 environment to DB2 10.5 and downloads the fixpack with DB2 10.5., but encounters a preview errors related to the ISW level being at the incorrect version level. |
Workaround:
-----------
1. Customer has DB2 V10.1, downloads and registers the fixpack with DB2 10.5.
2. Customer has followed the instructions to uplift or upgrade DB2 to 10.5 by following the technote Upgrading an IBM PureData System for Operational Analytics Version 1.0 environment to DB2 10.5 and downloads the fixpack with DB2 10.1. Contact IBM Support for help to de-register the incorrect fixpack. Download the fixpack with the correct db2 levels. Follow the fixpack instructions as usually. 3. Customer has only partially followed the instructions to uplift or upgrade DB2 to 10.5 by following the technote Upgrading an IBM PureData System for Operational Analytics Version 1.0 environment to DB2 10.5 and downloads the fixpack with DB2 10.5. but encounters a preview errors related to the ISW level being at the incorrect version level. This scenario is most likely to due to issues where the technote was not fully followed. This can happen due to confusion about the relationships between the InfoSphere Warehouse and DB2. Most customers understand how to upgrade DB2 and it is easy to miss that it is important to update the InfoSphere Warehouse product as well. Our Fixpack catalog does not at present support mixing DB2 10.5 and InfoSphere Warehouse 10.1 together. So the customer will need to revisit the Upgrading an IBM PureData System for Operational Analytics Version 1.0 environment to DB2 10.5 technote to verify that all of the update steps were followed and that the InfoSphere Warehouse levels are at 10.5 and the WebSphere Application Server levels are at 8.5.5.x as required as part of the technote. Once the levels are updated per the technote the Fixpack can be resumed and the preview should no longer fail. Fixed:
----------
V1.0.0.6/V1.1.0.2 this known issue is no longer applicable.
|
|
KI007538
The FixCentral download inadvertently includes XML files that are not part of the fixpack.
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
The FixCentral download inadvertently includes XML files that are not part of the fixpack.
I downloaded all of the files included in the FixCentral for the fixpack and there are extra XML files included. What are they for?
The XML files are of the following pattern: *.fo.xml *SG*.xml |
Workaround:
-----------
These files were inadvertently included in the fixpack packages and should either not be downloaded or deleted.
Fixed:
----------
V1.0.0.6/V1.1.0.2
|
|
KI007549
Storage update failed during apply phase because an update is already in progress message. [ Added 2017-09-18 ]
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
Storage update failed during apply phase because an update is already in progress message. [ Added 2017-09-18 ]
During the fixpack apply phase the fixpack fails.
---------------------------------------------------------------------------------------- /BCU_share/applmgmt/pflayer/log/pl_update.log: --> Not much in this log except for the failure message. ---------------------------------------------------------------------------------------- [16 Sep 2017 21:14:05,518] <8126944 UPDT APPI DEBUG mgmthost> STORAGE:storage0:172.23.1.181:1:Storage firmware update failed. ---------------------------------------------------------------------------------------- /BCU_share/applmgmt/pflayer/log/pl_update.trace: --> This excerpts shows an attempt to update the drive fw fails. The critical message is this one: "CMMVC6055E The action failed as an update is in progress.\n"],<"PLLogger=HASH(0x2000f558)" ---------------------------------------------------------------------------------------- [16 Sep 2017 21:11:38,575] <5832914 CTRL TRACE mgmthost> sleep time is not configured, defaults will be applied [16 Sep 2017 21:13:38,576] <5832914 CTRL DEBUG mgmthost> apply: 172.23.1.181: now installing drive updates... [16 Sep 2017 21:13:38,577] <5832914 CTRL DEBUG mgmthost> drive_id: 0:1:2:3:4:6:7:8:9:10:11:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:58:59:60:61:62:64:65:66:67:68:69:70:71 [16 Sep 2017 21:13:38,577] <5832914 CTRL DEBUG mgmthost> Number of drive id's is less than 128 [16 Sep 2017 21:13:38,578] <5832914 CTRL DEBUG mgmthost> Drive update command execution cnt : 0. [16 Sep 2017 21:13:43,150] <5832914 CTRL TRACE mgmthost> command: ssh admin@172.23.1.181 LANG=en_US svctask applydrivesoftware -file IBM2076_DRIVE_20160923 -type firmware -drive 0:1:2:3:4:6:7:8:9:10:11:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:58:59:60:61:62:64:65:66:67:68:69:70:71 [16 Sep 2017 21:13:43,151] <5832914 CTRL TRACE mgmthost> CMMVC6055E The action failed as an update is in progress. [16 Sep 2017 21:13:43,151] <5832914 CTRL TRACE mgmthost> Rc = 1 [16 Sep 2017 21:13:43,152] <5832914 CTRL DEBUG mgmthost> Extracted msg from NLS: apply: 172.23.1.181 ssh admin@172.23.1.181 LANG=en_US svctask applydrivesoftware -file IBM2076_DRIVE_20160923 -type firmware -drive 0:1:2:3:4:6:7:8:9:10:11:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:58:59:60:61:62:64:65:66:67:68:69:70:71 command failed. [16 Sep 2017 21:13:43,153] <5832914 CTRL DEBUG mgmthost> apply: 172.23.1.181: error: ssh admin@172.23.1.181 LANG=en_US svctask applydrivesoftware -file IBM2076_DRIVE_20160923 -type firmware -drive 0:1:2:3:4:6:7:8:9:10:11:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:58:59:60:61:62:64:65:66:67:68:69:70:71 command failed , rc=1 [16 Sep 2017 21:13:43,153] <5832914 CTRL TRACE mgmthost> { Entering Ctrl::Updates::Storage::search_token (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storage.pm line 1127) [16 Sep 2017 21:13:43,154] <5832914 CTRL TRACE mgmthost> Args:[["CMMVC8325E","None of the specified drives needed to be upgraded or downgraded"],[],<"PLLogger=HASH(0x2000f558)">] [16 Sep 2017 21:13:43,155] <5832914 CTRL DEBUG mgmthost> Not able to find CMMVC8325E None of the specified drives needed to be upgraded or downgraded in the output, an unexpected error occured [16 Sep 2017 21:13:43,155] <5832914 CTRL TRACE mgmthost> Return: 0 [16 Sep 2017 21:13:43,156] <5832914 CTRL TRACE mgmthost> Exiting Ctrl::Updates::Storage::search_token } [16 Sep 2017 21:13:43,156] <5832914 CTRL TRACE mgmthost> { Entering Ctrl::Updates::Storage::search_token (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storage.pm line 1128) [16 Sep 2017 21:13:43,157] <5832914 CTRL TRACE mgmthost> Args:[["CMMVC8325E","None of the specified drives needed to be upgraded or downgraded"],["CMMVC6055E The action failed as an update is in progress.\n"],<"PLLogger=HASH(0x2000f558)">] [16 Sep 2017 21:13:43,157] <5832914 CTRL DEBUG mgmthost> Not able to find CMMVC8325E None of the specified drives needed to be upgraded or downgraded in the output, an unexpected error occured [16 Sep 2017 21:13:43,158] <5832914 CTRL TRACE mgmthost> Return: 0 [16 Sep 2017 21:13:43,158] <5832914 CTRL TRACE mgmthost> Exiting Ctrl::Updates::Storage::search_token } [16 Sep 2017 21:13:43,158] <5832914 CTRL TRACE mgmthost> { Entering Ctrl::Updates::Storage::search_token (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storage.pm line 1138) [16 Sep 2017 21:13:43,159] <5832914 CTRL TRACE mgmthost> Args:[["CMMVC6546E","The current drive status is degraded"],[],<"PLLogger=HASH(0x2000f558)">] [16 Sep 2017 21:13:43,159] <5832914 CTRL DEBUG mgmthost> Not able to find CMMVC6546E The current drive status is degraded in the output, an unexpected error occured [16 Sep 2017 21:13:43,160] <5832914 CTRL TRACE mgmthost> Return: 0 [16 Sep 2017 21:13:43,160] <5832914 CTRL TRACE mgmthost> Exiting Ctrl::Updates::Storage::search_token } [16 Sep 2017 21:13:43,160] <5832914 CTRL TRACE mgmthost> { Entering Ctrl::Updates::Storage::search_token (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storage.pm line 1139) [16 Sep 2017 21:13:43,161] <5832914 CTRL TRACE mgmthost> Args:[["CMMVC6546E","The current drive status is degraded"],["CMMVC6055E The action failed as an update is in progress.\n"],<"PLLogger=HASH(0x2000f558)">] [16 Sep 2017 21:13:43,161] <5832914 CTRL DEBUG mgmthost> Not able to find CMMVC6546E The current drive status is degraded in the output, an unexpected error occured [16 Sep 2017 21:13:43,162] <5832914 CTRL TRACE mgmthost> Return: 0 [16 Sep 2017 21:13:43,162] <5832914 CTRL TRACE mgmthost> Exiting Ctrl::Updates::Storage::search_token } [16 Sep 2017 21:13:43,163] <5832914 CTRL DEBUG mgmthost> Function search_token failed, exiting the loop. [16 Sep 2017 21:13:43,163] <5832914 CTRL DEBUG mgmthost> Drive update got failed on 172.23.1.181 storage. [16 Sep 2017 21:13:43,203] <6029570 CTRL DEBUG mgmthost> apply: storage0: apply failed [16 Sep 2017 21:13:43,204] <6029570 CTRL TRACE mgmthost> For message id::1021 [16 Sep 2017 21:13:58,212] <6029570 CTRL TRACE mgmthost> { Entering Ctrl::Query::Status::read_status_n_details (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Updates/Storage.pm line 1210) [16 Sep 2017 21:13:58,213] <6029570 CTRL TRACE mgmthost> Args:["172.23.1.181 storage0 storage 0 NA"] [16 Sep 2017 21:13:58,213] <6029570 CTRL TRACE mgmthost> { Entering Ctrl::Util::util_details (Called from /opt/ibm/aixappl/pflayer/lib/Ctrl/Query/Status.pm line 45) [16 Sep 2017 21:13:58,214] <6029570 CTRL TRACE mgmthost> Args:["172.23.1.181","storage",0] .... 16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> Command Status:Details: [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> Status: Online [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> AccessState: Unlocked [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> Model: 124 [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> IPv4Address: ["172.23.1.181"] [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> Manufacturer: IBM ... [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> FWBuild: 115.54.1610251759000 [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> PLLogicalName: storage0 [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> MachineType: 2076 [16 Sep 2017 21:14:05,520] <8126944 UPDT APPI DEBUG mgmthost> HostName: V7_00_1 [16 Sep 2017 21:14:05,521] <8126944 UPDT APPI DEBUG mgmthost> FWLevel: 7.5.0.11 [16 Sep 2017 21:14:05,521] <8126944 UPDT APPI DEBUG mgmthost> Description: IBM Storwize V7000 Storage [16 Sep 2017 21:14:05,521] <8126944 UPDT APPI DEBUG mgmthost> STORAGE:storage0:172.23.1.181:1:Storage firmware update failed. [16 Sep 2017 21:14:05,521] <8126944 UPDT APPI DEBUG mgmthost> , Command Status->1 [16 Sep 2017 21:14:05,522] <8126944 UPDT ERROR mgmthost> TASK_END::13::1 of 1::StorageUPD::172.23.1.181::::RC=1::Storage update failed on 172.23.1.181 [16 Sep 2017 21:14:05,523] <8126944 UPDT APPI TRACE mgmthost> Return: 0 ... [16 Sep 2017 21:14:05,831] <8126944 UPDT APPI ERROR mgmthost> Error on nodes (172.23.1.181). [16 Sep 2017 21:14:05,848] <8126944 UPDT APPI INFO mgmthost> STEP_END::13::StorageFW_UPD::FAILED ... [16 Sep 2017 21:14:06,220] <8126944 UPDT APPI TRACE mgmthost> Exiting /opt/ibm/aixappl/pflayer/lib/ManageCatalogStatus.pm => ODM::change_record } [16 Sep 2017 21:14:06,224] <8126944 UPDT APPI TRACE mgmthost> Exiting ManageCatalogStatus::update_status } [16 Sep 2017 21:14:06,225] <8126944 UPDT APPI INFO mgmthost> PHASE_END APPLY IMPACT [16 Sep 2017 21:14:06,225] <8126944 UPDT APPI TRACE mgmthost> For message id::640 [16 Sep 2017 21:14:06,227] <8126944 UPDT APPI ERROR mgmthost> The apply phase for the release 'bwr5' failed. [16 Sep 2017 21:14:06,228] <8126944 UPDT RESU INFO mgmthost> PHASE_END RESUME [16 Sep 2017 21:14:06,228] <8126944 UPDT RESU TRACE mgmthost> For message id::640 [16 Sep 2017 21:14:06,229] <8126944 UPDT RESU ERROR mgmthost> The resume phase for the release 'bwr5' failed. ---------------------------------------------------------------------------------------- lseventlog message --> this messages shows the last message with the 2050 error code. ---------------------------------------------------------------------------------------- 131 170916183340 cluster V7_00_1 alert no 009198 2050 System update completion required ---------------------------------------------------------------------------------------- V7000 lsupdate status --> Indcates a status of system_completion_required ---------------------------------------------------------------------------------------- ssh -n superuser@172.23.1.181 'lsupdate' status system_completion_required event_sequence_number 131 progress estimated_completion_time suggested_action complete system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name This is a documented issue when upgrading V7000 firmware from 7.3.x to 7.4.x as indicated in the 7.4.0 release notes. https://public.dhe.ibm.com/storage/san/sanvc/release_notes/740_releasenotes.html |
Workaround:
-----------
Using the PureData System for Operational Analytics Console, use the Service Level Access page to find the link to access the Management Interface for each of the V7000s.
Navigate to the Events page which should show an alert. Select this alert and follow the fix procedures to initiate the second phase. Do this for each of the V7000s that has this issue. This step will take approximately 40 mins per enclosure and can be run in parallel. After the second phase is completed. You should see the message 'System update completion finished from lseventlog. lseventlog -fixed yes 130 170916183340 cluster V7_00_1 message no 980507 Update completed 131 170916183340 cluster V7_00_1 alert yes 009198 2050 System update completion required 132 170917022703 cluster V7_00_1 message no 980511 System update completion started 133 170917022713 node 3 node2 message no 980513 Node restarted for system update completion 134 170917022713 io_grp 0 io_grp0 message no 981102 SAS discovery occurred, configuration changes pending 135 170917022729 io_grp 0 io_grp0 message no 981103 SAS discovery occurred, configuration changes complete 136 170917022818 node 3 node2 message no 980349 Node added 137 170917022818 io_grp 0 io_grp0 message no 981102 SAS discovery occurred, configuration changes pending 138 170917022828 io_grp 0 io_grp0 message no 981103 SAS discovery occurred, configuration changes complete 139 170917025828 node 1 node1 message no 980513 Node restarted for system update completion 140 170917025828 io_grp 0 io_grp0 message no 981102 SAS discovery occurred, configuration changes pending 141 170917025828 io_grp 0 io_grp0 message no 981103 SAS discovery occurred, configuration changes complete 142 170917025939 node 1 node1 message no 980349 Node added 143 170917025941 io_grp 0 io_grp0 message no 981102 SAS discovery occurred, configuration changes pending 144 170917025941 cluster V7_00_1 message no 980512 System update completion finished 145 170917025946 io_grp 0 io_grp0 message no 981103 SAS discovery occurred, configuration changes complete lsupdate on that host should show 'status' = success. ssh -n superuser@172.23.1.181 'lsupdate' status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name Once this is completed for all V7000s resume the apply phase. Fixed:
----------
V1.0.0.6/V1.1.0.2
|
|
KI007421
HMC fw update fails in getupgfiles step. [ Added 2017-10-07 ]
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
HMC fw update fails in getupgfiles step. [ Added 2017-10-07 ]
The fixpack fails with the following message in the pl_update.log file.
[07 Oct 2017 14:51:11,167] <3933062 CTRL DEBUG host01> iso file validation failed/not-applicable [07 Oct 2017 14:51:11,168] <3933062 CTRL DEBUG host01> Updates failed [07 Oct 2017 14:51:11,188] <3474144 UPDT ERROR host01> TASK_END::2::1 of 1::HMCUPD::172.23.1.246::::RC=1::Update failed for HMC [07 Oct 2017 14:51:11,285] <3474144 UPDT APPI DEBUG host01> Executing query Logical_name=Management AND Solution_version=4.0.5.0, to update status of Product [07 Oct 2017 14:51:11,340] <3474144 UPDT APPI DEBUG host01> Executing query Sub_module_type=Management AND Solution_version=4.0.5.0, to update status of sub module [07 Oct 2017 14:51:11,570] <3474144 UPDT APPI DEBUG host01> Executing query Logical_name=Management AND Solution_version=4.0.5.0, to update status of Product [07 Oct 2017 14:51:11,616] <3474144 UPDT APPI DEBUG host01> Executing query Sub_module_type=Management AND Solution_version=4.0.5.0, to update status of sub module [07 Oct 2017 14:51:11,735] <3474144 UPDT APPI ERROR host01> Error on nodes (172.23.1.246 172.23.1.245). [07 Oct 2017 14:51:11,752] <3474144 UPDT APPI INFO host01> STEP_END::2::HMC_UPD::FAILED [07 Oct 2017 14:51:11,756] <3474144 UPDT APPI DEBUG host01> Error occured in apply for product hmc1 [07 Oct 2017 14:51:11,806] <3474144 UPDT APPI DEBUG host01> Executing query Logical_name=hmc1 AND Solution_version=4.0.5.0, to update status of Product [07 Oct 2017 14:51:11,925] <3474144 UPDT APPI ERROR host01> Apply (impact) phase for management module has failed. [07 Oct 2017 14:51:12,011] <3474144 UPDT APPI DEBUG host01> Executing query Logical_name=bwr1 AND Solution_version=4.0.5.0, to update status of Solution [07 Oct 2017 14:51:12,129] <3474144 UPDT APPI INFO host01> PHASE_END APPLY IMPACT [07 Oct 2017 14:51:12,131] <3474144 UPDT APPI ERROR host01> The apply phase for the release 'bwr1' failed. [07 Oct 2017 14:51:12,132] <3474144 UPDT RESU INFO host01> PHASE_END RESUME [07 Oct 2017 14:51:12,134] <3474144 UPDT RESU ERROR host01> The resume phase for the release 'bwr1' failed. Looking earlier in the log we see the following message: 07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> Last login: Sat Oct 7 14:41:30 2017 from 172.23.1.1^M [07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> ^[[?1034hhscroot@pddrmd7hmc1:~> getupgfiles -h 172.23.1.1 -u root -d /BCU_share/bwr1/firmware/hmc/CR6/image/imports/HMC_Recovery_V8R830_5 -s [07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> Enter the current password for user root: [07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> [07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> The file transfer did not complete sucessfully. [07 Oct 2017 14:51:07,582] <3998084 CTRL DEBUG host01> Verify the remote directory exists, all required files needed for upgrade are there, [07 Oct 2017 14:51:07,583] <3998084 CTRL DEBUG host01> you have read access to both the directory and the files, and then try the operation again. [07 Oct 2017 14:51:07,583] <3998084 CTRL DEBUG host01> hscroot@pddrmd7hmc1:~> echo $? [07 Oct 2017 14:51:07,583] <3998084 CTRL DEBUG host01> 1 [07 Oct 2017 14:51:07,583] <3998084 CTRL DEBUG host01> hscroot@pddrmd7hmc1:~> [07 Oct 2017 14:51:07,585] <3998084 CTRL DEBUG host01> From process 4522578: STDERR: [07 Oct 2017 14:51:07,585] <3998084 CTRL DEBUG host01> [07 Oct 2017 14:51:07,586] <3998084 CTRL DEBUG host01> Exit code: 1 [07 Oct 2017 14:51:07,587] <3998084 CTRL DEBUG host01> Command return code -> 1 [07 Oct 2017 14:51:07,588] <3998084 CTRL DEBUG host01> getupgfiles command failed [07 Oct 2017 14:51:07,626] <3998084 CTRL DEBUG host01> Failed to upgrade release |
Workaround:
-----------
Check the log file to see if the update completed on the second or other HMC in the environment. If the update completed successfully then the most likely reason is the known_hosts file for the root user has an incorrect ssh host key associated with the management host. This should be a rare occurrence but can happen if the ssh host keys on the management host change over time and during troubleshooting or a deployment step an ssh session was initiated from the root user on the hmc to the management host causing an issue.
To resolve this issue will require PDOA support to open a secondary to the HMC support team. The HMC support team will lead the customer to obtain pesh access. This is described in the following documentation or pesh for Power 7 and Power 8. Access to pesh requires accessing the hscpe user and the root user. In PDOA environments the hscpe user is removed before the system is turned over, but it may have been created during troubleshooting steps. Therefore it may be necessary to create the hscpe user or to change the password for the hscpe user if that user already exists due to a previous troubleshooting step. The same is true for the root user, if the root password is not known, then it will be necessary to modify the root password. Both hscpe and root password can be modified through the hscroot user using the chhmcuser command. Once a pesh session is established and the customer is able to access the root account it is possible to test that this indeed is the problem via the following as the root user: bash-4.1# ssh root@172.23.1.1 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is To fix the issue as the root user on the hmc run the following: Note that if your management host internal network ip address is different from 172.23.1.1 then substitute that IP address in the ssh-keygen command. ssh-keygen -R 172.23.1.1 This command will remove the entry from /root/.ssh/known_hosts file in hmc. We have to remove this using root user of hmc. Fixed:
----------
N/A.
|
|
KI007553
"Could not start the product 'GPFS' on" during apply phase.[ Added 2017-11-22 ]
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1 |
"Could not start the product 'GPFS' on" during apply phase.[ Added 2017-11-22 ]
When the fixpack attempts to restart GPFS on a host, it may fail to start GPFS causing the fixpack process to fail.
|
Workaround:
-----------
This happens due to a limitation in the pflayer code which determines whether all of the GPFS filesystem mount points are indeed mounted, allowing the fixpack process to proceed to the next step. This code works on a very specific naming convention for NSDs and associated GPFS filesystems as well as a one to one mapping of NSDs to filesystems. If a filesystem and nsd do not follow either of these conventions then the GPFS startup code will not be able to determine when all filesystems are indeed mounted. Customers that have added GPFS filesystems that do not follow these two conventions will need to contact IBM for possible remediation options.
Here is the test. Run the following commands on the hosts identified in the pl_update.log file that could not start GPFS. These commands can be run prior to the fixpack process. /usr/lpp/mmfs/bin/mmlsfs all -d 2> /dev/null | grep "\-d" | awk '{ sub(/nsd/, "", $2);print $2}'|sort mount | grep " mmfs " | awk '{ sub(/\/dev\//,"",$1);print $1}' | sort The expectation is that the output is exactly the same. Fixed:
----------
V1.0.0.6/V1.1.0.2 The fixpack mechanism has changed. However this symptom could occur when running appl_start commands which still rely on the rules above to work correctly. The impact is different as the fixpack no longer relies 100% on appl_start and appl_stop to work.
|
|
KI007499
Drive update required for product ID ST900MM0006[ Added 2017-11-22 ]
|
Fixpack |
I_V1.0.0.5
|
Drive update required for product ID ST900MM0006[ Added 2017-11-22 ]
Before starting the apply phases of the fixpack, it is necessary to apply an update to the V7000 drives. These steps can be applied while the system is online. See the linked V7000 tech note for more information.
Product ID ST900MM0006 need to update drives with firmware level B56S before running the "applydrivesoftware" command. Data Integrity Issue when Drive Detects Unreadable Data http://www-01.ibm.com/support/docview.wss?rs=591&uid=ssg1S1005289 |
Workaround:
-----------
1. In an ssh session log as the root user on the management host.
2. Determine the the ip addresses of all of the V7000 enclosures in the environment. The SAN_FRAME entries in the xcluster.cfg file are V7000 enclosures. $ grep 'SAN_FRAME[0-9][0-9]*_IP' /pschome/config/xcluster.cfg SAN_FRAME1_IP = 172.23.1.181 SAN_FRAME2_IP = 172.23.1.182 SAN_FRAME3_IP = 172.23.1.183 SAN_FRAME4_IP = 172.23.1.184 SAN_FRAME5_IP = 172.23.1.185 SAN_FRAME6_IP = 172.23.1.186 SAN_FRAME7_IP = 172.23.1.187 or Use the following command to query the console for the storage enclosures. $ appl_ls_hw -r storage -A M_IP_address,Description "172.23.1.181","IBM Storwize V7000 Storage" "172.23.1.182","IBM Storwize V7000 Storage" "172.23.1.183","IBM Storwize V7000 Storage" "172.23.1.184","IBM Storwize V7000 Storage" "172.23.1.185","IBM Storwize V7000 Storage" "172.23.1.186","IBM Storwize V7000 Storage" "172.23.1.187","IBM Storwize V7000 Storage" In the above examples there are several V7000 storage enclosures and the ip addresses are: 172.23.181 to 172.23.1.187 3. Determine if your system has the impacted drive. This command will provide the number of drives that match the 900 GB with type ST900MM0006. $ grep 'SAN_FRAME[0-9]*[0-9]_IP' /pschome/config/xcluster.cfg | while read a b c d;do echo "*** ${c} ***";ssh -n superuser@${c} 'lsdrive -nohdr| while read id rest;do lsdrive $id;done' | grep -c "product_id ST900MM0006";done *** 172.23.1.181 *** 0 *** 172.23.1.182 *** 5 *** 172.23.1.183 *** 23 *** 172.23.1.184 *** 0 *** 172.23.1.185 *** 0 *** 172.23.1.186 *** 0 *** 172.23.1.187 *** 0 4 The PureData System for Operational Analytics V1.0 FP5 image includes the necessary files to perform the drive update. These files were unpacked as part of the fixpack registration. Determine the location of the fixpack on the management host. $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 4.0.4.0 Committed Updates for IBM_PureData_System_for_Operational_Analytics bwr1 4.0.5.0 Committed Updates for IBM_PureData_System_for_Operational_Analytics_DB2105 In the above command the fixpack files are part of the id 'bwr1'. This means the files were unpacked on the management host in /BCU_share/bwr1. 5. Determine the fix path by changing the <BWR> variable to the identifier determined in step 3 in the path /BCU_share/<BWR>/firmware/storage/2076/image/imports/drives. From the above example, the id was 'bwr1' so the path is "/BCU_share/bwr1/firmware/storage/2076/image/imports/drives". 6. Verify the fix file exists and also the cksum of the file. $ ls -la /BCU_share/bwr1/firmware/storage/2076/image/imports/drives total 162728 drwxr-xr-x 2 26976 19768 256 Jan 18 08:53 . drwxr-xr-x 5 26976 19768 256 Jan 18 08:53 .. -rw-r--r-- 1 26976 19768 83313381 Jan 18 08:53 IBM2076_DRIVE_20160923 $ cksum /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 3281318949 83313381 /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 7. For each ip address identified in step 2. The example below uses 172.23.1.183, All V7000s can be updated concurrently. a. Copy the image to storwize location /home/admin/upgrade scp /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 admin@172.23.1.183:/home/admin/upgrade IBM2076_DRIVE_20160923 100% 79MB 39.7MB/s 00:02 b. Update the drive using the command below: ssh admin@172.23.1.183 "applydrivesoftware -file IBM2076_DRIVE_20160923 -all" 7. Monitor the status of drive upgrade using Isdriveupgradeprogress command. This following command like will report on the progress of all of the V7000s. Repeat this command until there is no longer any output indicating the updates have finished. $ grep 'SAN_FRAME[0-9]*[0-9]_IP' /pschome/config/xcluster.cfg | while read a b c d;do echo "*** ${c} ***";ssh -n superuser@${c} lsdriveprogress;done *** 172.23.1.181 *** *** 172.23.1.182 *** *** 172.23.1.183 *** *** 172.23.1.184 *** *** 172.23.1.185 *** *** 172.23.1.186 *** *** 172.23.1.187 *** Fixed:
----------
NA This KI is limited to V1.0.0.5.
|
|
KI007566
HA Tools Version 2.0.5.0 hareset fails with "syntax error at line 854 :`else' unexpected" error.[ Added 2018-01-23]
|
General |
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
HA Tools Version 2.0.5.0 hareset fails with "syntax error at line 854 :`else' unexpected" error.[ Added 2018-01-23]
When attempting backup or restore the core TSA domains the hareset command fails with an error similar to the following.
/usr/IBM/analytics/ha_tools/hareset: syntax error at line 854 :`else' unexpected This is due to an errant edit as part of changes that were incorporated into the hatools in the March fixpacks (V1.0.0.5/V1.1.0.1) as part of HA Tools version 2.0.5.0. |
Workaround:
-----------
To fix in the field:
Login to the management host as root: cp /usr/IBM/analytics/ha_tools/hareset to /usr/IBM/analytics/ha_tools/hareset.bak Using the vi editor, modify the file /usr/IBM/analytics/ha_tools/hareset. Find line 850 in this file. Modify 'if' to say 'fi'. Save the file. diff the file: $ diff hareset.bak hareset 850c850 < if --- > fi Copy this new hareset to file to the rest of the hosts Fixed:
----------
V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. |
|
KI007610
On FP3->FP5 The TSA upgrade does not include the appropriate TSA license. [ Added 9/5/2018 ]
|
Fixpack | I_V1.0.0.5 |
On FP3->FP5 The TSA upgrade does not include the appropriate TSA license. [ Added 9/5/2018 ]
This issue only affects customers who apply PDOA V1.0.0.5 (FP5) to V1.0.0.3 (FP3). There have been two symptoms that have appeared in the field. The first symptom occurs after trying to run a command to change rsct / TSA policies. (mkrsrc-api) 2621-309 Command not allowed as daemon does not have a valid license. The second symptom can occur when trying to update TSA when there is no license. The following error can show up when running installSAM: prereqSAM: All prerequisites for the ITSAMP installation are met on operating system: AIX 7100-05 |
Workaround:
-----------
1. Verify that the license is not applied by running the following command from root on the management host. $ dsh -n ${ALL} 'samlicm -s' | dshbak -c 2. If planning to update to PDOA V1.0.0.6 then V1.0.0.6 will include instructions on how to remedy this issue. When PDOA V1.0.0.6 is available download the fixpack from FixCentral and follow the instructions to unpack the fixpack and then in the Appendix which describes how to apply the license as part of the TSA update. If not planning on applying FP6 then contact IBM Support to obtain the sam41.lic file and proceed to step 3. 3. Create the directory /stage/FP3_FP5/TSA. mkdir -p /stage/FP3_FP5/TSA 4. Copy the sam41.lic file to the /stage/FP3_FP5/TSA directory. 5. Verify that stage is mounted on all hosts in the domain. 6. Run the following command to apply the license to all domains. This does not require restart. dsh -n $ALL "samlicm -i /stage/FP3_FP5/TSA/sam41.lic " 7. Verify the license was applied successfully. The output should be similar to the output below once the TSA copies are licensed. $ dsh -n $ALL "samlicm -s " | dshbak -c Fixed:
----------
NA. Only applies to V1.0.0.3->V1.0.0.5 scenarios.
|
|
KI007570
Multiple DB2 Copies installed on the core hosts can confuse the fixpack. [ Added 2018-10-03 ]
|
Fixpack |
I_V1.0.0.1
I_V1.0.0.2 I_V1.0.0.3 I_V1.0.0.4 I_V1.0.0.5 I_V1.1.0.1 |
Multiple DB2 Copies installed on the core hosts can confuse the fixpack. [ Added 2018-10-03 ]
The PDOA appliance is designed as follows: 1 DB2 9.7 copy on the management host to support IBM System Director. 1 DB2 10.1 or 10.5 DB2 copy on the management and management standby hosts supporting Warehouse Tools and DPM. 1 DB2 10.1, 10.5, 11.1 copy on all core hosts supporting the core database. This assumption is built into the PDOA Console and can impact the following: --> compliance checks comparing what is one the system to the validated stack --> fixpack application (preview, prepare, apply, commit phases). The most likely scenario is a that a customer who is very familiar with DB2 may install additional copies as part of a fixpack or special build installation. This is supported by DB2 but if the previous copy is left on the system it can cause various issues with the console with the most severe issues occurring during the fixpack application. This issue will minimally impact customers on V1.0.0.5 or V1.1.0.1 as the non-cumulative V1.0.0.6 (FP6) / V1.1.0.2 (FP2) have significantly changed and no longer have this restriction and the compliance check for DB2 in the platform layer is not a critical function. |
Workaround:
-----------
Remove any extra DB2 copies from the environment on all hosts before running the fixpack preview. This will prevent fixpack failures due to multiple DB2 copies. If a problem is encountered during the appliance fixpack related to multiple db2 copies it will be necessary to seek guidance from IBM Support as the next steps will depend on the failure as well as when in the process of applying the fixpack. Fixed:
----------
V1.0.0.6/V1.1.0.2 The fixpack application mechanism has been modified and no longer requires just on DB2 copy to be installed.
|
|
KI007499
Drive update required for product ID ST1200MM0007
|
Fixpack | I_V1.1.0.1 |
Drive update required for product ID ST1200MM0007
Before starting the apply phases of the fixpack, it is necessary to apply an update to the V7000 drives. These steps can be applied while the system is online. See the linked V7000 tech note for more information.
Product ID ST1200MM0007 need to update drives with firmware level B57D before running the "applydrivesoftware" command. Data Integrity Issue when Drive Detects Unreadable Data http://www-01.ibm.com/support/docview.wss?rs=591&uid=ssg1S1005289 |
Workaround:
-----------
1. In an ssh session log as the root user on the management host.
2. Determine the the ip addresses of all of the V7000 enclosures in the environment. The odd numbered SAN_FRAME entries in the xcluster.cfg file are V7000 enclosures. $ grep 'SAN_FRAME[0-9]*[13579]_IP' /pschome/config/xcluster.cfg SAN_FRAME1_IP = 172.23.1.181 SAN_FRAME3_IP = 172.23.1.183 or Use the following command to query the console for the storage enclosures. $ appl_ls_hw -r storage -A M_IP_address,Description "172.23.1.181","IBM Storwize V7000 FAB-1 Storage" "172.23.1.182","IBM FlashSystem 900 Storage" "172.23.1.183","IBM Storwize V7000 FAB-1 Storage" "172.23.1.184","IBM FlashSystem 900 Storage" In the above examples there are two V7000 storage enclosures and the ip addresses are: 172.23.181 and 172.23.1.183 3. Determine if your system has the impacted drive. This command will provide the number of drives that match the 1.2 TB drive that matches the $ grep 'SAN_FRAME[0-9]*[13579]_IP' /pschome/config/xcluster.cfg | while read a b c d;do echo "*** ${c} ***";ssh -n superuser@${c} 'lsdrive -nohdr| while read id rest;do lsdrive $id;done' | grep -c "product_id ST1200MM0007";done *** 172.23.1.181 *** 27 *** 172.23.1.183 *** 35 4 The PureData System for Operational Analytics V1.1 FP1 image includes the necessary files to perform the drive update. These files were unpacked as part of the fixpack registration. Determine the location of the fixpack on the management host. $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 4.0.4.0 Committed Updates for IBM_PureData_System_for_Operational_Analytics bwr1 4.0.5.0 Committed Updates for IBM_PureData_System_for_Operational_Analytics_DB2105 In the above command the fixpack files are part of the id 'bwr1'. This means the files were unpacked on the management host in /BCU_share/bwr1. 5. Determine the fix path by changing the <BWR> variable to the identifier determined in step 3 in the path /BCU_share/<BWR>/firmware/storage/2076/image/imports/drives. From the above example, the id was 'bwr1' so the path is "/BCU_share/bwr1/firmware/storage/2076/image/imports/drives". 6. Verify the fix file exists and also the cksum of the file. $ ls -la /BCU_share/bwr1/firmware/storage/2076/image/imports/drives total 162728 drwxr-xr-x 2 26976 19768 256 Jan 18 08:53 . drwxr-xr-x 5 26976 19768 256 Jan 18 08:53 .. -rw-r--r-- 1 26976 19768 83313381 Jan 18 08:53 IBM2076_DRIVE_20160923 $ cksum /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 3281318949 83313381 /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 7. For each ip address identified in step 2. The example below uses 172.23.1.183, All V7000s can be updated concurrently. a. Copy the image to storwize location /home/admin/upgrade scp /BCU_share/bwr1/firmware/storage/2076/image/imports/drives/IBM2076_DRIVE_20160923 admin@172.23.1.183:/home/admin/upgrade IBM2076_DRIVE_20160923 100% 79MB 39.7MB/s 00:02 b. Update the drive using the command below: ssh admin@172.23.1.183 "applydrivesoftware -file IBM2076_DRIVE_20160923 -all" 7. Monitor the status of drive upgrade using Isdriveupgradeprogress command. This following command like will report on the progress of all of the V7000s. Repeat this command until there is no longer any output indicating the updates have finished. $ grep 'SAN_FRAME[0-9]*[13579]_IP' /pschome/config/xcluster.cfg | while read a b c d;do echo "*** ${c} ***";ssh -n superuser@${c} lsdriveprogress;done *** 172.23.1.181 *** *** 172.23.1.183 *** Fixed:
----------
NA. This known issue is only applicable to this fixpack level.
|
|
KI007470
Preview failed for flash storage
|
Fixpack | I_V1.1.0.1 |
Preview failed for flash storage
ymptoms:
1. Fix pack apply fails 2. The miupdate log shows the following: ===================================================== [20 Jan 2017 04:09:59,225] <6291746 CTRL TRACE reverseflash01> CMMVC5994E Error in verifying the signature of the update package. and 20 Jan 2017 04:09:59,230] <6291746 CTRL ERROR reverseflash01> STORAGE:storage1:172.23.1.182:1:Error: The update operation on the system cannot be performed. [20 Jan 2017 04:09:59,230] <6291746 CTRL ERROR reverseflash01> STORAGE:storage3:172.23.1.184:1:Error: The update operation on the system cannot be performed. ===================================================== This is a known issue with the Flash900 firmware as listed in the following URL: https://www-01.ibm.com/support/docview.wss?uid=ssg1S1009254 |
Workaround:
-----------
If the issue cannot be resolved, contact IBM support
Fixed:
----------
|
|
KI007523
Apply failed on one flash storage on reverseflash
|
Fixpack | I_V1.1.0.1 |
Apply failed on one flash storage on reverseflash
The upgrade cannot proceed because of hardware errors.
LOG excerpts: ============ [02 Feb 2017 13:43:34,301] <3801392 UPDT ERROR reverseflash01> TASK_END::14::1 of 1::StorageUPD::172.23.1.182::::RC=1::Storage update failed on 172.23.1.182 [02 Feb 2017 13:43:34,302] <3801392 UPDT INFO reverseflash01> TASK_END::14::1 of 1::StorageUPD::172.23.1.184::::RC=0::Storage update succeeded on 172.23.1.184 [02 Feb 2017 13:43:34,497] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Logical_name=Infrastructure AND Solution_version=4.0.5.0, to update status of Product [02 Feb 2017 13:43:34,561] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Sub_module_type=Infrastructure AND Solution_version=4.0.5.0, to update status of sub module [02 Feb 2017 13:43:34,771] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Logical_name=Infrastructure AND Solution_version=4.0.5.0, to update status of Product [02 Feb 2017 13:43:34,818] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Sub_module_type=Infrastructure AND Solution_version=4.0.5.0, to update status of sub module [02 Feb 2017 13:43:34,925] <3801392 UPDT APPI ERROR reverseflash01> Error on nodes (172.23.1.182). [02 Feb 2017 13:43:34,944] <3801392 UPDT APPI INFO reverseflash01> STEP_END::14::StorageFW_UPD::FAILED [02 Feb 2017 13:43:34,950] <3801392 UPDT APPI DEBUG reverseflash01> Error occured in apply for product storagefw3 [02 Feb 2017 13:43:35,013] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Logical_name=storagefw3 AND Solution_version=4.0.5.0, to update status of Product [02 Feb 2017 13:43:35,154] <3801392 UPDT APPI ERROR reverseflash01> Apply (impact) phase for solution has failed. [02 Feb 2017 13:43:35,224] <3801392 UPDT APPI DEBUG reverseflash01> Executing query Logical_name=bwr1 AND Solution_version=4.0.5.0, to update status of Solution [02 Feb 2017 13:43:35,349] <3801392 UPDT APPI INFO reverseflash01> PHASE_END APPLY IMPACT [02 Feb 2017 13:43:35,351] <3801392 UPDT APPI ERROR reverseflash01> The apply phase for the release 'bwr1' failed. [02 Feb 2017 13:43:35,352] <3801392 UPDT RESU INFO reverseflash01> PHASE_END RESUME [02 Feb 2017 13:43:35,353] <3801392 UPDT RESU ERROR reverseflash01> The resume phase for the release 'bwr1' failed. [02 Feb 2017 11:42:10,533] <5112002 CTRL DEBUG reverseflash01> Extracted msg from NLS: apply: 172.23.1.182 ssh admin@172.23.1.182 LANG=en_US svctask applysoftware -file IBM9840_INSTALL_1.4.5.0 c ommand failed. [02 Feb 2017 11:42:10,534] <5112002 CTRL DEBUG reverseflash01> apply: 172.23.1.182: error: svctask applysoftware failed, rc=1 [02 Feb 2017 11:42:10,568] <2097734 CTRL DEBUG reverseflash01> apply: storage1: apply failed [02 Feb 2017 11:46:48,787] <4259928 CTRL DEBUG reverseflash01> get_update_status: status is <upgrading 2 [02 Feb 2017 11:46:48,787] <4259928 CTRL DEBUG reverseflash01> > |
Workaround:
-----------
If the issue cannot be resolved, contact IBM support
Fixed:
----------
NA
|
|
KI007539
The FixCentral download for V1.1 FP1 includes V1.0.0.5 filenames. This is confusing.
|
Fixpack | I_V1.1.0.1 |
The FixCentral download for V1.1 FP1 includes V1.0.0.5 filenames. This is confusing.
When downloading the files for V1.1 FP1, I noticed that the filenames say 1.0.0.5. Did I download the right files? Is there a mistake in FixCentral?
|
Workaround:
-----------
There is no mistake in FixCentral. The fixpack files for V1.0.0.5 (DB2 10.5) and V1.1.0.1 are exactly the same.
The file listing for V1.1.0.1 is: 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.fo.xml 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.readme 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_001 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_002 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_003 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_004 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_005 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_006 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_007 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_008 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_009 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_010 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_011 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_012 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_013 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_014 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_015 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_016 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_017 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_018 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_019 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_020 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_021 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005.tar_part_022 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005_SG_Single_1489238245139.xml 1.0.0.5-IM-PureData_System_for_OpAnalytics_DB2105-fp005_SG_Single_1489373378095.xml 1.1.0.1-IM-PureData_System_for_OpAnalytics_DB2105-fp005.fo.xml 1.1.0.1-IM-PureData_System_for_OpAnalytics_DB2105-fp005_SG_Single_1489273702786.xml 1.1.0.1-IM-PureData_System_for_OpAnalytics_DB2105-fp005_SG_Single_1489374195057.xml As per the previous known issue the XML files were inadvertently included in the fixpack and are not needed. Fixed:
----------
NA
|
|
KI007560
A new drive type, AL14SEB120N, in the V7000 gen2 will cause a preview failure. [ Added 2017-12-22]
|
Fixpack | I_V1.1.0.1 |
A new drive type, AL14SEB120N, in the V7000 gen2 will cause a preview failure. [ Added 2017-12-22]
The above drive type was added as a possible drive for V7000 gen2 enclosures after the PDOA Fixpack was released and is not recognized by the V7000 test application shipped in that fixpack. This can lead to errors similar to the following during the preview stage.
[20 Dec 2017 09:34:43,469] <12386730 CTRL TRACE host01> This tool has detected that the cluster contains [20 Dec 2017 09:34:43,470] <12386730 CTRL TRACE host01> one or more internal disks that are not known drive types. [20 Dec 2017 09:34:43,470] <12386730 CTRL TRACE host01> Please retry with the latest version of svcupgradetest. If this [20 Dec 2017 09:34:43,470] <12386730 CTRL TRACE host01> error is still being reported, please contact your support representative. [20 Dec 2017 09:34:43,470] <12386730 CTRL TRACE host01> +----------------------+-------------+ [20 Dec 2017 09:34:43,470] <12386730 CTRL TRACE host01> | Reported model | Drive count | [20 Dec 2017 09:34:43,471] <12386730 CTRL TRACE host01> +----------------------+-------------+ [20 Dec 2017 09:34:43,471] <12386730 CTRL TRACE host01> | AL14SEB120N | 32 | [20 Dec 2017 09:34:43,471] <12386730 CTRL TRACE host01> +----------------------+-------------+ [20 Dec 2017 09:34:43,471] <12386730 CTRL TRACE host01> To see the list of entries in any of the above tables [20 Dec 2017 09:34:43,471] <12386730 CTRL TRACE host01> re-run the tool with a -d parameter added on the end. [20 Dec 2017 09:34:43,472] <12386730 CTRL TRACE host01> This will cause the preview step to fail. No PDOA environments were shipped with this drive type prior to December, 2017. However, it is possible that a drive replacement could introduce this drive type to an existing PDOA V1.1 environment leading to this scenario. |
Workaround:
-----------
The only resolution to this issue is to update the firmware of all V7000 enclosures that contain this drive type to a level at or above the level shipped with the fixpack.
The V7000 storage enclosures support concurrent firmware updates, however updates should only be performed when there is a light workload. The V7000 enclosures in a PDOA V1.1 environment provide management storage with generally minimal io requirements as well second tier or backup storage for the database partitions. It is recommended to quiesce the system prior to applying the update. The update should take approx 3 hours to perform for planning purposes. Step 1: Identify all V7000 storage enclosures. The number of enclosures will vary depending on the size of the environment. This is run as root on the management node. ===================================== $: grep 'SAN_FRAME[0-9]*[13579]_IP' /pschome/config/xcluster.cfg | while read a b c d;do echo "*** ${c} ***";ssh -n superuser@${c} 'lsdrive -nohdr| while read id rest;do lsdrive $id;done' | grep -c "product_id AL14SEB120N";done *** 172.23.1.181 *** 39 *** 172.23.1.183 *** 32 *** 172.23.1.185 *** 46 *** 172.23.1.187 *** 41 Step 2: Identify the pflayer storage# id's associated with the above enclosures. ===================================== $ appl_ls_hw -r storage -A Logical_name,Description,M_IP_address | sort "storage0","IBM Storwize V7000 FAB-1 Storage","172.23.1.181" "storage1","IBM FlashSystem 900 Storage","172.23.1.182" "storage2","IBM Storwize V7000 FAB-1 Storage","172.23.1.183" "storage3","IBM FlashSystem 900 Storage","172.23.1.184" "storage4","IBM Storwize V7000 FAB-1 Storage","172.23.1.185" "storage5","IBM FlashSystem 900 Storage","172.23.1.186" "storage6","IBM Storwize V7000 FAB-1 Storage","172.23.1.186" "storage7","IBM FlashSystem 900 Storage","172.23.1.187" Step 3: Run the pflayer validation check replacing the 'storage#,storage#,' with the storage identified above. ===================================== $PL_ROOT/bin/icmds/appl_ctrl_storage update -validate -t 7.5.0.11 -l "storage0,storage2,storage4,storage6" -f /BCU_share/bwr1/firmware/storage/2076/image STORAGE:storage2:172.23.1.183:1:Error: The update operation on the system cannot be performed. STORAGE:storage6:172.23.1.187:1:Error: The update operation on the system cannot be performed. STORAGE:storage4:172.23.1.185:1:Error: The update operation on the system cannot be performed. STORAGE:storage0:172.23.1.181:1:Error: The update operation on the system cannot be performed. Step 4: Run the pflayer prepare step by replacing the 'storage#,storage#,' with the storage identified above. ===================================== $PL_ROOT/bin/icmds/appl_ctrl_storage update -prepare -l "storage0,storage2,storage4,storage6" -f /BCU_share/bwr1/firmware/storage/2076/image STORAGE:storage2:172.23.1.183:0: STORAGE:storage6:172.23.1.187:0: STORAGE:storage4:172.23.1.185:0: STORAGE:storage0:172.23.1.181:0: Step 5: Run the pflayer update step by replacing the 'storage#,storage#,' with the storage identified above. ===================================== nohup $PL_ROOT/bin/icmds/appl_ctrl_storage update -install -t 7.5.0.11 -l "storage0,storage2,storage4,storage6" -f /BCU_share/bwr1/firmware/storage/2076/image Step 6: Rerun the preview step as part of the Fixpack application. ===================================== Fixed:
----------
NA.
|
|
KI007583
Trial DB2 10.5 License Discovered after fixpack apply phase.
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
Trial DB2 10.5 License Discovered after fixpack apply phase.
Due to a change in the mechanims of updating DB2 10.5 in V1.0.0.5 and V1.1.0.1 if DB210.5 is updated by V1.0.0.5 or V1.1.0.1 then it will not apply the appropriate license file to the new DB2 copies on the core hosts. This only impacts the core hosts and not the management hosts. This does not impact DB2 11.1 nor DB2 10.1.
The following command will show the current licenses installed for all Known DB2 copies on the core nodes:
$ dsh -n ${BCUDB2ALL} '/usr/local/bin/db2ls -c | grep -v "#" | cut -d: -f 1 | while read f;do ${f}/adm/db2licm -l;done' | dshbak -c
HOSTS ------------------------------------------------------------------------- kf5hostname02, kf5hostname05, kf5hostname06, kf5hostname07 ------------------------------------------------------------------------------- Product name: "DB2 Advanced Enterprise Server Edition" License type: "Trial" Expiry date: "06/04/2018" Product identifier: "db2aese" Version information: "10.5" Product name: "DB2 Enterprise Server Edition"
License type: "Trial" Expiry date: "06/04/2018" Product identifier: "db2ese" Version information: "10.5" HOSTS ------------------------------------------------------------------------- kf5hostname04 ------------------------------------------------------------------------------- Product name: "DB2 Advanced Enterprise Server Edition" License type: "Trial" Expiry date: "06/09/2018" Product identifier: "db2aese" Version information: "10.5" Product name: "DB2 Enterprise Server Edition"
License type: "Trial" Expiry date: "06/09/2018" Product identifier: "db2ese" Version information: "10.5" |
Workaround:
-----------
Ensure that /BCU_share is mounted on all core hosts.
Login to the management host as the root user.
$ dsh -n ${BCUDB2ALL} 'mount | grep /BCU_share'
### If not mounted, then mount /BCU_share
### This command assumes that the management hosts internal ip address is 172.23.1.1. $ dsh -n ${BCUDB2ALL} 'mount 172.23.1.1:/BCU_share /BCU_share'
# Verify: $ dsh -n ${BCUDB2ALL} 'mount | grep /BCU_share'
kf5hostname05: 172.23.1.1 /BCU_share /BCU_share nfs3 Mar 14 17:53 kf5hostname02: 172.23.1.1 /BCU_share /BCU_share nfs3 Mar 14 17:53 kf5hostname07: 172.23.1.1 /BCU_share /BCU_share nfs3 Mar 14 17:53 kf5hostname06: 172.23.1.1 /BCU_share /BCU_share nfs3 Mar 14 17:53 kf5hostname04: 172.23.1.1 /BCU_share /BCU_share nfs3 Mar 14 17:53 # Find the installation directory for the fixpack. $ appl_ls_cat NAME VERSION STATUS DESCRIPTION bwr0 4.0.4.2 Committed Updates for IBM_PureData_System_for_Operational_Analytics bwr1 4.0.5.0 Committed Updates for IBM_PureData_System_for_Operational_Analytics_DB2105 --> The DB2 license file used in PDOA environments can be found in the following location. Replace 'bwr1' with the appropriate name from the appl_ls_cat command. /BCU_share/bwr1/software/ISW/PDS/warehouse/db2aese_c.lic
Verify that the license file is for the DB2 Copy in question, for example, DB2 10.5.
$ cat /BCU_share/bwr1/software/ISW/PDS/warehouse/db2aese_c.lic | grep ProductVersion
ProductVersion=10.5 # Verify that all host can see the file. dsh -n ${BCUDB2ALL} 'cksum /BCU_share/bwr1/software/ISW/PDS/warehouse/db2aese_c.lic' | dshbak -c
HOSTS ------------------------------------------------------------------------- kf5hostname02, kf5hostname04, kf5hostname05, kf5hostname06, kf5hostname07 ------------------------------------------------------------------------------- 1513072379 915 /BCU_share/bwr1/software/ISW/PDS/warehouse/db2aese_c.lic # Apply the license file. This is done as root for all hosts. Find the db2 file:
dsh -n ${BCUDB2ALL} '/usr/local/bin/db2ls -c | grep -v "#" | cut -d: -f 1'
kf5hostname02: /usr/IBM/dwe/db2/V10.5.0.8..0 kf5hostname04: /usr/IBM/dwe/db2/V10.5.0.8..0 kf5hostname07: /usr/IBM/dwe/db2/V10.5.0.8..0 kf5hostname06: /usr/IBM/dwe/db2/V10.5.0.8..0 kf5hostname05: /usr/IBM/dwe/db2/V10.5.0.8..0 dsh -n ${BCUDB2ALL} '/usr/IBM/dwe/db2/V10.5.0.8..0/adm/db2licm -a /BCU_share/bwr1/software/ISW/PDS/warehouse/db2aese_c.lic' kf5hostname02: kf5hostname02: LIC1402I License added successfully. kf5hostname02: kf5hostname02: kf5hostname02: LIC1426I This product is now licensed for use as outlined in your License Agreement. USE OF THE PRODUCT CONSTITUTES ACCEPTANCE OF THE TERMS OF THE IBM LICENSE AGREEMENT, LOCATED IN THE FOLLOWING DIRECTORY: "/usr/IBM/dwe/db2/V10.5.0.8..0/license/en_US.iso88591" kf5hostname04: kf5hostname04: LIC1402I License added successfully. kf5hostname04: kf5hostname04: kf5hostname04: LIC1426I This product is now licensed for use as outlined in your License Agreement. USE OF THE PRODUCT CONSTITUTES ACCEPTANCE OF THE TERMS OF THE IBM LICENSE AGREEMENT, LOCATED IN THE FOLLOWING DIRECTORY: "/usr/IBM/dwe/db2/V10.5.0.8..0/license/en_US.iso88591" kf5hostname07: kf5hostname07: LIC1402I License added successfully. kf5hostname07: kf5hostname07: kf5hostname07: LIC1426I This product is now licensed for use as outlined in your License Agreement. USE OF THE PRODUCT CONSTITUTES ACCEPTANCE OF THE TERMS OF THE IBM LICENSE AGREEMENT, LOCATED IN THE FOLLOWING DIRECTORY: "/usr/IBM/dwe/db2/V10.5.0.8..0/license/en_US.iso88591" kf5hostname06: kf5hostname06: LIC1402I License added successfully. kf5hostname06: kf5hostname06: kf5hostname06: LIC1426I This product is now licensed for use as outlined in your License Agreement. USE OF THE PRODUCT CONSTITUTES ACCEPTANCE OF THE TERMS OF THE IBM LICENSE AGREEMENT, LOCATED IN THE FOLLOWING DIRECTORY: "/usr/IBM/dwe/db2/V10.5.0.8..0/license/en_US.iso88591" kf5hostname05: kf5hostname05: LIC1402I License added successfully. kf5hostname05: kf5hostname05: kf5hostname05: LIC1426I This product is now licensed for use as outlined in your License Agreement. USE OF THE PRODUCT CONSTITUTES ACCEPTANCE OF THE TERMS OF THE IBM LICENSE AGREEMENT, LOCATED IN THE FOLLOWING DIRECTORY: "/usr/IBM/dwe/db2/V10.5.0.8..0/license/en_US.iso88591" # Verify the license is applied. dsh -n ${BCUDB2ALL} '/usr/IBM/dwe/db2/V10.5.0.8..0/adm/db2licm -l' | dshbak -c
HOSTS ------------------------------------------------------------------------- kf5hostname02, kf5hostname04, kf5hostname05, kf5hostname06, kf5hostname07 ------------------------------------------------------------------------------- Product name: "DB2 Advanced Enterprise Server Edition" License type: "CPU Option" Expiry date: "Permanent" Product identifier: "db2aese" Version information: "10.5" Enforcement policy: "Soft Stop" Fixed:
----------
V1.0.0.6/V1.1.0.2
|
|
KI007647
IBM PureData System for Operational Analytics environments may be vulnerable to Flash900 HIPER involving a crash or data corruption.
|
General | I_V1.1.0.0 |
IBM PureData System for Operational Analytics environments may be vulnerable to Flash900 HIPER involving a crash or data corruption.
BM PureData System for Operational Analytics environments built and shipped in 2015 included Flash900 FW levels that contain a serious HIPER. All customers should verify their Flash900 firmware levels and should plan on applying the firmware level of 1.4.5.0 as soon as possible.
See technote http://www.ibm.com/support/docview.wss?uid=swg22005436 for more information.
|
Workaround:
-----------
See technote http://www.ibm.com/support/docview.wss?uid=swg22005436 for more information.
Fixed:
----------
V1.1.0.1
|
|
KI007447
IBM PureData System for Operational Analytics environment contain an incorrect crontab entry for root.
|
General |
I_V1.0
I_V1.1
|
IBM PureData System for Operational Analytics environment contain an incorrect crontab entry for root.
IBM PureData System for Operational Analytics V1.0 and V1.1 environments contain the following crontab entry on all AIX hosts which while harmless is unnecessary. "* * * * /usr/bin/stcron -parmfile /etc/stcron >/dev/null 2>&1"
|
Workaround:
-----------
See technote https://www-01.ibm.com/support/docview.wss?uid=swg21996578 for more information.
Fixed:
----------
NA
|
|
KI007366
Excessive events and alerts in an IBM PureData System for Operational Analytics environment.
|
General |
I_V1.0
I_V1.1
|
Excessive events and alerts in an IBM PureData System for Operational Analytics environment.
Too many events, snmp traps, and active status entries appear in the IBM PureData System for Operational Analytics console as well as in the IBM Systems Director console. In addition too many Active Status records are generated in IBM System Director which leads to increased System Director start times.
|
Workaround:
-----------
See technote http://www.ibm.com/support/docview.wss?uid=swg21994243 for more information.
Fixed:
----------
|
|
KI005724:
Inconsistent or insufficient system dump device definitions on AIX hosts in the appliance. |
General |
I_V1.0
I_V1.1
|
Inconsistent or insufficient system dump device definitions on AIX hosts in the appliance
From initial deployment and AIX updates during appliance fixpack applications the following issues may be seen on the AIX hosts in the appliance. This symptom should only be remedied before a fixpack is applied or after the fixpack is committed when rootvg is mirrored.
1. Insufficient dump device size.The following errpt snippet may be encountered on one or more of the AIX hosts in the appliance.
LABEL: DMPCHK_TOOSMALL IDENTIFIER: E87EF1BE Date/Time: Wed Mar 6 15:00:00 CST 2019 Sequence Number: 1890 Machine Id: 00FAC8F34C00 Node Id: ap29pdimdb02 Class: O Type: PEND WPAR: Global Resource Name: dumpcheck Description The largest dump device is too small. Probable Causes Neither dump device is large enough to accommodate a system dump at this time. Recommended Actions Increase the size of one or both dump devices. 2. Inconsistent dump device specifications.a. hd7 is not defined. This occurs most often to V1.0 customers with Power 7 / IOC Systems that have applied FP3.
# Returns Blank V1.0 $ dsh -n ${ALL} 'lsvg -l rootvg | grep "^hd7"' | sort
or
$ dsh -n ${ALL} 'lslv -l hd7' 2>&1 | dshbak -c HOSTS ------------------------------------------------------------------------- stgkf201, stgkf202, stgkf203, stgkf204, stgkf205, stgkf206, stgkf208 ------------------------------------------------------------------------------- 0516-306 lslv: Unable to find hd7 in the Device Configuration Database.
# The following three symptoms are likely to happen at the same time. These symptoms occur after AIX is updated as part of V1.0.0.4, V1.0.0.5, V1.0.0.6, V1.1.0.1, V1.1.0.2 or through a support related activity.
|
Workaround:
-----------
The fix for this is a manual fix performed by a user with root authority.
Here are the goals assuming rootvg is comprised of hdisk0 and hdisk1. Over time it is possible that rootvg will use other hdisk #'s although it is rare.
lg_dumplv:
hd7:
ISSUE 2A: If hd7 is not created.1. Determine the device which contains lg_dumplv. The following commands shows lg_dumplv resides on hdisk0.$ lslv -l lg_dumplv lg_dumplv:N/A PV COPIES IN BAND DISTRIBUTION hdisk0 007:000:000 85% 000:006:001:000:000
|
|
KI007649
V7000 Canister 1 failed to reboot while V7000 upgrade is in progress
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
V700 Canister 1 failed to reboot while V7000 upgrade is in progress
This is similar to KI006996.
The fixpack stage will time out after 4 hours with a failure.
The command "svcupdate lsupdate" will show 50% complete.
|
Workaround:
-----------
Step 1: Checking the upgrade status in more granularity than the fixpack status dispays. To save time, start this step 90 minutes after the storage update has started., otherwise wait until the apply phase has failed (approximately 4 hours). Repeat this step in 10 minute intervals and look closely at the status and progress fields. If after 30 minutes the progress monitor has not moved proceed to the next step. This command is run as the root user on the management host. The root user on the management host uses key based authentication to access all storage enclosures without needing a password.
$ grep "SAN_FRAME[0-9]*[0-9]_IP" /pschome/config/xcluster.cfg | while read frame eq ip rest;do echo "*** ${ip} ***";ssh -n superuser@${ip} "lsupdate";done
*** 172.23.1.204 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name *** 172.23.1.205 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name *** 172.23.1.206 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name *** 172.23.1.207 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name *** 172.23.1.208 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name *** 172.23.1.209 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name This command will show the update status for all of the storage enclosures and works for V7000 and Flash900 enclosures. Note that in older V7000 firmware levels the command was lssoftwareupgradestatus, this command still works but provides less information in a different format. Step 2: As root on the management host, run the following command. This is a preliminary check to look for this symptom when it is time to call in a service ticket. All enclosures should show two Active nodes as shown below. For this symptom the lsservicenodes commands will show just one node for the storage enclosure. This is just a quick finding that can be provided initially to IBM Support.
$ grep "SAN_FRAME[0-9]*[0-9]_IP" /pschome/config/xcluster.cfg | while read frame eq ip rest;do echo "*** ${ip} ***";ssh -n superuser@${ip} "sainfo lsservicenodes";done *** 172.23.1.204 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 02-2 0000010020A00EE0 V7_00 1 node1 local Active 02-1 0000010020A00EE0 V7_00 2 node2 partner Active *** 172.23.1.205 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 01-1 0000020062AA266E Flash_00 1 node1 local Active 01-2 0000020062AA266E Flash_00 2 node2 partner Active *** 172.23.1.206 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 02-1 0000010020A00EF8 V7_01 2 node2 local Active 02-2 0000010020A00EF8 V7_01 1 node1 partner Active *** 172.23.1.207 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 01-2 0000020062EA268E Flash_01 2 node2 local Active 01-1 0000020062EA268E Flash_01 1 node1 partner Active *** 172.23.1.208 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 02-2 0000010020800E4E V7_02 1 node1 local Active 02-1 0000010020800E4E V7_02 2 node2 partner Active *** 172.23.1.209 *** panel_name cluster_id cluster_name node_id node_name relation node_status error_data 01-2 0000020062AA26AA Flash_02 2 node2 local Active 01-1 0000020062AA26AA Flash_02 1 node1 partner Active
Step 3: Open a ticket with IBM Support. The most common high level scenario is as follows for this symptom: i. IBM Support will ask for a support package or svc_snap on the problematic enclosure. This will be used to verify system details and look for some root cause for failures. ii. IBM Support will send a CE or SSR onsite with a replacement canister based on the appropriate part as determined using the svc_snap. iii. The CE or SSR will work with IBM Support to reseat the canister. This step can take 30 minutes for each reseat attempt which is usually done twice before a replacement is attempted. iv. If the reseat brings the canister up then IBM Support will work with the customer to complete or backout the update. If the reseat fails then the canister will be replaced. Step 4. Once the storage is at a steady state (either completely updated or rolled back) the fixpack stage can be resumed. Fixed:
----------
N/A. V7000 and Flash900 firmware updates improve with each update and can reduce the possibility of encountering these issues.
|
|
KI007650
DPM upgrade failed because OPMDB runs out of transaction log space.
|
Fixpack |
I_V1.0.0.5
I_V1.1.0.1
|
DPM upgrade failed because OPMDB runs out of transaction log space.
|
Workaround:
-----------
Update the DPM database setting LOGSECOND to 200. This settings requires the database to be started to change and to be restarted to take effect. While this does not impact all customers, this step can be done for all customers prior to running the management update phase or after the management phase has failed.
Step 1:This step will stop DPM. If this is done prior to the management phase failure verify that DPM can be stopped for a short time. Also do not attempt to do this while the fixpack is actively running without guidance from IBM Support. The following steps are tedious and complicated. Read through the steps and if there is any concern about following these steps please contact IBM Support. If any of the steps don't match expected output, please also contact IBM support.
Step 1: Login as root to the management host.
Step 2: Verify the status of DPM. DPM should be in the same state once the LOGSECOND update is complete. Follow the instructions next to the status the matches.
a) Domain is online and DPM is online. Follow the substeps when the status matches.
$ hals -mgmt MANAGEMENT DOMAIN +============+===============+===============+===============+=================+=================+=============+ | COMPONENT | PRIMARY | STANDBY | CURRENT | OPSTATE | HA STATUS | RG REQUESTS | +============+===============+===============+===============+=================+=================+=============+ | WASAPP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DB2APP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DPM | kf5hostname01 | kf5hostname03 | kf5hostname01 | Online | Normal | - | | DB2DPM | kf5hostname01 | kf5hostname03 | kf5hostname01 | Online | Normal | - | +============+===============+===============+===============+=================+=================+=============+
$ ssh kf5hostname01 'su - db2opm -c "db2 get db cfg for opmdb | grep LOGSECOND"' Number of secondary log files (LOGSECOND) = 120
$ ssh kf5hostname01 'su - db2opm -c "db2 update db cfg for opmdb using LOGSECOND 200"' DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. SQL1363W One or more of the parameters submitted for immediate modification were not changed dynamically. For these configuration parameters, the database must be shutdown and reactivated before the configuration parameter changes become effective. $ ssh kf5hostname01 'su - db2opm -c "db2 connect to opmdb;db2 get db cfg for opmdb show detail| grep LOGSECOND"' Database Connection Information Database server = DB2/AIX64 10.5.10 SQL authorization ID = DB2OPM Local database alias = OPMDB Number of secondary log files (LOGSECOND) = 120 200
$ hastopdpm;hastartdpm Stopping DPM and DB2 instance............................Resources offline MANAGEMENT DOMAIN +============+===============+===============+===============+=================+=================+=============+ | COMPONENT | PRIMARY | STANDBY | CURRENT | OPSTATE | HA STATUS | RG REQUESTS | +============+===============+===============+===============+=================+=================+=============+ | WASAPP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DB2APP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DPM | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DB2DPM | kf5hostname01 | N/A | N/A | Offline | Offline | - | +============+===============+===============+===============+=================+=================+=============+ Starting DPM and DB2 instance..............................Resources online MANAGEMENT DOMAIN +============+===============+===============+===============+=================+=================+=============+ | COMPONENT | PRIMARY | STANDBY | CURRENT | OPSTATE | HA STATUS | RG REQUESTS | +============+===============+===============+===============+=================+=================+=============+ | WASAPP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DB2APP | kf5hostname01 | N/A | N/A | Offline | Offline | - | | DPM | kf5hostname01 | kf5hostname03 | kf5hostname01 | Online | Normal | - | | DB2DPM | kf5hostname01 | kf5hostname03 | kf5hostname01 | Online | Normal | - | +============+===============+===============+===============+=================+=================+=============+
$ ssh kf5hostname01 'su - db2opm -c "db2 connect to opmdb;db2 get db cfg for opmdb show detail| grep LOGSECOND"' Database Connection Information Database server = DB2/AIX64 10.5.10 SQL authorization ID = DB2OPM Local database alias = OPMDB Number of secondary log files (LOGSECOND) = 200 200 b) Domain is offline.
$ hals -mgmt none are available... returning
$ cat ~db2opm/sqllib/db2nodes.cfg 0 kf5hostname01 0 kf5hostname01
If Active proceed to b.4.
$ ssh kf5hostname01 'su - db2opm -c "db2pd -"' Database Member 0 -- Active -- Up 0 days 00:06:37 -- Date 2019-03-14-19.32.27.159573 If Non Active proceed to b.3.
$ ssh kf5hostname01 'su - db2opm -c "db2pd -"'
Unable to attach to database manager on member 0. Please ensure the following are true: - db2start has been run for the member. - db2pd is being run on the same physical machine as the member. - DB2NODE environment variable setting is correct for the member or db2pd -mem setting is correct for the member.
$ ssh kf5hostname01 'su - db2opm -c "db2start"' 03/14/2019 19:36:30 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.
$ ssh kf5hostname01 'su - db2opm -c "db2 get db cfg for opmdb | grep LOGSECOND"' Number of secondary log files (LOGSECOND) = 120
$ ssh kf5hostname01 'su - db2opm -c "db2 update db cfg for opmdb using LOGSECOND 200"' DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. SQL1363W One or more of the parameters submitted for immediate modification were not changed dynamically. For these configuration parameters, the database must be shutdown and reactivated before the configuration parameter changes become effective. $ ssh kf5hostname01 'su - db2opm -c "db2 connect to opmdb;db2 get db cfg for opmdb show detail| grep LOGSECOND"' Database Connection Information Database server = DB2/AIX64 10.5.10 SQL authorization ID = DB2OPM Local database alias = OPMDB Number of secondary log files (LOGSECOND) = 120 200
$ ssh kf5hostname01 'su - db2opm -c "db2stop"' 03/14/2019 19:43:16 0 0 SQL1064N DB2STOP processing was successful. SQL1064N DB2STOP processing was successful. (0) root @ kf5hostname01: 7.1.0.0: / $ ssh kf5hostname01 'su - db2opm -c "db2start"' 03/14/2019 19:43:26 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.
$ ssh kf5hostname01 'su - db2opm -c "db2 connect to opmdb;db2 get db cfg for opmdb show detail| grep LOGSECOND"' Database Connection Information Database server = DB2/AIX64 10.5.10 SQL authorization ID = DB2OPM Local database alias = OPMDB Number of secondary log files (LOGSECOND) = 200 200
$ ssh kf5hostname01 'su - db2opm -c "db2stop"' 03/14/2019 19:47:28 0 0 SQL1064N DB2STOP processing was successful. SQL1064N DB2STOP processing was successful.
Fixed:
----------
N/A.
|
|
KI007655
SSH Bad protocol 2 host key algorithms '+ssh-dss'. [ Added 2019-09-12 ]
|
Fixpack |
I_V1.0.0.6
I_V1.1.0.2
|
SSH Bad protocol 2 host key algorithms '+ssh-dss'.
In the V1.0.0.6 and V1.1.0.2 Readme Version 337 document the section "STAGE 1 - Prerequisites" had a element d. that discusses updates to the /etc/ssh/sshd_config settings.
At issue is hte settings for HostKeyAlgorithms and PubkeyAcceptedKeyTypes which for AIX 7.1 TL5 PDOA environment should include +ssh-dss . When updating from PDOA V1.0.0.5 IF01 or PDOA V1.1.0.1 IF01 the edits do not hurt ssh or ssd. In AIX 7.1 TL3 levels of PDOA such as in V1.0.0.5 and V1.1.0.1 and earlier, this setting in invalid and will result in failures for the ssh client and sshd server.
The readme document incorrectly has these modifications as part of Pre-requisite section and instead should be only be done after the AIX levels are updated.
The challenges related to updating SSH which happens as part of the AIX 7.1 TL3 to AIX 7.1 TL5 update are documented in the following technote:
https://www.ibm.com/support/pages/ibm-aix-various-ssh-problems-after-upgrading-openssh-7x
Customers who have applied V1.0.0.5 IF01 or V1.1.0.1 IF01 will have experienced these SSH issues.
|
Workaround:
-----------
PermitRootLogin must still be set to 'yes' explicitly as part of the pre-requisite steps for the update. This is required for the pflayer to function correctly during the fixpack.
Except for the PermitRootLogin step undo the HostKeyAlgorithms/PubkeyAcceptedKeyTypes and wait until after STAGE 6 for the management hosts and STAGE 7 for the core hosts to update the /etc/ssh/ssh_config and /etc/ssh/sshd_config files.
Fixed:
----------
FP7_FP3 will not be impacted.
If there are future updates to the FP6_FP2 documentation this will be fixed at that time.
|
|
KI007641
0516-1734 chvg: Warning, savebase failed. Please manually run 'savebase' before rebooting. [ Added 2019-09-12 ]
|
Fixpack |
I_V1.0.0.6
I_V1.1.0.2
|
0516-1734 chvg: Warning, savebase failed. Please manually run 'savebase' before rebooting.
As part of Stage 8 Phase 6 Remmirror the rootvg volume group after the update is successful. Step e.
$ mirrorvg rootvg hdisk1
0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1734 mklvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1804 chvg: The quorum change takes effect immediately. 0516-1734 chvg: Warning, savebase failed. Please manually run 'savebase' before rebooting. 0516-1126 mirrorvg: rootvg successfully mirrored, user should perform bosboot of system to initialize boot records. Then, user must modify bootlist to include: hdisk0 hdisk1. 0516-1734 mirrorvg: Warning, savebase failed. Please manually run 'savebase' before rebooting. |
Workaround:
-----------
Execute savebase command after execution of bosboot -a and bootlist command
Fixed:
----------
This appears to be a very rare error and no root cause or fix has been identified.
|
|
KI007660
appl_ctrl_net_adapter commands do not return any output. [ Added 2019-09-12 ]
|
Fixpack |
I_V1.0.0.6
I_V1.1.0.2
|
appl_ctrl_net_adapter commands do not return any output.
From Appendix J: Netowrk adapter Firmware Update
The following commands may not always return any output.
$ $PL_ROOT/bin/icmds/appl_ctrl_net_adapter update -validate -l net_adapter8,net_adapter9,net_adapter0,net_adapter1 -f /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image
$ $PL_ROOT/bin/icmds/appl_ctrl_net_adapter update -install -l net_adapter8,net_adapter9,net_adapter0,net_adapter1 -f /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image During testing this was caused when the appliance filesystem /BCU_share was not NFS mounted from the management node on the impacted nodes.
|
Workaround:
-----------
Run the verify commands in the step immediately after, i this case step 5. appl_ctrl_adapter query -l <adapter>
And/Or:
Verify the update by reviewing the log in $VAR_PL_ROOT/log/platform_layer.log on the management host.
Failed Examples: In this example /BCU_share was purposely unmounted to illustrate the issue and demonstrate failure messages.
[16 May 2019 06:41:17,243] <8716478 CTRL DEBUG stgkf201> FCAdapterFW-apply: net_adapter9 Output: Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmware/net_adapter/ a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,245] <8716478 CTRL DEBUG stgkf201> FCAdapterFW-apply: writing to STATUSFILE: net_adapter9==1==Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/fi rmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,279] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Status of server: server4, pid: 8716478 is 1 [16 May 2019 06:41:17,281] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Convert status info file to ret hash [16 May 2019 06:41:17,282] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter0==1==Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmwar e/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.65. [16 May 2019 06:41:17,282] <10289372 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 06:41:17,283] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter8==1==Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmwar e/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,283] <10289372 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 06:41:17,284] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter1==1==Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmwar e/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.65. [16 May 2019 06:41:17,284] <10289372 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 06:41:17,284] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter9==1==Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmwar e/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,284] <10289372 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 06:41:17,285] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Deleting statusfile /tmp/status_adapters.txt [16 May 2019 06:41:17,303] <10289372 CTRL DEBUG stgkf201> FCAdapterFW-apply: Contents of ret hash$VAR1 = { [16 May 2019 06:41:17,303] <10289372 CTRL DEBUG stgkf201> 'net_adapter9' => { [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'success' => '1', [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'message' => 'Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmware/net_ada pter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> ' [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> }, [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'net_adapter0' => { [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'success' => '1', [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'message' => 'Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmware/net_ada pter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.65. [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> ' [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> }, [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'net_adapter1' => { [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'success' => '1', [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'message' => 'Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmware/net_ada pter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.65. [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> ' [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> }, [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'net_adapter8' => { [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'success' => '1', [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> 'message' => 'Update status: Failed to unpack the adapter firmware file /BCU_share/FP6_FP2/firmware/net_ada pter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on 172.23.2.63. [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> ' [16 May 2019 06:41:17,304] <10289372 CTRL DEBUG stgkf201> } [16 May 2019 06:41:17,305] <10289372 CTRL DEBUG stgkf201> };
Successful Examples:
[16 May 2019 07:12:41,834] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Convert status info file to ret hash [16 May 2019 07:12:41,835] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter8==0==Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent2(172.23.2.63).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent3(172.23.2.63). [16 May 2019 07:12:41,835] <11600048 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 07:12:41,835] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter0==0==Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent0(172.23.2.65).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent1(172.23.2.65). [16 May 2019 07:12:41,835] <11600048 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 07:12:41,836] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter9==0==Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent0(172.23.2.63).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent1(172.23.2.63). [16 May 2019 07:12:41,836] <11600048 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 07:12:41,837] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Processing line: net_adapter1==0==Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent2(172.23.2.65).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent3(172.23.2.65). [16 May 2019 07:12:41,837] <11600048 CTRL DEBUG stgkf201> of file /tmp/status_adapters.txt [16 May 2019 07:12:41,838] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Deleting statusfile /tmp/status_adapters.txt [16 May 2019 07:12:41,857] <11600048 CTRL DEBUG stgkf201> FCAdapterFW-apply: Contents of ret hash$VAR1 = { [16 May 2019 07:12:41,857] <11600048 CTRL DEBUG stgkf201> 'net_adapter9' => { [16 May 2019 07:12:41,857] <11600048 CTRL DEBUG stgkf201> 'success' => '0', [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'message' => 'Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent0(172.23.2.63).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent1(172.23.2.63). [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> ' [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> }, [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'net_adapter0' => { [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'success' => '0', [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'message' => 'Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent0(172.23.2.65).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent1(172.23.2.65). [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> ' [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> }, [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'net_adapter1' => { [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'success' => '0', [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'message' => 'Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent2(172.23.2.65).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent3(172.23.2.65). [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> ' [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> }, [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'net_adapter8' => { [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'success' => '0', [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> 'message' => 'Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent2(172.23.2.63).Update status: The adapter update is successful for the adapter firmware image /BCU_share/FP6_FP2/firmware/net_adapter/a21910071410d103/image/a21910071410d103.0400401800009.aix.rpm on ent3(172.23.2.63). [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> ' [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> } [16 May 2019 07:12:41,858] <11600048 CTRL DEBUG stgkf201> };
Fixed:
----------
N/A.
|
|
KI007616
scp fails to complete on some files to and from PDOA AIX servers. [ Added 2019-09-12 ]
|
Fixpack |
I_V1.0.0.6
I_V1.1.0.2
|
scp fails to complete on some files to and from PDOA AIX servers.
When attempting to use scp to copy files into and out of PDOA AIX hosts after updating to IF01, V1.0.0.6, V1.1.0.2, the scp command will run for a short time and then fail on certain files and succeed on others.
This symptom matches this APAR found in AIX 7.2 https://www-01.ibm.com/support/entdocview.wss?uid=isg1IJ03680.
When debugging the case where AIX is the target server running ssh over port 9022 we can see the server, when running some scp commands to port 9022 the following output.
# As root on any PDOA aix server run the following. This requires port 9022 to be available. Then connect to that port with a client scp session that fails.
$(which sshd) -o Port=9022 -ddd
debug2: channel 0: rcvd adjust 98304 ssh_packet_send: invalid format debug1: do_cleanup debug1: audit event euid 0 user root event 12 (SSH_connabndn) debug1: Return Val-1 for auditproc:0 |
Workaround:
-----------
If there are failures during scp activiies to or from the AIX hosts add the following to the ssh command options, '-o Compression=off'.
Fixed:
----------
Unknown at this time.
|
|
KI007598
apply: storage1: apply failed
|
Fixpack |
I_V1.1.0.2
|
apply: storage1: apply failed
During Stage 4 on Flash900 firmware updates we have seen a Drive Module failure occur while attempting to apply 1.4.7.1 to a Flash900 enclosure. This is a not a common error.
pflayer log entires may include lines as follows.
$ grep "apply: " platform_layer.trace.6 [18 Apr 2018 07:28:14,786] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <4>: update_status=<upgrading 23> [18 Apr 2018 07:28:16,956] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <4>: update_status=<upgrading 23> [18 Apr 2018 07:33:16,374] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <5>: update_status=<upgrading 23> [18 Apr 2018 07:33:20,170] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <5>: update_status=<upgrading 23> [18 Apr 2018 07:38:18,915] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <6>: update_status=<committing 47> [18 Apr 2018 07:38:22,543] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <6>: update_status=<upgrading 23> [18 Apr 2018 07:43:22,314] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <7>: update_status=<updating_hardware 50> [18 Apr 2018 07:43:25,034] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <7>: update_status=<upgrading 23> [18 Apr 2018 07:48:24,577] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <8>: update_status=<updating_hardware 56> [18 Apr 2018 07:48:27,757] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <8>: update_status=<committing 47> [18 Apr 2018 07:53:26,636] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <9>: update_status=<updating_hardware 77> [18 Apr 2018 07:53:30,571] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: waiting for upgrade to complete, iteration <9>: update_status=<hardware_failed 50> [18 Apr 2018 07:58:29,505] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <10>: update_status=<updating_hardware 79> [18 Apr 2018 07:58:30,572] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: broke out of timed wait after 10 iterations of maximum 48. update status is <hardware_failed 50> [18 Apr 2018 07:58:30,573] <4129530 CTRL DEBUG kf5hostname01> Extracted msg from NLS: apply: 172.23.1.205 Error: The update status of the end point is <hardware_failed 50>. [18 Apr 2018 07:58:30,573] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: error: error state, update status is <hardware_failed 50> [18 Apr 2018 07:58:30,616] <3146728 CTRL DEBUG kf5hostname01> apply: storage1: apply failed [18 Apr 2018 08:03:32,417] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <11>: update_status=<updating_hardware 85> (0) root @ kf5hostname01: 7.1.0.0: /BCU_share/aixappl/pflayer/log [18 Apr 2018 07:58:29,504] <5701764 CTRL TRACE kf5hostname01> Exiting Ctrl::Updates::Storage::get_update_status } [18 Apr 2018 07:58:29,505] <5701764 CTRL DEBUG kf5hostname01> apply: 172.23.1.209: waiting for upgrade to complete, iteration <10>: update_status=<updating_hardware 79> [18 Apr 2018 07:58:30,572] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: broke out of timed wait after 10 iterations of maximum 48. update status is <hardware_failed 50> [18 Apr 2018 07:58:30,573] <4129530 CTRL DEBUG kf5hostname01> Extracted msg from NLS: apply: 172.23.1.205 Error: The update status of the end point is <hardware_failed 50>. [18 Apr 2018 07:58:30,573] <4129530 CTRL DEBUG kf5hostname01> apply: 172.23.1.205: error: error state, update status is <hardware_failed 50> [18 Apr 2018 07:58:30,616] <3146728 CTRL DEBUG kf5hostname01> apply: storage1: apply failed Accessing the Flash900 CLI via ssh may reveal the following. The event log shows 'Array rebuild complete' and 'Array mdisk is not protected by sufficient spares' right after the log shows the update failure.
BM_FlashSystem:Flash_00:admin>lseventlog sequence_number last_timestamp object_type object_id object_name copy_id status fixed event_id error_code description secondary_object_type secondary_object_id 102 150901084131 node 2 node2 message no 980349 Node added 103 150901090631 enclosure 0 message no 988113 Internal hardware update completed 114 150903035055 drive 0 message no 988024 Flash module format complete 115 150903035055 drive 4 message no 988024 Flash module format complete 116 150903035100 drive 1 message no 988024 Flash module format complete 117 150903035100 drive 9 message no 988024 Flash module format complete 118 150903035100 drive 2 message no 988024 Flash module format complete 119 150903035100 drive 5 message no 988024 Flash module format complete 120 150903035100 drive 6 message no 988024 Flash module format complete 121 150903035100 drive 7 message no 988024 Flash module format complete 122 150903035100 drive 8 message no 988024 Flash module format complete 123 150903035105 drive 3 message no 988024 Flash module format complete 124 160111081546 cluster Flash_00 message no 980506 Update prepared 125 160111082846 node 2 node2 message no 980349 Node added 126 160111084312 node 1 node1 message no 980349 Node added 9000005 161003035616 enclosure 2 alert no 045071 1037 Unsupported canister combination 9000006 161003035656 node 1 node1 alert no 077187 1092 Temperature critical threshold exceeded 147 170102161358 cluster Flash_00 message no 980506 Update prepared 148 170102162543 node 1 node1 message no 980349 Node added 151 170102163830 node 2 node2 message no 980349 Node added 156 170213164006 node 2 node2 message no 980349 Node added 159 180418070758 cluster Flash_00 message no 980506 Update prepared 160 180418072508 node 2 node2 message no 980349 Node added 164 180418074545 node 1 node1 message no 980349 Node added 165 180418075055 enclosure 1 alert no 085048 2060 Reconditioning of batteries required 166 180418075055 enclosure 1 alert no 085048 2060 Reconditioning of batteries required 167 180418075220 enclosure 0 alert no 085118 2010 Update process failed 168 180418083654 mdisk 0 array0 message no 988023 Array rebuild complete 169 180418083654 mdisk 0 array0 alert no 085031 1690 Array mdisk is not protected by sufficient spares IBM_FlashSystem:Flash_00:admin>lseventlog 167 sequence_number 167 first_timestamp 180418075220 first_timestamp_epoch 1524048740 last_timestamp 180418075220 last_timestamp_epoch 1524048740 object_type enclosure object_id 0 object_name copy_id reporting_node_id 1 reporting_node_name node1 root_sequence_number event_count 1 status alert fixed no auto_fixed no notification_type error event_id 085118 event_id_text System update halted error_code 2010 error_code_text Update process failed machine_type 9840AE2 serial_number 1351337 FRU None fixed_timestamp fixed_timestamp_epoch callhome_type hardware sense1 01 00 31 33 35 31 33 33 37 00 00 00 00 00 00 00 sense2 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 sense3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense5 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense6 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense7 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 sense8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 secondary_object_type secondary_object_id IBM_FlashSystem:Flash_00:admin> IBM_FlashSystem:Flash_00:superuser>lsenclosureslot enclosure_id slot_id port_1_status port_2_status drive_present drive_id 1 1 offline offline no 1 2 online online yes 0 1 3 online online yes 1 1 4 online online yes 1 5 online online yes 3 1 6 online online yes 4 1 7 online online yes 5 1 8 online online yes 6 1 9 online online yes 7 1 10 online online yes 8 1 11 online online yes 9 1 12 offline offline no 8:54:06 PM |
Workaround:
-----------
Immediately contact IBM Support.
Consult the mustgather link at the end of this document for any relevant data required.
Fixed:
----------
N/A.
|
|
KI007658
Missing license gpfs.license.std
|
Fixpack |
I_V1.1.0.2
I_V1.0.0.6
|
Missing license gpfs.license.std
During the application of the fixpack for PDOA in the readme for GPFS updates in Appendix F as part of Stage 6 and Stage 8 the following message will appear when running installp to update GPFS.
stgkf203: FAILURES
stgkf203: -------- stgkf203: Filesets listed in this section failed pre-installation verification stgkf203: and will not be installed. stgkf203: stgkf203: Requisite Failures stgkf203: ------------------ stgkf203: SELECTED FILESETS: The following is a list of filesets that you asked to stgkf203: install. They cannot be installed until all of their requisite filesets stgkf203: are also installed. See subsequent lists for details of requisites. stgkf203: stgkf203: gpfs.license.std 4.2.3.0 # IBM Spectrum Scale Standard ... stgkf203: stgkf203: MISSING REQUISITES: The following filesets are required by one or more stgkf203: of the selected filesets listed above. They are not currently installed stgkf203: and could not be found on the installation media. stgkf203: stgkf203: gpfs.license.std 4.2.0.0 # Base Level Fileset stgkf203: stgkf203: << End of Failure Section >> This is consistent with what we experience in the lab during testing. The rest of of the installation log should show success for the rest of the filesests and the verifcation steps with lslpp -l display should be consistent with the documentation.
stgkf201: Name Level Part Event Result
stgkf201: ------------------------------------------------------------------------------- stgkf201: gpfs.msg.en_US 4.2.3.0 USR APPLY SUCCESS stgkf201: gpfs.base 4.2.3.0 USR APPLY SUCCESS stgkf201: gpfs.base 4.2.3.0 ROOT APPLY SUCCESS stgkf201: gpfs.ext 4.2.3.0 USR APPLY SUCCESS stgkf201: gpfs.docs.data 4.2.3.0 SHARE APPLY SUCCESS stgkf201: gpfs.gskit 8.0.50.75 USR APPLY SUCCESS $ dsh -n ${BCUMGMT},${BCUMGMTSTDBY} 'lslpp -l "*gpfs*"' | dshbak -c
HOSTS ------------------------------------------------------------------------- kf5hostname01, kf5hostname03 ------------------------------------------------------------------------------- Fileset Level State Description ---------------------------------------------------------------------------- Path: /usr/lib/objrepos gpfs.base 4.2.3.7 APPLIED GPFS File Manager gpfs.ext 4.2.3.7 APPLIED GPFS Extended Features gpfs.gskit 8.0.50.75 APPLIED GPFS GSKit Cryptography Runtime gpfs.msg.en_US 4.2.3.6 APPLIED GPFS Server Messages - U.S. English Path: /etc/objrepos
gpfs.base 4.2.3.7 APPLIED GPFS File Manager Path: /usr/share/lib/objrepos
gpfs.docs.data 4.2.3.6 APPLIED GPFS Server Manpages and Documentation Proceed with the update as long as the verification steps are consistent.
|
Workaround:
-----------
N/A
Fixed:
----------
N/A
|
|
KI007677
ssh Connection reset by peer to BNT switches [ Added 2019-09-17 ]
|
General | I_V1.1 |
ssh Connection reset by peer to BNT switches [ Added 2019-09-17 ]
This can lead to miauth failures when attempting to change the password on the BNT switches or simply accessing the switches via ssh on the command line.
When trying to connect via ssh from the root user on the management host to one of the BNT network switches in a PDOA environment, ssh can fail with errors similar to below either on the ssh command line or in the platform layer log files that use ssh.
$ ssh -o Protocol=2,1 admin@172.23.1.254
Read from socket failed: Connection reset by peer In PDOA V1.1 environments from GA to FP2 the use of the /etc/ssh/ssh_config setting of Protocol=2,1 leads to connection issues to the BNT switches.
On the same system the following works:
$ ssh -o Protocol=2 admin@172.23.1.254
Enter login password: IBM Networking Operating System RackSwitch G8052.
|
Workaround:
-----------
PDOA was not shipped with this value uncommented so as fixpacks are applied the default behavior of ssh clients in PDOA will change due to updates in security defaults.
Fixed:
----------
Update /etc/ssh/ssh_config to set the value of Protocol to 2 or comment the value and allow the ssh client to follow the default behavior.
|
|
KI007679
FP6_FP2 Readme Appendix I is not clear about Installation Manager removal [ Added 2019-09-20 ]
|
Fixpack |
I_V1.0.0.6
I_V1.1.0.2
|
FP6_FP2 Readme Appendix I is not clear about Installation Manager removal [ Added 2019-09-20 ]
There are two issues with the Appendix that have been identified.
The first issue is related to formatting.
In 'v.' the following instruction appears as a single command in the document instead of command and output. The standard should be commands are in bold.
$ /opt/IBM/InstallationManager/eclipse/tools/imcl listInstalledPackages
com.ibm.cic.agent_1.8.5000.20160506_1125
The second issue is the command to remove installation manager is actually missing.
|
Workaround:
-----------
This step can be performed at any time during or after the fixpack has completed.
There is no outage required and no impact on the appliance function.
Issue #1:
The command is:
$ /opt/IBM/InstallationManager/eclipse/tools/imcl listInstalledPackages
The output is:
com.ibm.cic.agent_1.8.5000.20160506_1125
Issue #2:
As the root user on management run the following command (in bold)
$ time /opt/IBM/InstallationManager/eclipse/tools/imcl uninstall com.ibm.cic.agent_1.8.5000.20160506_1125
Uninstalled com.ibm.cic.agent_1.8.5000.20160506_1125 from the /opt/IBM/InstallationManager/eclipse directory. real 0m29.70s user 0m4.03s sys 0m0.43s ### Verify that Installation is uninstalled.
$ ls -la /opt/IBM/InstallationManager
ls: 0653-341 The file /opt/IBM/InstallationManager does not exist. Fixed:
----------
N/A.
|
|
KI007473
Mail to root user with subject "Electronic Service Agent not" received on AIX hosts. [ Added 2019-09-23 ]
|
General
Fixpack
|
I_V1.0
I_V1.0.0.5_IF01
I_V1.0.0.6
I_V1.1.0.1_IF01
I_V1.1.0.2
|
Mail to root user with subject "Electronic Service Agent not" received on AIX hosts.
The root user on al AIX hosts may receive the following e-mail:
Message 13:
From esaadmin Sun Aug 26 03:01:01 2018 Date: Sun, 26 Aug 2018 03:01:01 -0400 From: esaadmin To: root Subject: Electronic Service Agent not activated Electronic Service Agent has not been activated. To activate Electronic Service Agent, do the following: From the SMIT main menu, select Electronic Service Agent, then select Configure Electronic Service Agent.
For information about Electronic Service Agent, including the benefits of activating it, see the following:
http://publib.boulder.ibm.com/infocenter/eserver/v1r2/topic/eicbd/eicbdkickoff.htm To discontinue this periodic reminder message, execute the command /usr/esa/sbin/rmESAReminder.
The e-mail originates from an AIX component called the Electronic Service Agent. This is documented in the following Knowledge Center for AIX 7.1 link. https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/electronicserviceagent/eicbdkickoff.html
PDOA includes the bos.esagent filesets.
In V1.0 systems the esaadmin user was also defined as was the cron job associated with that user that generated the above mail.
In V1.1 systems the esadmin user was removed.
For customers who may have resolved the issue who had not yet applied a fixpack that includes an AIX 7.1 TL5 based update (IF01/FP6_FP2) when they do apply these updates it is believed that as part of the update that the esaadmin user is recreated and the cron job to notify root that the Electronic Service Agent is not configured is reset.
The following commands can be used to check the appliance state:
$ dsh -n ${ALL} 'lslpp -l bos.esagent' | dshbak -c HOSTS ------------------------------------------------------------------------- reverseflash01, reverseflash02, reverseflash03, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- Fileset Level State Description ---------------------------------------------------------------------------- Path: /usr/lib/objrepos bos.esagent 7.1.5.0 COMMITTED Electronic Service Agent Path: /etc/objrepos bos.esagent 7.1.5.0 COMMITTED Electronic Service Agent
$ dsh -n ${ALL} 'crontab -l esaadmin' | dshbak -c
HOSTS ------------------------------------------------------------------------- reverseflash01, reverseflash02, reverseflash03, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- 0 3 * * 0 /usr/esa/sbin/esa_awareness $ dsh -n ${ALL} 'mail -H | grep -c "Electronic Service Agent not"' | sort
reverseflash01: 7 reverseflash02: 20 reverseflash03: 20 reverseflash04: 20 reverseflash05: 20 reverseflash06: 20 |
Workaround:
-----------
The Electronic Service is not configured or needed on PDOA environments as PDOA uses HMCs for alerting and call home. To remediate this issue following the instructions in the e-mail to disable the reminder.
$ dsh -n ${ALL} '/usr/esa/sbin/rmESAReminder'
reverseflash01: ...checking for user esaadmin reverseflash01: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig efs_initialks_mode=admin efs_keystore_algo=RSA_2048 efs_keystore_access=file efs_adminks_access=file efs_allowksmodechangebyuser=true efs_file_algo=AES_128_CBC fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash01: ...checking existence of crontab file reverseflash01: ...removing the crontab entry reverseflash06: ...checking for user esaadmin reverseflash06: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash06: ...checking existence of crontab file reverseflash06: ...removing the crontab entry reverseflash02: ...checking for user esaadmin reverseflash02: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash02: ...checking existence of crontab file reverseflash02: ...removing the crontab entry reverseflash03: ...checking for user esaadmin reverseflash03: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash03: ...checking existence of crontab file reverseflash03: ...removing the crontab entry reverseflash04: ...checking for user esaadmin reverseflash04: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash04: ...checking existence of crontab file reverseflash04: ...removing the crontab entry reverseflash05: ...checking for user esaadmin reverseflash05: esaadmin id=12 pgrp=system groups=system,staff home=/var/esa shell=/usr/bin/ksh login=false su=true rlogin=false daemon=true admin=true sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=22 registry=files SYSTEM=compat logintimes= loginretries=0 pwdwarntime=0 account_locked=false minage=0 maxage=0 maxexpired=-1 minalpha=0 minloweralpha=0 minupperalpha=0 minother=0 mindigit=0 minspecialchar=0 mindiff=0 maxrepeats=8 minlen=0 histexpire=0 histsize=0 pwdchecks= dictionlist= default_roles=SysConfig fsize=-1 cpu=-1 data=-1 stack=393216 core=-1 rss=-1 nofiles=-1 stack_hard=393216 roles=SysConfig reverseflash05: ...checking existence of crontab file reverseflash05: ...removing the crontab entry $ dsh -n ${ALL} 'crontab -l esaadmin' 2>&1 | dshbak -c
HOSTS ------------------------------------------------------------------------- reverseflash01, reverseflash02, reverseflash03, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- 0481-103 Cannot open a file in the /var/spool/cron/crontabs directory. A file or directory in the path name does not exist. Verify the esadmin user details and the easadmin user is not allowed to login.
$ dsh -n ${ALL} 'lsuser -a login rlogin esaadmin' | dshbak -c
HOSTS ------------------------------------------------------------------------- reverseflash01, reverseflash02, reverseflash03, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- esaadmin login=false rlogin=false Questions: Can the esaadmin user be removed? In V1.1 this user was removed and FP1/FP2 did not seem to be impacted nor did runtimes. So it should be safe to remove this user. Can the bos.esagent filesets be removed? The effect of removing these filesets hasn't been verified within our appliance. However there is documentation in the AIX knowledge center describing these steps: PDOA has not tested whether the removal of these filesets wil alleviate the issue coming back after an AIX update is performed. Since this is a documented AIX procedure and PDOA does not use the user nor the pacakge it should be safe to remove. Fixed:
----------
N/A. Follow the workaround as needed.
|
|
KI007701
IBM PureData System for Operational Analytics V1.1 Appliances May experience an Internal Raid Card Failure During A Power Cycle [ Added 2019-10-18 ]
|
General
Fixpack
|
I_V1.1
I_V1.1.0.1
I_V1.1.0.2
|
IBM PureData System for Operational Analytics V1.1 Appliances May experience an Internal Raid Card Failure During A Power Cycle
The most common symptom is an LPAR fails to boot after a power cycle of the LPAR or CEC. The raid adapter will appear as missing.
See this technote for more infomration about this issue:
https://www.ibm.com/support/pages/node/1088866
|
Workaround:
-----------
The following technote illustrates how to reduce the risk of this issue prior to FP7_FP3. https://www.ibm.com/support/pages/node/1088866
Fixed:
----------
FP7_FP3
|
|
KIG00052
FP7_FP3 gen_update_script.sh does not compare version places with double digits correctly. [ Added 2020-02-14 ]
|
Fixpack |
I_V1.0.0.7
I_V1.1.0.3
|
FP7_FP3 gen_update_script.sh does not compare version places with double digits correctly.
From Stage 6 Phase 3 in the FP7_FP3 Readme when directed to
Appendix H - DB2 Update the script gen_update_script.sh does not generate mgmtfixpack.sh or corefixpack.sh files.
This is due to an issue with the version comparison algorithm between DB2 versions that doesn't compare double and single digits in the same location.
For example, DB2 10.5 updates from V1.1 FP1 to V1.1 FP3
unpack_10.5.0.5..1 unpack_10.5.0.10..6 will not recognize that 10.5.0.10 is higher than 10.5.0.5.
|
Workaround:
-----------
This issue should only impact DB2 10.5 customers.
Use the following command for 'mgmtfixpack.sh' for DB2 10.5.0.10..6 updates to the management host.
$ cat mgmtfixpack.sh
/BCU_share/FP7_FP3/software/DB2/unpack_10.5.0.10..6/universal/installFixPack -n -b /usr/IBM/dwe/mgmt_db2/V10.5 -c /BCU_share/FP7_FP3/software/DB2/unpack_nlpack_10.5.0.10..0/nlpack -f NOTSAMP -f update -t /tmp/$(hostname)_db2_10.5.0.5..1_$(date +%Y%m%d_%H%M%S).trc -n The corefixpack.sh is a litlte more complicated as it varies depending on the source path and the DB2 version. Replace the <current db2 10.5 copy> the copy directory associated with your core instance name.
$ cat corefixpack.sh
/BCU_share/FP7_FP3/software/DB2/unpack_10.5.0.10..6/universal/installFixPack -n -b <current db2 10.5 copy> -p /usr/IBM/dwe/db2/V10.5.0.10..6 -c /BCU_share/FP7_FP3/software/DB2/unpack_nlpack_10.5.0.10..0/nlpack -f NOTSAMP -f update -t /tmp/$(hostname)_db2_10.5.0.10..0_$(date +%Y%m%d_%H%M%S).trc -n In the example below, the copy would be /usr/IBM/dwe/db2/V10.5.0.10..0. This represents the copy that the installer will use to determine the components and licenses to carry over to the updated db2 copy as part of the -p option.
$ dsh -n ${ALL} -e $(pwd)/scripts/get_db2_data.sh | dshbak -c
HOSTS ------------------------------------------------------------------------- reverseflash01, reverseflash03 ------------------------------------------------------------------------------- /usr/IBM/dwe/mgmt_db2/V10.5|10.5|10.5.0.10|10.5.0.10..0|db2opm,dweadmin|db2opm,dweadmin|| HOSTS -------------------------------------------------------------------------
reverseflash02, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- /usr/IBM/dwe/db2/V10.5.0.10..0|10.5|10.5.0.10|10.5.0.10..0|bcuaix|bcuaix|| OR
Contact IBM Support to obtain an updated gen_update_script.sh.
Backup the gen_update_script.sh file in /BCU_share/FP7_FP3/software/DB2/scripts and replace it with the script obtained from IBM Support.
Update the permissions to read and execute for the user.
chmod 500 gen_update_script.sh
Fixed:
----------
FP8_FP4
|
|
KIG00063
After the FP7_FP3 update the HMC call home settings are lost and call home tickets for HMC or Power Server issues are not submitted to IBM. (Added: 2020-03-12)
|
Fixpack |
I_V1.0.0.7
I_V1.1.0.3
|
After the FP7_FP3 update the HMC call home settings are lost and call home tickets for HMC or Power Server issues are not submitted to IBM. (Added: 2020-03-12)
During Stage 3 of this PDOA FP7_FP3 fixpack the HMC FW level is updated to V9R1.930.0. Anytime after this update if there is a Call Home event the information sent to IBM does not include the the Solution MTM nor the Solution Serial number. Instead the device MTM and Serial number are sent. Without the Solution Serial Number and the Solution Machine Type information these tickets will fail entitlement and will not result in a HW ticket.
Other symptoms:
hscroot@dsshmc49:~> ls -l /opt/hsc/data/ISASconfig
ls: cannot access /opt/hsc/data/ISASconfig: No such file or directory hscroot@dsshmc49:~> cat /opt/ccfw/data/ecc/hmc_ecc.properties
cat: /opt/ccfw/data/ecc/hmc_ecc.properties: No such file or directory |
Workaround:
-----------
If this is part of a planning exercise before the fixpack verify that you have a copy of isas_config.xml file that was used to setup call home for the HMC.
This file may be found in one or more of the following locations:
Management host:
/pschome/config/isas_config.xml
HMC:
/hscroot/config/isas_config.xml
/opt/hsc/data/ISASconfig/isas_config.xml
/opt/sfp/data/service/ISASconfig/isas_config.xml
If you cannot find your isas_config.xml file then this can be generated with help from IBM Support.
Verify the contents of the isas_config.xml:
The file should include a solution entry for each Solution Serial.
Each solution entry should include the type (8279/8280), the model, and the solution serial number for all the components in that rack.
IBM Support can look up your solution serial numbers and rack assignments.
The following information will be needed if the isas support mode is disabled.
Work with IBM Support to verify that the hscpe user is configured and you know the password.
Work with IBM Support to verify that you know the root password.
Verify the HMC Serial Number of the current Date on both HMC servers.
Ensure the isas_config.xml is located in the hscroot users home directory on both HMCs. This file can be copied from the management host as the root user to each HMC.
cd <path to xml file>
scp -p isas_config.xml hscroot@hmc:
### Replace hmc with ip addresses of the two HMCs.
Run this command to import the isas_config.xml file into each hmc.
ssh hscroot@hmc 'cpfile -t modelmap -l l -o import -f /home/hscroot/isas_config.xml'
### Replace hmc with the ip addresses of the two HMCs.
Verify that the file was imported. The directory where this file is imported has changed since PDOA FP6_FP2.
ssh hscroot@hmc1 'ls -la /opt/sfp/data/service/ISASconfig/isas_config.xml'
-rw-r--r-- 1 root sfp 948 Mar 12 15:00 /opt/sfp/data/service/ISASconfig/isas_config.xml Verify that the isas support mode is enabled.
ssh hscroot@hmc 'ls -la /opt/sfp/data/sa/hmc_ecc.properties'
-rw-rw-rw- 1 sfp sfp 66 Feb 12 14:39 /opt/sfp/data/sa/hmc_ecc.properties ssh hscroot@hmc 'cat /opt/sfp/data/sa/hmc_ecc.properties'
# This flag controls iSAS problem reporting mode isasmode = False If isasmode = False, the it is necesary to work with support to obtain the PESH password.
Once you have the PESH password do the following.
## As root on the management or through a putty session login as the hscpe user.
$ ssh hscpe@172.23.1.245
Password: ## Using the PESH password requested from support, enter PESH mode with pesh <HMC SERIAL>
hscpe@dsshmc49:~> pesh 840A7BD
Password: ## Login as root with 'su -'
[hscpe@dsshmc49 ~] $ su -
Password: ## Enable ISAS / PDOA reporting mode.
[root@dsshmc49 ~] # /opt/hsc/bin/chisascfg --mode enable
iSAS problem reporting mode enabled. Please reboot the HMC. [root@dsshmc49 ~] # cat /opt/sfp/data/sa/hmc_ecc.properties # This flag controls iSAS problem reporting mode isasmode = True ## Reboot the HMC and work with support to submit a test ticket to verify tickets are working again.
Fixed:
----------
N/A.
|
|
KIG00064
PDOA rsct filesets are in APPLIED state. (Added: 2020-03-18) |
Fixpack |
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.2
I_V1.1.0.3
|
PDOA FP7_FP3 rsct filesets are in APPLIED state. (Added: 2020-03-18)
In Appendix F of the PDOA FP6_FP2 and FP7_FP3 readme documents there is a check to see what AIX filesets are in the APPLIED state. This check is part of the instructions to commit GPFS filesets. If this check is run after TSA is updated, then it will reveal several RSCT filesets that are in the APPLIED state. Below is an excerpt from a V1.1 FP3 environment after TSA has been upgraded and committed.
|
Workaround:
-----------
here is no workaround as yet. As part of FP8_FP4 we will look at this state to determine if any actions need to be taken. PDOA has used the same TSA update model over many years and we expect that this issue has been around through all of those fixpacks and it was only due to additional checking introduced in FP6_FP2 that this issues was recognized. There is no functionality issue with APPLIED filesets as the latest filesets are in use. This is evidenced by the checks performed during the TSA updates. Fixed:
----------
N/A
|
|
KIG00058
PDOA FP7_FP3 has failures when running Power Firmware (PFW) updates in parallel on the same server (MTM) type. (Added: 2020-03-18)
|
Fixpack |
I_V1.0.0.7
I_V1.1.0.3
|
PDOA FP7_FP3 has failures when running Power Firmware (PFW) updates in parallel on the same server (MTM) type.
In Stage 7 of the FP7_FP3 Readme there are instructions to update the Power Firmware (PFW) for one or more servers in the environment. Many customers are choosing to take a full outage during this time and are planning to apply the PFW in parallel on all of the servers instead of following the model of only updating the power firmware for Standby servers.
While this issue impacts all customers, the cost is felt for customers taking the outage windows and also have large environments.
During Stage 7 the customer may experience the following symptom:
fsp=server_fsp1,server_fsp2,server_fsp3,server_fsp4;s=$(date);echo "Starting at ${s}.";$PL_ROOT/bin/icmds/appl_ctrl_fsp update -install -l ${fsp} -f /BCU_share/FP7_FP3/firmware/server_fsp/22A_42A/image/imports;e=$(date);echo "Started: ${s} Ended: ${e}." Starting at Thu Feb 20 23:47:55 IST 2020. PFW:server_fsp1:0:Successfully updated PFW:server_fsp4:0:Successfully updated PFW:server_fsp2:1:update failed for server server_fsp2 PFW:server_fsp3:1:update failed for server server_fsp3 Updates failed for one or more CECsThe text above shows that two servers, server_fsp1 and server_fsp4 succeeded, while server_fsp2 and server_fsp3 failed. While it appears that fsp1 and fsp4 ran in a parallel, the pflayer actually runs updates in sets based on the MTM of the server. All PDOA environments at 1.5 DNs or higher have two different Power servers in the environment. 1 4U and 1 2U that have different MTMs. FSP1 and FSP4 have different MTMs and a close examination of the log would show that they ran about 30 to 40 minutes apart. The log will also show that the fsp2 and fsp3 failed almost immediately.
This is a confirmed limitation of the HMC firmware level shipped in FP7_FP3 which does not allow the updlic command to be run in a parallel.
Examination of the platform_layer.log will show the following messages:
[21 Feb 2020 00:55:17,846] <2360134 CTRL DEBUG flashdancehostname01> Successfully changed CEC power start policy as autostart[21 Feb 2020 00:55:17,883] <1508172 CTRL DEBUG flashdancehostname01> Starting update for -> server_fsp3[21 Feb 2020 00:55:17,886] <1508172 CTRL DEBUG flashdancehostname01> Executing command on hmc 172.23.1.245 => LANG=en_US /usr/bin/ssh hscroot@172.23.1.245 updlic -m Data1-8284-22A-SN216B47V -o u -t sys -l 01SV860_205 -r mountpoint -d /home/hscroot/01SV860_205_165[21 Feb 2020 00:55:18,263] <2360134 CTRL DEBUG flashdancehostname01> update failed for server_fsp3This is not a failure in terms of applying the power firmware but a failure from the updlic command which won't allow this update to proceed while another update is running.
This scenario has not impacted any of the previous PDOA fixpacks.
|
Workaround:
-----------
This is not a functional issue, but rather a time issue.
There is no need for a workaround for 0.5 DN systems as PDOA updates the foundation hosts serially.
For smaller systems consider adding the time costs to the outage windows, about 1 hour per Server.
One workaround is to follow the Stage 7 model for those updates and only update Quiesced Servers. This is more tedious but allows the updates to happen outside of outage windows.
Another workaround is to use the HMC GUI to run the updates. The HMC GUI should support running the PFW updates in parallel for servers of the same type.
PDOA development has not tested this method.
Fixed:
----------
N/A.
Outside of the GUI workaround to run in parallel there is no fix from the HMC team available to address this at the command line.
|
|
KIG00004
alt_disk_install fails with long lv names.(Added: 2020-03-18) |
Fixpack |
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
alt_disk_install fails with long lv names.(Added: 2020-03-18)
PDOA fixpack processes have use hdisk cloning from AIX as part of our fixpack procedures for a long time. This is a known model for updating AIX where in a two disk rootvg mirror the mirror is broken and a copy or clone of the running disk is created.
In the event of a failure or corruption during the AIX update (or any change to rootvg) it is then possible to boot using the cloned disk to get back to the original rootvg .
As part of this process however there are some potential for LVs that are created after PDOA was shipped to the customer to cause the cloning step to fail. In this case LVs with names that are too long may cause a falure like the one below.
0505-129 alt_disk_install: The rootvg contains logical
volume name(s): 01234567890lv, which exceed the 11 character limitation. To correct this problem, unmount the logical volume(s). Then, rename and mount the logical volume(s) and retry the command. For customers applying V1.0 FP5 or earlier or V1.1 FP1 this will prevent the fixpack during management and core phases.
For customers applying V1.0 FP6+ or V1.1 FP2+ this will prevent updates on that LPAR in Stage 6 (management) or Stage 7(core).
It is possible to check this in a PDOA environment using the following command. This command checks for LVs with 9 or more (max=8) characters. The max=8 will return on all PDOA servers to validate the command works, update to max=11 to find lvs that are problematic.
dsh -n ${ALL} 'max=8;lsvg -l rootvg | egrep -v "^rootvg:|^LV NAME" | while read lvn rest;do l=$(echo ${lvn} | wc -c);if [ ${l} -gt ${max} ];then echo "${lvn} exceeds ${max} chars at ${l} characters long.";fi;done' | dshbak -c HOSTS ------------------------------------------------------------------------- reverseflash01 ------------------------------------------------------------------------------- hd11admin exceeds 8 chars at 10 characters long. lg_dumplv exceeds 8 chars at 10 characters long. livedump exceeds 8 chars at 9 characters long. paging00 exceeds 8 chars at 9 characters long. gsacache exceeds 8 chars at 9 characters long. HOSTS ------------------------------------------------------------------------- reverseflash02, reverseflash03, reverseflash04, reverseflash05, reverseflash06 ------------------------------------------------------------------------------- paging00 exceeds 8 chars at 9 characters long. hd11admin exceeds 8 chars at 10 characters long. lg_dumplv exceeds 8 chars at 10 characters long. livedump exceeds 8 chars at 9 characters long. |
Workaround:
-----------
To proceed rename any logical volumes that exceed the 11 character limit. For FP5_FP1 and earlier resume. For FP6_FP2 and higher rerun the alt_disk_install step to clone rootvg once the LV have been renamed. Fixed:
----------
N/A. There is no fix for this issue as it is a limitation of the cloning procedure used by PDOA.
|
|
KIG00057
PDOA Power Firmware Update process leaves LPAR with a BA218001 error code. (Added: 2020-03-18)
|
Fixpack | I_V1.1.0.3 |
PDOA Power Firmware Update process leaves LPAR with a BA218001 error code. (Added: 2020-03-18)
During the Power Firmware update stage on V1.1.0.3 a customer experienced an LPAR that did not start. In this case the case was SRC BA218001.
This issue is rare but it matched another issue with a Power 9 system in a similar scenario. The root cause appears to be a an incompatibility that exists between the Fiber Channel Firmware and the Power Firmware which results in stack-underflow during POST.
|
Workaround:
-----------
Once this issue is hit it is important to open a ticket with IBM as this remedy may not fix all issues matching this SRC.
Once the ticket it opened it is possible to attempt the workaround. This is the advice from a similar ticket.
The workaround that has been devised and found to work for this problem, until the root cause is established and fixes are released, is as follows:1. Shut down the partition 2. Change the partition profile to remove any/all fibre channel adapters 3. Activate the partition being sure to specify the partition profile with the change (the one with no fibre adapters) 4. Let partition boot - it will likely stop with a CA00E175 or AA00E1A9. Make sure it does not hang with any BA21xxxx condition. 5. Again shut down the partition 6. Modify the partition profile to add the desired fibre channel adapters back 7. Activate the partition again being sure to specify the partition profile with the change 8. Make sure the partition does not hang with BA210001 or BA218001 9. Retry the VIOS installFor PDOA systems we followed something slightly different.
Pre-Requisites:
You will need the hscroot password for one of the HMCs on the PDOA environment.
You will need browser access to the HMC and can login as hscroot.
You will need to identify the Server and associated LPAR. For Foundation Servers there are two LPARs/Server.
Login the HMC as hscroot.
1. Find the Server hosting the LPAR with the SRC code.
2. Shut down the LPAR in the HMC.
3. Find the Managed Profiles window for the LPAR.
4. Create a copy of the current profile. For example, if the profile name is 'adm_node' create a copy called 'adm_node_nofcs'
5. Edit the profile copy to remove all HBA adapters from teh lpar assignment. In the I/O tab there will be slots listed with 'Required'. Select all Fibre Channel Adapters that are currently required and hit the Remove button.
6. Attempt to start or Activate the LPAR and choose the newly copied profile.
7. If the system boots login to the environment. If the LPAR is still not active, stop and work with support for further troubleshooting.
8. On boot,GPFS will start automatically, you may want to run '/usr/lpp/mmfs/bin/mmumount all' and then '/usr/lpp/mmfs/bin/mmshutdown' to cleanly stop it.
9. Shutdown the server from the command line but do not reboot it. 'shutdown +0'.
10. In the HMC verify the LPAR is not active.
11. Active the LPAR, this time choose the original profile.
12. Verify that the LPAR has booted and all FC cards are available and not Defined.
13. Delete the copied profille from the Managed Profiles page for that LPAR.
Fixed:
----------
There is no fix available for PDOA V1.1.0.3's fixpack applcation.
|
|
KIG00066
XML Load In PDOA Systems Returns SQL1406N Shared sort memory cannot be allocated for this utility. (Added: 2020-03-20) |
General |
I_V1.0
I_V1.1
|
XML Load In PDOA Systems Returns SQL1406N Shared sort memory cannot be allocated for this utility.When trying to execute a LOAD into an XML table, you may see the following errors.
Agent Type Node SQL Code Result LOAD 001 -00001406 Error. RESTART required. LOAD 002 -00001406 Error. RESTART required. LOAD 003 -00001406 Error. RESTART required. LOAD 004 -00001406 Error. RESTART required. LOAD 005 -00001406 Error. RESTART required. LOAD 006 -00001406 Error. RESTART required. LOAD 007 -00001406 Error. RESTART required. LOAD 008 -00001406 Error. RESTART required. LOAD 009 -00001406 Error. RESTART required. LOAD 010 -00001406 Error. RESTART required. LOAD 011 -00001406 Error. RESTART required. LOAD 012 -00001406 Error. RESTART required. LOAD 013 -00001406 Error. RESTART required. LOAD 014 -00001406 Error. RESTART required. LOAD 015 -00001406 Error. RESTART required. LOAD 016 -00001406 Error. RESTART required. LOAD 017 -00001406 Error. RESTART required. LOAD 018 -00001406 Error. RESTART required. LOAD 019 -00001406 Error. RESTART required. LOAD 020 -00001406 Error. RESTART required. LOAD 021 -00001406 Error. RESTART required. LOAD 022 -00001406 Error. RESTART required. LOAD 023 -00001406 Error. RESTART required. LOAD 024 -00001406 Error. RESTART required. LOAD 025 -00001406 Error. RESTART required. RESULTS: 0 of 25 LOADs completed successfully. This appears to be a memory issue, but even with 0 rows to load the error will appear.
|
Workaround:
-----------
This type of load operation is not supported on PDOA environments in general as it requires shared memory to be enabled. PDOA environments are primarily configured to use private sorts. Changing the memory parameters to use shared sorts will allow this type of operation to work, however doing so should be considered a tuning exercise that will ensure existing workloads and SLAs are not impacted. Primarily this is due to the following settings: (examples are from V1.1 systems) DBM CFG:SHEAPTHRES 2800000 DB CFG:INTRA_PARALLEL NO See https://www.ibm.com/support/knowledgecenter/en/SSH2TE_1.1.0/com.ibm.7700.r2.common.doc/doc/c00000109.html for more information. For Db2 Requirements for XML Load refer to this link. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.xml.doc/doc/c0024119.html Fixed:
----------
N/A
|
| KIG00067 /var/log/syslog.out shows new messages for ssh. (Added: 2020-03-24) |
Fixpack |
I_V1.0.0.5_IF01
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.1_IF01
I_V1.1.0.2
I_V1.1.0.3
|
/var/log/syslog.out shows new messages for ssh.
After applying a fixpack or interim fix that updates AIX to 7.1 TL5, sshd reports several deprecated or unsupported options in the system log.
Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: rexec line 31: Deprecated option KeyRegenerationInterval
Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: rexec line 47: Deprecated option RSAAuthentication Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: rexec line 52: Deprecated option RhostsRSAAuthentication Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: rexec line 96: Unsupported option PrintLastLog Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: rexec line 99: Deprecated option UsePrivilegeSeparation Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: reprocess config line 47: Deprecated option RSAAuthentication Oct 4 18:01:28 kf5hostname04 auth|security:info sshd[6554284]: reprocess config line 52: Deprecated option RhostsRSAAuthentication The AIX update includes updates to OpenSSH where these options are either deprecated or unsupported.
These options were originally setup in the default /etc/sshd_config files as part of the PDOA deployment.
$ dsh -n ${ALL} 'egrep "KeyRegenerationInterval|RSAAuthentication|RhostsRSAAuthentication|PrintLastLog|UsePrivilegeSeparation" /etc/ssh/ssh*_config' | dshbak -c
HOSTS ------------------------------------------------------------------------- flashdancehostname01, flashdancehostname02, flashdancehostname03, flashdancehostname04, flashdancehostname05, flashdancehostname06, flashdancehostname07 ------------------------------------------------------------------------------- /etc/ssh/ssh_config:# RhostsRSAAuthentication no /etc/ssh/ssh_config:# RSAAuthentication yes /etc/ssh/sshd_config:KeyRegenerationInterval 1h /etc/ssh/sshd_config:RSAAuthentication yes /etc/ssh/sshd_config:RhostsRSAAuthentication no /etc/ssh/sshd_config:# RhostsRSAAuthentication and HostbasedAuthentication /etc/ssh/sshd_config:PrintLastLog yes /etc/ssh/sshd_config:UsePrivilegeSeparation yes These messages will appear for for each ssh session initiated on the server which can lead to unnecessary increased syslog traffic.
|
Workaround:
-----------
These options can be commented out in the /etc/ssh/sshd_config and /etc/ssh/ssh_config files.
Great care should be taken before updating /etc/ssh/sshd_config files and refreshing the sshd daemon as this could cause unexpected outages if there are errors in the config files. GPFS and Db2 are two appliance components dependent on ssh.
Always ensure that it is possible to login via the HMC to this LPAR using vtmenu on the command line or through Console connections through the HMC GUI.
To test sshd_config file edits:
As root create a sandbox location to test sshd_config updates.
$ mkdir /tmp/ssh_test
$ cp /etc/ssh/sshd_config /tmp/ssh_test/
$ cd /tmp/ssh_test/
Edit the file by commenting out KeyRegenerationInterval,RSAAuthentication,RhostsRSAAuthentication,PrintLastLog.
$ egrep "KeyRegenerationInterval|RSAAuthentication|RhostsRSAAuthentication|PrintLastLog|UsePrivilegeSeparation" /tmp/ssh_test/sshd_config
#KeyRegenerationInterval 1h #RSAAuthentication yes #RhostsRSAAuthentication no # RhostsRSAAuthentication and HostbasedAuthentication #PrintLastLog yes #UsePrivilegeSeparation yes Create a separate sshd session on port 10022 using the new configuration file.
This will not run in the background and will not fork any processes.
$ $(which sshd) -d -D -p 10022 -f /tmp/ssh_test/sshd_config
debug1: sshd version OpenSSH_7.5, OpenSSL 1.0.2o 27 Mar 2018
debug1: private host key #0: ssh-rsa SHA256:1TkrCRt7BFWLrPs+LIi51tmoChteTRNLKQ9LUszCBXk debug1: private host key #1: ssh-dss SHA256:CHyQxuRUB2No3hP5k+Bj8GPeYzFKGdiSlnKt5oU/SE8 debug1: rexec_argv[0]='/usr/sbin/sshd' debug1: rexec_argv[1]='-d' debug1: rexec_argv[2]='-D' debug1: rexec_argv[3]='-p' debug1: rexec_argv[4]='10022' debug1: rexec_argv[5]='-f' debug1: rexec_argv[6]='/tmp/ssh_test/sshd_config' debug1: Bind to port 10022 on 0.0.0.0. Server listening on 0.0.0.0 port 10022. debug1: Bind to port 10022 on ::. Server listening on :: port 10022. In a separate window, use ssh to login to this sshd deamon.
$ ssh -p 10022 flashdancehostname01
debug1: AIX/loginsuccess: msg Last unsuccessful login: Sun Mar 8 13:43:58 IST 2020 on rexec from motte2.canlab.ibm.com Last login: Tue Mar 24 19:17:57 IST 2020 on ssh from 172.23.1.1 debug1: audit session open euid 0 user root tty name /dev/pts/6
Last unsuccessful login: Sun Mar 8 13:43:58 IST 2020 on rexec from motte2.canlab.ibm.com Last login: Tue Mar 24 19:16:06 IST 2020 on /dev/pts/6 from 172.23.1.1 ******************************************************************************* * * * * * Welcome to AIX Version 7.1! * * * * * * Please see the README file in /usr/lpp/bos for information pertinent to * * this release of the AIX Operating System. * * * * * ******************************************************************************* debug1: ACCESS KEy before calling efslogin: debug1: permanently_set_uid: 0/0
Environment: USER=root LOGNAME=root LOGIN=root HOME=/ PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java5/jre/bin:/usr/java5/bin:/usr/lpp/htx/etc/scripts:/test/tools:/usr/lpp/htx/test/tools:/home/monitor/test/tools:/nim/build_net/tools:/u MAIL=/var/spool/mail/root SHELL=/usr/bin/ksh TZ=Asia/Calcutta SSH_CLIENT=172.23.1.1 65207 10022 SSH_CONNECTION=172.23.1.1 65207 172.23.1.1 10022 SSH_TTY=/dev/pts/6 TERM=xterm AUTHSTATE=compat LANG=en_US LOCPATH=/usr/lib/nls/loc NLSPATH=/usr/lib/nls/msg/%L/%N:/usr/lib/nls/msg/%L/%N.cat:/usr/lib/nls/msg/%l.%c/%N:/usr/lib/nls/msg/%l.%c/%N.cat LC__FASTMSG=true ODMDIR=/etc/objrepos CLCMD_PASSTHRU=1 MANPATH=/opt/ibm/director/man NUM_PARALLEL_LPS=2 (0) root @ flashdancehostname01: 7.1.0.0: /
$ exit Connection to flashdancehostname01 closed.
After logging in and exiting the following messages will appear on your sshd debug session and it will exit.
debug1: fd 5 clearing O_NONBLOCK
debug1: Server will not fork when running in debugging mode. debug1: rexec start in 5 out 5 newsock 5 pipe -1 sock 8 debug1: inetd sockets after dupping: 3, 3 debug1: audit connection from 172.23.1.1 port 65207 euid 0 Connection from 172.23.1.1 port 65207 on 172.23.1.1 port 10022 debug1: Client protocol version 2.0; client software version OpenSSH_7.5 debug1: match: OpenSSH_7.5 pat OpenSSH* compat 0x04000000 debug1: Local version string SSH-2.0-OpenSSH_7.5 debug1: Enabling compatibility mode for protocol 2.0 debug1: Failed dlopen: /usr/krb5/lib/libkrb5.a(libkrb5.a.so): 0509-022 Cannot load module /usr/krb5/lib/libkrb5.a(libkrb5.a.so). 0509-026 System error: A file or directory in the path name does not exist. debug1: Error loading Kerberos, disabling the Kerberos auth
debug1: permanently_set_uid: 202/201 [preauth] debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256 [preauth] debug1: SSH2_MSG_KEXINIT sent [preauth] debug1: SSH2_MSG_KEXINIT received [preauth] debug1: kex: algorithm: curve25519-sha256 [preauth] debug1: kex: host key algorithm: rsa-sha2-512 [preauth] debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none [preauth] debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none [preauth] debug1: expecting SSH2_MSG_KEX_ECDH_INIT [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: SSH2_MSG_NEWKEYS sent [preauth] debug1: expecting SSH2_MSG_NEWKEYS [preauth] debug1: SSH2_MSG_NEWKEYS received [preauth] debug1: rekey after 4294967296 blocks [preauth] debug1: KEX done [preauth] debug1: userauth-request for user root service ssh-connection method none [preauth] debug1: attempt 0 failures 0 [preauth] debug1: userauth-request for user root service ssh-connection method publickey [preauth] debug1: attempt 1 failures 0 [preauth] debug1: userauth_pubkey: test whether pkalg/pkblob are acceptable for RSA SHA256:1TkrCRt7BFWLrPs+LIi51tmoChteTRNLKQ9LUszCBXk [preauth] debug1: temporarily_use_uid: 0/0 (e=0/0) debug1: trying public key file //.ssh/authorized_keys debug1: fd 5 clearing O_NONBLOCK debug1: matching key found: file //.ssh/authorized_keys, line 1 RSA SHA256:1TkrCRt7BFWLrPs+LIi51tmoChteTRNLKQ9LUszCBXk debug1: restore_uid: 0/0 debug1: Failed to collect Cookie from Keystore debug1: Keystore Opening wil be failed after login
debug1: Cookie received :
[preauth] Postponed publickey for root from 172.23.1.1 port 65207 ssh2 [preauth] debug1: userauth-request for user root service ssh-connection method publickey [preauth] debug1: attempt 2 failures 0 [preauth] debug1: temporarily_use_uid: 0/0 (e=0/0) debug1: trying public key file //.ssh/authorized_keys debug1: fd 8 clearing O_NONBLOCK debug1: matching key found: file //.ssh/authorized_keys, line 1 RSA SHA256:1TkrCRt7BFWLrPs+LIi51tmoChteTRNLKQ9LUszCBXk debug1: restore_uid: 0/0 debug1: Failed to collect Cookie from Keystore debug1: Keystore Opening wil be failed after login
debug1: Cookie received :
[preauth] Accepted publickey for root from 172.23.1.1 port 65207 ssh2: RSA SHA256:1TkrCRt7BFWLrPs+LIi51tmoChteTRNLKQ9LUszCBXk debug1: AIX/loginsuccess: msg Last unsuccessful login: Sun Mar 8 13:43:58 IST 2020 on rexec from motte2.canlab.ibm.com Last login: Tue Mar 24 19:16:06 IST 2020 on /dev/pts/6 from 172.23.1.1 debug1: monitor_child_preauth: root has been authenticated by privileged process
debug1: Entering sshefs_option_check [preauth] debug1: AllowPkcs12KeystoreAutoOpen option not set [preauth] debug1: EFS ACESS KEY: [preauth] debug1: monitor_read_log: child log fd closed debug1: audit event euid 0 user root event 2 (SSH_authsuccess) debug1: Return Val-1 for auditproc:0 debug1: rekey after 4294967296 blocks debug1: rekey after 4294967296 blocks debug1: ssh_packet_set_postauth: called debug1: Entering interactive session for SSH2. debug1: server_init_dispatch debug1: server_input_channel_open: ctype session rchan 0 win 1048576 max 16384 debug1: input_session_request debug1: channel 0: new [server-session] debug1: session_new: session 0 debug1: session_open: channel 0 debug1: session_open: session 0: link with channel 0 debug1: server_input_channel_open: confirm session debug1: server_input_global_request: rtype no-more-sessions@openssh.com want_reply 0 debug1: server_input_channel_req: channel 0 request pty-req reply 1 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req pty-req debug1: Allocating pty. debug1: session_pty_req: session 0 alloc /dev/pts/6 debug1: server_input_channel_req: channel 0 request shell reply 1 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req shell debug1: Values: options.num_allow_users: 0 debug1: RLOGIN VALUE :1 debug1: AIX/loginsuccess: msg Last unsuccessful login: Sun Mar 8 13:43:58 IST 2020 on rexec from motte2.canlab.ibm.com
Last login: Tue Mar 24 19:17:57 IST 2020 on ssh from 172.23.1.1 Starting session: shell on pts/6 for root from 172.23.1.1 port 65207 id 0
setsid: Operation not permitted. debug1: Received SIGCHLD. debug1: session_by_pid: pid 5177780 debug1: session_exit_message: session 0 channel 0 pid 5177780 debug1: session_exit_message: release channel 0 debug1: session_pty_cleanup: session 0 release /dev/pts/6 debug1: audit session close euid 0 user root tty name /dev/pts/6 Received disconnect from 172.23.1.1 port 65207:11: disconnected by user Disconnected from user root 172.23.1.1 port 65207 debug1: do_cleanup debug1: audit event euid 0 user root event 12 (SSH_connabndn) debug1: Return Val-1 for auditproc:0 (255) root @ flashdancehostname01: 7.1.0.0: /tmp/ssh_test
$ Once a file is known good, then it can be copied to /etc/ssh/sshd_config and the sshd deamon can be restarted.
Fixed:
----------
N/A
|
KIG00072
FP7_FP3 Readme Appendix M: mksysb command does not capture /bosinst.data for host(Added: 2020-04-15) |
Fixpack |
I_V1.0.0.7
I_V1.1.0.3
|
FP7_FP3 Readme Appendix M: mksysb command does not capture /bosinst.data for host
The command documented in V101 of the FP7_FP3 Readme does not copy the bosinst.data along with the mksysb and image.data files. When registering a mksysb image in NIM all three files are necessary.
The documented command is:
time dsh -n ${ALL} 'dir=/stage/backups/FP7_FP3/kf1/$(hostname);mkdir -p ${dir};mksysb -ip ${dir}/$(hostname).mksysb;cp /image.data ${dir}'
|
Workaround:
-----------
a) If a mksysb has been taken and it is not possible to retrieve the /bosinst.data file from the host and it is necessary to restore the host using mksysb.
The bosint.data file is documented in the AIX Knowledge Center.
While the /bosinst.data file has specific information about the host, that information can be omitted.
In a PDOA environment you can see the differences per host by running the following as root on the management host.
dsh -n ${ALL} 'ssh 172.23.1.1 "cat /bosinst.data" | diff /bosinst.data -' | dshbak -c
Here is an example of the difference between a management host and the standby management host.
HOSTS -------------------------------------------------------------------------
flashdancehostname03 ------------------------------------------------------------------------------- 4c4 < CONSOLE = /dev/vty0 --- > CONSOLE = Default 7c7 < PROMPT = no --- > PROMPT = yes 22c22 < DESKTOP = --- > DESKTOP = NONE 48c48 < BOSINST_LANG = en_US --- > BOSINST_LANG = C 55,57c55,57 < PVID = 00f968c1cfa4c789 < PHYSICAL_LOCATION = U78C9.001.WZS02HM-P1-C14-T1-L205DA5D000-L0 < CONNECTION = sas0//205da5d000,0 --- > PVID = 00f968bfcf3f7fcc > PHYSICAL_LOCATION = U78C9.001.WZS02F5-P1-C14-T1-L205DA55500-L0 > CONNECTION = sas0//205da55500,0 Management host:
-----------------------
(0) root @ flashdancehostname01: 7.1.0.0: /
$ cat /bosinst.data # Basic bosinst_data file created by NIM control_flow:
CONSOLE = Default INSTALL_METHOD = overwrite INSTALL_EDITION = standard PROMPT = yes EXISTING_SYSTEM_OVERWRITE = yes INSTALL_X_IF_ADAPTER = yes RUN_STARTUP = yes RM_INST_ROOTS = no ERROR_EXIT = CUSTOMIZATION_FILE = TCB = no INSTALL_TYPE = BUNDLES = SWITCH_TO_PRODUCT_TAPE = RECOVER_DEVICES = Default BOSINST_DEBUG = no ACCEPT_LICENSES = yes ACCEPT_SWMA = DESKTOP = NONE INSTALL_DEVICES_AND_UPDATES = yes IMPORT_USER_VGS = CREATE_JFS2_FS = yes ALL_DEVICES_KERNELS = yes GRAPHICS_BUNDLE = SYSTEM_MGMT_CLIENT_BUNDLE = FIREFOX_BUNDLE = KERBEROS_5_BUNDLE = SERVER_BUNDLE = ALT_DISK_INSTALL_BUNDLE = REMOVE_JAVA_118 = HARDWARE_DUMP = ADD_CDE = ADD_GNOME = ADD_KDE = ERASE_ITERATIONS = 0 ERASE_PATTERNS = MKSYSB_MIGRATION_DEVICE = TRUSTED_AIX = TRUSTED_AIX_LSPP = TRUSTED_AIX_SYSMGT = SECURE_BY_DEFAULT = ADAPTER_SEARCH_LIST = locale:
BOSINST_LANG = C CULTURAL_CONVENTION = en_US MESSAGES = en_US KEYBOARD = en_US target_disk_data: PVID = 00f968bfcf3f7fcc PHYSICAL_LOCATION = U78C9.001.WZS02F5-P1-C14-T1-L205DA55500-L0 CONNECTION = sas0//205da55500,0 LOCATION = 03-00-00 SIZE_MB = 544792 HDISKNAME = hdisk0 Standby Managmeent Host:
---------------------------------
$ ssh flashdancehostname03 cat /bosinst.data
# Basic bosinst_data file created by NIM control_flow:
CONSOLE = /dev/vty0 INSTALL_METHOD = overwrite INSTALL_EDITION = standard PROMPT = no EXISTING_SYSTEM_OVERWRITE = yes INSTALL_X_IF_ADAPTER = yes RUN_STARTUP = yes RM_INST_ROOTS = no ERROR_EXIT = CUSTOMIZATION_FILE = TCB = no INSTALL_TYPE = BUNDLES = SWITCH_TO_PRODUCT_TAPE = RECOVER_DEVICES = Default BOSINST_DEBUG = no ACCEPT_LICENSES = yes ACCEPT_SWMA = DESKTOP = INSTALL_DEVICES_AND_UPDATES = yes IMPORT_USER_VGS = CREATE_JFS2_FS = yes ALL_DEVICES_KERNELS = yes GRAPHICS_BUNDLE = SYSTEM_MGMT_CLIENT_BUNDLE = FIREFOX_BUNDLE = KERBEROS_5_BUNDLE = SERVER_BUNDLE = ALT_DISK_INSTALL_BUNDLE = REMOVE_JAVA_118 = HARDWARE_DUMP = ADD_CDE = ADD_GNOME = ADD_KDE = ERASE_ITERATIONS = 0 ERASE_PATTERNS = MKSYSB_MIGRATION_DEVICE = TRUSTED_AIX = TRUSTED_AIX_LSPP = TRUSTED_AIX_SYSMGT = SECURE_BY_DEFAULT = ADAPTER_SEARCH_LIST = locale:
BOSINST_LANG = en_US CULTURAL_CONVENTION = en_US MESSAGES = en_US KEYBOARD = en_US target_disk_data: PVID = 00f968c1cfa4c789 PHYSICAL_LOCATION = U78C9.001.WZS02HM-P1-C14-T1-L205DA5D000-L0 CONNECTION = sas0//205da5d000,0 LOCATION = 03-00-00 SIZE_MB = 544792 HDISKNAME = hdisk0 For a missing /bosinst.data file it is possible to use one from another host and edit the 'target_disk_data' stanza to blank to remove system specific details:
target_disk_data:
LOCATION = SIZE_MB = HDISKNAME = b) If it is not necessary to restore the host via mksysb as the host is available.
i. If mksysb has been taken already. Copy the /bosinst.data file to the mksysb directory on /stage. On the host as root:
dir=/stage/backups/FP7_FP3/kf1/$(hostname)
cp /bosinst.data ${dir}
ii. If mksysb has not been taken already use the following command instead.
dsh -n ${ALL} 'dir=/stage/backups/FP7_FP3/FP3/$(hostname);mkdir -p ${dir};mksysb -ip ${dir}/$(hostname).mksysb;cp /image.data ${dir};cp /bosinst.data ${dir}'
Fixed:
----------
Targeted to be fixed in FP8_FP4 Readme.
If there is an update to the FP7_FP3 readme it will be addressed in that update.
|
|
KIG00073
FP7_FP3: BNT Update May Fail with on one or more switches with: BNT:net2:172.23.1.252:1:Compare of firmware failed for switch after copy. (Added: 2020-04-18)
|
Fixpack |
I_V1.0.0.7
I_V1.1.0.3
|
FP7_FP3: BNT Update May Fail with on one or more switches with: BNT:net2:172.23.1.252:1:Compare of firmware failed for switch after copy.
As part of Stage 8 the BNT switches are updated. In some enviroments one or more BNT switches will report a failure message after attempting to update.
$ /opt/ibm/aixappl/pflayer/bin/icmds/appl_ctrl_net update -install -l "net0,net1,net2,net3" -f /BCU_share/FP7_FP3/firmware/net
BNT:net3:172.23.1.251:0:Compare of firmware success for switch after copy. BNT:net1:172.23.1.253:0:Compare of firmware success for switch after copy. BNT:net0:172.23.1.254:0:Compare of firmware success for switch after copy. BNT:net2:172.23.1.252:1:Compare of firmware failed for switch after copy. Verify the symptom by logging into the switch.
Login to the switch that failed to verify the symptom matches. In the above case the switch is 172.23.1.252.
$ ssh admin@172.23.1.252
flashdance64c1>show boot Current running image version:7.11.8 Currently set to boot software image1, active config block. NetBoot: disabled, NetBoot tftp server: , NetBoot cfgfile: Current boot Openflow protocol version: 1.0 USB Boot: disabled Currently profile is default, set to boot with default profile next time. Current FLASH software: image1: version 7.11.8, downloaded 23:55:36 Wed Feb 5, 2020 NormalPanel image2: version 7.11.15, downloaded 3:15:03 Thu Feb 20, 2020 NormalPanel boot kernel: version 7.11.8 Currently scheduled reboot time: none The boot kernel: version says 7.11.8 (if updating from FP5_FP1) or 7.11.11 (if updating from FP6_FP2).
During the attempt to copy the boot kernal update to the switch the scp or outer pflayer failed and the image was not fully copied.
Verify that the boot kernel version is updated.
|
Workaround:
-----------
1. Manually copy the boot kernel image to the affected switch. As root on the management host.
Replace the 172.23.1.252 with the ip address of the switch with the issue. Note that scp will report 100% but will not complete right away.
The file putboot is a /proc filesystem link so the switch will post-process the file during the scp session. The timing below is about how long the scp session should last.
scp -l 500 /BCU_share/FP7_FP3/firmware/net/8264/boot_image/G8264-RS-7.11.15.0_Boot.img admin@172.23.1.252:putboot
Enter login password:
Switch: executing scp command - putboot.
G8264-RS-7.11.15.0_Boot.img 100% 10MB 45.7KB/s 03:53 2. Login to the switch to verify the boot kernel image is updated.
$ ssh admin@172.23.1.252
flashdance64c1>show boot
Current running image version:7.11.8 Currently set to boot software image1, active config block. NetBoot: disabled, NetBoot tftp server: , NetBoot cfgfile: Current boot Openflow protocol version: 1.0 USB Boot: disabled Currently profile is default, set to boot with default profile next time. Current FLASH software: image1: version 7.11.8, downloaded 23:55:36 Wed Feb 5, 2020 NormalPanel image2: version 7.11.15, downloaded 3:15:03 Thu Feb 20, 2020 NormalPanel boot kernel: version 7.11.15 Currently scheduled reboot time: none 3. Rerun the command to update the switches. It should proceed through and update the switches which will include a switch reboot. Be sure to run the command in a screen session or console session from the HMC.
Fixed:
----------
N/A.
|
|
KIG00093
The df command shows inconsistent or incorrect results on PDOA AIX hosts for GPFS or Spectrum Scale filesystems. (Added: 2020-06-23)
|
General |
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.2
I_V1.1.0.3
|
On PDOA systems at V1.0.0.6 or V1.1.0.2 and higher customers who use the df command for checking disk usage have opened tickets indicating one or more of the following symptoms:
a) The df command reports different disk and inode usage across multiple hosts for the same filesystem.
b) The df command reports 100% disk usage for a filesystem that when checked against the GPFS command /usr/lpp/mmfs/bin/mmdf incorrectly shows the filesystem is full when it is not. c) The df command reports 100% inode usage for a filesystem that when checked against the GPFS command /usr/lpp/mmfs/bin/mmdf incorrectly shows the filesystem is out of inodes when it is not. Since df is used as a health check mechanism this discrepancy can lead to unnecessary alerts or actions in an attempt to alleviate the discrepancy.
|
Workaround:
-----------
To workaround this issue it is possible to use the 'mmdf' command to synchronize the data provided to the df command for a particular filesystem on a particular host. By default this command is not in the path for any user on the pdoa envrionment.
/usr/lpp/mmfs/bin/mmdf <filesystem>
Where <filesystem> is replaced with the filesystem device name.
There are two options for using this command as described below.
a) For any filesystem that is reporting 100% disk or 100% inode full use the following command on the host.
Replace '/db2home' with the absolute filesystem path. Be sure to leave the "^" and the ":" at the beginning and end of the filesystem path to only pick the filesystem with the discrepancy. This command will retrieve the device name that can be used with the mmdf command.
$ lsfs -c | grep "^/db2home:" | cut -d: -f 2 | xargs -n 1 /usr/lpp/mmfs/bin/mmdf
disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 2.4 TB) nsddb2home 314572800 -1 yes yes 175472384 ( 56%) 38008 ( 0%) ------------- -------------------- ------------------- (pool total) 314572800 175472384 ( 56%) 38008 ( 0%) ============= ==================== ===================
(total) 314572800 175472384 ( 56%) 38008 ( 0%) Inode Information
----------------- Number of used inodes: 10259 Number of free inodes: 297005 Number of allocated inodes: 307264 Maximum number of inodes: 307264 After running mmdf on the host with the discrepancy, the df command should show the correct values.
b) For a broader method, run the following command on a daily basis.
$ time dsh -n ${ALL} -f 1 'lsfs -c | grep mmfs | cut -d: -f2 | while read x;do echo "${x}"; /usr/lpp/mmfs/bin/mmlsfs $x > /dev/null 2>&1 && /usr/lpp/mmfs/bin/mmdf $x;done'
This command is typically run as root unless another user has been enabled for DSH. This command takes 35 minutes on a 2.5 DN V1.1 environment so it is a long running command. The mmdf command is governed by locking mechanisms in GPFS which prevent multiple mmdf commands from running on the same filesystem at the same time across the clusters.
Fixed:
----------
For IBM Spectrum Scale (GPFS) based filesystems the best command to use to check for filesystem health is the /usr/lpp/mmfs/bin/mmdf command for the most accurate results.
The IBM Spectrum Scale Knowledge Center contains a link describing how to query filesystem space.
The documentation for mmdf can be found here:
In PDOA systems the following command line illustrates how to report filesystem disk and inode usage for a single host. Row one is the filesystem, row two is the disk usage and free statistics, row three is the inode usage and free statistics.
$ mount | grep -i " mmfs " | while read fd fs rest;do echo "$fs:";/usr/lpp/mmfs/bin/mmdf $fd -Y | egrep 'fsTotal:|inode:' | grep -v HEADER | cut -d: -f 7,8;done
/opmfs:
838860800:830921984 18212:481820 /db2home: 314572800:175472384 10259:297005 /dwhome: 10485760:10018304 4044:61748 /stage: 10482614272:2211388416 37465:462567 /usr/IBM/dwe/appserver_001: 209715200:203930880 5371:199493 Third party monitoring tools may rely on the more familiar 'df' command. For cases where 'df' must be used, refer to the workarounds to use the mmdf command on a periodic basis to synchronize the data for df on that host, and/or add the mmdf command as a response to verify any disk or inode threshold alerts.
|
|
KIG00107
The High Availability Toolkit may not execute a script defined in RESOURCE_MOVE_TARGET_SCRIPT or SUCCESSFUL_FAILOVER_SCRIPT may not run on a successful failover.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
The High Availability Toolkit may not execute a script defined in RESOURCE_MOVE_TARGET_SCRIPT or SUCCESSFUL_FAILOVER_SCRIPT may not run on a successful failover. |
Workaround:
----------- N/A
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. However this only reduces the number of cases where failovers do not generate callout actions. Successful failover attempts that take more than an hour to resolve will not initiate a callout. Also some support assisted starts or failovers may not be recognized as failovers. This is a limitation of the implementation.
If using the following features:
EMAIL_ADDRESS
EMAIL_ON_MOVE=1
EMAIL_ON_SUCCESSFUL_FAILOVER=1
EMAIL_ON_UNSUCCESSFUL_FAILOVER=1
in the hatools.conf file then the ha tools will send e-mail alerts to the EMAIL_ADDRESS with warnings that a partition set has not started every 10 minutes as well as a final warning and callout that the failover was unsuccessful. In these rare cases this should ensure administrators are notified and can take appropriate actions.
|
|
KIG00109
The High Availability Toolkit may not attempt to stop all resources on a failed start attempt leading to a never ending transitional state.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
Partitions may fail to start due to a timeout or may fail outright due to a db2sysc error. When partitions fail due to a timeout the automation tool is able to attempt to failover those partitions if another host is available. If the partitions fail outright, ha tools does not pass the failure notification to TSA which prevents TSA from taking appropriate actions to failover. This leaves the partitions in a transitional state that can never be resolved.
|
Workaround:
----------- If this scenario is encountered contact IBM Support to help diagnose why Db2 was not able to start.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. Two changes to HA Tools also help with this scenario.
1. Added to hatools.conf is the ability to specify two callout scripts using two new variables: SUCCESSFUL_START_SCRIPT and FAILED_START_SCRIPT. For cases of outright success or outright failure this callout can be used to notify an admin. These callouts will not fire in the event of a failure to start due to a timeout, however if a timeout is hit while trying to start Db2 TSA is able to failover if a standby or the original primary is available again. There are also changes to alerting behavior in 2.0.8.0 that are explained in item 2 below.
2. As part of KIG00107 if a partition set takes more than 10 minutes to start a warning is issued if the EMAIL settings are configured as well as in the syslog on the primary and standby hosts for that resource. This occurs every 10 minutes until an hour has passed in which case a final warning is issued.
Using these feature can help ensure administrators are notified and can help address these issues.
|
|
KIG00110
The High Availability Toolkit may start database partitions on a standby host before those partitions are fully stopped on the primary host.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
This scenario occurs when Db2 is not able to stopped by the ha tools to help facilitate a failover. However hatools does not communicate this failute to TSA which will initiate a failover if the standby is available.
The case where this occurred was on a GPFS node expulsion scenario where the db2sys process could not be killed by a kill -9 command. While GPFS was recovering db2sysc was still detected leading to multiple attempts by the automation to attempt and fail to stop those processes. Once GPFS recovered the db2sysc process was able to be killed but this leads to churn and can lead to KIG00112 which will force down partitions that have already failed over.
|
Workaround:
----------- If this scenario is detected the best approach is to reboot or shutdown the host with the hang as soon as possible. This should allow TSA and GPFS to recover and reach a stable state.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. The monitoring algorithms have been updated to avoid using an optimistic algorithm that assumes it is possible to kill db2 processes as part of stop orders from TSAMP. instead, the monitor may require another monitoring period (30 seconds) to ensure that all database partitions are stopped.
|
|
KIG00111
The High Availability policies may shutdown partition resources on a healthy node if another node in the same domain is expelled from GPFS.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
This is a tradeoff within the policy design. This occurs in systems that are 2.5 DN or larger and appears to only be a factor in domains with at least 2 active data nodes. The expected action is that the database partitions on the non-expelled node will be able to restart. This may prolong a failover but there are changes to the behavior in HA Tools 2.0.8.0 that significant improve the ability to resolve this scenario. This symptom can be easy to miss as it usually part of a larger failover type event and will likely elongate the failover time.
|
Workaround:
----------- N/A
Fixed:
---------- N/A
|
|
KIG00112
The High Availability Toolkit may stop database partitions on the wrong host if the host running the stop is not associated with the partitions in db2nodes.cfg being stopped.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
This issue can occur if KIG00109 happens or if a false positive db2sysc process appears on a standby host. This symptom can be easy to miss as it usually happens in the larger context of a failover. |
Workaround:
----------- N/A
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. |
|
KIG00116
The High Availability Toolkit command hastartdb2 may fail to start partitions correctly if they will start as a failover.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
When running 'hastartdb2' the script will verify that the primary host definitions as per the high availability policy match the partition assignments in db2nodes.cfg. If they don't match then it will attempt to update the policy to match db2nodes.cfg. Prior to 2.0.8.0 this algorithm failed and could leave the domain in an odd state.
The most common scenario is a failover as a result of host rebooting in a two node domain. For example all PDOA environments have their admin and admin standby nodes in a single domain. A failover when one node is no longer in the domain does not allow the roving HA algorithme to update the primary and standby hosts leading to the discrepancy. The goal of this is to prevent a second failover (after stopping and starting the database).
|
Workaround:
----------- Contact IBM Support.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. In 2.0.8.0 this algorithm is improved and should not result in bad domain states, however, if there are many inconsistencies hatools may not be able to rectify all cases to prevent failovers. This can lead to much longer start times and potential Failed Offline states due to timeouts. In those cases it may be neccesary to use 'hareset' or 'hachkconfig' to resolve those inconsistencies.
|
|
KIG00123
The High Availability Toolkit may leave .failoverInProgress semaphore files that can prevent other failovers from completing.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
Failover attempts are serialized so no two partition sets will attempt to failover at the same time. This is controlled using .failoverInProgress files which act as a semaphore. In some cases this file may be not be removed which prevents all other failovers from starting. This leads to TSA timeouts and Failed Offline states. |
Workaround:
----------- Once TSA has reached a steady state remove .failoverInProgress files in the instance owners directory. This will allow failovers to proceed. Note it may be necesary to run hareset to clear any Failed Offline states.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. |
|
KIG00124
The High Availability Toolkit hals command may show a message like this "halscore[74]: db2_bcuaix_0_1_2_3_4_5-rg: bad number" if a node is Pending Offline in the domain.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
This is a transitory issue while the node is pending offline and only occurs if hatools confirmed that it was a connect node. Re-run hals until the node is offline. |
Workaround:
----------- Rerun hals until the pending state is resolved.
Fixed:
---------- NA
|
|
KIG00125
The High Availability Toolkit hals command will not show a standby node as not available if that node is excluded from its TSA domain.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
TSA provides the ability to exclude a host from hosting managed resources in a domain. In PDOA this only happens as part of support scenarios as there are no hatools or ha scenarios that lead to this state. If a node is excluded all managed GPFS filesystems (such as /db2home) will be unmounted on that host. Since /stage and /dwhome are not managed these filesystems will continue to be available on that host. Any attempt to mount a managed filesystem will result in that filesystem being unmounted as soon as TSA detects it is up. Any attempts to failover to that host will result in a restart on the same host.
The hals utility does not detect this so this may be hard for PDOA customers to diagnose.
|
Workaround:
----------- Use the following command as root on any host to determine if there are excluded nodes in the environment. Any nodes listed in the brackets next to 'ExcludedNodes' are excluded.
$ dsh -n ${ALL} 'lssamctrl ExcludedNodes 2> /dev/null' | dshbak -c
HOSTS ------------------------------------------------------------------------- b30i01, b30i02, b30i05, b30i06 ------------------------------------------------------------------------------- Displaying SAM Control information: SAMControl:
ExcludedNodes = {} Contact IBM support to help diagnose why the nodes are excluded and what actions to take next.
Fixed:
---------- V1.1: Fixed in V1.1 FP4 with HA Tools 2.0.8.1 see V1.1 FP4 Readme
|
|
KIG00129
The High Availability Toolkit does not have callouts for successful partition set starts or partition set stops.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
The callout mechansim described here https://www.ibm.com/support/pages/node/259139?_ga=2.249700869.2043190155.1604591443-1840982800.1601911348 does not provide for notifications of successful start or failed starts as the focus is only on failovers.
There are several cases where failovers can go undetected.
|
Workaround:
----------- NA until HA Tools 2.0.8.0
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. Two new variables were added to hatools.conf: SUCCESSFUL_START_SCRIPT and FAILED_START_SCRIPT. These callouts use a different mechanism than the failover callout scripts to provide an alternative to those callouts. These scripts are called without arguments whenever a partition set is successfully started (regardless of failover or regular start), or explicitly fails to start on a host. If TSAMP kills the process due to a start timeout then no callouts are made. However, when used in conjunction with the EMAIL alerting features the TSAMP timeouts should result in e-mail warnings every 10 minutes.
|
|
KIG00133
The High Availability Toolkit command hareset -restore will not work correctly when run on management nodes after FP6_FP2 (V1.0 FP6/V1.1 FP2) is applied.
(Added: 2020-11-05)
|
General | I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.2 I_V1.1.0.3 |
When attempting to run 'hareset -restore' there will be an error saying that no backups are found. This is a bug that appears after V1.0FP6/V1.1 FP2 as that update creates management domain backup images that interfere with the ability of hareset to find core domain backups. |
Workaround:
----------- Try running hareset -restore from the admin node. This prevents hareset from seeing the management domain backups.
Fixed:
---------- V1.1: Fixed in V1.1 FP4 with HA Tools 2.0.8.1 see V1.1 FP4 Readme
|
|
KIG00135
The High Availability Toolkit command hastartdb2 may return SQL1035N when trying to activate the database when failovers are encountered.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
This issue occurs when there is a failover in the admin nodes and the former primary node for the admin partition set is still running. This only impacts the explicit activation call as part of hastartdb2. |
Workaround:
----------- If detected early enough run activate database explicitly as the instance owner. Otherwise the database will be implicitly activated by connecting applications.
Fixed:
---------- NA
|
|
KIG00136
The High Availability Toolkit command hareset -rebuild command does not complete the rebuild leaving the domain in an incomplete state.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
This issue occurs due to a timing issue and seems to be more prevalent in earlier PDOA fixpack levels such as V1.1 GA. |
Workaround:
----------- Try 'hareset -restore' instead of rebuild.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. |
|
KIG00137
upgrade to HMC V9R1.941.x from HMC V9R1.930.0 (MH01810) daily SRC E3325009 errors, domain suffix needs populated (Added: 2020-11-25) |
General |
I_V1.0.0.7
I_V1.1.0.3
I_V1.1.0.4
|
Some PDOA V1.1.0.3 customers who needed to update their HMC levels have experienced daily SRC E3325009 errors after updating their HMC levels.
In some cases the hostname of their HMCs and DNS settings may not have been setup where the HMC hostname is available through DNS.
This may also impact V1.1 FP4 customers.
On PDOA a way to verify the host settings is to run the following command as root from the management host. This will connect to both hmcs and run the host command against the HMC hostname.
$ appl_ls_hw -r hmc -A M_IP_address | sed 's|"||g' | while read ip;do echo "*** ${ip} ***";ssh -n hscroot@${ip} 'lshmc -n -Fhostname | while read h;do host $h;done';done
*** 172.23.1.245 *** dsshmc49.torolab.ibm.com has address 9.26.18.135 *** 172.23.1.246 *** dsshmc50.torolab.ibm.com has address 9.26.18.136 (0) root @ flashdancehostname01: 7.1.0.0: /
Another test from the test note can be run to verify both systems see the same primary HMC assignments for each server.
$ appl_ls_hw -r hmc -A M_IP_address | sed 's|"||g' | while read ip;do echo "*** ${ip} ***";ssh -n hscroot@${ip} 'lssyscfg -r sys -F name | sort| while read m;do printf "$m: ";lsprimhmc -m $m;done';done
*** 172.23.1.245 *** Data1-8284-22A-SN216B47V: is_primary=1,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49,primary_hmc_ipv6addr= Data2-8284-22A-SN216B44V: is_primary=1,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49,primary_hmc_ipv6addr= FDNactive-8286-42A-SN2168BFV: is_primary=1,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49,primary_hmc_ipv6addr= FDNstby-8286-42A-SN2168C1V: is_primary=1,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49,primary_hmc_ipv6addr= Stby-Data-8284-22A-SN216B42V: is_primary=1,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49,primary_hmc_ipv6addr= *** 172.23.1.246 *** Data1-8284-22A-SN216B47V: is_primary=0,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49.torolab.ibm.com,primary_hmc_ipv6addr= Data2-8284-22A-SN216B44V: is_primary=0,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49.torolab.ibm.com,primary_hmc_ipv6addr= FDNactive-8286-42A-SN2168BFV: is_primary=0,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49.torolab.ibm.com,primary_hmc_ipv6addr= FDNstby-8286-42A-SN2168C1V: is_primary=0,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49.torolab.ibm.com,primary_hmc_ipv6addr= Stby-Data-8284-22A-SN216B42V: is_primary=0,primary_hmc_mtms=7042-CR8/840A7BD,"primary_hmc_ipaddr=172.23.1.245,172.16.0.1,9.26.18.135",primary_hmc_hostname=dsshmc49.torolab.ibm.com,primary_hmc_ipv6addr= |
Workaround:
--------------
Contact IBM support.
Fixed:
--------------
NA
|
|
KIG00138
The High Availability Toolkit command hareset -rebuild does not support restoring third tier storage management resources.
(Added: 2020-11-05)
|
General |
I_V1.0.0.4
I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0
I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
After 'hareset -rebuild' completes if third tier storage was placed under HA management control the resource definitions will not be recreated. This requires manual intervention from support to restore.
|
Workaround:
----------- Use 'hareset -restore' which restores the domain versus rebuilding it. As a practice any changes to the domain should be backed up.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. A keyword was added to hatools.conf called 'TIERSTORAGE" which can be used to specify additional storage tiers that should be managed as high availability resources.
For example: if each partition has additional cooling and cold storage as specified by this pattern, where <part> is the 4 digit 0 padded partition number.
/db2fscool/<instance>/NODE<part>
/db2fscold/<instance>/NODE<part>
Then specify the following in hatools.conf:
TIERSTORAGE="db2fscool/${INSTANCE} db2fscold/${INSTANCE}"
This will allow hachkconfig -restore and hareset -rebuild to correct/add/rebuild the domains to include those paths per partition.
Third tier filesystems must match the PDOA 1 NSD to 1 FS ratio as well as meet our filesystem and NSD naming conventions.
|
|
KIG00139
The High Availability Toolkit command hachkconfig -repair cannot repair corporate network relationships that use names like 'db2_bcuaix_0_1_2_3_4_5-rs_DependsOn_db2_VLAN501_network'
(Added: 2020-11-05)
|
General |
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2 I_V1.1.0.3 |
The High Availability Toolkit command hachkconfig -repair cannot repair corporate network relationships that use names like 'db2_bcuaix_0_1_2_3_4_5-rs_DependsOn_db2_VLAN501_network' which may have been created as part of the corporate network configuration process in early V1.1 systems. The hatools would expect the name to be 'db2_bcuaix_0_1_2_3_4_5-rs_DependsOn_db2_VLAN501_network-rel'. This ha no impact on the function of the policy but it will cause hachkconfig to fail as it cannot correct it.
|
Workaround:
----------- This workaround can be applied with resources online or offline during an outage window.
If the Opstates are Offline then it is not necessary to put the domain in manual mode.
Put the domain in manual mode.
(0) root @ b30i04: 7.1.0.0: /tmp/halog
$ hadomain -core manual All core domains set to Manual mode. (0) root @ b30i04: 7.1.0.0: /tmp/halog
$ hals CORE DOMAIN +============+=========+=========+=============+=================+=================+=============+ | PARTITIONS | CURRENT | STANDBY | DOMAIN | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=============+=================+=================+=============+ | 0-5 | b30i04 | b30i02 | bcudomain01 | Online | MANUAL MODE | - | | 6-15 | b30i05 | b30i07 | bcudomain02 | Online | MANUAL MODE | - | | 16-25 | b30i06 | b30i07 | bcudomain02 | Online | MANUAL MODE | - | +============+=========+=========+=============+=================+=================+=============+ Find the relationships that are improperly named.
$ dsh -n ${BCUDB2ALL} "lsrel -D@ -s 'Name like \"%network\"' Name | grep network | cut -d@ -f1" 2> /dev/null | dshbak -c
HOSTS ------------------------------------------------------------------------- b30i02, b30i04 ------------------------------------------------------------------------------- db2_bcuaix_0_1_2_3_4_5-rs_DependsOn_db2_VLAN501_network Rename those relationships.
$ dsh -n ${BCUDB2ALL} "lsrel -D@ -s 'Name like \"%network\"' Name 2> /dev/null | grep network | cut -d@ -f1 | while read f;do echo \$f;chrel -c \$f-rel \$f;done" | dshbak -c
HOSTS ------------------------------------------------------------------------- b30i04 ------------------------------------------------------------------------------- db2_bcuaix_0_1_2_3_4_5-rs_DependsOn_db2_VLAN501_network Verify that the the relationships are renamed. The following should return blank.
$ dsh -n ${BCUDB2ALL} "lsrel -D@ -s 'Name like \"%network\"' Name | grep network | cut -d@ -f1" 2> /dev/null | dshbak -c
(0) root @ b30i04: 7.1.0.0: /tmp/halog
$ Verify that the domains are still in Manual Mode with the OPSTATE column showing Online. If there are Pending Offline states contact support.
$ hals
CORE DOMAIN +============+=========+=========+=============+=================+=================+=============+ | PARTITIONS | CURRENT | STANDBY | DOMAIN | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=============+=================+=================+=============+ | 0-5 | b30i04 | b30i02 | bcudomain01 | Online | MANUAL MODE | - | | 6-15 | b30i05 | b30i07 | bcudomain02 | Online | MANUAL MODE | - | | 16-25 | b30i06 | b30i07 | bcudomain02 | Online | MANUAL MODE | - | +============+=========+=========+=============+=================+=================+=============+ Restore the domains to automation mode.
$ hadomain -core auto
All core domains set to Auto mode. (0) root @ b30i04: 7.1.0.0: /tmp/halog
$ hals CORE DOMAIN +============+=========+=========+=============+=================+=================+=============+ | PARTITIONS | CURRENT | STANDBY | DOMAIN | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=============+=================+=================+=============+ | 0-5 | b30i04 | b30i02 | bcudomain01 | Online | Normal | - | | 6-15 | b30i05 | b30i07 | bcudomain02 | Online | Normal | - | | 16-25 | b30i06 | b30i07 | bcudomain02 | Online | Normal | - | +============+=========+=========+=============+=================+=================+=============+ Rerun 'hachkconfig' to verify no more errors related to these relationships.
Fixed:
---------- NA (The workaround is a permanent solution).
|
|
KIG00148
The High Availability Toolkit command hasetuptemp will fail if DB2OPTIONS include -v. This only occurs on V1.0 environments.
(Added: 2020-11-05)
|
General | I_V1.0.0.4 I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 |
This symptom only impacts V1.0 customers as this is required when the temp tablespace is recreated which reside on local ssd (/db2ssd) filessytems. This symptom was not seen in the field so it is likely the '-v' option is not used, or not used by customers who have needed to recreate their system temp. |
Workaround:
----------- Remove the '-v' option and rerun the command.
Fixed:
---------- NA
|
|
KIG00150
The High Availability Toolkit commands like hareset and hackconfig may fail or corrupt the domain on systems V1.0 systems at 12.5 DNs or higher and V1.1 systems at 10.5 DNs or higher due to TSA commands truncating long object names.
(Added: 2020-11-05)
|
General | I_V1.0.0.4 I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0 I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
On larger PDOA environments with three digit partitions some of the resource names are truncated when using TSA commands in table format. This leads hatools to get incorrect names for some resources. This can lead to failures that leave domains corrupted. |
Workaround:
----------- Run 'hareset -restore' to restore the domain from a good copy. Contact IBM Support to verify the health of the domain.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. Replaced TSA table based commands with delimited options to prevent truncation.
|
|
KIG00157
The High Availability Toolkit hals command does not recognize a standby node is not available if GPFS is not started on the node but the node is online in the domain.
(Added: 2020-11-05)
|
General | I_V1.0.0.4 I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0 I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
The algorithm to determine whether a node is available as a standby does not check to see if GPFS is online and only checks to see if the domain is online and there are no Failed, Stuck or Unknown states. The filesystem monitors do check for GPFS but when GPFS is Offline they return that the filesystem is also Offline.
|
Workaround:
----------- If failovers are not working, and hals does not indicate a failure.
Verify the GPFS state on each node in the system using this command.
$ dsh -n ${ALL} '/usr/lpp/mmfs/bin/mmgetstate -a' | dshbak -c
HOSTS ------------------------------------------------------------------------- b30i01, b30i03 ------------------------------------------------------------------------------- Node number Node name GPFS state
------------------------------------------ 1 b30i01 active 2 b30i03 active HOSTS -------------------------------------------------------------------------
b30i02, b30i04, b30i05, b30i06, b30i07 ------------------------------------------------------------------------------- Node number Node name GPFS state
------------------------------------------ 1 b30i02 active 2 b30i04 active 3 b30i05 active 4 b30i06 active 5 b30i07 active Contact IBM Support to help determine why a host may not be active.
Fixed:
---------- NA
|
|
KIG00168
The High Availability Toolkit commands hachkconfig and hareset -rebuild will only check or create the first corporate VLAN set of resources as specified in hatools.conf.
(Added: 2020-11-05)
|
General | I_V1.0.0.4 I_V1.0.0.5 I_V1.0.0.6 I_V1.0.0.7 I_V1.1.0.0 I_V1.1.0.1 I_V1.1.0.2 I_V1.1.0.3 |
This can lead to inconsistencies in the ServiceIP settings when more than one corporate service ip is defined. While generally harmless in some cases this can prevent Db2 from starting. |
Workaround:
----------- Contact IBM to help fix any inconsistency.
Fixed:
---------- V1.0. Contact IBM Support.
V1.1: Fixed in HA Tools 2.0.8.0 which is available by technote or as part of PDOA V1.1 FP4. See IBM PureData System for Operational Analytics High Availability toolkit component 2.0.8.0 update. HA tools will now check and be able to address multiple corporate service ips.
|
|
KIG00222
Management host syslog has "Crypto library (CLiC) error: Wrong signature" and "Keystore doesn't contain ssh public key cookie" sshd errors. (Added: 2020-11-05)
|
General |
I_V1.0.0.3
I_V1.0.0.4
I_V1.0.0.5
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.0
I_V1.1.0.1
I_V1.1.0.2
I_V1.1.0.3
|
SSH logins on the management nodes will generate many messages like the ones below:
Nov 26 01:56:03 flashdancehostname01 auth|security:err|error sshd[4653944]: Keystore doesn't contain ssh public key cookie
Nov 26 01:56:04 flashdancehostname01 auth|security:err|error sshd[2950080]: Crypto library (CLiC) error: Wrong object type This is related to the use of EFS on the management host which is used by the platform layer.
PDOA does not allow EFS to be automatically opened through SSH connections.
|
Workaround:
-----------
N/A
This is a tradeoff in the way PDOA uses EFS on the management host.
Fixed:
----------
N/A
|
|
KIG00006
PDOA FP7_FP3/ FP6_FP2 Readmes Appendix H incorrectly using the "-d" option with "db2fm" to disable the Db2 Fault Monitor on startup.
[ Added 2021-03-12]
|
Fixpack |
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.2
I_V1.1.0.3
|
When Db2 is updated in PDOA environments the Db2 Fault Monitor may be enabled. The readme's provide instructions on how to stop and disable Db2's fault monitor in Appendix H.
The following command listed in Appendix H to do this is as follows.
dsh -n ${BCUDB2ALL} "/usr/IBM/dwe/db2/V10.5.0.10..0/bin/db2fm -i bcuaix -d" | dshbak -c
The '-d' option will bring the instance down. The option that should be used is '-D'.
dsh -n ${BCUDB2ALL} "/usr/IBM/dwe/db2/V10.5.0.10..0/bin/db2fmcu -D" | dshbak -c
|
Workaround:
-----------
In Db2 10.5 and 11.1 systems the fault monitor daemon does not seem to be enabled as part of updating Db2 and this command may not be necessary.
Use '-D' when attempting to stop the db2fm fault monitor daemon.
dsh -n ${BCUDB2ALL} "/usr/IBM/dwe/db2/V10.5.0.10..0/bin/db2fmcu -D" | dshbak -c
Fixed:
----------
N/A
|
|
KIG00026
When the starting point for a fixpack is V1.0.0.5 or V1.1.0.1 if AIX is updated to Version 7.1 TL5 before Stage 4 is completed this may prevent Stage 4 from completing successfully due to SSH connectivity issues.
[ Added 2021-03-12]
|
Fixpack |
I_V1.0.0.6
I_V1.0.0.7
I_V1.1.0.2
I_V1.1.0.3
I_V1.1.0.4
|
Scenario:
V1.1 FP1->FP3 update.
Instead of updating AIX in Stage 6 on the management host, AIX was updated much earlier in the cycle before Stage 4 had completed.
During Stage 4 the V7000 enclosures are updated and in the update process one canister is updated and then recycled, after that canister boots the configuration role moves to the updated canister. At this point the platform layer becomes unable to monitor the enclosure and eventually times out and reports a failure. In the meantime, any enclosures that were in the process of updating will continue to update the canister firmware.
The exact reason why this fails is not known but because the V7000 adds ECDSA keys that appear after the updated canister becomes the configuration node, the speculation is that the tighter security settings in AIX no longer allow the connection leading to ssh connections issues.
Between V1.0 FP7/V1.1 FP1 and V1.0 FP7/V1.1 FP3 there were issues with SSH during each update which if fixed may alleviate this issue from occurring for some customers.
|
Workaround:
-----------
If there is Stage 4 failure and the examination of the platform layer logs and trace files shows connectivity issues to the storage, then do the following.
These commands are run as root on the management host.
1. Check the current keys for the V7000 enclosures as listed in 'known_hosts' file for root. There are two examples below. It is more likely that the keys shown are RSA keys.
$ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh-keygen -F ${ip};done
# Host 172.24.1.181 found: line 15
172.24.1.181 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDhFYisNOOUPZKXOljT3OH/jkmCiRVS8hthriLJeb4E4P5XMRMCf5HjMr9bTSRXTT+TU7j+e0oIzFz1lpPtMC3KVhLBiwuGuT38PClvufMxEJCn9zdGLcxy9CLmwRwT/UkRPfxioG8z+TPx677BW34JZs+QVAqmeCU9wsDvrm7g5I/Osj4dqSHkEcwhzrO7A1jNZNoxLGsYSrtfhkPQxP+gvf9hiq3lN/MAYKqJD8w2Au1I0iz4xJmkpnokYpmikpiQTk1Mr7YrTEJpQOwBsaPwooYO1mUeN3N8GU4oeI3hqnFoZSQEzIzBjtIIwml3ZDM6Fz+FGO20kRWYM/zIiIGJ *** 172.24.1.183 *** # Host 172.24.1.183 found: line 13 172.24.1.183 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDFNIHVGXRfiEWAdGWjoQo3jfrBjTpymJo5bVUnNsUe9Ru2AU1VRvqyWaEpbeDSTfj6lZBjhYbh2d9CoA0ZRD2HbAj6zF9YSZkt0elbEbBKFldFp/AFef0IARC7/xghGCkZRlCjkAB7DtfKObtrT2xfmax2abZy0wzzrOnL7qppdZ9KLCJ3uwbyBxTo7IrtGorejddh6fMKuGiaflStoWez5vzzdtYdPsXmbJ4Fibz5Td6gJyFFjkfRDwWnNmpNbGxRyc9pyTv2fhGZOmoqjZoERYFaxLJs46kcEYBOnl1xqpFs1BQVXAM/dn74DGCX9bCXKhSZ90icsbLVQldJF4HT *** 172.24.1.185 *** # Host 172.24.1.185 found: line 14 172.24.1.185 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDxiX5ztNtWuuvqGljV/8STvl85bKee7UFg/+dgxvU6dSylRLAkhf5GRQt3XTc0HWqPxRE2Tiag0Y30B4ryezzdZ7DW7w2RYhDY0T+S6w17hmZg2uPqAggw48QgOUTAD62+aDKHaL7ngHWydVKFybB//TT873kJ9/kJuR5bwX959MY18Dk6ZB/9CU2J+C5r6E5gVtniPepCGyWATGnwUSpJhPaYhbJp6BH6S0++QEHZ4CbQMlAkbLS2G30ELcNhQ492hWih0FH07Cw+g3Sa/npU/jZFYAfb4/v2Y+kWy/bXLxV6zRcxf+Ovn/FnOUt87e9If5tHXT9LyavCK4DMW75H $ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh-keygen -F ${ip};done
*** 172.23.1.204 ***
# Host 172.23.1.204 found: line 48 172.23.1.204 ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAAJ6D0nWA5A/z6XntN6JWxnf5EJ38GMdSDemVhlesLoIkLoSVdDXM0qfpsZ/fVhXLOAPmKQ68JxT9lk8oD/t3G3NAFUqEWzsUvpBxkPYeDHkGISe1My/Wnr4QN8L0ZtgsascB8V0QwYnugUCbx5nHVjvSdUTqx/96OVA5aFoPr/sAc6tg== *** 172.23.1.206 *** # Host 172.23.1.206 found: line 45 172.23.1.206 ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAFyZIHqBO9Ay2u9aRbw3OSFASv86YO6fmyO/Ol53FEwNkWLlZeqqacivo+crZxO8J8m3BRdebeZ37clXEVFb+DNugHSscB1E6NIj90LkWFkn9kdIKKjP/gMsvty4I3palHxldqnHOcjfgqCEK5q9nII8mX3MddM9x6ItYytZVufaPmBLw== *** 172.23.1.208 *** # Host 172.23.1.208 found: line 46 172.23.1.208 ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAAgUPvhy+zhJBMUA0RPsAS3+XDTlaz2ryQ6n1ZZER52/tiUZldBw6uNidBTWFBSgCs5XPlnpKsucS0lmi9ju0FASwHfjGBjO1XN2eVXphSN9e2jDjJEA4lotB126Hhbb1rTVNEKO2OG00YazvSX/1Ua7Mxml0cZ4l1kfXIpZKh+ByoDjA== 2. Remove the problematic ssh keys for the storage from roots 'known_hosts' file.
$ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh-keygen -R ${ip};done
*** 172.23.1.204 *** # Host 172.23.1.204 found: line 48 /.ssh/known_hosts updated. Original contents retained as /.ssh/known_hosts.old *** 172.23.1.206 *** # Host 172.23.1.206 found: line 45 /.ssh/known_hosts updated. Original contents retained as /.ssh/known_hosts.old *** 172.23.1.208 *** # Host 172.23.1.208 found: line 45 /.ssh/known_hosts updated. Original contents retained as /.ssh/known_hosts.old 3. Recheck the keys. All should be empty.
$ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh-keygen -F ${ip};done
*** 172.23.1.204 ***
*** 172.23.1.206 *** *** 172.23.1.208 *** 4. Run the following to re-populate the known hosts file. You will need to reply 'yes' to each of the prompts. This will add the ECDSA keys to known hosts replacing the RSA keys. It will also show the update status.
$ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh -n superuser@${ip} 'lsupdate';done
*** 172.23.1.204 *** The authenticity of host '172.23.1.204 (172.23.1.204)' can't be established. ECDSA key fingerprint is SHA256:dUaEPdNk5MnzHdIXSFip+mms61SEBt6ASBI2Y3ldPxU. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '172.23.1.204' (ECDSA) to the list of known hosts. status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time *** 172.23.1.206 *** The authenticity of host '172.23.1.206 (172.23.1.206)' can't be established. ECDSA key fingerprint is SHA256:TsKafP3voTG8kGkqhEIsGa0jhEJO9wwgI15t7jb0KiU. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '172.23.1.206' (ECDSA) to the list of known hosts. status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time *** 172.23.1.208 *** The authenticity of host '172.23.1.208 (172.23.1.208)' can't be established. ECDSA key fingerprint is SHA256:NSdzP0Idc5wer2dFhr9ESYKMjQgBU+ycYluIfwI/414. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '172.23.1.208' (ECDSA) to the list of known hosts. status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time 4. Check the status again of the updates on the V7000s. The following command, run as root on the management host, will collect the ip addresses of all of the V7000s and will run 'lsupdate' to show the current status. This time there should be no prompting.
$ appl_ls_hw -r storage -A M_IP_address,Machine_type < /dev/null | grep "2076" | sed 's|"||g' | cut -d, -f1 | while read ip;do echo " *** ${ip} ***";ssh -n superuser@${ip} 'lsupdate';done
*** 172.23.1.204 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time *** 172.23.1.206 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time *** 172.23.1.208 *** status success event_sequence_number progress estimated_completion_time suggested_action start system_new_code_level system_forced no system_next_node_status none system_next_node_time system_next_node_id system_next_node_name system_next_pause_time 5. After all updates are completed rerun the update again to ensure the hard drive firmware update completes. The platfrom layer will detect that the canister firmware is updated and will then verify the hard disk firmware is updated if needed.
Fixed:
----------
See workaround.
|
|
KIG00091
When running db2_all or rah commands the following error is returned:
stty: tcgetattr: A specified file does not support the ioctl system call.
[ Added 2021-03-12]
|
General | All Versions |
When running db2_all or rah commands the following error is returned:
stty: tcgetattr: A specified file does not support the ioctl system call.
This error may appear when customers add 'stty' based commands to shell profiles without checking first for interactive or non-interactive sessions.
One common technique is to change the backspace character to the BACKSPACE key using 'stty erase ^?' in their .profile or .bashrc files. While this works with interactive sessions it can cause messages like the one above to occur.
These messages can impact non-interactive sessions by adding unexpected output that needs to be parsed to overloading log files.
|
Workaround:
-----------
One way to avoid this is to use a check to make sure the terminal is interactive.
tty > /dev/null 2>&1 && <command>
The tty command will return an non-zero error code if run without a tty attached. So the <command> will only run when it is an interactive session.
This could also be used with a 'if' block to check the return code of the tty command if more complex commands are used for interactive sessions.
Fixed:
----------
See workaround.
|
|
KIG00092
Db2 installation, db2_all, rah or db2iupdate on PDOA appears to hang.
[ Added 2021-03-12]
|
General | All Versions |
When attempting to install Db2 a fixpack, running db2_all, rah or db2iupdate commands appears to hang.
The default shell for the instance owner (bcuaix by default) is 'ksh'.
In the field one we found some customers prefer to use the 'bash' shell for their instance owner.
PDOA does not ship with bash but some customers may download it from the AIX Toolbox for Linux site.
The user that manages the instance owner may add bash as the at end of their .profile file.
This has the effect of changing the shell to bash at login. The problem with this is that it causes issues with non-interactive shells appearing to hang.
In fact, the non-interactive commands are waiting for the 'bash' shell to do something, but since their is no interaction, a hang occurs preventing the .profile file from finishing.
While this was discovered performing a db2 update, this scenario can cause issues for any user attempting to run an interactive shell command in their .profile when running non-interactive shell sessions.
|
Workaround:
-----------
a) Remove 'bash' from the .profile script. Force the user to run bash after login.
b) Only run 'bash' when it is an interactive shell.
Similar to the solution for KIG00091, the answer is to only run bash when it is an interactive shell.
tty > /dev/null 2>&1 && bash
c) Change the default shell from ksh to bash for the instance owner.
A different option would be to change the shell for the instance owner from ksh to bash to avoid this check. This change may be simple but it is not recommended without understanding the implications and doing careful planning.
Fixed:
----------
N/A
|
|
KIG00316
The message "Unable to find host in active machine list. Exiting." was encountered when running 'hafailover host'.
[Added 2021-03-25]
|
General | All Versions |
In the following PDOA V1.1 0.5 DN scenario the partition set 0-5 is active on 'host04' and has a standby of 'host02'. The 'hals' command shows:
$ hals MANAGEMENT DOMAIN +============+=========+=========+=========+=================+=================+=============+ | COMPONENT | PRIMARY | STANDBY | CURRENT | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=========+=================+=================+=============+ | WASAPP | host01 | host03 | host01 | Online | Normal | - | | DB2APP | host01 | host03 | host01 | Online | Normal | - | | DPM | host01 | host03 | host01 | Online | Normal | - | | DB2DPM | host01 | host03 | host01 | Online | Normal | - | +============+=========+=========+=========+=================+=================+=============+ CORE DOMAIN
+============+=========+=========+=============+=================+=================+=============+ | PARTITIONS | CURRENT | STANDBY | DOMAIN | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=============+=================+=================+=============+ | 0-5 | host04 | host02 | bcudomain01 | Online | Normal | - | +============+=========+=========+=============+=================+=================+=============+ An attempt to failover to host02 is done using hafailover.
$ hafailover host02
Unable to find host02 in active machine list. Exiting This fails as the hafailover command requires an already active host, as shown in the CURRENT column to failover FROM versus failing over TO.
This can be confusing .
|
Workaround:
-----------
There is no workaround needed, instead, ensure to only pass hostnames for partition sets that appear in the CURRENT column.
For more information about hafailover see PDOA V1.1 Knowledge Center Failover Documentation
Fixed:
----------
N/A
|
|
KIG00384
The update_pfw.sh fails when attempting to update more than one server during the power firmware stage 8 update step in V1.1 FP4.
[ Added 2021-08-30]
[ Updated 2022-06-30 ]
|
Fixpack | I_V1.1.0.4 |
When attempting to update multiple servers of the same type, core servers with MT 8284, this script constructs an invalid command line to the platform layer.
Instead of looping through the servers it only runs one platform layer update command with invalid parameters. The script should construct one platform layer command for each server.
This only impacts customers who attempt to run this update in a full outage scenario that have 1.5DN or more, or customers that have 5.5 DN or more when using the failover scenario.
Here is an example of the output when failing withthe incorrectly generated platform layer update command.
To work around the issue run the platform layer commands separately.
|
Workaround:
-----------
Replace server_fsp5 with the correct platform layer name for the server to be updated.
To Validate:
To Update:
Fixed:
----------
Follow the instructions in the updated V1.1 FP8_FP4 Readme V229 which reference this technote.
|
|
KIG00497
In V1.1 FP4 Stage 7 the command to quiesce nodes (quiesce_node.sh) may fail to detect active services leading to an outage.
[ Added 2022-03-16 ]
[ Updated 2022-06-30 ]
|
Fixpack | I_V1.1.0.4 |
V1.1 FP4 Stage 7 allows for updates to be applied to standby servers. The quiesce_nodes.sh command is
designed to run on all core hosts and to identify hosts that are currently standby servers. In one case the
command was run all all core servers but was cancelled using ctrl-c. After a few seconds the command was
issued again. While the first command correctly identified all active and standby hosts, the second command
failed to recognize some hosts as active leading to those hosts being quiesced. This can lead to a prolonged
outage to troubleshoot and bring the system back online.
As long as the command is not killed this scenario shoud not happen in the field, however, it does illustrate
that there is risk when running the stage 7 quiesce steps.
Update: 2022-03-18: This was encounted in a V1.1 FP2 to V1.1 FP4 scenario.
After some more investigation it appears that TSA's lsrg command may be an issue here after a node leaves the domain. The lsrg command is used by quiesce_nodes.sh and
hals. The below output was taken after completing a Stage 7 pass while the system was in manual mode. Notice that hals shows N/A and
the lsrg -m output shows no active host for some of the domains. The last command shows that db2sysc is running on all hosts. Individal lssam
output also shows the correct state of the resources.
$ dsh -n ${BCUDB2ALL} 'lsrg -m | grep "IBM.Application:db2_bcuaix" |sort' | dshbak -c
HOSTS ------------------------------------------------------------------------- host02, host04 ------------------------------------------------------------------------------- IBM.Application:db2_bcuaix_0_1_2_3_4_5-rs True db2_bcuaix_0_1_2_3_4_5-rg Online HOSTS -------------------------------------------------------------------------
host15, host16, host17, host18, host19 ------------------------------------------------------------------------------- IBM.Application:db2_bcuaix_106_107_108_109_110_111_112_11... True db2_bcuaix_106_107_108_109_110_111_112_113_114_115-rg Online Nominal host15 IBM.Application:db2_bcuaix_116_117_118_119_120_121_122_12... True db2_bcuaix_116_117_118_119_120_121_122_123_124_125-rg Online Nominal host16 IBM.Application:db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rs True db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rg Online Nominal host19 IBM.Application:db2_bcuaix_96_97_98_99_100_101_102_103_10... True db2_bcuaix_96_97_98_99_100_101_102_103_104_105-rg Online Nominal host18 HOSTS -------------------------------------------------------------------------
host05, host06, host07, host08, host09 ------------------------------------------------------------------------------- IBM.Application:db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rs True db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rg Online IBM.Application:db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rs True db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rg Online IBM.Application:db2_bcuaix_36_37_38_39_40_41_42_43_44_45-rs True db2_bcuaix_36_37_38_39_40_41_42_43_44_45-rg Online IBM.Application:db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rs True db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rg Online HOSTS -------------------------------------------------------------------------
host10, host11, host12, host13, host14 ------------------------------------------------------------------------------- IBM.Application:db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rs True db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rg Online Nominal host12 IBM.Application:db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rs True db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rg Online Nominal host13 IBM.Application:db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rs True db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rg Online Nominal host11 IBM.Application:db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rs True db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rg Online Nominal host10 $ hals none are available... returning CORE DOMAIN +============+=========+=========+=============+=================+=================+=============+ | PARTITIONS | CURRENT | STANDBY | DOMAIN | OPSTATE | HA STATUS | RG REQUESTS | +============+=========+=========+=============+=================+=================+=============+ | 0-5 | N/A | host02 | bcudomain01 | Online | MANUAL MODE | - | | 6-15 | N/A | host05 | bcudomain02 | Online | MANUAL MODE | - | | 16-25 | N/A | host05 | bcudomain02 | Online | MANUAL MODE | - | | 26-35 | N/A | host05 | bcudomain02 | Online | MANUAL MODE | - | | 36-45 | N/A | host05 | bcudomain02 | Online | MANUAL MODE | - | | 46-55 | host12 | host14 | bcudomain03 | Online | MANUAL MODE | - | | 56-65 | host13 | host14 | bcudomain03 | Online | MANUAL MODE | - | | 66-75 | host11 | host14 | bcudomain03 | Online | MANUAL MODE | - | | 76-85 | host10 | host14 | bcudomain03 | Online | MANUAL MODE | - | | 86-95 | host19 | host17 | bcudomain04 | Online | MANUAL MODE | - | | 96-105 | host18 | host17 | bcudomain04 | Online | MANUAL MODE | - | | 106-115 | host15 | host17 | bcudomain04 | Online | MANUAL MODE | - | | 116-125 | host16 | host17 | bcudomain04 | Online | MANUAL MODE | - | +============+=========+=========+=============+=================+=================+=============+ $ dsh -n ${ALL} 'ps -ef | grep -v grep | grep -c db2sysc ' | sort
host01: 0 host02: 0 host03: 0 host04: 6 host05: 0 host06: 10 host07: 10 host08: 10 host09: 10 host10: 10 host11: 10 host12: 10 host13: 10 host14: 0 host15: 10 host16: 10 host17: 0 host18: 10 host19: 10 $ dsh -n ${BCUDB2ALL} 'lsrg -m -d | grep IBM.Application:db2_bcuaix | sort | cut -d: -f4 | xargs -n 1 lssam -g | grep db2_bcuaix' | dshbak -c
HOSTS ------------------------------------------------------------------------- host02, host04 ------------------------------------------------------------------------------- Online IBM.ResourceGroup:db2_bcuaix_0_1_2_3_4_5-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_0_1_2_3_4_5-rs |- Offline IBM.Application:db2_bcuaix_0_1_2_3_4_5-rs:host02 '- Online IBM.Application:db2_bcuaix_0_1_2_3_4_5-rs:host04 HOSTS -------------------------------------------------------------------------
host15, host16, host17, host18, host19 ------------------------------------------------------------------------------- Online IBM.ResourceGroup:db2_bcuaix_106_107_108_109_110_111_112_113_114_115-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_106_107_108_109_110_111_112_113_114_115-rs |- Online IBM.Application:db2_bcuaix_106_107_108_109_110_111_112_113_114_115-rs:host15 '- Offline IBM.Application:db2_bcuaix_106_107_108_109_110_111_112_113_114_115-rs:host17 Online IBM.ResourceGroup:db2_bcuaix_116_117_118_119_120_121_122_123_124_125-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_116_117_118_119_120_121_122_123_124_125-rs |- Online IBM.Application:db2_bcuaix_116_117_118_119_120_121_122_123_124_125-rs:host16 '- Offline IBM.Application:db2_bcuaix_116_117_118_119_120_121_122_123_124_125-rs:host17 Online IBM.ResourceGroup:db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rs |- Offline IBM.Application:db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rs:host17 '- Online IBM.Application:db2_bcuaix_86_87_88_89_90_91_92_93_94_95-rs:host19 Online IBM.ResourceGroup:db2_bcuaix_96_97_98_99_100_101_102_103_104_105-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_96_97_98_99_100_101_102_103_104_105-rs |- Offline IBM.Application:db2_bcuaix_96_97_98_99_100_101_102_103_104_105-rs:host17 '- Online IBM.Application:db2_bcuaix_96_97_98_99_100_101_102_103_104_105-rs:host18 HOSTS -------------------------------------------------------------------------
host10, host11, host12, host13, host14 ------------------------------------------------------------------------------- Online IBM.ResourceGroup:db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rs |- Online IBM.Application:db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rs:host12 '- Offline IBM.Application:db2_bcuaix_46_47_48_49_50_51_52_53_54_55-rs:host14 Online IBM.ResourceGroup:db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rs |- Online IBM.Application:db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rs:host13 '- Offline IBM.Application:db2_bcuaix_56_57_58_59_60_61_62_63_64_65-rs:host14 Online IBM.ResourceGroup:db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rs |- Online IBM.Application:db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rs:host11 '- Offline IBM.Application:db2_bcuaix_66_67_68_69_70_71_72_73_74_75-rs:host14 Online IBM.ResourceGroup:db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rs |- Online IBM.Application:db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rs:host10 '- Offline IBM.Application:db2_bcuaix_76_77_78_79_80_81_82_83_84_85-rs:host14 HOSTS -------------------------------------------------------------------------
host05, host06, host07, host08, host09 ------------------------------------------------------------------------------- Online IBM.ResourceGroup:db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rs |- Offline IBM.Application:db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rs:host05 '- Online IBM.Application:db2_bcuaix_16_17_18_19_20_21_22_23_24_25-rs:host06 Online IBM.ResourceGroup:db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rs |- Offline IBM.Application:db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rs:host05 '- Online IBM.Application:db2_bcuaix_26_27_28_29_30_31_32_33_34_35-rs:host07 Online IBM.ResourceGroup:db2_bcuaix_36_37_38_39_40_41_42_43_44_45-rg Automation=Manual Nominal=Online Online IBM.ResourceGroup:db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rg Automation=Manual Nominal=Online |- Online IBM.Application:db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rs |- Offline IBM.Application:db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rs:host05 '- Online IBM.Application:db2_bcuaix_6_7_8_9_10_11_12_13_14_15-rs:host08 |
Workaround:
-----------
To workaround this problem there are a couple of ways to approach Stage 7.
1. Take a full outage to apply the updates. This can be time costly as the power firmware updates must be applied serially since V1.1 FP3.
2. Do not issue a 'ctrl-c' command when running the quiesce_node.sh step.
3. Prior to quiescing the node, there is a step that checks for eligible hosts. Verify that this list only includes standby hosts as eligible. This uses the same
helper script to determine if a node is eligible or not.
4. Instead of using '${ALL}' in the quiesce_node.sh call, replace ${ALL} with the specific hosts in a comma-separated format that are currently standby nodes.
This list can determined using the hals command.
Fixed:
----------
An update to the active node detection script will add a check for db2sysc processes in addition to the lsrg -m command. This is referenced in the updated V1.1 FP8_FP4 V229 readme along with the following technote.
|
|
KIG00599
In V1.1 FP2 and higher, an error occurs when running appl_conf to manage passwords. "The resource state should be on or online."
[Added: 2022-07-21]
|
General |
I_V1.1.0.2
|
In V1.1 FP2 and higher the PDOA Console GUI and mi* layers are removed and the fixpack process was moved from full automation to a tooling model. However, some comands like appl_stop were still used in V1.1 FP2 steps. This changed some of the status for servers from 'Online' to 'On' but the process did not ensure that they were changed back.
If customer's have issues with password validation, when they attempt to modify the passwords with appl_conf, they may get this failure if the state of th server(s) is still 'On' and not 'Online'.
Symptom:
The following check, run as 'root' on the management host, shows that 'server1' is 'On' instead of 'Online'.
|
Workaround:
-----------
For each server run the following as root on the management host to update the server status to 'Online'. Be sure to replace server1 with the server to be updated.
Fixed:
----------
This is address in V1.1 FP3's platform layer and higher. For customers applying V1.1 FP4 to V1.1 FP2 it is advised to use the workaround or you can wait to address the issue after Stage 2 is completed.
|
|
KIG00602
When attempting configupload operations on a SAN switch, encountered "configUpload not permitted (scp failed)." [ Added: 2022-07-27 ]
|
General |
I_V1.1.0.2
I_V1.1.0.3
I_V1.1.0.4
|
In Stage 2 in the V1.1 FP4 documentation, there is a step to backup the SAN configurations for all of the SAN switches in the environment. This requires logging into the SAN and running a utilty that uses scp to copy the configuration to the management node. In some cases the SAN switches may have stored an ssh public key for the host that is no longer valid. This may have been stored as part of the deployment or if the host key on the managmeent node has changed since deployment. If this key is incorrect it will cause all ssh based operations to fail.
Here is an example of the failure on a V1.1 FP4 environment.
|
Workaround:
-----------
This solution has only been tested in V1.1 FP4, however it may be applicable on previous versions if supported by the SAN switch's firmware level. The solution is to remove the old public host key from the known_hosts file in the SAN using the sshutil delknownhost function.
The following shows how to clean the known hosts file on the san. Login as the admin user from the root account on the management host. This account from this host has ssh key based access to the SANs. A successfull configuration file backup session can also be seen.
The following shows an attempt to remove the management host using delknownhost option without the '-all' option. This did not work. It is not known why this method does not work.
Fixed:
----------
N/A
|
|
KIG00604
hachkconfig hangs. [ Added 8/1/2022 ]
|
Fixpack |
I_V1.1.0.4
|
In Stage 8 of V1.1 FP4 there is a step to update hatools that requires running 'hachkconfig' after unpacking and applying the updating hatools on all of the hosts. This command can hang when there are service ips defined on environments with 1.5 DNs or higher and those service IPs are not defined for all database partitions.
Typically the following output would be seen when the hang occurs:
To identify if the system is at risk, examine the file /usr/IBM/analytics/ha_tools/hatools.conf. This file should exist on all hosts and by sychronized whenever it is modified. This file is rarely modified and the best practice is to only update it on the management host and then to distribute the updated file to the rest of the hosts.
In this file look for AR_VLAN entries.
This shows a 1.5 DN scenario, with VLAN001 having a service ip on the admin node (0-5) and the first data node (6-15). Note that partitions 0-5 are in the TSA domain bcudomain01 and '6-15' are in TSA domain bcudomain02.
The scenario above works because each domain has at least one service ip defined on the VLAN.
In the same environment, if one of the entries is removed as shown below. This would lead to a hang.
This would lead to a hang.
The issue is related to the processVLAN function in hafunctions. This function has while loop that checks for all VLANs VLAN00${x} where $x should be incremented. However, if a domain does not have a VLAN entry for at least one partition set, this results in an endless loop as x is not incremented.
|
Workaround:
-----------
First clean up the failed hachkconfig process. If you ran 'hachkconfig -repair' please stop and contact IBM support as it may have made modifications to the domains that need to be verified before proceeding.
1. Kill the current hachkconfig process. Note that domains will have resource group locks that will need to be removed.
2. Remove the resource group locks left over by the killed hachkconfig process. This command purposes uses the '-f 1' fanout to run the command on one node at time.
3. Verify the build level. This workaround is only applicable to this version of hatools.
3. Find the file 'hafunctions' and make a backup.
4. Edit the file and find line 2029.
5. Modify the if block to increment ${x} before continue the while loop as shown below. Save the file.
6. Diff the new/old files to verify that was the only line changed.
7. Rerun the hachkconfig command to verify it doesn't hang.
Fixed:
----------
Fix is targeted for V1.1 FP5
|
|
KIG00706
DPM (aka OPM) Fails to start on the management host after Stage 2 V1.1 FP5 [ Added 11/14/2022 ]
|
Fixpack | I_V1.1.0.5 |
In V1.1 FP5 Stage 2 the management host is migrated from AIX 7.1 to AIX 7.2. After migration if DPM is started with 'hastartdpm' it will not succeed to start on the management host, nor will it fail over automatically.
The following error will be seen. This is a common error with DPM in the past due to a tight threshhold for wiating for DPM to start. In most cases prior to V1.1 FP5 DPM would eventually start.
When checking lssam: (Note the 'Sacrificed' and 'Failed Offline' status).
In some cases you may see this error from lssam and hals.
The above error is most likely due to the '/tmp' file system becoming full due to file related to the java crash on the management node. Which created 'Snap*' and 'jitdump*' files and may also create 'core*' files in /tmp.
|
Workaround:
-----------
None.
Fixed:
----------
None.
DPM was deprecated in V1.1 FP1 and instructions and guidance to remove DPM was provided in V1.1 FP4 readme file.. The associated IBM product, OPM, is not supported. While DPM can run on the management standby, that is only possible until Stage 6 of the fixpack.
After Stage 6 is applied do not attempt to start the management domain or DPM. The instructions from DPM removal are provided in the V1.1 FP4 and V1.1 FP5 readme files as part of Stage 9.
The readme file for V1.1 FP5 will be updated in versions after version 101 to address this issue.
|
Related Information
Product Synonym
PureData System for Operational Analytics;PDOA
Was this topic helpful?
Document Information
Modified date:
14 November 2022
UID
ibm10872628
