IBM Support

With Broadcom Emulex Fibre Channel adapters, CPU DLPAR add or remove operations might fail, causing the system to soft lockup

Flashes (Alerts)


Abstract

Power Systems with Broadcom Emulex Fibre Channel adapters might fail to add or remove CPUs by using Dynamic Logical Partitioning (DLPAR). This failure can be seen in Red Hat Enterprise Linux 8, Red Hat Enterprise Linux 9.x, and SUSE Linux Enterprise Server 15.

Content

Linux Releases Affected
Red Hat Enterprise Linux 8.7
Red Hat Enterprise Linux 9.x
SUSE Linux Enterprise Server 15, Service Pack 3
IBM Systems Affected

Power Systems with Emulex FC adapters

Symptoms

When performing DLPAR operations on CPUs while the Emulex FC adapter is installed, there is a possibility that the driver might not register the addition of new CPUs or the removal of active CPUs. This failure might cause the system to hit a soft lock up that can look similar to the following trace:

watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set nf_tables nfnetlink dm_service_time dm_multipath pseries_rng mlx5_ib xts vmx_crypto ib_uverbs ib_core binfmt_misc xfs libcrc32c sd_mod sg ibmvscsi scsi_transport_srp ibmveth mlx5_core lpfc nvmet_fc nvmet nvme_fc nvme_fabrics mlxfw nvme_core tls t10_pi scsi_transport_fc psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nft_compat]
CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-408.el8.ppc64le #1
NIP:  c00800000f0fbbfc LR: c00800000f1135dc CTR: c00800000f0ff598
REGS: c0000000021832f0 TRAP: 0901   Not tainted  (4.18.0-408.el8.ppc64le)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28000288  XER: 00000003
CFAR: c00800000f0fb7cc IRQMASK: 0
GPR00: c00800000f1135dc c000000002183580 c00800000f23ae00 c00000362cfc0000
GPR04: c000003635119348 0000000000000000 0000000000000004 0000000000000001
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000002
GPR12: c00800000f0ff598 c000000002e10000
NIP [c00800000f0fbbfc] lpfc_sli4_process_eq+0x544/0x710 [lpfc]
LR [c00800000f1135dc] lpfc_sli4_poll_hbtimer+0xc4/0xe0 [lpfc]
Call Trace:
[c000000002183640] [c00800000f1135dc] lpfc_sli4_poll_hbtimer+0xc4/0xe0 [lpfc]
[c000000002183680] [c00000000025bb70] call_timer_fn+0x50/0x200
[c000000002183710] [c00000000025be68] expire_timers+0x148/0x230
[c000000002183780] [c00000000025c7f0] run_timer_softirq+0x3f0/0xe80
[c000000002183850] [c000000000f6370c] __do_softirq+0x16c/0x3e4
[c000000002183940] [c000000000179f94] irq_exit_rcu+0x1a4/0x1d0
[c000000002183970] [c000000000179fe0] irq_exit+0x20/0x40
[c000000002183990] [c000000000020958] timer_interrupt+0x128/0x2f0
[c0000000021839f0] [c0000000000091b0] decrementer_common+0x110/0x120
--- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
    LR = dedicated_cede_loop+0x168/0x1d0
[c000000002183cf0] [c0000000021c754c] cpu_idle_force_poll+0x0/0x4 (unreliable)
[c000000002183d70] [c000000000bb21ac] cpuidle_enter_state+0x33c/0x7e0
[c000000002183de0] [c000000000bb26f0] cpuidle_enter+0x50/0x70
[c000000002183e20] [c0000000001cdf38] do_idle+0x3d8/0x470
[c000000002183ea0] [c0000000001ce218] cpu_startup_entry+0x38/0x40
[c000000002183ed0] [c0000000000106c4] rest_init+0xe0/0xf8
[c000000002183f00] [c0000000016a44b0] start_kernel+0x690/0x6cc
[c000000002183f90] [c00000000000adcc] start_here_common+0x1c/0x550
Workaround

There is no workaround for this issue currently. It is advised to shut down the logical partition rather than using the DLPAR operation before adding or removing CPUs from the configuration.

For more information about DLPAR, see Dynamic logical partitioning.
Fix Outlook
The resolution to this issue is still under investigation and will be applied to a future zstream kernel version or maintweb release. Once the fix is available, upgrading the kernel must resolve the issue. If the issue is still seen after a fix is identified and deployed, reach out to IBM support for further assistance.

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV157","label":"IBM Support for Red Hat Enterprise Linux Server"},"ARM Category":[{"code":"a8m0z000000Gnl7AAC","label":"Red Hat Enterprise Linux"},{"code":"a8m0z000000GnlCAAS","label":"SUSE Linux Enterprise Server"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
21 December 2022

UID

ibm16847539