Red Hat OpenShift Data Foundation Object Storage 裝置 (OSD) 失敗
對於本端儲存裝置所支援叢集上任何類型的失敗儲存裝置,您必須取代 Red Hat® OpenShift® Data Foundation Object Storage 裝置 (OSD)。
如果您遇到此問題,請聯絡 IBM 支援中心 。
- 開始之前
- Red Hat 建議使用與要更換的裝置類似的基礎架構和資源來配置替代 OSD 裝置。您可以在下列基礎架構上使用本端儲存裝置部署的 Red Hat OpenShift Data Foundation 中取代 OSD:
- 裸機
- 具有本端部署的 VMware
- SystemZ
- 程序
執行下列步驟,以檢查是否發生 Red Hat OpenShift Data Foundation OSD 失敗:
- 將 Red Hat OpenShift Data Foundation 叢集設為維護:
oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode=true"
輸出範例:[root@fu40 ~]# oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode=true" odfcluster.odf.isf.ibm.com/odfcluster labeled
- 識別失敗的 OSD:使用下列任何方法來檢查 OSD 是否失敗:
- 登入 Red Hat OpenShift Container Platform Web 主控台,並跳至儲存體系統詳細資料頁面。
- 在 狀態 區段,以取得 儲存體叢集中的任何警告。 標籤中,檢查
- 如果警告指出 OSD 已關閉或欠佳,請聯絡 IBM 支援中心 ,以取代內部連接環境中儲存節點的 Red Hat OpenShift Data Foundation 故障 OSD。警告訊息範例:
1 個 OS 關閉 1 個主機 (1 個 osds) 關閉 欠佳資料備援: 已降級 333/999 個物件 (33.333%) , 81 個 pgs 欠佳
- 登入 IBM Storage Fusion 使用者介面。
- 移至 資料基礎 頁面,並檢查儲存體叢集的 性能 區段中是否有警告。
或者,您可以使用 oc 指令來識別 OSD:oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
範例輸出:[root@fu40 ~]# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-6c99fc999b-2s9mr 1/2 CrashLoopBackOff 5 (17s ago) 17m 10.128.4.216 fu49 <none> <none> rook-ceph-osd-1-764f9cff48-6gkg9 2/2 Running 0 16m 10.131.2.18 fu47 <none> <none> rook-ceph-osd-2-5d9d5984dc-8gkrz 2/2 Running 0 16m 10.129.2.53 fu48 <none> <none>
在此範例中,需要取代
rook-ceph-osd-0-6c99fc999b-2s9mr
,且fu49
是在其上排定 OSD 的 Red Hat OpenShift Container Platform 節點。 失敗的 OSD ID 為0
。您也可以在 Ceph 工具中檢視 OSD 詳細資料
ceph osd df
。 而失敗的 OSD ID 與前一個步驟中的相同。 - 縮減 OSD 部署
- 將 OSD 部署抄本縮減為 0
- 驗證前一個步驟中的 OSD ID ,
rook-ceph-osd-0-6c99fc999b-2s9mr
及 Pod ID 是0
。osd_id_to_remove=<replace-it-with-osd-id> oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
輸出範例:[root@fu40 ~]# osd_id_to_remove=0 [root@fu40 ~]# oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-0 scaled
- 等待 rook-ceph-osd Pod 已終止
- 執行 oc 指令以終止
rook-ceph-osd
Pod。oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
輸出範例:[root@fu40 ~]# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6c99fc999b-2s9mr 0/2 Terminating 6 20m
附註: 如果rook-ceph-osd
Pod 處於終止狀態且需要更多時間,請使用強制選項來刪除 Pod。oc delete -n openshift-storage pod rook-ceph-osd-0-6c99fc999b-2s9mr --grace-period=0 --force
輸出範例:[root@fu40 ~]# oc delete -n openshift-storage pod rook-ceph-osd-0-6c99fc999b-2s9mr --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6c99fc999b-2s9mr" force deleted
驗證rook-ceph-osd
是否已終止。[root@fu40 ~]# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} No resources found in openshift-storage namespace.
- 從叢集中移除舊的 OSD。
- 刪除任何舊的
ocs-osd-removal
工作 - 執行 oc 指令,以刪除
ocs-osd-removal
工作。oc delete -n openshift-storage job ocs-osd-removal-job
- 從叢集中移除舊的 OSD
請確定您已設定正確的
osd_id_to_remove
。在移除 OSD 之後,只有三個 OSD 的叢集或空間不足無法還原所有三個資料抄本的叢集中, FORCE_OSD_REMOVATION 值必須變更為
true
。- 超過三個 OSD
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
- 只有三個 OSD 或空間不足 (強制刪除)
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
輸出範例:[root@fu40 ~]# echo $osd_id_to_remove 0
- 超過三個 OSD
- 驗證已移除 OSD
- 等待
ocs-osd-removal-job
Pod 完成。[root@fu40 ~]# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-s4vhc 0/1 Completed 0 24s
重複確認日誌。[root@fu40 ~]# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal' 2022-11-25 16:08:49.858109 I | cephosd: completed removal of OSD 0
PVC 將移至Pending
,而 pv 將是Released
。openshift-storage ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j Pending ibm-spectrum-fusion-local 7m16s
local-pv-a2879220 600Gi RWO Delete Released openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b ibm-spectrum-fusion-local 41m
若要尋找工作者節點,請使用 oc 指令來說明 pv。
例如, pv 主機名稱是 fu49kubernetes.io/hostname=fu49
。[root@fu40 ~]# oc describe pv local-pv-a2879220 Name: local-pv-a2879220 Labels: kubernetes.io/hostname=fu49 storage.openshift.com/owner-kind=LocalVolumeSet storage.openshift.com/owner-name=ibm-spectrum-fusion-local storage.openshift.com/owner-namespace=openshift-local-storage Annotations: pv.kubernetes.io/bound-by-controller: yes pv.kubernetes.io/provisioned-by: local-volume-provisioner-fu49-96f64c0f-e5ed-4bb1-b4ff-cad610562f58 storage.openshift.com/device-id: scsi-36000c2913ba6a22c66120c73cb1edae6 storage.openshift.com/device-name: sdb Finalizers: [kubernetes.io/pv-protection] StorageClass: ibm-spectrum-fusion-local Status: Released Claim: openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b Reclaim Policy: Delete Access Modes: RWO VolumeMode: Block Capacity: 600Gi Node Affinity: Required Terms: Term 0: kubernetes.io/hostname in [fu49] Message: Source: Type: LocalVolume (a persistent volume backed by local storage on a node) Path: /mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 6m2s (x26 over 12m) deleter Error cleaning PV "local-pv-a2879220": failed to get volume mode of path "/mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6": Directory check for "/mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6" failed: open /mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6: no such file or directory
附註: 如果 ocs-osd-removal-job 失敗,且 Pod 未處於預期的已完成狀態,請檢查 Pod 日誌以進行進一步除錯。
- 移除加密相關配置
- 從 OSD 裝置中移除 dm-crypt 受管理裝置對映,如果在安裝期間已啟用加密,則會從個別 Red Hat OpenShift Data Foundation 節點中移除這些裝置對映。
- 對於先前識別的每一個節點,請執行下列動作:
oc debug node/<node name> chroot /host dmsetup ls| grep <pvc name>
- 移除對映的裝置。
cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
輸出範例:[root@fu40 ~]# oc debug nodes/fu49 Starting pod/fu49-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# dmsetup ls ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt (253:0) sh-4.4# cryptsetup luksClose --debug --verbose ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt
# cryptsetup 2.3.3 processing "cryptsetup luksClose --debug --verbose ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt" # Running command close. # Locking memory. # Installing SIGINT/SIGTERM handler. # Unblocking interruption on signal. # Allocating crypt device context by device ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt. # Initialising device-mapper backend library. # dm version [ opencount flush ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # Detected dm-ioctl version 4.43.0. # Detected dm-crypt version 1.21.0. # Device-mapper backend running with UDEV support enabled. # dm status ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount noflush ] [16384] (*1) # Releasing device-mapper backend. # Allocating context for crypt device (none). # Initialising device-mapper backend library. Underlying device for crypt device ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt disappeared. # dm versions [ opencount flush ] [16384] (*1) # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush securedata ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # dm deps ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush ] [16384] (*1) # LUKS device header not available. # Deactivating volume ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt. # dm versions [ opencount flush ] [16384] (*1) # dm status ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount noflush ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush securedata ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # dm deps ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush securedata ] [16384] (*1) # dm versions [ opencount flush ] [16384] (*1) # Udev cookie 0xd4d9390 (semid 0) created # Udev cookie 0xd4d9390 (semid 0) incremented to 1 # Udev cookie 0xd4d9390 (semid 0) incremented to 2 # Udev cookie 0xd4d9390 (semid 0) assigned to REMOVE task(2) with flags DISABLE_LIBRARY_FALLBACK (0x20) # dm remove ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt [ opencount flush retryremove ] [16384] (*1) # Udev cookie 0xd4d9390 (semid 0) decremented to 1 # Udev cookie 0xd4d9390 (semid 0) waiting for zero
- 對於先前識別的每一個節點,請執行下列動作:
- 尋找需要刪除的持續性磁區 (PV)
- 執行 oc 指令,以尋找失敗的 pv。
oc get pv -l kubernetes.io/hostname=<failed-osds-worker-node-name>
輸出範例:[root@fu40 ~]# oc get pv -l kubernetes.io/hostname=fu49 NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-a2879220 600Gi RWO Delete Released openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b ibm-spectrum-fusion-local 55m
- 刪除已釋放的持續性磁區 (PV)
- 執行 oc 指令,以刪除已釋放的 pv。
oc delete pv <pv_name>
輸出範例:[root@fu40 ~]# oc delete pv local-pv-a2879220 persistentvolume "local-pv-a2879220" delete
- 刪除任何舊的
- 將新的 OSD 新增至節點。
將新的裝置實際新增至節點。
- 追蹤符合 deviceInclusion規格之裝置的持續性磁區 (PV) 供應。
- 佈建 PV 可能需要幾分鐘。 識別 PV 之後,它會自動將自己新增至叢集。
- lvs 規格
oc -n openshift-local-storage describe localvolumeset ibm-spectrum-fusion-local
輸出範例:... Spec: Device Inclusion Spec: Device Types: disk part Max Size: 601Gi Min Size: 599Gi Node Selector: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: In Values:
- lvs 規格
- 刪除 ocs-osd-removal-job
- 執行 oc 指令,以刪除 ocs-osd-removal-job。
``` oc delete -n openshift-storage job ocs-osd-removal-job ``` ``` [root@fu40 ~]# oc delete -n openshift-storage job ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted ```
- 驗證有新的 OSD 在執行中
- 驗證新的 OSD Pod 正在執行中
- 執行 oc 指令,以檢查新的 OSD Pod 是否在執行中。
oc get -n openshift-storage pods -l app=rook-ceph-osd
輸出範例:[root@fu40 ~]# oc get -n openshift-storage pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-7f99b8ccd5-ssj5w 2/2 Running 0 7m31s <<-- This pod rook-ceph-osd-1-764f9cff48-6gkg9 2/2 Running 0 64m rook-ceph-osd-2-5d9d5984dc-8gkrz 2/2 Running 0 64m
提示: 如果新的 OSD 在幾分鐘後未顯示為「執行中」,請重新啟動 rook-ceph-operator Pod 以強制核對。oc delete pod -n openshift-storage -l app=rook-ceph-operator
- 驗證已建立新的 PVC
- 執行 oc 指令,以檢查 Pod 是否在執行中。
oc get pvc -n openshift-storage
輸出範例:[root@fu40 ~]# oc get pvc -n openshift-storage NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-783036b5-ec40-41a7-91e5-9e179fd24cc3 50Gi RWO ocs-storagecluster-ceph-rbd 65m <<--This one ocs-deviceset-ibm-spectrum-fusion-local-0-data-04vwvq Bound local-pv-b45b1d67 600Gi RWO ibm-spectrum-fusion-local 66m ocs-deviceset-ibm-spectrum-fusion-local-0-data-24nj5t Bound local-pv-c3de9110 600Gi RWO ibm-spectrum-fusion-local 66m ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j Bound local-pv-1c9f3b11 600Gi RWO ibm-spectrum-fusion-local 34m [root@fu40 ~]# [root@fu40 ~]# oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-1c9f3b11 600Gi RWO Delete Bound openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j ibm-spectrum-fusion-local 10m <<--This one local-pv-b45b1d67 600Gi RWO Delete Bound openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-04vwvq ibm-spectrum-fusion-local 68m local-pv-c3de9110 600Gi RWO Delete Bound openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-24nj5t ibm-spectrum-fusion-local 68m pvc-783036b5-ec40-41a7-91e5-9e179fd24cc3 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 65m
- 驗證 OSD 加密設定
- 如果已啟用叢集層面加密,請確保 crypt 關鍵字位於 ocs-deviceset 名稱旁邊。
oc debug node/<new-node-name> -- chroot /host lsblk -f
oc debug node/<new-node-name> -- chroot /host dmsetup ls
輸出範例:[root@fu40 ~]# oc debug node/fu49 -- chroot /host lsblk -f Starting pod/fu49-debug ... To use host binaries, run `chroot /host` NAME FSTYPE LABEL UUID MOUNTPOINT loop1 crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca 6a8244eb-55d6-48cc-8e68-33436e512bc6 loop2 crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca fa228ec1-0b1d-43ad-8707-9ecd38bfb1f8 sda |-sda1 |-sda2 vfat EFI-SYSTEM A084-4057 |-sda3 ext4 boot 7d757098-d548-4b7b-8c9a-3dd4f34ceca1 /boot `-sda4 xfs root 1cd39805-6936-458d-ae8c-39313bb71c95 /sysroot sdc crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca fa228ec1-0b1d-43ad-8707-9ecd38bfb1f8 `-ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j-block-dmcrypt sr0 Removing debug pod ... [root@fu40 ~]# oc debug node/fu49 -- chroot /host dmsetup ls Starting pod/fu49-debug ... To use host binaries, run `chroot /host` ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j-block-dmcrypt (253:0) Removing debug pod ...
附註: 如果驗證步驟失敗,請聯絡 Red Hat 支援中心。
- 結束維護模式
- 完成所有步驟之後,執行 oc 指令以結束維護模式。
oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode-"
輸出範例:[root@fu40 ~]# oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode-" odfcluster.odf.isf.ibm.com/odfcluster unlabeled
- 移至 IBM Storage Fusion 使用者介面中的 資料基礎 頁面,並在 性能 區段中檢查 儲存體叢集 的性能。
- 將 Red Hat OpenShift Data Foundation 叢集設為維護: