Red Hat OpenShift Data Foundation Object Storage 裝置 (OSD) 失敗

對於本端儲存裝置所支援叢集上任何類型的失敗儲存裝置,您必須取代 Red Hat® OpenShift® Data Foundation Object Storage 裝置 (OSD)。

如果您遇到此問題,請聯絡 IBM 支援中心

開始之前
Red Hat 建議使用與要更換的裝置類似的基礎架構和資源來配置替代 OSD 裝置。
您可以在下列基礎架構上使用本端儲存裝置部署的 Red Hat OpenShift Data Foundation 中取代 OSD:
  • 裸機
  • 具有本端部署的 VMware
  • SystemZ
程序

執行下列步驟,以檢查是否發生 Red Hat OpenShift Data Foundation OSD 失敗:

  1. Red Hat OpenShift Data Foundation 叢集設為維護:
    oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode=true"
    
    輸出範例:
    [root@fu40 ~]# oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode=true"
    odfcluster.odf.isf.ibm.com/odfcluster labeled
    
  2. 識別失敗的 OSD:
    使用下列任何方法來檢查 OSD 是否失敗:
    • 登入 Red Hat OpenShift Container Platform Web 主控台,並跳至儲存體系統詳細資料頁面。
    • 概觀 > 區塊及檔案 標籤中,檢查 狀態 區段,以取得 儲存體叢集中的任何警告。
    • 如果警告指出 OSD 已關閉或欠佳,請聯絡 IBM 支援中心 ,以取代內部連接環境中儲存節點的 Red Hat OpenShift Data Foundation 故障 OSD。
      警告訊息範例:
      1 個 OS 關閉
      1 個主機 (1 個 osds) 關閉
      欠佳資料備援: 已降級 333/999 個物件 (33.333%) , 81 個 pgs 欠佳
    • 登入 IBM Storage Fusion 使用者介面。
    • 移至 資料基礎 頁面,並檢查儲存體叢集的 性能 區段中是否有警告。
    或者,您可以使用 oc 指令來識別 OSD:
    oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
    範例輸出:
    [root@fu40 ~]# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
    NAME                               READY   STATUS             RESTARTS      AGE   IP             NODE   NOMINATED NODE   READINESS GATES
    rook-ceph-osd-0-6c99fc999b-2s9mr   1/2     CrashLoopBackOff   5 (17s ago)   17m   10.128.4.216   fu49   <none>           <none>
    rook-ceph-osd-1-764f9cff48-6gkg9   2/2     Running            0             16m   10.131.2.18    fu47   <none>           <none>
    rook-ceph-osd-2-5d9d5984dc-8gkrz   2/2     Running            0             16m   10.129.2.53    fu48   <none>           <none>

    在此範例中,需要取代 rook-ceph-osd-0-6c99fc999b-2s9mr ,且 fu49 是在其上排定 OSD 的 Red Hat OpenShift Container Platform 節點。 失敗的 OSD ID 為 0

    您也可以在 Ceph 工具中檢視 OSD 詳細資料 ceph osd df 。 而失敗的 OSD ID 與前一個步驟中的相同。

  3. 縮減 OSD 部署
    將 OSD 部署抄本縮減為 0
    驗證前一個步驟中的 OSD ID , rook-ceph-osd-0-6c99fc999b-2s9mr 及 Pod ID 是 0
    osd_id_to_remove=<replace-it-with-osd-id>
    oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
    輸出範例:
    [root@fu40 ~]# osd_id_to_remove=0
    [root@fu40 ~]#  oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
    deployment.apps/rook-ceph-osd-0 scaled
    等待 rook-ceph-osd Pod 已終止
    執行 oc 指令以終止 rook-ceph-osd Pod。
    oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
    輸出範例:
    [root@fu40 ~]#  oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
    NAME                               READY   STATUS        RESTARTS   AGE
    rook-ceph-osd-0-6c99fc999b-2s9mr   0/2     Terminating   6          20m
    附註: 如果 rook-ceph-osd Pod 處於終止狀態且需要更多時間,請使用強制選項來刪除 Pod。
    oc delete -n openshift-storage pod rook-ceph-osd-0-6c99fc999b-2s9mr --grace-period=0 --force
    輸出範例:
    [root@fu40 ~]# oc delete -n openshift-storage pod rook-ceph-osd-0-6c99fc999b-2s9mr --grace-period=0 --force
    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
    pod "rook-ceph-osd-0-6c99fc999b-2s9mr" force deleted
    驗證 rook-ceph-osd 是否已終止。
    [root@fu40 ~]#  oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
    No resources found in openshift-storage namespace.
  4. 從叢集中移除舊的 OSD。
    刪除任何舊的 ocs-osd-removal 工作
    執行 oc 指令,以刪除 ocs-osd-removal 工作。
    oc delete -n openshift-storage job ocs-osd-removal-job
    從叢集中移除舊的 OSD

    請確定您已設定正確的 osd_id_to_remove

    在移除 OSD 之後,只有三個 OSD 的叢集或空間不足無法還原所有三個資料抄本的叢集中, FORCE_OSD_REMOVATION 值必須變更為 true

    • 超過三個 OSD
      oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
    • 只有三個 OSD 或空間不足 (強制刪除)
      oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
      輸出範例:
      [root@fu40 ~]# echo $osd_id_to_remove
      0
    驗證已移除 OSD
    等待 ocs-osd-removal-job Pod 完成。
    [root@fu40 ~]#  oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    NAME                        READY   STATUS      RESTARTS   AGE
    ocs-osd-removal-job-s4vhc   0/1     Completed   0          24s
    重複確認日誌。
    [root@fu40 ~]#  oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
    2022-11-25 16:08:49.858109 I | cephosd: completed removal of OSD 0
    PVC 將移至 Pending,而 pv 將是 Released
    openshift-storage   ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j   Pending                                                                        ibm-spectrum-fusion-local     7m16s
    local-pv-a2879220                          600Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b   ibm-spectrum-fusion-local              41m

    若要尋找工作者節點,請使用 oc 指令來說明 pv

    例如, pv 主機名稱是 fu49 kubernetes.io/hostname=fu49
    [root@fu40 ~]# oc describe pv local-pv-a2879220
    Name:              local-pv-a2879220
    Labels:            kubernetes.io/hostname=fu49
                      storage.openshift.com/owner-kind=LocalVolumeSet
                      storage.openshift.com/owner-name=ibm-spectrum-fusion-local
                      storage.openshift.com/owner-namespace=openshift-local-storage
    Annotations:       pv.kubernetes.io/bound-by-controller: yes
                      pv.kubernetes.io/provisioned-by: local-volume-provisioner-fu49-96f64c0f-e5ed-4bb1-b4ff-cad610562f58
                      storage.openshift.com/device-id: scsi-36000c2913ba6a22c66120c73cb1edae6
                      storage.openshift.com/device-name: sdb
    Finalizers:        [kubernetes.io/pv-protection]
    StorageClass:      ibm-spectrum-fusion-local
    Status:            Released
    Claim:             openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b
    Reclaim Policy:    Delete
    Access Modes:      RWO
    VolumeMode:        Block
    Capacity:          600Gi
    Node Affinity:
      Required Terms:
        Term 0:        kubernetes.io/hostname in [fu49]
    Message:
    Source:
        Type:  LocalVolume (a persistent volume backed by local storage on a node)
        Path:  /mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6
    Events:
      Type     Reason              Age                  From     Message
      ----     ------              ----                 ----     -------
      Warning  VolumeFailedDelete  6m2s (x26 over 12m)  deleter  Error cleaning PV "local-pv-a2879220": failed to get volume mode of path "/mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6": Directory check for "/mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6" failed: open /mnt/local-storage/ibm-spectrum-fusion-local/scsi-36000c2913ba6a22c66120c73cb1edae6: no such file or directory
    附註: 如果 ocs-osd-removal-job 失敗,且 Pod 未處於預期的已完成狀態,請檢查 Pod 日誌以進行進一步除錯。
    移除加密相關配置
    從 OSD 裝置中移除 dm-crypt 受管理裝置對映,如果在安裝期間已啟用加密,則會從個別 Red Hat OpenShift Data Foundation 節點中移除這些裝置對映。
    • 對於先前識別的每一個節點,請執行下列動作:
      oc debug node/<node name>
      chroot /host
      dmsetup ls| grep <pvc name>
    • 移除對映的裝置。
      cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
      輸出範例:
      [root@fu40 ~]# oc debug nodes/fu49
      Starting pod/fu49-debug ...
      To use host binaries, run `chroot /host`
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-4.4# dmsetup ls
      ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt	(253:0)
      sh-4.4# cryptsetup luksClose --debug --verbose ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt
      # cryptsetup 2.3.3 processing "cryptsetup luksClose --debug --verbose ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt"
      # Running command close.
      # Locking memory.
      # Installing SIGINT/SIGTERM handler.
      # Unblocking interruption on signal.
      # Allocating crypt device context by device ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt.
      # Initialising device-mapper backend library.
      # dm version   [ opencount flush ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # Detected dm-ioctl version 4.43.0.
      # Detected dm-crypt version 1.21.0.
      # Device-mapper backend running with UDEV support enabled.
      # dm status ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount noflush ]   [16384] (*1)
      # Releasing device-mapper backend.
      # Allocating context for crypt device (none).
      # Initialising device-mapper backend library.
      Underlying device for crypt device ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt disappeared.
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush securedata ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm deps ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush ]   [16384] (*1)
      # LUKS device header not available.
      # Deactivating volume ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt.
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm status ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount noflush ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush securedata ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm deps ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # dm table ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush securedata ]   [16384] (*1)
      # dm versions   [ opencount flush ]   [16384] (*1)
      # Udev cookie 0xd4d9390 (semid 0) created
      # Udev cookie 0xd4d9390 (semid 0) incremented to 1
      # Udev cookie 0xd4d9390 (semid 0) incremented to 2
      # Udev cookie 0xd4d9390 (semid 0) assigned to REMOVE task(2) with flags DISABLE_LIBRARY_FALLBACK         (0x20)
      # dm remove ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b-block-dmcrypt  [ opencount flush retryremove ]   [16384] (*1)
      # Udev cookie 0xd4d9390 (semid 0) decremented to 1
      # Udev cookie 0xd4d9390 (semid 0) waiting for zero
    尋找需要刪除的持續性磁區 (PV)
    執行 oc 指令,以尋找失敗的 pv
    oc get pv -l kubernetes.io/hostname=<failed-osds-worker-node-name>
    輸出範例:
    [root@fu40 ~]# oc get pv -l kubernetes.io/hostname=fu49
    NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                                     STORAGECLASS                REASON   AGE
    local-pv-a2879220   600Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m227b   ibm-spectrum-fusion-local            55m
    刪除已釋放的持續性磁區 (PV)
    執行 oc 指令,以刪除已釋放的 pv
    oc delete pv <pv_name>
    輸出範例:
    [root@fu40 ~]#  oc delete pv local-pv-a2879220
    persistentvolume "local-pv-a2879220" delete
  5. 將新的 OSD 新增至節點。

    將新的裝置實際新增至節點。

    追蹤符合 deviceInclusion規格之裝置的持續性磁區 (PV) 供應。
    佈建 PV 可能需要幾分鐘。 識別 PV 之後,它會自動將自己新增至叢集。
    • lvs 規格
      oc -n openshift-local-storage describe localvolumeset ibm-spectrum-fusion-local
      輸出範例:
      ...
      Spec:
      Device Inclusion Spec:
          Device Types:
          disk
          part
          Max Size:  601Gi
          Min Size:  599Gi
      Node Selector:
          Node Selector Terms:
          Match Expressions:
              Key:       cluster.ocs.openshift.io/openshift-storage
              Operator:  In
              Values:
    刪除 ocs-osd-removal-job
    執行 oc 指令,以刪除 ocs-osd-removal-job
    ```
    oc delete -n openshift-storage job ocs-osd-removal-job
    ```
    ```
    [root@fu40 ~]# oc delete -n openshift-storage job ocs-osd-removal-job
    job.batch "ocs-osd-removal-job" deleted
    ```
  6. 驗證有新的 OSD 在執行中
    驗證新的 OSD Pod 正在執行中
    執行 oc 指令,以檢查新的 OSD Pod 是否在執行中。
    oc get -n openshift-storage pods -l app=rook-ceph-osd
    輸出範例:
    [root@fu40 ~]# oc get -n openshift-storage pods -l app=rook-ceph-osd
    NAME                               READY   STATUS    RESTARTS   AGE
    rook-ceph-osd-0-7f99b8ccd5-ssj5w   2/2     Running   0          7m31s       <<-- This pod
    rook-ceph-osd-1-764f9cff48-6gkg9   2/2     Running   0          64m
    rook-ceph-osd-2-5d9d5984dc-8gkrz   2/2     Running   0          64m
    提示: 如果新的 OSD 在幾分鐘後未顯示為「執行中」,請重新啟動 rook-ceph-operator Pod 以強制核對。
    oc delete pod -n openshift-storage -l app=rook-ceph-operator
    驗證已建立新的 PVC
    執行 oc 指令,以檢查 Pod 是否在執行中。
    oc get pvc -n openshift-storage
    輸出範例:
    [root@fu40 ~]# oc get pvc -n openshift-storage
    NAME                                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
    db-noobaa-db-pg-0                                       Bound    pvc-783036b5-ec40-41a7-91e5-9e179fd24cc3   50Gi       RWO            ocs-storagecluster-ceph-rbd   65m  <<--This one
    ocs-deviceset-ibm-spectrum-fusion-local-0-data-04vwvq   Bound    local-pv-b45b1d67                          600Gi      RWO            ibm-spectrum-fusion-local     66m
    ocs-deviceset-ibm-spectrum-fusion-local-0-data-24nj5t   Bound    local-pv-c3de9110                          600Gi      RWO            ibm-spectrum-fusion-local     66m
    ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j   Bound    local-pv-1c9f3b11                          600Gi      RWO            ibm-spectrum-fusion-local     34m     
    [root@fu40 ~]#
    [root@fu40 ~]# oc get pv
    NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                                     STORAGECLASS                  REASON   AGE
    local-pv-1c9f3b11                          600Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j   ibm-spectrum-fusion-local              10m     <<--This one
    local-pv-b45b1d67                          600Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-04vwvq   ibm-spectrum-fusion-local              68m
    local-pv-c3de9110                          600Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-24nj5t   ibm-spectrum-fusion-local              68m
    pvc-783036b5-ec40-41a7-91e5-9e179fd24cc3   50Gi       RWO            Delete           Bound    openshift-storage/db-noobaa-db-pg-0                                       ocs-storagecluster-ceph-rbd            65m
    驗證 OSD 加密設定
    如果已啟用叢集層面加密,請確保 crypt 關鍵字位於 ocs-deviceset 名稱旁邊。
    oc debug node/<new-node-name>  -- chroot /host lsblk -f
    oc debug node/<new-node-name> -- chroot /host dmsetup ls
    輸出範例:
    [root@fu40 ~]# oc debug node/fu49 -- chroot /host lsblk -f
    Starting pod/fu49-debug ...
    To use host binaries, run `chroot /host`
    NAME                                                                  FSTYPE      LABEL                                           UUID                                 MOUNTPOINT
    loop1                                                                 crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca 6a8244eb-55d6-48cc-8e68-33436e512bc6
    loop2                                                                 crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca fa228ec1-0b1d-43ad-8707-9ecd38bfb1f8
    sda
    |-sda1
    |-sda2                                                                vfat        EFI-SYSTEM                                      A084-4057
    |-sda3                                                                ext4        boot                                            7d757098-d548-4b7b-8c9a-3dd4f34ceca1 /boot
    `-sda4                                                                xfs         root                                            1cd39805-6936-458d-ae8c-39313bb71c95 /sysroot
    sdc                                                                   crypto_LUKS pvc_name=ocs-deviceset-ibm-spectrum-fusion-loca fa228ec1-0b1d-43ad-8707-9ecd38bfb1f8
    `-ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j-block-dmcrypt
    sr0
    
    Removing debug pod ...
    [root@fu40 ~]# oc debug node/fu49 -- chroot /host dmsetup ls
    Starting pod/fu49-debug ...
    To use host binaries, run `chroot /host`
    ocs-deviceset-ibm-spectrum-fusion-local-0-data-3nsk8j-block-dmcrypt	(253:0)
    
    Removing debug pod ...
    附註: 如果驗證步驟失敗,請聯絡 Red Hat 支援中心。
    結束維護模式
    完成所有步驟之後,執行 oc 指令以結束維護模式。
    oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode-"
    輸出範例:
    [root@fu40 ~]# oc label odfclusters.odf.isf.ibm.com -n ibm-spectrum-fusion-ns odfcluster "odf.isf.ibm.com/maintenanceMode-"
    odfcluster.odf.isf.ibm.com/odfcluster unlabeled
  7. 移至 IBM Storage Fusion 使用者介面中的 資料基礎 頁面,並在 性能 區段中檢查 儲存體叢集 的性能。