Install the ESS system

Before proceeding with the following steps, ensure that you have completed all the steps in Install the management server software. Follow these steps to perform a new installation of the ESS software on a management server node and I/O server nodes. Node host names ems1, gssio1, and gssio2 are examples. Each environment could have its own unique naming conventions. For an xCAT command such as updatenode, use an xCAT host name. For the IBM Spectrum Scale commands (those start with mm), use an IBM Spectrum Scale host name. For example, ems1 is an xCAT host name (typically a hostname associated with the management interface) and ems1-hs is the corresponding IBM Spectrum Scale host name (typically a host name associated with the high speed interface).
  1. Make the gssdeploy script executable:
    chmod +x /opt/ibm/gss/install/rhel7/<ARCH>/samples/gssdeploy
  2. Clean the current xCAT installation and associated configuration to remove any preexisting xCAT configuration, and then address any errors before proceeding:
    /opt/ibm/gss/install/rhel7/<ARCH>/samples/gssdeploy -c
  3. Run one of the following commands depending on the architecture:
    For PPC64BE:
    cd /var/tmp ; ./gssinstall_ppc64 -u
    For PPC64LE:
    cd /var/tmp ; ./gssinstall_ppc64le -u
  4. Run the following command to copy the gssdeploy.cfg.default and customize it for your environment by editing it:
    
    cp /var/tmp/gssdeploy.cfg.default /var/tmp/gssdeploy.cfg
    
    Note: The directory from which you execute the gssinstall script determines where the gssdeploy.cfg.default is stored. It is recommended that you run gssinstall script from /var/tmp, but not mandatory.
    Do not copy the gssdeploy.cfg configuration file to the /tmp directory because the gssdeploy script uses the /tmp/gssdeploy directory and the /tmp directory might get cleaned up in case of a system reboot.
  5. If deploying on the PPC64LE platform, gather information for the gssdeploy.cfg configuration file using the following commands when you are in close proximity with the rack containing the nodes:
    1. Scan the nodes in the FSP subnet range:
      /var/tmp/gssdeploy -f FSP_Subnet_Range
      FSP_Subnet_Range is the FSP management node interface subnet range. For example, 10.0.0.0/24.
      Note:
      • It is recommended to use the IP address 10.0.0.1 for the management interface, if possible.
      • It is highly recommended that you use the /24 netmask because scanning of the subnet takes a considerable duration of time if a wider network range is used.
      • The gssdeploy -f command first determines if a DHCP server is running on the network. If the DHCP sever is not running, it prompts you to start one so that the I/O server nodes can obtain addresses. Select Y to start the DHCP server when prompted.
      Note:

      This command scans the specified subnet range to ensure that only the nodes on which you want to deploy are available. These include I/O server nodes and management server node (EMS).

      This command also returns the following:
      • Serial numbers and FSP numbers of the nodes in the building block
      • Serial numbers and IP addresses of I/O server nodes in the building block
      Note: Do not proceed to the next step until FSP IP addresses and serial numbers of all known nodes are visible using the gssdeploy -f script.
    2. Physically identify the nodes in the rack:
      /var/tmp/gssdeploy -i
      With the -i option, Node_IP, Default_Password, and Duration need to be provided as input, where:
      • Node_IP is the returned FSP IPMI IP address of the node obtained by using the gssdeploy -f command.
      • Default_Password is the default password of the node, which is PASSW0RD
      • Duration is the time duration in seconds for which the LED on the node should blink.

      After you issue this command, the LED blinks on the specified node for the specified duration. You can identify the node in the rack using the blinking LED.

      Depending on the order of a node in the rack, its corresponding entry is made in the gssdeploy.cfg file. For example, for the bottommost node in the rack, its corresponding entry is put first in gssdeploy.cfg.

  6. Update the gssdeploy.cfg file according to your requirements and the gathered information.
    The options that you can specify in the gssdeploy.cfg file include:
    • Whether use DVD for installation: RHEL_USE_DVD

      The default option is to use ISO.

    • If DVD, then device location: RHEL_DVD
    • Mount point to use for RHEL media: RHEL_MNT
    • ISO location: RHEL_ISODIR

      The default location is /opt/ibm/gss/iso.

    • ISO file name: RHEL_ISO
    • EMS host name: EMS_HOSTNAME
    • Network interface for xCAT management network: EMS_MGTNETINTERFACE
    • Network interface for FSP network: FSP_MGTNETINTERFACE [Not applicable for PPC64BE]
    • FSP default IPMI password: FSP_PASSWD [Not applicable for PPC64BE]
    • HMC host name: HMC_HOSTNAME [Not applicable for PPC64LE]
    • HMC default user ID: HMC_ROOTUID [Not applicable for PPC64LE]
    • HMC default password: HMC_PASSWD[Not applicable for PPC64LE]
    • I/O server user ID: IOSERVERS_UID
    • I/O server default password: IOSERVERS_PASSWD
    • I/O server serial numbers: IOSERVERS_SERIAL [Not applicable for PPC64BE]
    • I/O server node names: IOSERVERS_NODES

      For example, gssio1 gssio2

    • Deployment OS image: DEPLOY_OSIMAGE
    Note: For PPC64LE, there must be a one-to-one relationship between serial number and node in gssdeploy.cfg and for every node specified in gssdeploy.cfg, there must be a matching entry in /etc/hosts.
  7. Copy the RHEL 7.6 ISO file to the directory specified in the gssdeploy.cfg file.
  8. Perform precheck to detect any errors and address them before proceeding further:
    /opt/ibm/gss/tools/samples/gssprecheck -N ems1 --pre --install --file /var/tmp/gssdeploy.cfg
    Note: gssprecheck gives hints on ways to fix any discovered issues. It is recommended to review each found issue carefully though resolution of all might not be mandatory.
    Attention: Power down the storage enclosures, or remove the SAS cables, before running gssdeploy -x.
  9. Verify that the ISO is placed in the location specified in the gssdeploy.cfg configuration file and then run the gssdeploy script:
    /var/tmp/gssdeploy -x  
    
    Note: To perform I/O server discovery task this step will power cycle the I/O server nodes specified in the gssdeploy.cfg file.
  10. Log out and then log back in to acquire the environment updates.
  11. Back up the xCAT database and save it to a location not on the management server node:
    dumpxCATdb -p /var/tmp/db
    tar -zcvf xCATDB-backup.tar.gz /var/tmp/db
  12. Set up the kernel, systemd, and Network Manager errata repositories. For example, use the following command on PPC64BE systems:
    /var/tmp/gssdeploy -k /home/deploy/kernel_ESS_5211_LE.tgz -p /home/deploy/systemd_ESS_5211_5361_LE.tgz,/home/deploy/netmanager-RHBA-2020-0381-LE.tar.gz,/home/deploy/opal-patch-le.tar.gz --silent 
    Note: This command extracts the supplied tar zip files and builds the associated repository.
    • -k option: Set up the kernel repository
    • -p option: Set up the patch repository (For example: systemd, network manager). One or more patches might be specified at the same time separated by comma.
    • Directory structure:

      Kernel repository

      /install/gss/otherpkgs/rhels7/<arch>/kernel

      Patch repository

      /install/gss/otherpkgs/rhels7/<arch>/patch

    Important: Make sure that all RPMs in the /install directory including the extracted files in the kernel directory (/install/gss/otherpkgs/rhels7/<arch>/kernel), the patch directory (/install/gss/otherpkgs/rhels7/<arch>/patch), and xCAT RPMs, etc. have the correct read permission for user, group, and others (chmod 644 files). For example:
    /install/gss/otherpkgs/rhels7/<arch>/kernel
    -rw-r--r-- 1 root root 45772640 2020-08-24 09:21 kernel-3.10.0-957.58.2.el7.ppc64le.rpm
    /install/gss/otherpkgs/rhels7/<arch>/patch
    -rw-r--r-- 1 root root 5447968 2020-08-24 21:57 systemd-219-67.el7_7.10.ppc64le.rpm
    -rw-r--r-- 1 root root 2038068 Feb 25 11:50 NetworkManager-1.18.0-5.el7_7.2.ppc64.rpm
    Wrong file permission will lead to node deployment failure.
  13. Update the management server node. Here ems1 is the xCAT host name. This step installs the kernel, uninstalls OFED, installs IBM Spectrum Scale, and applies the IBM Spectrum Scale profile.
     
    updatenode ems1 -P gss_updatenode
    
    Use systemctl reboot to reboot the management server node and run this step again as shown below. This additional step rebuilds OFED for new kernel and builds GPFS portability layer (GPL) for IBM Spectrum Scale.
    updatenode ems1 -P gss_updatenode
    Note: You can use the -V option with the updatenode command for a more verbose output on the screen for a better understanding of failures, if any.
  14. Update OFED on the management server node:
     
    updatenode ems1 -P gss_ofed 
    
  15. Update the IP RAID Adapter firmware on the management server node:
    updatenode ems1 -P gss_ipraid
  16. Use systemctl reboot to reboot the management server node.

Deploy the I/O server nodes

  1. Before initiating the deployment of the I/O server nodes, do the following:
    1. Verify that the running kernel level is 957.58.2 using the uname -a command.
    2. Verify that there are no repository errors using the yum repolist command.
    3. Ensure that the attached storage enclosures are powered off.
  2. Run the gssinstallcheck script:
    gssinstallcheck -N ems1

    This script is used to verify IBM Spectrum Scale profile, OFED, and kernel. etc.

    1. Check for any error with the following:

      1. Installed packages
      2. Linux® kernel release
      3. OFED level
      4. IPR SAS FW
      5. IPR SAS queue depth
      6. System firmware
      7. System profile setting
      8. Host adapter driver
    Ignore other errors that may be flagged by the gssinstallcheck script. They will go away after the remaining installation steps are completed.
  3. Run the gssprecheck script in full install mode and address any errors:
    /opt/ibm/gss/tools/samples/gssprecheck -N ems1 --install --file /var/tmp/gssdeploy.cfg 
    Note: gssprecheck gives hints on ways to fix any discovered issues. It is recommended to review each found issue carefully though resolution of all might not be mandatory.
  4. Deploy on the I/O server nodes using the customized deploy script:
      
    ./gssdeploy -d 
    
  5. After a duration of about five minutes, run the following command:
      
    nodestat gss_ppc64 
    
    After running the command, the output displays the OS image name or packages being installed. For example:
    PPC64LE installations:
    node: rhels7.6-ppc64le-install-gss
    node: rhels7.6-ppc64le-install-gss
    PPC64BE installations:
    node: rhels7.6-ppc64-install-gss
    node: rhels7.6-ppc64-install-gss
    After about 30 minutes, the following output displays:
    node: sshd
    node: sshd

    The installation is complete when nodestat displays sshd for all I/O server nodes. Here gss_ppc64 is the xCAT node group containing I/O server nodes. To follow the progress of a node installation, you can tail the console log by using the following command:

    tailf /var/log/consoles/NodeName

    where NodeName is the node name.

    Note: Make sure the xCAT post-installation script is complete before rebooting the nodes. You can check xCAT post process running on the I/O server nodes as follows:
    xdsh gss_ppc64 "ps -eaf | grep -v grep | grep xcatpost" 
    If there are any processes still running, wait for them to complete.
  6. At the end of the deployment, wait for approximately five minutes and reboot the node:
    xdsh gss_ppc64 systemctl reboot
  7. Once rebooted, verify the installation by running gssinstallcheck:
    gssinstallcheck -G ems1,gss_ppc64

    Check for any error with the following:

    1. Installed packages
    2. Linux kernel release
    3. OFED level
    4. IPR SAS FW
    5. IPR SAS queue depth
    6. System firmware
    7. System profile setting
    8. Host adapter driver
Ignore other errors that may be flagged by the gssinstallcheck script. They will go away after the remaining installation steps are completed.

Check the system hardware

After the I/O server nodes have been installed successfully, power on the storage enclosures and then wait for at least 10 minutes from power on for discovery to complete before moving on to the next step. Here is the list of key log files that should be reviewed for possible problem resolution during deployment.
  • By default /var/log/message log from all I/O server nodes are directed to the message log in the EMS node.
  • The gssdeploy log is located at /var/log/gss
  • The xCAT log is located at /var/log/xcat
  • Console outputs from the I/O server node during deployment are located at /var/log/consoles
  1. Update the /etc/hosts file with high-speed hostname entries in the management server node and copy the modified /etc/hosts file to the I/O server nodes of the cluster as follows:
    xdcp gss_ppc64 /etc/hosts /etc/hosts
  2. Run gssstoragequickcheck:
      
    gssstoragequickcheck -G gss_ppc64 
    
  3. Run gssfindmissingdisks:
      
    gssfindmissingdisks -G gss_ppc64
    
    If gssfindmissingdisks displays an error, run mmgetpdisktopology and pipe it to topsummary on each I/O server node to obtain more information about the error:
    mmgetpdisktopology > /var/tmp/<node>_top.out
    topsummary <node>_top.out
  4. Run gsscheckdisks:
    
    GSSENV=INSTALL gsscheckdisks -G gss_ppc64 --encl all --iotest a --write-enable
    
    Attention: When run with --iotest w (write) or --iotest a (all), gsscheckdisks will perform write I/O to the disks attached through the JBOD. This will overwrite the disks and will result in the loss of any configuration or user data stored on the attached disks. gsscheckdisks should be run only during the installation of a building block to validate that read and write operations can be performed to the attached drives without any error. The GSSENV environment variable must be set to INSTALL to indicate that gsscheckdisks is being run during installation.
  5. Check for any hardware serviceable events and address them as needed. To view the serviceable events, issue the following command:
    
    gssinstallcheck -N ems1,gss_ppc64 --srv-events
    If any serviceable events are displayed, you can obtain more information by using the --platform-events EVENTLIST flag.
    Note: During the initial deployment of the nodes on the PPC64BE platform, SRC BA15D001 might be logged as a serviceable event by Partition Firmware. This is normal and should be cleared after the initial deployment. For more information, see Known issues.
Note: Configure the node to connect to the Red Hat network and apply the latest security patches, if needed.

Set up the high-speed network

Customer networking requirements are site-specific. The use of bonding to increase fault-tolerance and performance is advised but guidelines for doing this have not been provided in this document. Consult with your local network administrator before proceeding further.

  • To set up bond over IB, run the following command.
    gssgennetworks -G ems,gss_ppc64 --create-bond --ipoib --suffix=-hs --mtu 4092
    In this example, MTU is set to 4092. Consult your network administrator for the proper MTU setting.
  • To set up bond over Ethernet, run the following command.
    gssgennetworks -N ems1,gss_ppc64 --suffix=-hs --create-bond

Create the cluster, recovery groups, and file system

  1. Create the GPFS cluster:
      
    gssgencluster -C test01 -G gss_ppc64 --suffix=-hs --accept-license  
    
    In this example, test01 is used as the cluster name and -hs is used as the suffix of the host name.
  2. Verify healthy network connectivity:
    xdsh gss_ppc64 /usr/lpp/mmfs/bin/mmnetverify
  3. Create the recovery groups:
      
    gssgenclusterrgs -G gss_ppc64 --suffix=-hs
    
  4. Create the vdisks, NSDs, and file system:
    
    gssgenvdisks --create-vdisk --create-nsds --create-filesystem --contact-node gssio1
    
    Note: gssgenvdisk, by default, creates data vdisk with 8+2p RAID code and 8MB block size, and metadata vdisk with 3WayReplication and 1MB block size. These default values can be changed to suitable values for the customer environment.
  5. Add the management server node to the cluster:
      
    gssaddnode -N ems1 --cluster-node gssio1 --suffix=-hs --accept-license --no-fw-update
    
    In this example, the management server hostname is ems1 with a suffix of -hs (ems1-hs) in the high-speed network. The --no-fw-update option is used because the management server node does not contain a SAS adapter or attached drives.

Check the installed software and system health

  1. Run gssinstallcheck on the management server:
    
    gssinstallcheck -N ems1 
    
  2. Run gssinstallcheck on the I/O server nodes:
    
    gssinstallcheck -G gss_ppc64 
    
  3. Shut down GPFS in all nodes and reboot all nodes.
    1. Shut down GPFS all nodes:
      mmshutdown -a
    2. Reboot all server nodes:
      xdsh gss_ppc64 "systemctl reboot"
    3. Reboot the management server node:
      systemctl reboot
  4. After reboots, run the following command (Not applicable for PPC64LE):
    gssinstallcheck -G gss_ppc64 --phy-mapping

    Ensure that the phy mapping check is OK.

  5. Restart GPFS in all nodes and wait for all nodes to become active:
    mmstartup -a
  6. Mount the filesystem and perform a stress test. For example, run:
      
    mmmount gpfs0 -a
    gssstress /gpfs/gpfs0 gssio1 gssio2
    
    In this example, gssstress is invoked on the management server node. It is run on I/O server nodes gssio1 and gssio2 with /gpfs/gpfs0 as the target path. By default gssstress runs for 20 iterations and can be adjusted using the -i option (type gssstress and press Enter to see the available options). During the I/O stress test, check for network error by running from another console:
    
    gssinstallcheck -N ems1,gss_ppc64 -–net-errors
  7. Perform a health check. Run:
      
    gnrhealthcheck
    /usr/lpp/mmfs/bin/mmhealth node show -N all --verbose
    Address any issues that are identified.
  8. Check for any open hardware serviceable events and address them as needed. The serviceable events can be viewed as follows:
    gssinstallcheck -N ems1,gss_ppc64 --srv-events
    If any serviceable events are displayed, you can obtain more information by using the --platform-events EVENTLIST flag.
    Note: During initial deployment of the nodes, SRC BA15D001 may be logged as serviceable event by Partition Firmware. This is normal and should be cleared after the initial deployment. For more information, see Known issues.
  9. Verify that NTP is set up and enabled.
    1. On the management server node verify that /etc/ntp.conf is pointing to the management server node itself over the management interface.
    2. Restart NTP daemon on each node.
      xdsh <ems>,gss_ppc64 "systemctl restart ntpd"
    3. Verify that NTP is setup correctly by running the following checks:
      • Verify that offset is 0.
        xdsh ems1,gss_ppc64 "ntpq -p"
      • Verify that NTP is enabled and synchronized.
        xdsh ems1,gss_ppc64 "timedatectl status" | grep -i NTP
      • Verify that the timezone is set correctly on each node.
        xdsh ems1,gss_ppc64 "timedatectl status"  | grep -i zone

Install the ESS GUI

Important: Complete all of the following steps carefully including the steps for configuring mmperfmon and restricting certain sensors to the management server node (EMS) only.
  1. Generate performance collector on the management server node by running the following command. The management server node must be part of the ESS cluster and the node name must be the node name used in the cluster (e.g., ems1-hs).
      
    mmperfmon config generate --collectors ems1-hs
    
  2. Set up the nodes in the ems nodeclass and gss_ppc64 nodeclass for performance monitoring by running the following command.
    mmchnode --perfmon -N ems,gss_ppc64
  3. Start the performance monitoring sensors by running the following command.
    xdsh ems1,gss_ppc64 "systemctl start pmsensors"
  4. Capacity and fileset quota monitoring is not enabled in the GUI by default. You must correctly update the values and restrict collection to the management server node only.
    1. To modify the GPFS Disk Capacity collection interval, run the following command:
      mmperfmon config update GPFSDiskCap.restrict=EMSNodeName
                GPFSDiskCap.period=PeriodInSeconds

      The recommended period is 86400 so that the collection is done once per day.

    2. To restrict GPFS Fileset Quota to run on the management server node only, run the following command:
      mmperfmon config update GPFSFilesetQuota.period=600 GPFSFilesetQuota.restrict=EMSNodeName

      Here the EMSNodeName must be the name shown in the mmlscluster output.

      Note: To enable quota, the filesystem quota checking must be enabled. Refer mmchfs -Q and mmcheckquota commands in the IBM Spectrum Scale: Command and Programming Reference.
  5. Verify that the values are set correctly in the performance monitoring configuration by running the mmperfmon config show command on the management server node. Make sure that GPFSDiskCap.period is properly set, and GPFSFilesetQuota and GPFSDiskCap are both restricted to the management server node only.
    Note: If you are moving from manual configuration to auto configuration then all sensors are set to default. Make the necessary changes using the mmperfmon command to customize your environment accordingly. For information on how to configure various sensors using mmperfmon, see Manually installing IBM Spectrum Scale GUI.
  6. Start the performance collector on the management server node:
    systemctl start pmcollector
  7. Enable and start the GUI service:
    systemctl enable gpfsgui.service
    systemctl start gpfsgui
  8. To launch the ESS GUI in a browser, go to: https://EssGuiNode where ESSGuiNode is the hostname or IP address of the management server node for GUI access. To log in, type admin in the User Name field and your password in the Password field on the login page. The default password for admin is admin001. Walk through each panel and complete the GUI Setup Wizard.

This completes the installation task of the ESS system. After completing the installation, apply security updates available from Red Hat.

For information on applying optimized configuration settings to a set of client nodes or a node class, see Adding IBM Spectrum Scale nodes to an ESS cluster.