Deploy FPO

Edit online

In File Placement Optimizer (FPO) mode, data blocks are stored in chunks in IBM Spectrum® Scale, and replicated to protect against disk and node failure. DFS clients run on the storage node so that they can leverage the data locality for executing the tasks quickly.

For the local storage mode configuration, Short-circuit read (SSR) is recommended to improve the access efficiency.

Note: Ambari only supports creating an IBM Spectrum Scale FPO file system.

Follow the Installation but to create a new FPO cluster, the following deviation is to be followed:

Skip ESS setup.
Follow Create HDP cluster.
Skip Establish an IBM Spectrum Scale cluster on the Hadoop cluster.
Skip Configure remote mount access.
Follow Install Mpack package.
Follow the Deploy the IBM Spectrum Scale service with the following deviations:

Under Assign Masters:

All the Yarn’s NodeManager nodes should be FPO nodes with the same number of disks for each node specified in the NSD stanza.

Under Customize Services:

Configuration fields on both standard and advanced tabs are populated with values taken from the Hadoop performance tuning guide.
Verify that the gpfs.storage.type is set to local.
If you do not plan to have a sub-directory under the IBM Spectrum Scale mount point, do not click on the gpfs.data.dir field to preserve the field to not have any values set.
Ensure the yarn.nodemanager.local-dirs and yarn.nodemanager.local-logs are set to a dummy local directory initially. When a new FPO is deployed, partitioned local directories dynamically replace the ones in yarn.nodemanager.local-dirs after the FPO system is created. Manually check to ensure that the yarn.nodemanager.local-logs value is set correctly. For more information, see Disk-partitioning algorithm.
Create an NSD file, gpfs_nsd, and place it into the /var/lib/ambari-server/resources directory. Ensure that the permission on the file is at least 444. Add the NSD filename, gpfs_nsd, to the GPFS File system > GPFS NSD stanza file field in the Standard Config tab.

Two types of NSD files are supported for file system auto creation. One is the preferred simple format and another is the standard IBM Spectrum Scale NSD file format for IBM Spectrum Scale experts.

Simple NSD

If a simple NSD file is used, Ambari selects the proper metadata and data ratio for you. If possible, Ambari creates partitions on some disks for the Hadoop intermediate data, which improves the Hadoop performance. Simple NSD does not support existing partitioned disks in the cluster.

Disk partitioning under Ambari would happen only if the following conditions are met:
1. The NSD stanza requires all GPFS nodes to be specified in the NSD stanza file.
2. Each of those nodes should have the same number of disks specified in the stanza file.
3. Number of host entries in the stanza file/NSD servers should be equal to the number of Node managers. This requires all hosts running GPFS node to be set up as a Node manager too. If you do not want Hadoop jobs to run on a specific GPFS node host (For example, on the Ambari server host), you could remove the Node manager component from that host after deploying the IBM Spectrum™ scale service.

For more details on disk partitioning, see the following:

Standard NSD

If the cluster has a partitioned file system, only a Standard NSD file can be used.

For standard IBM Spectrum Scale NSD file is used, administrators are responsible for the storage space arrangement.

Apply the partition algorithm.
Apply the algorithm for system pool and usage.
Apply the failure group selection rule.
Failure groups are created based on the rack location of the node.
Define the Rack mapping file.
Nodes can be defined to belong to racks.
Partition the function matrix.
The reason why one disk is divided into two partitions is so that one partition is used for the ext3 or ext4 to store the map or reduce intermediate data, while the other partition is used as a data disk in the IBM Storage Scale file system. Also, only data disks can be partitioned. Metadata disks cannot be partitioned.
A policy file is required when a standard IBM Storage Scale NSD file is used.
A policy file, gpfs_fs.pol, must be created and placed into the /var/lib/ambari-server/resources directory. Add the policy filename, gpfs_fs.pol, into the GPFS policy file field in the Standard Config tab.
For more information on creating policy files, see Policy File.

For more information on each of the set-up points for standard NSD file, see IBM Storage Scale-FPO deployment.

Note: Deploying HDP over an existing IBM Spectrum Scale FPO cluster through Ambari, requires to either store the Yarn’s intermediate data into the IBM Spectrum Scale file system, or use idle disks formatted as a local file system. It is recommended to use the idle disks formatted as a local file system. For more information, see Deploy HDP or IBM Spectrum Scale service on pre-existing IBM Spectrum Scale file system.