Hadoop cluster planning

Edit online

In a Hadoop cluster that runs the HDFS protocol, a node can be a DFS client, a NameNode, or a DataNode, or all of them. The Hadoop cluster might contain nodes that are all part of an IBM Storage® Scale cluster or where only some of the nodes belong to the IBM Storage Scale cluster.

NameNode

You can specify a single NameNode or multiple NameNodes to protect against a single point of failure in the cluster. For more information, see High availability configuration. The NameNode must be a part of an IBM Storage Scale cluster and must have a robust configuration to reduce the chances of a single-node failure. The NameNode is defined by setting the fs.defaultFS parameter to the hostname of the NameNode in the core-site.xml file.

Note: The Secondary NameNode in native HDFS is not needed for HDFS Transparency because the HDFS Transparency NameNode is stateless and does not maintain an FSImage like state information.

DataNode

You can specify multiple DataNodes in a cluster. The DataNodes must be a part of an IBM Storage Scale cluster. The DataNodes are specified by listing their hostnames in the workers configuration file.

DFS client

The DFS client can be a part of an IBM Storage Scale cluster. When the DFS client is a part of an IBM Storage Scale cluster, it can read data from IBM Storage Scale through an RPC or use the short-circuit mode. Otherwise, the DFS client can access data from IBM Storage Scale only through an RPC. You can specify the NameNode address in DFS client configuration so that DFS client can communicate with the appropriate NameNode service.

In a production cluster, it is recommended to configure NameNode HA: one active NameNode and one standby NameNode. Active NameNode and standby NameNode must be located in two different nodes. For a small test or a POC cluster, such as 2-node or 3-node cluster, you can configure one node as NameNode and DataNode. However, in a production cluster, it is not recommended to configure the same node as both NameNode and DataNode.

The purpose of cluster planning is to define the node roles: Hadoop node, HDFS transparency node, and GPFS™ node.