Hadoop cluster planning
In a Hadoop cluster that runs the HDFS protocol, a node can be a DFS client, a NameNode, or a DataNode, or all of them. The Hadoop cluster might contain nodes that are all part of an IBM Storage® Scale cluster or where only some of the nodes belong to the IBM Storage Scale cluster.
NameNode
fs.defaultFS parameter to the hostname of the NameNode in the
core-site.xml file.DataNode
You can specify multiple DataNodes in a cluster. The DataNodes must be a part of an IBM Storage Scale cluster. The DataNodes are specified by listing their hostnames in the workers configuration file.
DFS client
The DFS client can be a part of an IBM Storage Scale cluster. When the DFS client is a part of an IBM Storage Scale cluster, it can read data from IBM Storage Scale through an RPC or use the short-circuit mode. Otherwise, the DFS client can access data from IBM Storage Scale only through an RPC. You can specify the NameNode address in DFS client configuration so that DFS client can communicate with the appropriate NameNode service.
In a production cluster, it is recommended to configure NameNode HA: one active NameNode and one standby NameNode. Active NameNode and standby NameNode must be located in two different nodes. For a small test or a POC cluster, such as 2-node or 3-node cluster, you can configure one node as NameNode and DataNode. However, in a production cluster, it is not recommended to configure the same node as both NameNode and DataNode.
The purpose of cluster planning is to define the node roles: Hadoop node, HDFS transparency node, and GPFS™ node.