Multiple HDFS Transparency clusters on the same set of physical nodes

If you have limited physical nodes or you have too many Hadoop clusters running inside containers, then you might have to set up multiple HDFS Transparency clusters on the same set of physical nodes.

For example, configure HDFS Transparency cluster1 on the physical node 1/2/3 and configure HDFS Transparency cluster2 on the same physical node 1/2/3. This is supported since HDFS Transparency version 2.7.3-1.

Running multiple HDFS Transparency clusters on the same set of physical nodes will require configuration changes, especially network port number assigned to different HDFS Transparency clusters. This section will explain the steps to configure two HDFS Transparency clusters.

In the following example, it will configure two HDFS Transparency clusters on the same physical nodes gpfstest1/2/6/7/9/10/11/12 with gpfstest2 as the NameNode. In this environment, Kerberos is not enabled and the configurations for 1st HDFS Transparency cluster is from the HortonWorks HDP cluster.
  1. Configure the /usr/lpp/mmfs/hadoop/etc/hadoop to bring the first HDFS Transparency cluster up. The gpfstest1/2/6/7/9/10/11/12 is configured as the HDFS transparency cluster1:
    [root@gpfstest2 ~]# mmhadoopctl connector getstate
    gpfstest2.cn.ibm.com: namenode running as process 6699.
    gpfstest2.cn.ibm.com: datanode running as process 8425.
    gpfstest9.cn.ibm.com: datanode running as process 13103.
    gpfstest7.cn.ibm.com: datanode running as process 9980.
    gpfstest10.cn.ibm.com: datanode running as process 6420.
    gpfstest11.cn.ibm.com: datanode running as process 83753.
    gpfstest1.cn.ibm.com: datanode running as process 22498.
    gpfstest12.cn.ibm.com: datanode running as process 52927.
    gpfstest6.cn.ibm.com: datanode running as process 48787.
    
    Note: This setup will be configured by HortonWorks HDP through Ambari and the gpfstest2 is configured as the NameNode.
  2. Select any one node from these nodes, and change the configurations:

    In this example, the gpfstest1 node is selected.

    Step 3 to step 10 are done on the gpfstest1 node as the node selected in step 2.

  3. Copy the following configurations from /usr/lpp/mmfs/hadoop/etc/hadoop to /usr/lpp/mmfs/hadoop/etc/hadoop2.
    Note: /usr/lpp/mmfs/hadoop/etc/Hadoop is the configuration location for the 1st HDFS Transparency cluster and /usr/lpp/mmfs/hadoop/etc/hadoop2 is the configuration location for the second HDFS Transparency cluster.
    -rw-r--r-- 1 root root  2187 Oct 28 00:00 core-site.xml
    -rw------- 1 root root   393 Oct 28 00:00 gpfs-site.xml
    -rw------- 1 root root  6520 Oct 28 00:00 hadoop-env.sh
    -rw------- 1 root root  2295 Oct 28 00:00 hadoop-metrics2.properties
    -rw------- 1 root root  2490 Oct 28 00:00 hadoop-metrics.properties
    -rw------- 1 root root  1308 Oct 28 00:00 hadoop-policy.xml
    -rw------- 1 root root  6742 Oct 28 00:00 hdfs-site.xml
    -rw------- 1 root root 10449 Oct 28 00:00 log4j.properties
    -rw------- 1 root root   172 Oct 28 00:00 slaves
    -rw------- 1 root root   884 Oct 28 00:00 ssl-client.xml
    -rw------- 1 root root  1000 Oct 28 00:00 ssl-server.xml
    -rw-r--r-- 1 root root 17431 Oct 28 00:00 yarn-site.xml
    
  4. Change the fs.defaultFS value in core-site.xml

    In /usr/lpp/mmfs/hadoop/etc/hadoop/core-site.xml: fs.defaultFS=hdfs://gpfstest2.cn.ibm.com:8020

    In /usr/lpp/mmfs/hadoop/etc/hadoop2/core-site.xml: fs.defaultFS=hdfs://gpfstest2.cn.ibm.com:8021

  5. Change values in the hdfs-site.xml file:
    In /usr/lpp/mmfs/hadoop/etc/hadoop/hdfs-site.xml:
    dfs.datanode.address=0.0.0.0:50010
    dfs.datanode.http.address=0.0.0.0:50075
    dfs.datanode.https.address=0.0.0.0:50475
    dfs.datanode.ipc.address=0.0.0.0:8010
    dfs.https.port=50470
    dfs.journalnode.http-address=0.0.0.0:8480
    dfs.journalnode.https-address=0.0.0.0:8481
    dfs.namenode.http-address=gpfstest2.cn.ibm.com:50070
    dfs.namenode.https-address=gpfstest2.cn.ibm.com:50470
    dfs.namenode.rpc-address=gpfstest2.cn.ibm.com:8020
    dfs.namenode.secondary.http-address=gpfstest10.cn.ibm.com:50090
    
    In /usr/lpp/mmfs/hadoop/etc/hadoop2/hdfs-site.xml:
    dfs.datanode.address=0.0.0.0:50011
    dfs.datanode.http.address=0.0.0.0:50076
    dfs.datanode.https.address=0.0.0.0:50476
    dfs.datanode.ipc.address=0.0.0.0:8011
    dfs.https.port=50471
    dfs.journalnode.http-address=0.0.0.0:8482
    dfs.journalnode.https-address=0.0.0.0:8483
    dfs.namenode.http-address=gpfstest2.cn.ibm.com:50071
    dfs.namenode.https-address=gpfstest2.cn.ibm.com:50471
    dfs.namenode.rpc-address=gpfstest2.cn.ibm.com:8021    <== match the port number in step4
    dfs.namenode.secondary.http-address=gpfstest10.cn.ibm.com:50091
    
    Note: Check that the network port numbers for the different HDFS Transparency clusters require to be different. If not, when starting the HDFS Transparency, it will report network port conflicts.
  6. Change values in the hadoop-env.sh file.
    In /usr/lpp/mmfs/hadoop/etc/hadoop/hadoop-env.sh:
    HADOOP_PID_DIR=/var/run/hadoop/$USER
    HADOOP_LOG_DIR=/var/log/hadoop/$USER
    
    In /usr/lpp/mmfs/hadoop/etc/hadoop2/hadoop-env.sh:
    HADOOP_PID_DIR=/var/run/hadoop/hdfstransparency2   
    HADOOP_LOG_DIR=/var/log/hadoop/hdfstransparency2  
    
    Change the $USER in HADOOP_JOBTRACKER_OPTS, SHARED_HADOOP_NAMENODE_OPTS,HADOOP_DATANODE_OPTS
    into value “hdfstransparency2”.
    
    Note: HDFS Transparency can only be started as the root user. If the first HDFS Transparency cluster takes the $USER as root. The second HDFS Transparency cluster, you need to change $USER into a different string value by setting it to hdfstransparency2 to make the second HDFS Transparency able to write logs there.
  7. Change hadoop-metrics2.properties:

    In /usr/lpp/mmfs/hadoop/etc/hadoop/hadoop-metrics2.properties:

    namenode.sink.timeline.metric.rpc.client.port=8020

    In /usr/lpp/mmfs/hadoop/etc/hadoop2/hadoop-metrics2.properties:

    namenode.sink.timeline.metric.rpc.client.port=8021 <== match the amenode port number in step 4

  8. Update /usr/lpp/mmfs/hadoop/etc/hadoop2/gpfs-site.xml, especially for the gpfs.data.dir field.
    • Configure different gpfs.data.dir values for the different HDFS Transparency cluster.
    • Configure different gpfs.mnt.dir values if you have multiple file systems.
  9. Sync the second transparency cluster configuration from node gpfstest1 (selected in step2):
    export HADOOP_GPFS_CONF_DIR=/usr/lpp/mmfs/hadoop/etc/hadoop2
    mmhadoopctl connector syncconf /usr/lpp/mmfs/hadoop/etc/hadoop2
  10. Start the end transparency cluster:
    export HADOOP_GPFS_CONF_DIR=/usr/lpp/mmfs/hadoop/etc/hadoop2
    export HADOOP_CONF_DIR=/usr/lpp/mmfs/hadoop/etc/hadoop2/
    mmhadoopctl connector start
    mmhadoopctl connector getstate:
    [root@gpfstest1 hadoop2]# mmhadoopctl connector getstate
    gpfstest2.cn.ibm.com: namenode running as process 18234.
    gpfstest10.cn.ibm.com: datanode running as process 29104.
    gpfstest11.cn.ibm.com: datanode running as process 72171.
    gpfstest9.cn.ibm.com: datanode running as process 94872.
    gpfstest7.cn.ibm.com: datanode running as process 28627.
    gpfstest2.cn.ibm.com: datanode running as process 25777.
    gpfstest6.cn.ibm.com: datanode running as process 30121.
    gpfstest12.cn.ibm.com: datanode running as process 36116.
    gpfstest1.cn.ibm.com: datanode running as process 21559.
    
  11. Check the second transparency cluster:
    On any node, run the following commands:
    hdfs --config /usr/lpp/mmfs/hadoop/etc/hadoop2 dfs -put /etc/passwd /
    hdfs --config /usr/lpp/mmfs/hadoop/etc/hadoop2 dfs -ls /
  12. Configure hdfs://gpfstest2.cn.ibm.com:8020 and hdfs://gpfstest2.cn.ibm.com:8021 to different Hadoop clusters running inside the container.