Configuring an instance group to work with your CDH cluster

Once you have set up your CDH and IBM® Spectrum Conductor environment, you are ready to configure an existing instance group to work with the CDH cluster.

Before you begin

Ensure that you have completed all the prerequisites for integrating CDH with IBM Spectrum Conductor as described in Open source CDH (Cloudera Distributed Hadoop) integration.

Procedure

  1. Copy the HDFS (Hadoop Distributed File System) and Apache Hive configuration files to your Spark home directory:
    1. On your CDH cluster server host, go to the /var/run/cloudera-scm-agent/process/ directory, and locate the current version of these files:
      • hdfs-site.xml
      • core-site.xml
      • hive-site.xml
    2. Copy the files to your Spark home directory's /conf directory on each of your IBM Spectrum Conductor hosts. For an example, if the Spark home directory of your instance group is /opt/SIG243cdh/spark-2.4.3-hadoop-2.7/conf/, then copy the files there.
  2. Add a Kerberos TGT (Ticket Granting Ticket) secured HDFS data connector to your instance group:
    1. From the cluster management console, click Workload > Instance Groups select the instance group to update, and then click Configure.
    2. Click Manage > Configure > Data Connectors > Add.
    3. Provide the new data connector information, ensuring that you select Kerberos TGT secured HDFS from the Type list, and then click Save:
      Cluster management console flow of the New Data Connector page showing Kerberos TGT secured HDFS from the Type list
  3. Add a new environment variable and parameter to the instance group:
    1. Click Basic Settings, select the Spark version that the instance group must use, then click Configuration to open the configuration dialog.
    2. Click Add an Environment Variable (under the Driver Environment section) and add the path to your Kerberos credential cache (krb5cc) file as the value for the spark.ego.driverEnv.KRB5CCNAME value.
    3. Click Add a Parameter (under the Additional Parameters section) to set the spark.sql.catalogImplementation value to hive and click Save:
      Cluster management console flow of the Configure Spark page showing Add a Parameter button under the Additional Parameters section, and spark.sql.catalogImplementation value set to hive
  4. Use open source AdoptOpenJDK JRE instead of the default IBM Java JRE:
    Reasoning for open source JDK: To avoid potential exceptions in your hive code, use AdoptOpenJDK JRE. Cloudera Manager is compatible with OpenJDK; using IBM Java can cause null pointer exceptions, such as this error:
    Message: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
    StackTrace: at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
    …
    …
    Caused by: java.lang.NullPointerException
    at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:91)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:466)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    To switch to AdoptOpenJDK JRE, follow these steps:
    1. Search for the Java home environment variable by typing JAVA_HOME in the search field.
    2. Set the JAVA_HOME value (under the Environment Variable section) to the path for AdoptOpenJDK JRE and click Save. In this example, it is set to /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre/:Cluster management console flow of the Configure Spark page showing JAVA_HOME in the search field, and JAVA_HOME value under the Environment Variable section for the path for your AdoptOpenJDK JRE.
  5. Click Modify Instance Group, which redeploys the instance group with your configurations.

What to do next

Once you have configured your instance group to work with the CDH cluster, verify this integration.