If you set up HA through Apache ZooKeeper for your
YARN resource
manager, when an active resource manager fails, Apache ZooKeeper selects a
standby resource manager to become active.
Before you begin
To use Apache ZooKeeper for HA, ensure that Apache ZooKeeper cluster is installed and configured for your YARN resource manager.
Procedure
-
Modify the service configuration file ($EGO_ESRVDIR/esc/conf/services/EGOYARN.xml) on the resource manager host:
-
Add the names of the hosts to be used as resource managers when the active resource manager fails. Do this by modifying the
<ego:ResourceRequirement>
property.
Ensure that all resource managers run on management hosts (or in a dedicated resource group that consists of management hosts, not compute hosts).
Surround each resource manager host name with single quotation marks ('), and separate each host name with ||. The following is an example:
<ego:ResourceRequirement>select('rm_host_1'||'rm_host_2')</ego:ResourceRequirement>
-
Set the minimum and maximum number of resource manager instances to be used as resource managers. Do this by modifying the
<sc:MinInstances>
and <sc:MinInstances>
properties. Ensure that the maximum value is greater than the minimum value. The following is an example:
<sc:MinInstances>1</sc:MinInstances>
<sc:MaxInstances>2</sc:MaxInstances>
-
Modify the YARN
configuration file ($HADOOP_CONF_DIR/yarn-site.xml) on all
resource manager hosts and the node manager hosts. Do this by adding Apache ZooKeeper HA
properties to the file. For example, the following is based on open source
YARN configuration:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>zk1:port1,zk2:port2,..., zkN:portN</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
Notes:
- If the
ego.yarn.resourcemanager.ha.virtualip
property exists in the file, ensure that it is set to
false as this property applies to virtual
IP-based HA.
- If your yarn-site.xml file included the
yarn.resourcemanager.store.class
and
yarn.resourcemanager.cluster-id
properties,
then the system uses these values. If your
yarn-site.xml did not contain these
properties, the system adds the following default values when the
resource manager and node manager
start:<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>defaultClusterID</value>
</property>
- When the resource manager and node manager start, the system
automatically generates the following values for the
yarn.resourcemanager.ha.rm-ids
,
yarn.resourcemanager.hostname.rmhost2
, and
yarn.resourcemanager.hostname.rmhost1
properties in
yarn-site.xml:<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm_host_1,rm_host_2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rmhost2</name>
<value>rm_host_2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rmhost1</name>
<value>rm_host_1</value>
</property>
-
Restart the EGO primary
and the EGOYARN service:
egosh ego restart primary_host_name
egosh service stop EGOYARN
egosh service start EGOYARN
- Verify that there are multiple resource manager instances
in the cluster:
egosh service list -l
Note that one resource manager is active; the
others are on standby for HA.
Results
Once you have configured the YARN resource
manager for HA, if the active resource manager is down or is no longer on the list, one
of the standby resource managers becomes active and resumes resource manager
responsibilities. This way, jobs continue to run and complete successfully.