Known issues

Review the list of known issues in IBM® Spectrum Conductor 2.5.1.

Found in version 2.5.1

As of releasing IBM Spectrum Conductor 2.5.1, there are no known issues. Refer to the subsequent section for issues found in earlier product versions.

Found in earlier versions

Installing with root user with a non-root cluster administrator cannot set entitlement during a rolling upgrade on Ubuntu

When you install with root during a rolling upgrade on Ubuntu, a non-root cluster administrator is not able to successfully set entitlement. As a workaround, you can switch to the cluster administrator and run the egoupgrade host command manually after the installation to set entitlement.

In upgrade scenarios, you might need to manually start the ServiceDirector on a host that uses glibc version 2.14 or higher

In upgrade scenarios, the ServiceDirector and WebServiceGateway services continue to start automatically as in previous versions. However, if the ServiceDirector host does not use glibc version 2.14 or higher, both the ServiceDirector and the WebServiceGateway (which depends on the ServiceDirector) fail to start. You must then manually start the ServiceDirector on a host that uses glibc version 2.14 or higher.

Issues logging in to cluster management console with public DNS URL

If a management host joins the cluster under a private DNS host name which is different from the host name returned from public DNS, login issues occur when you use the public DNS host name to access the cluster management console. Use the IP address (instead of the host name) to access the cluster management console.

When you configure security, host names that start with a number can cause issues

Out-of-box certificates do not generate successfully if your host has a domain name that starts with a number.

If hierarchical scheduling policy is set for an instance group, Jupyter notebook applications belonging to that instance group do not run

If hierarchical scheduling policy is set for an instance group (SPARK_EGO_APP_SCHEDULE_POLICY=hierarchy), you must set an additional environment variable JUPYTER_SPARK_OPTS so that notebook applications run successfully. The environment variable JUPYTER_SPARK_OPTS must use Spark configuration options to provide policy tag information. Setting policy tags, ensures that all notebook drivers that use these policy tags are started. For example:

--conf spark.executorEnv.SPARK_EGO_TAG_ENV_NAME_USED_INTERNAL=/root/small/t1 --conf spark.ego.driverEnv.SPARK_EGO_TAG_ENV_NAME_USED_INTERNAL=/root/small/t1

Updated conda environment does not auto-deploy following host startup

Updating a conda environment when a host is shut down does not automatically redeploy the updated environment on the host after startup. This issue includes creating or removing the environment and installing or removing conda packages for a deployed Anaconda distribution instance. To resolve this issue, update the conda environment again with the host started.

Cannot create instance group or deploy Anaconda with Kerberos authentication

If the egodeploy command or the repository service (RS) crashes with Kerberos authentication enabled, check if the Kerberos plug-in library sec_ego_gsskrb.so under $EGO_LIBDIR links to the right Kerberos libraries libkrb5.so.3, libgssapi_krb5.so.2, and libk5crypto.so.3, which are installed with standard OS packages. The sec_ego_gsskrb.so plug-in library and Kerberos libraries should not have dependencies on the libcrypto.so.10 library. This dependency causes egodeploy and RS to crash. To resolve this issue, copy the libkrb5.so.3, libgssapi_krb5.so.2, and libk5crypto.so.3 libraries to $EGO_LIBDIR.

If using Dockerized services in a shared environment, an instance group must use a shared file system location for spark.local.dir

By default, when creating a instance group, the spark.local.dir parameter is set to /tmp. If using Dockerized services in a shared environment, you must set this value to a location on the shared file system.

When a host automatically joins a resource group, timing issues might occur that cause deployment errors for either an instance group or an application instance

If either a instance group or an application instance is being deployed and a new host joins an associated resource group from automatic deployment, deployment issues can occur. To avoid this issue, you can either turn off automatic deployment (see Automatically deploying packages); or you can prevent new hosts from joining while the associated instance group or application instance is deploying. To recover from this issue, redeploy either the instance group or application instance once the new host is added.

Running applications killed after migrating the service instance

Running applications are killed when the Spark master service instances are migrated to another host.

Number of failed tasks per executor incorrect when submitting applications in client mode

When a task is lost during application submission in client mode, the executor log does not generate a log for the lost task, resulting in an incorrect number of failed tasks to be generated in the executor log.

Balanced slot allocation does not work when any Application instance (ASC service) or Conductor services are restarted

Balanced slot allocation does not take effect when either a Conductor service or an application instance (ASC) service is restarted. This issue occurs when these services are defined as a stateful service; and when the resource requirement order(string) is added to the service profile.

To work around this issue, update the service profile to remove the order(string) under ServiceDefinition > AllocationSpecification > ResourceSpecification > ResourceRequirement for the service.

Cannot start application instance services or system services that are configured for virtual IP

An application instance services or system services that are configured for virtual IP cannot start. This issue occurs when the name of a network interface (in combination with the internally generated alias name) exceeds the maximum length of 15 characters. To fix this issue, rename the network interface to ensure that its name does not exceed 10 characters.

The Apache Spark parameter spark.network.crypto.enabled and its related parameters are not supported

When you configure an instance group, you are not able to set the Apache Spark spark.network.crypto.enabled parameter. For more information on the various security settings that you can define for a instance group, see Configuring security settings for a Spark instance group.

The built-in Spark 1.6.1 package does not include SparkR

The built-in Spark 1.6.1 package does not include the Apache Spark open source SparkR 1.6.1 package. As a result, users are not able to run a SparkR application in an instance group that is configured with Spark version 1.6.1. If you want to run a SparkR application, you can use a Spark 2.x package; see Supported Spark, Miniconda, Jupyter, and Dask versions.

The Spark master fails to start if services are started with the egosh service start all command

After you run egosh service start all, if the Spark master is in error, run egosh service stop <service name> and egosh service start <service name> to correct the problem.

Updating or modifying instance groups with GPU scheduling and shuffle service enabled

If you are updating or modifying instance groups that are created before IBM Spectrum Conductor version 2.2.1 and have GPU scheduling and shuffle service enabled, you must ensure that the CPU executor group for each instance group contains all GPU executor hosts before you upgrade or modify. If these instance groups do not, you need to modify them to change the resource group for the CPU executor to meet this requirement.

Microsoft Azure cloud provider fails to provision cloud hosts

When you configure Microsoft Azure as the cloud provider for cloud bursting through host factory, hosts are not provisioned from the Azure cloud, returning runtime exceptions similar to the following error:

[2017-12-12 16:27:45.140]-[ERROR]-[com.ibm.spectrum.util.AzureUtil.createVM(AzureUtil.java:686)] Create instances error.
java.lang.RuntimeException: java.net.UnknownServiceException: Unable to find acceptable protocols. isFallback=false, 
modes=[ConnectionSpec(cipherSuites=[TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, ... ], tlsVersions=[TLS_1_2, TLS_1_1, TLS_1_0], 
supportsTlsExtensions=true), ConnectionSpec(cipherSuites=[TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, ...], 
tlsVersions=[TLS_1_0], supportsTlsExtensions=true), ConnectionSpec()], supported protocols=[TLSv1]

This issue occurs because Azure is currently not PCI complaint and uses the deprecated TLSv1.0 to establish connections. IBM JDK (which is bundled with IBM Spectrum Conductor) does not support TLSv1.0. To work around this issue, use OpenJDK 8 or Oracle JDK 8 as follows:

Download and install either OpenJDK 8 or Oracle JDK 8.
Stop the HostFactory service:
```
egosh service stop HostFactory
```
Modify the host factory scripts that interface with Azure to point to your JDK:
1. Go to the $EGO_TOP/4.0/hostfactory/providers/azure/scripts/ directory.
2. In the getAvailableTemplates.sh, getRequestStatus.sh, requestMachines.sh, and requestReturnMachines.sh scripts, change the javaDir=${EGO_TOP}/jre/${EGOJRE_VERSION}/${OS_TYPE}/ line to point to your JDK installation.
Start the HostFactory service:
```
egosh service start HostFactory
```

Elasticsearch encounters memory pressures under heavy load

The default Elasticsearch heap size might be too small. For heap considerations for the Elastic Stack, see Installation requirements and considerations and see Tuning the heap sizes for Elasticsearch and Logstash to accommodate heavy load to increase the Elasticsearch client and data services heap sizes in IBM Spectrum Conductor.

To enable mlockall to allow the JVM to lock its memory and prevent it from being swapped by the OS, uncomment # bootstrap.memory_lock: true in elasticsearch.yml (found in $EGO_CONFDIR/../../integration/elk/conf/elasticsearch) on every management and primary host; and then restart the elk-elasticsearch service.

Also, refer to Disable swapping for more details about Elasticsearch and troubleshooting memory issues.

Unable to tab between IBM Spectrum Conductor cluster management console pages in Mozilla Firefox

For Mozilla Firefox, your screen reader used for accessibility might not be able to focus on some IBM Spectrum Conductor cluster management console pages (that is, you are unable to tab between menus, buttons, and text fields (instead of just between text fields)). This is a known issue for Mozilla Firefox; to address this, change your Firefox accessibility tab focus settings:

From your Firefox browser, in the address or location bar type about:config.
Click I’ll be careful, if prompted with a warning.
Enter accessibility.tabfocus in the Filter field.
From the preferences list:
- If you see a accessibility.tabfocus preference in the list, double-click it, and change its value to 7.
- If you do not see a accessibility.tabfocus preference in the list:
  1. Right-click (or control-click) anywhere in the list, and select New > Integer.
  2. In the resulting dialog, enter accessibility.tabfocus, and click OK.
  3. Enter 7 in the Enter value dialog, and click OK.
Load a new page in Firefox. Verify that you are able to tab between menus, buttons, and text fields.