Configuring jobs to run on Spark

You can configure IBM® DataStage® Flow Designer jobs to run on Spark.

Before you begin

You must have the following Spark configurations:

A Red Hat Enterprise Linux® (RHEL) 7.x operating system for both InfoSphere® Information Server and Spark
Spark 2.2.1 or later built for Hadoop 2.7 or later

Note: Only Spark running on YARN is supported. Spark Standalone, Spark on Kubernetes, and Spark on Apache Mesos are not supported.

Before you run a Spark job, you need to compile the job. This generates Scala code, which you can view by clicking the View Scala link that appears after the job is successfully compiled.

About this task

To run jobs on Spark, an administrator must first configure IBM DataStage Flow Designer so that it connects to your Spark engine. You can only configure IBM DataStage Flow Designer to use one Spark instance.

Procedure

Configure IBM DataStage Flow Designer to connect to a Spark engine.
1. Log in to IBM DataStage Flow Designer, select a project, and select the persona button on the top of the screen. From there, click Setup > Server.
2. On the General tab, review the path to the directory where you want to store IBM DataStage Flow Designer Spark files.
  The default path is /opt/IBM/InformationServer/Spark/. The default path works correctly only if your services tier is configured on a single WebSphere® Application Server ND or Liberty node. If you are using a multi-node WebSphere Application Server ND deployment, then you need to provide a path that is visible at the same path location to all nodes of the services tier. If you have a multiple node WebSphere Application Server ND configuration, then you only need to provide the path if you are running jobs on Spark or plan to do so in the future.
  If you update the path to the repository for Spark files after running jobs, you might need to move the IBM DataStage Flow Designer files from the default path to the new path that you specify.
  
  Note: The paths refer to locations on the services tier.
3. Specify the following parameters on the Spark tab:
  
  Spark Instance Name
  
  The name of your Spark instance. This value is automatically populated.
  
  Cluster manager
  
  YARN is the only supported resource manager and job scheduler.
  
  Authentication type
  
  Specify either None or Kerberos. Select None if you do not have authentication enabled for the Hadoop cluster and the cluster is not located behind a firewall. Select Kerberos to provide authentication to the services on the Hadoop cluster.
  
  File
  
  Upload the required core-site.xml and yarn-site.xml files. These configuration files are located on the cluster where the YARN and Spark service are running.
  If the authentication type is Kerberos, upload the krb5.conf configuration file used to secure the cluster you are connecting too. Typically the krb5.conf can be found in the cluster's /etc directory. Also, upload the keytab files of the principals who will be submitting jobs to the cluster that is being configured. For example, if you have 3 IBM DataStage Flow Designer users that each have an individual principal, then you need 3 keytabs so that when they run the job, the job will run on Spark under the principal.
4. In the Certificates section, click Add SSL Certificate if you want to upload an SSL certificate to the server. Secure Socket Layer (SSL) protocol uses encryption and authentication techniques to secure connections between clients and servers.
5. Click OK.
Log in to IBM DataStage Flow Designer to run a job on the Spark engine.
1. Click Jobs and open an existing job or create a new one. The job type must be Spark.
2. Click Run. Specify the following parameters on the Job Run Options pane:
  
  Spark Instance Name
  
  The name of your Spark instance. This field is automatically populated.
  
  Kerberos Keytab File
  
  Specify the name of the keytab file for the principal user.
  
  Kerberos Principal User Name
  
  Specify the principal used to authenticate against the Spark engine. For example, user@EXAMPLE.COM. The principal must correspond to the keytab that is selected in the run dialog. For example if you use mismatched principals and keytabs, such as the principal JANE@IBM.COM and the keytab bob.headless.keytab, the job will fail.
  
  Log Level
  
  Specify one of the following logging levels to print: ERROR, INFO, WARN, DEBUG, TRACE. When you specify a log level, the Spark job will run at that log level. ERROR provides the best performance, and logs errors that do not cause a process to fail. WARN logs messages about conditions that might cause errors or other issues. INFO impacts performance more and logs general information messages. DEBUG impacts a job's run time, but provides a lot of information. TRACE logs debug and monitoring information. It impacts a job's run time the most. You can view logs in IBM DataStage Flow Designer by going to View > Logs.
3. Click Run.