You can configure IBM®
DataStage® Flow Designer jobs to run
on Spark.
Before you begin
You must have the following Spark configurations:
- A Red Hat Enterprise Linux® (RHEL) 7.x operating system
for both InfoSphere® Information Server and Spark
- Spark 2.2.1 or later built for Hadoop 2.7 or later
Note: Only Spark running on YARN is supported. Spark Standalone, Spark on Kubernetes, and
Spark on Apache Mesos are not supported.
Before you run a Spark job, you need to compile the job. This generates Scala code, which you can
view by clicking the View Scala link that appears after the job is
successfully compiled.
About this task
To run jobs on Spark, an administrator must first configure IBM
DataStage Flow Designer so that it
connects to your Spark
engine.
You can only configure IBM
DataStage Flow Designer to use one
Spark instance.
Procedure
-
Configure IBM
DataStage Flow Designer to connect
to a Spark engine.
- Log in to IBM
DataStage Flow Designer, select a
project, and select the persona button on the top of the screen. From there, click
.
- On the General tab, review the path to the directory
where you want to store IBM
DataStage Flow Designer Spark
files.
The default path is
/opt/IBM/InformationServer/Spark/. The
default path works correctly only if your services tier is configured on a single WebSphere® Application Server ND or Liberty node. If you are
using a multi-node WebSphere Application Server ND
deployment, then you need to provide a path that is visible at the same path location to all nodes
of the services tier. If you have a multiple node WebSphere Application Server ND configuration, then you only need to provide the path if you
are running jobs on Spark or plan to do so in the future.
If you update the path to the
repository for Spark files after running jobs, you might need to move the IBM
DataStage Flow Designer files from
the default path to the new path that you specify.
Note: The paths refer to locations
on the services tier.
- Specify the following parameters on the Spark tab:
- Spark Instance Name
- The name of your Spark instance. This value is automatically populated.
- Cluster manager
- YARN is the only supported resource manager and job scheduler.
- Authentication type
- Specify either None or Kerberos. Select
None if you do not have authentication enabled for the Hadoop cluster and the
cluster is not located behind a firewall. Select Kerberos to provide
authentication to the services on the Hadoop
cluster.
- File
- Upload
the required core-site.xml and yarn-site.xml files. These
configuration files are located on the cluster where the YARN and Spark service are running.
If
the authentication type is Kerberos, upload the krb5.conf configuration file
used to secure the cluster you are connecting too. Typically the krb5.conf can
be found in the cluster's /etc directory. Also, upload the keytab files of the
principals who will be submitting jobs to the cluster that is being configured. For example, if you
have 3 IBM
DataStage Flow Designer
users that each have an individual principal, then you need 3 keytabs so that when they run the job,
the job will run on Spark under the principal.
- In the Certificates section, click Add SSL
Certificate if you want to upload an SSL certificate to the server. Secure Socket Layer
(SSL) protocol uses encryption and authentication techniques to secure connections between clients
and servers.
- Click OK.
- Log in to IBM
DataStage Flow Designer to run a job
on the Spark engine.
- Click Jobs and open an existing job or create a new one. The
job type must be Spark.
- Click Run. Specify the following parameters on the
Job Run Options pane:
- Spark Instance Name
- The name of your Spark instance. This field is automatically
populated.
- Kerberos Keytab File
- Specify
the name of the keytab file for the principal user.
- Kerberos Principal User Name
- Specify the principal used to authenticate against the Spark engine. For example,
user@EXAMPLE.COM. The principal must correspond to the keytab that is selected in the run dialog.
For example if you use mismatched principals and keytabs, such as the principal JANE@IBM.COM and the
keytab bob.headless.keytab, the job will fail.
- Log Level
- Specify one of the following logging levels to print: ERROR, INFO, WARN, DEBUG, TRACE. When you
specify a log level, the Spark job will run at that log level. ERROR provides the best performance,
and logs errors that do not cause a process to fail. WARN logs messages about conditions that might
cause errors or other issues. INFO impacts performance more and logs general information messages.
DEBUG impacts a job's run time, but provides a lot of information. TRACE logs debug and monitoring
information. It impacts a job's run time the most. You can view logs in IBM
DataStage Flow Designer by going to
.
- Click Run.