Configuring networking for Apache Spark

Complete this task to configure the port access and other networking customization that Apache Spark requires.

About this task

Apache Spark makes heavy use of the network for communication between various processes, as shown in Figure 1.

Figure 1. Network ports used in a typical Apache Spark environment

These ports are further described in Table 1 and Table 2, which list the ports that Spark uses, both on the cluster side and on the driver side.

Table 1. Network ports used by the Spark cluster
Port name	Default port number	Configuration property*	Notes
Master web UI	8080	`spark.master.ui.port` or SPARK_MASTER_WEBUI_PORT	The value set by the `spark.master.ui.port` property takes precedence.
Worker web UI	8081	`spark.worker.ui.port` or SPARK_WORKER_WEBUI_PORT	The value set by the `spark.worker.ui.port` takes precedence.
History server web UI	18080	`spark.history.ui.port`	Optional; only applies if you use the history server.
Master port	7077	SPARK_MASTER_PORT
Master REST port	6066	`spark.master.rest.port`	Not needed if you disable the REST service.
Worker port	(random)	SPARK_WORKER_PORT
Executor port	(random)	`spark.executor.port`	For Spark 1.5.2 only.
Block manager port	(random)	`spark.blockManager.port`
Shuffle server	7337	`spark.shuffle.service.port`	Optional; only applies if you use the external shuffle service.

Table 2. Network ports used by the Spark driver
Port name	Default port number	Configuration property*	Notes
Application web UI	4040	`spark.ui.port`
Driver port	(random)	`spark.driver.port`
Block manager port	(random)	`spark.blockManager.port`
File server	(random)	`spark.fileserver.port`	For Spark 1.5.2 only.
HTTP broadcast	(random)	`spark.broadcast.port`	For Spark 1.5.2 only. Not used if `spark.broadcast.factory` is set to `TorrentBroadcastFactory` (default).
Class file server	(random)	`spark.replClassServer.port`	For Spark 1.5.2 only and only used in Spark shells.

*The Spark properties in the Configuration property column can either be set in the spark-defaults.conf file (if listed in lower case) or in the spark-env.sh file (if listed in upper case).

Spark must be able to bind to all the required ports. If Spark cannot bind to a specific port, it tries again with the next port number. The default number of retries is 16. The maximum number of retries is controlled by the spark.port.maxRetries property in the spark-defaults.conf file.

Procedure

For your planned deployment and ecosystem, consider any port access and firewall implications for the ports listed in Table 1 and Table 2, and configure specific port settings, as needed. For instance, if your application developers need to access the Spark application web UI from outside the firewall, the application web UI port must be open on the firewall.
Each time a Spark process is started, a number of listening ports are created that are specific to the intended function of that process. Depending on your site networking policies, limit access to all ports and permit access for specific users or applications.

In z/OS®, you can use settings in z/OS Communications Server and RACF® to enforce controls. For instance, you can specify PORT UNRSV DENY in your TCPIP.PROFILE to deny all applications access to unreserved ports for TCP or UDP. You can also specify PORT UNRSV SAF to grant specific access to specific users, such as the user ID that starts the Spark cluster and the Spark users. For more information about the PORT statement, see z/OS Communications Server: IP Configuration Reference.
Consider disabling the REST server.
The REST server interface, which listens on port 6066 by default, is currently not included in the Apache Spark documentation. The REST server does not support TLS nor client authentication; however, Spark applications can be submitted through this interface. The REST server is used when applications are submitted using cluster deploy mode (--deploy-mode cluster). Client deploy mode is the default behavior for Spark, and is the way that notebooks, like Jupyter Notebook, connect to a Spark cluster. Depending on your planned deployment and environment, access to the REST server might be restricted by other controls. However, if you want to disable it, you can do so by setting spark.master.rest.enabled to false in $SPARK_CONF_DIR/spark-defaults.conf.
Configure Spark environment variables for common enterprise networking configurations. You can set each of the following environment variables in the spark-env.sh file:
SPARK_PUBLIC_DNS

For environments that use network address translation (NAT), set SPARK_PUBLIC_DNS to the external host name to be used for the Spark web UIs. SPARK_PUBLIC_DNS sets the public DNS name of the Spark master and workers. This allows the Spark Master to present in the logs a URL with the host name that is visible to the outside world.

SPARK_LOCAL_IP

Set the SPARK_LOCAL_IP environment variable to configure Spark processes to bind to a specific and consistent IP address when creating listening ports.

SPARK_MASTER_HOST

On systems with multiple network adaptors, Spark might attempt the default setting and give up if it does not work. Set the SPARK_MASTER_HOST (known as SPARK_MASTER_IP prior to Spark 2.0) to avoid this.

What to do next

Continue with Configuring IBM Java.