Start of change

Configuring networking for Apache Spark

Complete this task to configure the port access and other networking customization that Apache Spark requires.

About this task

Apache Spark makes heavy use of the network for communication between various processes, as shown in Figure 1.

Figure 1. Network ports used in a typical Apache Spark environment

These ports are further described in Table 1 and Table 2, which list the ports that Spark uses, both on the cluster side and on the driver side.

Table 1. Network ports used by the Spark cluster
Port name Default port number Configuration property* Notes
Master web UI 8080 spark.master.ui.port or SPARK_MASTER_WEBUI_PORT The value set by the spark.master.ui.port property takes precedence.
Worker web UI 8081 spark.worker.ui.port or SPARK_WORKER_WEBUI_PORT The value set by the spark.worker.ui.port takes precedence.
History server web UI 18080 spark.history.ui.port Optional; only applies if you use the history server.
Master port 7077 SPARK_MASTER_PORT  
Master REST port 6066 spark.master.rest.port Start of changeNot needed if you disable the REST service.End of change
Worker port (random) SPARK_WORKER_PORT  
Executor port (random) spark.executor.port For Spark 1.5.2 only.
Block manager port (random) spark.blockManager.port  
Shuffle server 7337 spark.shuffle.service.port Optional; only applies if you use the external shuffle service.
Table 2. Network ports used by the Spark driver
Port name Default port number Configuration property* Notes
Application web UI 4040 spark.ui.port  
Driver port (random) spark.driver.port  
Block manager port (random) spark.blockManager.port  
File server (random) spark.fileserver.port For Spark 1.5.2 only.
HTTP broadcast (random) spark.broadcast.port For Spark 1.5.2 only. Not used if spark.broadcast.factory is set to TorrentBroadcastFactory (default).
Class file server (random) spark.replClassServer.port For Spark 1.5.2 only and only used in Spark shells.

*The Spark properties in the Configuration property column can either be set in the spark-defaults.conf file (if listed in lower case) or in the spark-env.sh file (if listed in upper case).

Spark must be able to bind to all the required ports. If Spark cannot bind to a specific port, it tries again with the next port number. The default number of retries is 16. The maximum number of retries is controlled by the spark.port.maxRetries property in the spark-defaults.conf file.

Procedure

  1. For your planned deployment and ecosystem, consider any port access and firewall implications for the ports listed in Table 1 and Table 2, and configure specific port settings, as needed. Start of changeFor instance, if your application developers need to access the Spark application web UI from outside the firewall, the application web UI port must be open on the firewall.End of change

    Each time a Spark process is started, a number of listening ports are created that are specific to the intended function of that process. Depending on your site networking policies, limit access to all ports and permit access for specific users or applications.

    In z/OS®, you can use settings in z/OS Communications Server and RACF® to enforce controls. For instance, you can specify PORT UNRSV DENY in your TCPIP.PROFILE to deny all applications access to unreserved ports for TCP or UDP. You can also specify PORT UNRSV SAF to grant specific access to specific users, such as the user ID that starts the Spark cluster and the Spark users. For more information about the PORT statement, see z/OS Communications Server: IP Configuration Reference.

  2. Consider disabling the REST server.

    The REST server interface, which listens on port 6066 by default, is currently not included in the Apache Spark documentation. The REST server does not support TLS nor client authentication; however, Spark applications can be submitted through this interface. The REST server is used when applications are submitted using cluster deploy mode (--deploy-mode cluster). Client deploy mode is the default behavior for Spark, and is the way that notebooks, like Jupyter Notebook, connect to a Spark cluster. Depending on your planned deployment and environment, access to the REST server might be restricted by other controls. However, if you want to disable it, you can do so by setting spark.master.rest.enabled to false in $SPARK_CONF_DIR/spark-defaults.conf.

  3. Configure Spark environment variables for common enterprise networking configurations. You can set each of the following environment variables in the spark-env.sh file:
    SPARK_PUBLIC_DNS
    For environments that use network address translation (NAT), set SPARK_PUBLIC_DNS to the external host name to be used for the Spark web UIs. SPARK_PUBLIC_DNS sets the public DNS name of the Spark master and workers. This allows the Spark Master to present in the logs a URL with the host name that is visible to the outside world.
    SPARK_LOCAL_IP
    Set the SPARK_LOCAL_IP environment variable to configure Spark processes to bind to a specific and consistent IP address when creating listening ports.
    SPARK_MASTER_HOST
    On systems with multiple network adaptors, Spark might attempt the default setting and give up if it does not work. Set the SPARK_MASTER_HOST (known as SPARK_MASTER_IP prior to Spark 2.0) to avoid this.

What to do next

Continue with Configuring IBM Java.

End of change