Complete this task to create the Apache Spark working directories.
Procedure
- As you did in Creating the Apache Spark configuration directory, follow your file system conventions
and create new working directories for Apache Spark.
Note: Consider mounting the
$SPARK_WORKER_DIR and
$SPARK_LOCAL_DIRS directories on separate zFS file systems to avoid
uncontrolled growth on the primary zFS where
Spark is located.
The sizes of these zFS file systems depend on the activity level of your applications and whether
auto-cleanup or rolling logs is enabled (see step 4). If you are unsure about the sizes, 500 MB is a good
starting point. Then, monitor the growth of these file systems and adjust their sizes
accordingly.
- Provide read/write access to the user ID that runs IBM® z/OS® Platform for Apache Spark.
- Update the $SPARK_CONF_DIR/spark-env.sh script with the new
environment variables pointing to the newly created working directories. For example:
export SPARK_WORKER_DIR=/var/spark/work
- Configure these directories to be cleaned regularly.
- Configure Spark to
perform cleanup.
By default,
Spark
does not regularly clean up worker directories, but you can configure it to do so. Change the
following
Spark properties
in
$SPARK_CONF_DIR/spark-defaults.conf to values that support your planned
activity, and monitor these settings over time:
- spark.worker.cleanup.enabled
- Enables periodic cleanup of worker and application directories. This is disabled by default. Set
to true to enable it.
- spark.worker.cleanup.interval
- The frequency, in seconds, that the worker cleans up old application work directories. The
default is 30 minutes. Modify the value as you deem appropriate.
- spark.worker.cleanup.appDataTtl
- Controls how long, in seconds, to retain application work directories. The default is 7 days,
which is generally inadequate if Spark jobs are run frequently.
Modify the value as you deem appropriate.
For more information about these properties, see
Spark Standalone Mode.
- Configure Spark to
enable rolling log files.
Be default, Spark
retains all of the executor log files. You can change the following Spark properties in
$SPARK_CONF_DIR/spark-defaults.conf to enable rolling of executor logs:
- spark.executor.logs.rolling.maxRetainedFiles
- Sets the number of latest rolling log files that are going to be retained by the system. Older
log files will be deleted. The default is to retain all log files.
- spark.executor.logs.rolling.strategy
- Sets the strategy for rolling of executor logs. By default, it is disabled. The valid values
are:
- time
- Time-based rolling. Use spark.executor.logs.rolling.time.interval to set the
rolling time interval.
- size
- Size-based rolling. Use spark.executor.logs.rolling.maxSize to set the maximum
file size for rolling.
- spark.executor.logs.rolling.time.interval
- Sets the time interval by which the executor logs will be rolled over. Valid values are:
- daily
- hourly
- minutely
- Any number of seconds
- spark.executor.logs.rolling.maxSize
- Sets the maximum file size, in bytes, by which the executor logs will be rolled over.
For more information about these properties, see Spark Configuration.
- Create jobs that clean up or archive the following directories listed in Table 1:
- $SPARK_LOG_DIR
- $SPARK_WORKER_DIR, if not configured to be cleaned by Spark properties
- $SPARK_LOCAL_DIRS
z/OS UNIX ships
a sample script, skulker, that you can use as written or modify to suit your
needs. The -R option can be useful, as Spark files are often nested in
subdirectories. You can schedule skulker to run regularly from
cron or other in-house automation tooling. You can find a
sample skulker script in the /samples directory. For more
information about skulker, see "skulker - Remove old files from a directory" in
z/OS UNIX System Services Command Reference.
- Optional: Periodically check all file systems involved in Spark (such as
$SPARK_HOME and any others mounted under it or elsewhere).
- You can specify the FSFULL parameter for a file system to that it generates operator messages as
the file system reaches a user-specified threshold.
- Look for the number of extents, which can impact I/O performance for the disks involved. Perform
these steps to reduce the number of extents:
- Create and mount a new zFS.
- Use copytree, tar, or similar utilities to copy the key
directories from the old file system to the new one.
- Unmount the old file system and re-mount the new file system in its place.
For more information, see "Managing File System Size" in z/OS DFSMSdfp Advanced Services.
Note: Update the
BPXPRMxx member of parmlib with the new file systems.
Results
You have completed the customization of your Apache Spark directory structure.