Technical Blog Post
Abstract
Docker integration with IBM Spectrum Conductor with Spark
Body
IBM Spectrum Conductor with Spark v2.1.0 is now available, and includes full integration with the open-source project Docker! Here are some of the key points to know about Docker’s integration with IBM Spectrum Conductor with Spark:
What is Docker?
Docker is an open-source tool designed to make creating and deploying applications easier by running applications inside of Docker containers. Containers allow you to bundle applications together with application libraries and dependencies to be shipped in a single package, allowing you to run the applications on any host, and in any host environment.
IBM Spectrum Conductor with Spark allows you to run applications through interactive notebooks such as the built-in Apache Zeppelin notebook or the IPython-Jupyter notebook. Furthermore, you can submit Spark batch applications directly to Spark. Depending on the degree of “Dockerization”, you can Dockerize an entire Spark instance group, notebooks, and even any submitted Spark workloads.
How does Dockerizing Spectrum Conductor with Spark work?
There are three components that must be configured during Spark instance group creation if you want to Dockerize a Spark instance group; Spark drivers, Spark executors, and Spark instance group services. By enabling resource management, you can choose to run these components in Docker mode, Cgroup mode, or normally without any special configuration (None). To Dockerize the Spark instance group, selecting the Docker mode enables components to run in their own individual Docker containers, creating an isolated environment from the host. This means that each container executes its own tasks regardless of the host environment’s configuration.
IBM Spectrum Conductor with Spark also contains the built-in Apache Zeppelin notebook, and supports other notebooks like the IPython-Jupyter notebook, which can be Dockerized when adding a new notebook. Notebooks are added on the Notebook Management page of the management console, and can be Dockerized by selecting the Run notebook in a Docker container option. Once a notebook package is provided and the required fields are filled in, click Add to add the Dockerized notebook.
After the Dockerized notebook is added, you will see the Dockerized notebook available for use when you attempt to create a new Spark instance group.
Why Docker?
Docker is becoming increasingly popular in the Platform and Cloud industries, and offers many significant features that IBM Spectrum Conductor with Spark can utilize to achieve higher efficiency in job execution, user isolated environments, container resource fencing, simplified dependency setup for running environments, and many more.
In IBM Spectrum Conductor with Spark, Spark jobs are executed using a designated task execution user, which is the only user privileged to execute the submitted job. Docker integration allows you to import the specified execution user into the Docker container, thus securing the job running environment so that no other users can co-exist in the same environment unless specified. As a result, this kind of user isolation can prevent malicious users taking advantage of the running container’s environment.
Docker also allows you to configure containers with host contents and restrictions like memory fencing. Memory fencing limits the amount of memory usage for applications inside the container. With a Dockerized notebook, you can configure the memory limit to restrict excess memory, which can prevent resource hungry workloads from consuming all of the host resources.
Docker even allows you to build Docker images with built-in library and module dependencies to alleviate tedious environment setup when submitting jobs that have heavy dependencies on specific libraries and modules. This means that when you are using your own Docker image to execute jobs, you do not need to reconfigure the running environment for every host.
IBM Spectrum Conductor with Spark users can also create container definitions to manage and maintain all of the Docker image definitions for a Spark instance group, and set a default container definition for each of the Spark instance group components. Building dependencies into Docker images will significantly decrease the amount of time required to setup the running environment.
Troubleshooting Tips
Allowing users to isolate and resolve problems they may encounter with the Docker Controller or with IBM Spectrum Conductor with Spark is an integral part of the user experience. Troubleshooting options are available to users by consulting the appropriate logs:
Dockercontroller Log – The Dockercontroller logs are created on the hosts of the running Dockerized services or workloads, and the log resides in the $EGO_LIBDIR/../../../kernel/log directory. The naming convention for this log is dockercontroller.log.{hostname}. This log offers Docker container level logging, and reports the command used to execute the Docker container, as well as the reasons why a Docker container was not started. Consult the Dockercontroller log if you experience problems starting a Docker container.
PEM Log – The PEM Log resides on the host of interest at the $EGO_LIBDIR/../../../kernel/log directory, and offers activity level logging to report activity and jobmonitor information. The naming convention for this log is pem.log.{hostname}. You can turn on debug mode by executing the command egosh debug pemon and entering the hostname.
Both the Dockercontroller log and the PEM log can be accessed from the IBM Spectrum Conductor with Spark management console by navigating to Reports & Logs > System Logs.
Spark Services Log – The Spark services logs reside in a Spark instance group’s specific deploy location. The logs contain information about life cycle of the Spark Batch Master, Spark Notebook Master, History Server, and Shuffle Service. These logs can be found in the Spark deployment directory (for example, {deploydir}/spark-1.6.1-hadoop-2.6/logs). The same logs can be access from the Spectrum Conductor with Spark management console as well. Navigate to the specific Spark instance group, and click on the Services tab,
Click on the service of interest, open the Instances tab, then select the instance of interest. Finally, click on View Logs to access the service logs.
Spark Driver and Executor Logs – The default location for the Spark driver and executor logs is ${ELK_HARVEST_LOCATION}. You can configure the Elasticsearch harvest location during Spectrum Conductor with Spark installation. These logs provide the Spark command used to execute the job, and log Spark executor, Spark driver, and Spark master communications such as host information and port selection to start each Spark executor and driver. Consult these logs if Dockerized or non-Dockerized Spark workloads are not running properly. Spark driver logs are accessed through the IBM Spectrum Conductor with Spark management console. From the My Spark Applications & Notebooks page, click on the logs hyperlink next to the application state as shown in the image below.
After clicking on the logs hyperlink, a browser window opens displaying hyperlinks to the detailed error and output logs for the driver.
The executor logs on the other hand can be retrieved from the Apache Spark Application UI. The Apache Spark Application UI can be access from the Spark Instance Groups page. Click on the Spark instance group, then click the 1 batch master link.
Once the Masters dialog opens, click Launch Apache Spark Master UI to launch the Apache Spark Application UI.
Finally, click on the application of interest under Completed Applications to access the detailed executor logs.
It is good practice to view all of these logs even if no issues or errors occur, as they can help you to better understand how each component works, and how the communication between components is established.
Start using Docker and Conductor with Spark today!
IBM Spectrum Conductor with Spark’s Docker integration has introduced many new capabilities to our product, and sparked many new ideas for making use of this incredible open-source tool.
For more information about Docker integration with IBM Spectrum Conductor with Spark, see Docker integration for Linux in the IBM Spectrum Conductor with Spark Knowledge Center.
You can also learn more about Dockerizing Spark instance groups and notebooks in the IBM Spectrum Conductor with Spark Knowledge Center. See Creating Spark instance groups and Adding Dockerized notebooks for more information.
Download an evaluation package of IBM Spectrum Conductor with Spark on our Service Management Connect page to use these Docker features today!
UID
ibm16163785



