Accessing data from storage

When you use the Spark jobs API, you can store the application job files and your data files in storage volumes that you can managed by using the IBM Cloud Pak for Data volume API or alternatively, you can provision an instance of IBM Cloud Object Storage.

Your files can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

Working with files in external volumes

In Spark applications run using Analytics Engine powered by Apache Spark, a common way to reference the Spark job files, input data or the output data is through external storage volumes that you can manage by using the IBM Cloud Pak for Data volume API.

You can work with the following external volumes:

Working with files in multiple storage volumes

You can use multiple storage volumes when creating the Spark job payload.

The following example shows a Spark application that is uploaded under the customApps directory inside the vol1 volume, which is mounted as /myapp on the Spark cluster. The user data is in the vol2 volume which is mounted as /data on the Spark cluster.

{
    "engine": {
    "type": "spark",
    "conf": {
        "spark.executor.extraClassPath":"/myapp/*",
        "spark.driver.extraClassPath":"/myapp/*"
    },
		"volumes": [{ "volume_name": "vol1", "source_path": "customApps", "mount_path": "/myapp" },{ "volume_name": "vol2", "source_path": "", "mount_path": "/data" }]
	},
	"application_arguments": ["12"],
	"application_jar": "/myapp/spark-examples_2.11-2.4.3.jar",
	"main_class": "org.apache.spark.examples.SparkPi"
}

Working with files in Object Storage

You can store the job files and your data in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.