Accessing data from storage

When you use the Spark jobs API, you can store the application job files and your data files in storage volumes that you can managed by using the IBM Cloud Pak for Data volume API or alternatively, you can provision an instance of IBM Cloud Object Storage.

Your files can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

Working with files in external volumes

In Spark applications run using Analytics Engine powered by Apache Spark, a common way to reference the Spark job files, input data or the output data is through external storage volumes that you can manage by using the IBM Cloud Pak for Data volume API.

You can work with the following external volumes:

External NFS storage volume
- See the Prerequisites for setting up a NFS volume in Prerequisites for setting up an external NFS server.
- See how to create a volume on an external NFS server in Creating a volume on an external NFS server.
Existing persistent volume claim
- See Creating a volume in a persistent volume claim.

New volume you create

By using the volume API, you can create one or more volumes of the required sizes, upload data and applications by using the volume API and then pass the volume IDs as parameters in the Spark jobs API. For details, see Managing persistent volume instances with the Volumes REST API.

The following cURL code snippets show you how to create a file server, upload files, download them and then stop the file server. In the snippets, 7a8e1ca7a6854e35b9c898e985075ed7/41bba1ac-d013-435e-b4df-d5ddc26a3259/1.json is the location in the volume and /nginx_data/1.json is the local file to be uploaded. Use %2F between the directory and file name in a path. Note that you should not use the instance home storage volume to upload jobs or data. For example, let’s assume you have created a volume named volume1 to upload data and job files.

Before you begin, you need to get the access token to the service instance. Enter the following cURL command which returns a JSON response with the access token. Insert the URL to the IBM Cloud Pak for Data cluster, your user name and password.

  curl -k -X GET https://<CloudPakforData_URL>/v1/preauth/validateAuth -H 'content-type: application/json' -H 'username: <YOUR_USERNAME>' -H 'password: <YOUR_PASSWORD>'

With the returned access token, you can issue the following cURL commands:

To create a new volume with name volume1:

  curl -ik -X POST https://<CloudPakForData_URL/zen-data/v3/service_instances -H "Authorization: Bearer <ACCESS_TOKEN>" -H 'Content-Type: application/json' -d '{"addon_type":"volumes","addon_version":"-","create_arguments":{"metadata":{"storageClass":"<storage_class>","storageSize":"<space-allocated-to-volume>"}},"namespace":"<project_name>","display_name":"<Volume_name>"}'

To start the file server:

  curl -k -i -X POST 'https://<CloudPakforData_URL>/zen-data/v1/volumes/volume_services/volume1' -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{}' -H 'Content-Type: application/json' -H 'cache-control: no-cache'

To upload a file:

  curl -k -i -X PUT 'https://<CloudPakforData_URL>/zen-volumes/volume1/v1/volumes/files/7a8e1ca7a6854e35b9c898e985075ed7%2F41bba1ac-d013-435e-b4df-d5ddc26a3259%2F1.json' -H "Authorization: Bearer <ACCESS_TOKEN>"  -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/nginx_data/1.json'

To upload and extract a tar file:

  curl -k -i -X PUT 'https://<CloudPakforData_URL>/zen-volumes/volume1/v1/volumes/files/<YOUR_DIRECTORY>%2F?extract=true' -H "Authorization: Bearer <ACCESS_TOKEN>"   -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F  'upFile=@</local/path/file.tar.gz>'

To download a file:

  curl -k -i -X GET 'https://<CloudPakforData_URL>/zen-volumes/volume1/v1/volumes/files/7a8e1ca7a6854e35b9c898e985075ed7%2F41bba1ac-d013-435e-b4df-d5ddc26a3259%2F1.json' -H "Authorization: Bearer <ACCESS_TOKEN>" -H 'cache-control: no-cache'

To stop the file server:

  curl -k -i -X DELETE 'https://<CloudPakforData_URL>/zen-data/v1/volumes/volume_services/volume1' -H "Authorization: Bearer <ACCESS_TOKEN>" 

Working with files in multiple storage volumes

You can use multiple storage volumes when creating the Spark job payload.

The following example shows a Spark application that is uploaded under the customApps directory inside the vol1 volume, which is mounted as /myapp on the Spark cluster. The user data is in the vol2 volume which is mounted as /data on the Spark cluster.

{
    "engine": {
    "type": "spark",
    "conf": {
        "spark.executor.extraClassPath":"/myapp/*",
        "spark.driver.extraClassPath":"/myapp/*"
    },
		"volumes": [{ "volume_name": "vol1", "source_path": "customApps", "mount_path": "/myapp" },{ "volume_name": "vol2", "source_path": "", "mount_path": "/data" }]
	},
	"application_arguments": ["12"],
	"application_jar": "/myapp/spark-examples_2.11-2.4.3.jar",
	"main_class": "org.apache.spark.examples.SparkPi"
}

Working with files in Object Storage

You can store the job files and your data in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.

Create your application, for example a Python program file cosExample.py:

  from __future__ import print_function

  import sys
  import calendar
  import time

  from pyspark.sql import SparkSession
  if __name__ == "__main__":
      if len(sys.argv) != 5:
          print("Usage: cosExample <access-key> <secret-key> <endpoint> <bucket>", file=sys.stderr)
          sys.exit(-1)

      spark = SparkSession.builder.appName("CosExample").getOrCreate()
            
      prefix = "fs.cos.llservice"
      hconf = spark.sparkContext._jsc.hadoopConfiguration()
      hconf.set(prefix +".endpoint", sys.argv[3])
      hconf.set(prefix + ".access.key", sys.argv[1])
      hconf.set(prefix + ".secret.key", sys.argv[2])
        
      data = [1, 2, 3, 4, 5, 6]
      distData = spark.sparkContext.parallelize(data)
        
      distData.count()
      path = "cos://{}.llservice/{}".format(sys.argv[4], calendar.timegm(time.gmtime()))
      distData.saveAsTextFile(path)

      rdd = spark.sparkContext.textFile(path)
      print ("output rdd count: {}". format(rdd.count()))
      spark.stop()

Load the job file. To load the job file from an external volume, upload cosExample.py under the customApps directory in the storage volume vol1, which is mounted as /myapp in the Spark cluster:

  {
  "engine": {
      "type": "spark",
      "volumes": [{ "volume_name": "vol1", "source_path": "customApps", "mount_path": "/myapp" }]
  },
  "application_arguments": ["<ACCESS_KEY>", "<COS_SECRET_KEY>","<COS_ENDPOINT>", "<BUCKET_NAME>"],
  "application_jar": "/myapp/cosExample.py",
  "main_class": "org.apache.spark.deploy.SparkSubmit"
  }

Alternatively, to load the job file from an IBM Cloud Object Storage bucket, upload the job file in <OBJECT_NAME> from the bucket <BUCKET_NAME> in IBM Cloud Object Storage service (<COS_SERVICE_NAME>):

  { 
      "engine": { 
      "type": "spark", 
      "template_id": "<template-id>", 
      "conf": { 
          "spark.app.name": "MyJob", 
          "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.endpoint":"<COS_ENDPOINT>", 
          "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.secret.key":"<COS_SECRET_KEY>", 
          "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.access.key":"<COS_ACCESS_KEY>" 
          }, 
      "size": { 
          "num_workers": 1, 
          "worker_size": { 
              "cpu": 1, 
              "memory": "1g"
              }, 
          "driver_size": { 
              "cpu": 1, 
              "memory": "1g" 
              } 
          } 
      }, 
      "application_arguments": ["cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>"], 
      "application_jar": "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<REPLACE_WITH_OBJECT_NAME>", 
      "main_class": "org.apache.spark.deploy.SparkSubmit" 
  }