Copy data using a data connection

This topic describes how to input data into Cloud Pak for Data to be used with Watson Machine Learning Accelerator.

Create a data connection

You can use data that resides in Cloud Object Storage by creating a data connection to your Cloud Object Storage.
  1. Log in to Cloud Pak for Data and navigate to your project area.
  2. In your project, click Add to project. You have the option to choose the asset type, either:
    • Data
    • Connection
    • Connected data
    To learn more about adding these asset types, see Adding data to a project.
    Note: Use the Data option to upload files directly to your project area. Use the Connection or Connected data options to use your Cloud Object Storage data.
  3. Click Connection to create a new connection.
  4. Select Cloud Object Storage and click Select. Input your credentials and connection information. For more information, see Connecting to data sources at the platform level.

Using data connection through REST API

POST request for REST API /platform/rest/deeplearning/v1/execs accepts a data parameter in string with format specification described as below.
{
    'args': 'Arguments to the task. It has the same format as the as the `args` in the request parameters, except the `--cs-datastore-meta` options can be override by below `dataSource` configure.',
    'dataSource': [
    {
        'type': 'Type of the data source, it can be `fs`, `connection` or `data_asset`',
        'asset': {
            'asset_id': 'CP4D asset id for `connection` or `data_asset` asset.',
            'project_id': 'CP4D project id where the asset locates',
            'catalog_id': 'CP4D catalog id where the asset locates'
        },
        'location': {
            'paths': 'string, required for cloud object storage `connection` data source and optional for `fs` data source. It define paths of data, where multiple file paths separated by `,` are allowed for `connection` and only one directory is allowed for `fs`. ',
            'volume': 'string, optional, PVC name where data locates, which is only valid for `fs` data source.',
            'bucket': 'string, optional for cloud object storage `connection` when there is bucked information defined in the connection.'
        }
    }]
}
Example REST call with data parameter:
curl -k -H "Authorization: $TOKEN" -H "Accept: application/json" -F data='{"args": "--exec-start tensorflow --numWorker 1 -model-main train.py", "dataSource": [{"type": "connection", "asset": {"asset_id": "ea76a8f1-eab8-4a00-8bf3-ce31ade3fdc4", "project_id": "e008ff36-c41d-4f57-968f-639f9b5bb229"}, "location": {"paths": "data_batch_1,data_batch_5,data_batch_2,batches.meta,data_batch_3,test_batch,data_batch_4", "bucket": "data-bucket"}}]}' -F file=@tmpbdk2whso.modelDir.tar "https://wmla-console-abc.ibm.com/platform/rest/deeplearning/v1/execs"

Using data connection through the command line interface (CLI)

Use the Watson Machine Learning Accelerator CLI. You can download the CLI from the Watson Machine Learning Accelerator console, see: dlicmd.py reference.

When running training, you can specify your connection data source using the --cs-datastore-meta and --data-source options. If both options are specified, --data-source will be used.
python3 dlicmd.py –help
python dlicmd.py --exec-start <framework-name> <connection-options> <datastore-meta> <submit-arguments>

…
Datastore-meta:
   --cs-datastore-meta   Optional. Comma-separated string of name-value pairs. Acceptable names and values are:
                            type: 'fs'
                            data_path: Only needed for --exec-start option.
                            For 'fs' type, this is relative path to data file system (DLI_DATA_FS)
   --data-source Optional. JSON string to describe a list of data sources. Refer to training API documentation for the format of a data source. 
…

Example

python3 dlicmd.py --exec-start tensorflow --numWorker 1 --model-main train.py --model-dir /model/tmpbdk2whso --data-source '[{"type": "connection", "asset": {"asset_id": "ea76a8f1-eab8-4a00-8bf3-ce31ade3fdc4", "project_id": "e008ff36-c41d-4f57-968f-639f9b5bb229"}, "location": {"paths": "data_batch_1,data_batch_5,data_batch_2,batches.meta,data_batch_3,test_batch,data_batch_4", "bucket": "data-bucket"}}]' --rest-host wmla-console-abc.ibm.com