Managing data for deployments
This topic describes various ways of adding and promoting data assets to a space. It also contains important information about data types used in batch jobs.
Adding data sources to a space (Watson Machine Learning)
Add data sources to a deployment space to use with batch deployment jobs. Data can be:
- A data file such as a .csv file
- A connection to data that resides in a repository such as a database.
- Connected data that resides in a storage bucket, such as a data file that is a Cloud Object Storage bucket or Storage volume (NFS).
Notes:
- Depending on your configuration and the type of data connection, large data sets, typically more than 2GB, can time-out when you promote them to a space or catalog.
- Although you can promote any kind of data connection to a space, where you can use the connection is governed by factors such as model and deployment type. For example, you can access any of the connected data using a script, but in batch deployments you are limited to particular types of data, as listed in Batch deployment details by framework.
- If you promoted or added a Connection or Connected data that uses CloudPak credentials, make sure that the option to “Use your Cloud Pak for Data credentials to authenticate to the data source” is checked after you add it to a space. This will ensure that the Connected data using the credentials will continue to work properly.
Data added to a space is managed in a similar way to data added to a Watson Studio project. For example:
- Adding data to a space creates a new copy of the asset and its attachments within the space, maintaining a reference back to the project asset. If an asset such as a data connection requires access credentials, they persist and are the same whether you are acessing the data from a project or from a space.
- Just like with data connection in a project, you can edit data connection details from the space.
- Data assets are stored in a space in the same way they are stored in a project, using the same file structure for the space as the structure used for the project.
You can add data to a space in one of these ways:
- Promote a data source, such as a file or connection, from an associated project
- Add a data file, connection, or connected data directly to a space
- Save a data asset to a space programmatically
For details on how Watson Studio connects to data, see Accessing data.
Promoting data sources from a project
To promote data from a project:
- Save a data source, data connection, or connected data to a project.
- In the project Assets page, from the action item for the data asset, choose Promote.
The promoted data asset displays in the space and is available for use as an input data source in a deployment job.
Adding data to a space
To add data directly to a space:
- From the Assets page of the deployment space, click Add to space.
- Choose the type of data asset to add:
- Data to specify a file to upload
- Connection to specify a connection to a data repository such as DB2
- Connected data to connect to data in a storage object such as a Cloud Object Storage bucket
- Complete the steps to add the data.
The data asset displays in the space and is available for use as an input data source in a deployment job.
Using data from a Cloud Object Storage connection
- Create a connection to IBM Cloud Object Storage by adding a Connection to your project or space and selecting Cloud Object Storage (infrastructure) as the connection type. Provide the secret key, access key and login URL.
- Add input and output files to the deployment space as connected data using the COS connection you created.
Using data from a Storage volume (NFS) connection
For details on using data from a networked file system, see Storage volume connection.
Data sources for batch jobs
Input data can be supplied to a batch job as:
- Inline data - In this method, the input data for batch processing is specified in the batch deployment job’s payload, for example, you can pass a CSV file as the deployment input in the UI or as a value for parameter
scoring.input_datain a notebook. Once the batch deployment job is completed, the output of the batch deployment job is written to the corresponding job’s metadata parameterscoring.predictions -
Data reference - in this method, the input and output data for batch processing can be stored in a remote data source like a Cloud Object Storage bucket, an SQL/no-SQL database, or as a local or managed data asset in a deployment space. Details for data references include:
-
input_data_references.typeandoutput_data_reference.typemust bedata_asset -
The references to input data must be specified as a /v2/assets href in the
input_data_references.location.hrefparameter in the deployment job’s payload. The data asset specified here can be a reference to a local or connected data asset. -
If the batch deployment job’s output data has to be persisted in a remote data source, the references to output data must be specified as a /v2/assets href in
output_data_reference.location.hrefparameter in the deployment job’s payload. -
If the batch deployment job’s output data has to be persisted in a deployment space as a local asset,
output_data_reference.location.namemust be specified. Once the batch deployment job is completed successfully, the asset with the specified name will be created in the space. -
If the output data references where the data asset is in a remote database, you can specify if the batch output should be appended to the table or if the table is to be truncated and output data updated. Use the
output_data_references.location.write_modeparameter to specify the valuestruncateorappend. Note the following:- Specifying
truncateas value truncates the table and inserts the batch output data. - Specifying
appendas value appends the batch output data to the remote database table. write_modeis applicable only foroutput_data_referencesparameter.write_modeis applicable only for remote database related data assets. This parameter will not be applicable for a local data asset or a COS-based data asset.
- Specifying
-
Any input and output data asset references must be in the same space id as the batch deployment.
-
If the connected data asset references a Cloud Object Storage instance as source, for example, a file in a Cloud Object Storage bucket, you must supply the HMAC credentials for the COS bucket. Include an Access Key and a Secret Key to your IBM Cloud Object Storage connection to enable access to the stored files.
-