Managing Data Refinery flows (Data Refinery)

A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.

These are actions that you can do while you refine your data:

Working with the Data Refinery flow

Save a Data Refinery flow
Run or schedule a job for Data Refinery flow
Rename a Data Refinery flow

Steps

Undo or redo a step
Edit or delete a step
View the Data Refinery flow steps in a "snapshot view"

Working with the data sets

Change the source of a Data Refinery flow
Change the target of a Data Refinery flow

Actions on the project page

Reopen a Data Refinery flow to continue working
Duplicate a Data Refinery flow
Delete a Data Refinery flow
Promote a Data Refinery flow to a space

Working with the Data Refinery flow

Save a Data Refinery flow

Save a Data Refinery flow by clicking the Save Data Refinery flow icon Save icon in the Data Refinery toolbar. Data Refinery flows are saved to the project that you're working in. Save a Data Refinery flow so that you can continue refining a data set later.

The default output of the Data Refinery flow is saved as a data asset source-file-name_shaped.csv. For example, if the source file is mydata.csv, the default name and output for the Data Refinery flow is mydata_csv_shaped. You can edit the name and add an extension by changing the target of a Data Refinery flow.

Run or schedule a job for a Data Refinery flow

Data Refinery supports large data sets, which can be time-consuming and unwieldy to refine. So that you can work quickly and efficiently, Data Refinery operates on a sample subset of rows in the data set. The sample size is 1 MB or 10,000 rows, whichever comes first. When you run a job for the Data Refinery flow, the entire data set is processed. When you run the job, you select the runtime and you can add a one-time or repeating schedule.

In Data Refinery, from the Data Refinery toolbar click the Jobs icon the run or schedule a job icon , and then select Save and create a job or Save and view jobs.

After you save a Data Refinery flow, you can also create a job for it from the Project page. Go to the Assets tab, select the Data Refinery flow, choose Create job from the overflow menu ().

You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can view only the job details.

For more information about jobs, see Creating jobs in Data Refinery.

Rename a Data Refinery flow

In Data Refinery, open the information pane and click the Details tab.
Click the Edit icon next to the Data Refinery name.
Click Save .

Steps

Undo or redo a step

Click the undo ( undo icon ) icon or the redo () icon on the toolbar.

Edit or delete a step

To edit a step:

In the Steps pane, click the overflow menu () on the step for the operation that you want to edit. Data Refinery goes into edit mode and either displays the operation to be edited on the command line or in the Operation pane.
Edit the operation or select a different operation to take its place.
Apply the edited operation. Data Refinery updates the relevant step to reflect the change and reruns all the operations that follow the edited one.

View the Data Refinery flow steps in a "snapshot view"

To see what your data looked like at any point in time, click a previous step to put Data Refinery into snapshot view. For example, if you click Data source, you see what your data looked like before you started refining it. Click any operation step to see what your data looked like after that operation was applied. To leave snapshot view, click Viewing step x of y or click the same step that you selected to get into snapshot view.

Use the snapshot view to insert an operation between two steps:

Click the step before the position where you want to insert the new operation. Data Refinery shows you a snapshot view of the data set after that operation was applied.
Select and apply the new operation. Data Refinery inserts a new step between the existing steps, and it reruns all the operations that follow the new step.

Working with the data sets

Change the source of a Data Refinery flow

Change the source of a Data Refinery flow. Run the same Data Refinery flow but with a different source data set. In the Steps pane in Data Refinery, click the overflow menu () next to Data source, select Edit, and choose a different source data set.
Edit source

For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won't work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.

Change the target of a Data Refinery flow

In Data Refinery, open the Info pane and click the Details tab.
Click the Edit button.
In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon to change any of the following properties:
- Target location. (The target data set must be a different data set than the source data set.)
- Data set name and description
- Relational database targets only: Choose whether to overwrite the data in the existing data set. (If the target data set is not in a relational database, the target data is always overwritten.)
- File format
- Column header information
- Encoding (UTF-8 or SJIS)

Actions on the project page

Reopen a Data Refinery flow to continue working

To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Click the Data Refinery flow name.

Duplicate a Data Refinery flow

To create a copy of a Data Refinery flow, go to the project. Click the Assets tab. Select the Data Refinery flow, and then select Duplicate from the overflow menu (). The Data Refinery flow is added to the Data Refinery flows list as "original-name copy 1".

Delete a Data Refinery flow

To delete a Data Refinery flow, go to the project. Click the Assets tab. Select the Data Refinery flow, and then select Delete from the overflow menu ().

Promote a Data Refinery flow to a space

Deployment spaces are used to manage a set of related assets in a separate environment from your projects. You use a space to prepare data for a deployment job for Watson Machine Learning. You can promote Data Refinery flows from multiple projects to a single space. Complete the steps in the Data Refinery flow before you promote it because the Data Refinery flow is not editable in a space.

To promote a Data Refinery flow to a space, go to the project's Assets tab, click the overflow menu () for the Data Refinery flow, and then select Promote. The source file for the Data Refinery flow and any other dependent data will be promoted as well.

To create or run a job for the Data Refinery flow in a space, go the space’s Assets tab, scroll down to the Data Refinery flow, and select Create job ( the run or schedule a job icon ) from the overflow menu (). If you've already created the job, go to the Jobs tab to edit the job or view the job run details. The shaped output of the Data Refinery flow job will be available on the space’s Assets tab. You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can only view the job details. You can use the shaped output as input data for a job in Watson Machine Learning.

Restriction: Manually promote the target connected data asset: When you promote a Data Refinery flow from a project to a space and the target of the Data Refinery flow is a connected data asset, you must manually promote the connected data asset. This action ensures that the connected data asset's data is updated when you run the Data Refinery flow job in the space. Otherwise, a successful run of the Data Refinery flow job will create a new data asset in the space.

For information about spaces, see Deployment spaces.

Parent topic: Refining data