Managing Data Refinery flows (Data Refinery)
A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.
These are actions that you can do while you refine your data:
Working with the Data Refinery flow
Steps
Working with the data sets
Actions on the project page
- Reopen a Data Refinery flow to continue working
- Duplicate a Data Refinery flow
- Delete a Data Refinery flow
- Promote a Data Refinery flow to a space
Working with the Data Refinery flow
Save a Data Refinery flow
Save a Data Refinery flow by clicking the Save Data Refinery flow icon
in the Data Refinery toolbar. Data Refinery flows are saved to the project that you're
working in. Save a Data Refinery flow so that you can continue refining a data set later.
The default output of the Data Refinery flow is saved as a data asset source-file-name_shaped.csv. For example, if the source file is mydata.csv, the default name and output for the Data Refinery flow is mydata_csv_shaped.
You can edit the name and add an extension by changing the target of a Data Refinery flow.
Run or schedule a job for a Data Refinery flow
Data Refinery supports large data sets, which can be time-consuming and unwieldy to refine. So that you can work quickly and efficiently, Data Refinery operates on a sample subset of rows in the data set. The sample size is 1 MB or 10,000 rows, whichever comes first. When you run a job for the Data Refinery flow, the entire data set is processed. When you run the job, you select the runtime and you can add a one-time or repeating schedule.
In Data Refinery, from the Data Refinery toolbar click the Jobs icon
, and then select Save and create a job or Save and view jobs.
After you save a Data Refinery flow, you can also create a job for it from the Project page. Go to the Assets tab, select the Data Refinery flow, choose Create job from the overflow menu (
).
You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can view only the job details.
For more information about jobs, see Creating jobs in Data Refinery.
Rename a Data Refinery flow
- In Data Refinery, open the information pane
and click the Details tab. - Click the Edit icon next to the Data Refinery name.
- Click Save
.
Steps
Undo or redo a step
Click the undo (
) icon or the redo (
)
icon on the toolbar.
Edit or delete a step
To edit a step:
- In the Steps pane, click the overflow menu (
) on the step for the operation that you
want to edit. Data Refinery goes into edit mode and either displays the operation to be edited on the command line or in the Operation pane. - Edit the operation or select a different operation to take its place.
- Apply the edited operation. Data Refinery updates the relevant step to reflect the change and reruns all the operations that follow the edited one.
View the Data Refinery flow steps in a "snapshot view"
To see what your data looked like at any point in time, click a previous step to put Data Refinery into snapshot view. For example, if you click Data source, you see what your data looked like before you started refining it. Click any operation step to see what your data looked like after that operation was applied. To leave snapshot view, click Viewing step x of y or click the same step that you selected to get into snapshot view.
Use the snapshot view to insert an operation between two steps:
- Click the step before the position where you want to insert the new operation. Data Refinery shows you a snapshot view of the data set after that operation was applied.
- Select and apply the new operation. Data Refinery inserts a new step between the existing steps, and it reruns all the operations that follow the new step.
Working with the data sets
Change the source of a Data Refinery flow
Change the source of a Data Refinery flow. Run the same Data Refinery flow but with a different source data set. In the Steps pane in Data Refinery, click the overflow menu (
) next to Data source, select Edit, and choose a different source data set.

For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won't work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.
Change the target of a Data Refinery flow
- In Data Refinery, open the Info pane
and click the Details tab. - Click the Edit button.
- In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon to change any of the following properties:
- Target location. (The target data set must be a different data set than the source data set.)
- Data set name and description
- Relational database targets only: Choose whether to overwrite the data in the existing data set. (If the target data set is not in a relational database, the target data is always overwritten.)
- File format
- Column header information
- Encoding (UTF-8 or SJIS)
Actions on the project page
Reopen a Data Refinery flow to continue working
To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Click the Data Refinery flow name.
Duplicate a Data Refinery flow
To create a copy of a Data Refinery flow, go to the project. Click the Assets tab. Select the Data Refinery flow, and then select Duplicate from the overflow menu (
). The Data Refinery flow is added to the Data Refinery flows list as "original-name copy 1".
Delete a Data Refinery flow
To delete a Data Refinery flow, go to the project. Click the Assets tab. Select the Data Refinery flow, and then select Delete from the overflow menu (
).
Promote a Data Refinery flow to a space
Deployment spaces are used to manage a set of related assets in a separate environment from your projects. You use a space to prepare data for a deployment job for Watson Machine Learning. You can promote Data Refinery flows from multiple projects to a single space. Complete the steps in the Data Refinery flow before you promote it because the Data Refinery flow is not editable in a space.
To promote a Data Refinery flow to a space, go to the project's Assets tab, click the overflow menu (
) for the Data Refinery flow, and then select Promote. The source file for the Data Refinery flow and any other dependent data will be promoted as well.
To create or run a job for the Data Refinery flow in a space, go the space’s Assets tab, scroll down to the Data Refinery flow, and select Create job (
) from the overflow menu (
). If you've
already created the job, go to the Jobs tab to edit the job or view the job run details. The shaped output of the Data Refinery flow job will be available on the space’s Assets tab. You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can only view the job details. You can use the shaped output as input data for a job in Watson Machine
Learning.
Restriction: Manually promote the target connected data asset: When you promote a Data Refinery flow from a project to a space and the target of the Data Refinery flow is a connected data asset, you must manually promote the connected data asset. This action ensures that the connected data asset's data is updated when you run the Data Refinery flow job in the space. Otherwise, a successful run of the Data Refinery flow job will create a new data asset in the space.
For information about spaces, see Deployment spaces.
Parent topic: Refining data