Multicloud data integration tutorial: Integrate data

Take this tutorial to transform data stored in three external data sources with the Multicloud data integration use case of the data fabric trial. Your goal is to use DataStage to transform the data, and then deliver that transformed data to a single output file. If you completed the Virtualize external data tutorial, then you did many of the same tasks using Data Virtualization that this tutorial accomplishes using DataStage.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you use DataStage to join applicant and application data, filter by state, join applicant credit scores, calculate total debt, look up the mortgage interest rate to offer based on credit score ranges, and output the result to a CSV file. Right-click the image and open it in a new tab to view a larger image.

Screenshots of the tutorial

The story for the tutorial is that Golden Bank needs to adhere to a new regulation where it cannot lend to underqualified loan applicants. As a data engineer at Golden Bank, you currently use DataStage to aggregate your anonymized mortgage applications data with the mortgage applicants’ personally identifiable information. Your lenders use this information to help them decide whether they should approve or deny mortgage applications. Your leadership added some risk analysts who calculate daily what interest rate they recommend offering to borrowers in each credit score range. You need to integrate this information into the spreadsheet you share with the lenders. The spreadsheet includes credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. Lastly, load your data into a target output CSV file.

In this tutorial, you will complete these tasks:

Run an existing DataStage flow.
Edit the DataStage flow to:
Edit the Sequential file node and run the DataStage flow.
Create a catalog to store the published data asset.
View the output and publish to a catalog.
Cleanup (Optional)

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Prerequisites

The following prerequisites are required to complete this tutorial.

Access type	Description	Documentation
Services	- DataStage - Watson Knowledge Catalog (Optional)	- DataStage - Watson Knowledge Catalog
Role	Data Engineer	- Predefined roles and permissions - Manage roles
Additional access	Editor access to Default Catalog (Optional)	Add collaborators
Additional configuration	Disable Enforce the exclusive use of secrets	Require users to use secrets for credentials

Follow these steps to verify your roles and permissions. If your Cloud Pak for Data account does not meet all of the prerequisites, contact your administrator.

Click your profile image in the toolbar.
Click Profile and settings.
Select the Roles tab.

The permissions that are associated with your role (or roles) are listed in the Enabled permissions column. If you are a member of any user groups, you inherit the roles that are assigned to that group. These roles are also displayed on the Roles tab, and the group from which you inherit the role is specified in the User groups column. (If the User groups column shows a dash, that means the role is assigned directly to you.)
Roles and permissions

Create the sample project

If you already have the sample project for this tutorial, then skip to Task 1. Otherwise, follow these steps:

Download the Multicloud-Data-Integration.zip file.
From the Cloud Pak for Data navigation menu , choose Projects > All projects.
On the Projects page, click New project.
Select Create a project from a file.
Upload the previously downloaded ZIP file.
On the Create a project page, type the project name and optional description for the project.
```
Multicloud Data Integration
```
Click Create.
Click View new project to verify that the project and assets were created successfully.
Click the Assets tab, to view the project's assets.

Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Tip: If you encounter a guided tour while completing this tutorial in the Cloud Pak for Data user interface, click Maybe later.

Task 1: Run an existing DataStage flow

Start with a basic DataStage flow that joins the mortgage applicants and mortgage applications data sets, and outputs that result to a CSV file in the project. Follow these steps to run the DataStage flow:

Start in the Multicloud data integration project. If you don't have the project open, follow these steps:
1. From the Cloud Pak for Data navigation menu , choose Projects > All projects.
2. Open the Multicloud data integration project.
Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions. If you already have the tutorial instructions open in a new window, then close out this guided tour.
Click the Assets tab to see all of the assets in the project.
Click Flows > DataStage flows.
Click the Multicloud Data Integration flow in the list to open it. This flow joins the Mortgage Applicants and Mortgage Applications tables that are stored in Db2 Warehouse, filters the data to those records from the State of California, and creates a sequential file in CSV format as the output.
Click the zoom in and zoom out icons on the toolbar to set your preferred view of the canvas.
Double-click MORTGAGE_APPLICATIONS_1 node to view the settings.
1. Expand the Properties section.
2. Scroll down, and click Preview data. This data set includes information that is captured on a mortgage application.
3. Click Close.
Double-click MORTGAGE_APPLICANTS_1 node to view the settings.
1. Expand the Properties section.
2. Scroll down, and click Preview data. This data set includes information about mortgage applicants who applied for a loan.
3. Click Close.
Double-click Join_on_ID node to view the settings.
1. Expand the Properties section.
2. Note that the join key is the ID column.
3. Click Cancel to close the settings.
Click Compile, and then click Run. Alternatively, you can click Run which compiles and then runs the DataStage flow.
Click the Logs icon on the toolbar to watch the flow's progress. The run can take about one minute to complete.
When the run completes successfully, click Multicloud data integration in the navigation trail to return to the project.
On the Assets tab, click Data > Data asset.
Open the MORTGAGE_INTEREST_RATES.CSV file. You can see that this file contains the columns from both the mortgage applicants and mortgage applications data sets.

Check your progress

The following image shows resulting MORTGAGE_INTEREST_RATES.CSV file. The next task is to edit the DataStage flow.

Edit the DataStage flow

Now that you joined the mortgage applicant and application data, you are ready to edit the DataStage flow to:

Task 2: Specify a key column for the Join stage.
Task 3: Add credit score data from a PostgreSQL database.
Task 4: Add a Join stage to join the credit score data with the applicant and application data.
Task 5: Add a Transformer stage to calculate total debt.
Task 6: Add interest rate data from a MongoDB database.
Task 7: Add a Lookup stage to look up interest rates for applicants.

Task 2: Specify the key column for the Join stage

Identifying a key column indicates to DataStage that column contains unique values. The Join_on_ID node joins the mortgage applicants and mortgage applications data sets using the ID column for the join key. The next phase is to join the resulting data set with the credit score data. Later, you will join the resulting filtered data with the credit score data set. That second join will use the EMAIL_ADDRESS column as the join key. In this task, you edit the DataStage flow to specify the EMAIL_ADDRESS column as the key column for the resulting data set when it is joined with the credit score data.

The following animated image provides a visual representation as an alternative to the description of the two join nodes. Right-click the image and open it in a new tab to view a larger image.

Join nodes

Follow these steps to change the Join node settings:

Click Multicloud data integration in the navigation trail to return to the project.
On the Assets tab, click Flows > DataStage flows.
Open the Multicloud Data Integration flow.
Double-click the Join_on_ID node to edit the settings.
Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.
Click Edit.
For the EMAIL_ADDRESS column name, select Key.
Click Apply and return to return to the Join_on_ID node settings.
Click Save to save the Join_on_ID node settings.

Check your progress

The following image shows the DataStage flow with the edited Join_on_id stage. Now that you identified the EMAIL_ADDRESS column as the key column, you can add the PostgreSQL data containing the applicants credit scores.

Task 3: Add credit score data from a PostgreSQL database

Follow these steps to add the credit score data that is stored in a PostgreSQL database to the DataStage flow:

In the node palette, expand the Connectors section.
Drag the Asset browser connector to the canvas beside the MORTGAGE_APPLICANTS_1 node.
To locate the asset, select Connection > Data Fabric Trial - Databases for PostgreSQL > BANKING > CREDIT_SCORE. Note: To expand the connection and schema, click the connection or schema name instead of the checkbox.
Click the Preview icon to preview the credit score data for each applicant.
Click Add.

Check your progress

The following image shows the DataStage flow with the credit score asset added. Now that you added the credit score data to the canvas, you need to join the applicant, application, and credit score data.

Credit score data asset

Task 4: Add a Join stage to join the credit score data with the applicant and application data

Follow these steps to add another Join stage to join the filtered mortgage application and mortgage applicant joined data with the credit score data in the DataStage flow:

In the node palette, expand the Stages section.
Drag the Join stage on to the canvas, and drop the node on the link line between the Filter_State_Code and Sequential_file_1 nodes.
Hover over the CREDIT_SCORE_1 connector to see the arrow. Connect the arrow to the Join stage.
Double-click the CREDIT_SCORE_1 node to edit the settings.
1. Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.
2. Click Edit.
3. For the EMAIL_ADDRESS and CREDIT_SCORE column names, select Key.
4. Click Apply and return to return to the CREDIT_SCORE_1 node settings.
5. Click Save to save the CREDIT_SCORE_1 node settings.
Double-click the Join_1 node to edit the settings.
1. Expand the Properties section.
2. Click Add key.
  1. Click Add key again.
  2. Select EMAIL_ADDRESS from the drop-down list.
  3. Click Apply.
3. Click Apply and return to return to the Join_1 node settings.
4. Change the Join_1 node name to Join_on_email.
5. Click Save to save the Join_1 node settings.

Check your progress

The following image shows the DataStage flow with a second Join stage added. Now that you joined the application, applicant, and credit score data, you need to add a Transformer stage to calculate each applicant's total debt.

Join_on_email stage

Task 5: Add a Transformer stage to calculate total debt

Follow these steps to add a Transformer stage that creates a new column by summing the LOAN_AMOUNT and CREDITCARD_DEBT columns:

Drag the Transformer stage on to the canvas, and drop the node on the link line between the Join_on_email and Sequential_file_1 nodes.
Double-click the Transformer node to edit the settings.
Click the Output tab.
1. Click Add column.
2. Scroll down in the list of columns to see the new column.
3. Name the column TOTAL_DEBT.
4. Click the Edit icon in the row’s derivation column.
5. Click the Calculator icon to open the expression builder.
6. Search for LOAN_AMOUNT, and double-click the column name to add it to the expression.
7. Type a plus sign +.
8. Search for CREDITCARD_DEBT, and double-click the column name to add it to the expression.
9. Verify that the final expression is Link_5.LOAN_AMOUNT + Link_5.CREDITCARD_DEBT. Note: Your link number may be different.
10. Click Apply and return to return to the Transformer page.
Click Save and return to return to the canvas.

Check your progress

The following image shows the DataStage flow with the Transformer stage added. Now that you calcluated each applicant's total debt, you need to add the table of interest rates to offer based on credit score ranges.

Task 6: Add interest rate data from a MongoDB database

Follow these steps to include the interest rates in the flow by adding a data asset connector to a MongoDB database:

In the node palette, expand the Connectors section.
Drag the Asset browser connector on to the canvas beside the CREDIT_SCORE_1 node.
To locate the asset, select Connection > Data Fabric Trial - Mongo DB > DOCUMENT > DS_INTEREST_RATES.
Click the Preview icon to preview interest rates for each credit score range.

You can use the values in the STARTING_LIMIT and ENDING_LIMIT columns to look up the appropriate interest rate based on the applicant's credit score. The ID column is not needed, so you will delete that column in the next step.
Click Add.

Check your progress

The following image shows the DataStage flow with the interest rates data asset added from the MongoDB external source. Now that you added the interest rates table, you can lookup the appropriate interest rate for each applicant.

Task 7: Add a Lookup stage to look up interest rates for applicants

Based on each applicant's credit score, you want to lookup the appropriate interest rate. Follow these steps to add a Lookup stage and specify the range for starting and ending credit score limits for each interest rate:

Drag the Lookup stage on to the canvas, and drop the node on the link line between the Transformer_1 and Sequential_file_1 nodes.
Connect the DS_INTEREST_RATES_1 connector to the Lookup_1 stage.
Double-click the DS_INTEREST_RATES_1 node to edit the settings.
Click the Output tab.
1. Expand the Columns section, and click Edit.
2. Select the _ID column.
3. Click the Delete icon to delete this unnecessary column.
4. Click Apply and return to return to the DS_INTEREST_RATES_1 node settings.
5. Click Save to save the changes to the DS_INTEREST_RATES_1 node.
Double-click the Lookup_1 node to edit the settings.
Expand the Properties section.
1. For the Apply range to columns field, select CREDIT_SCORE. The Reference Links, Operator, and Range column fields display.
2. For the Reference Links, select Link_9. Note: Your link number may be different.
3. For the first Operator, select <.
4. For the first Range column, select ENDING_LIMIT.
5. For the second Operator, select >.
6. For the second Range column, select STARTING_LIMIT.
Click the Output tab.
1. Expand the Columns section, and click Edit.
2. Select the STARTING_LIMIT and ENDING_LIMIT columns.
3. Click the Delete icon to delete these unnecessary STARTING_LIMIT and ENDING_LIMIT columns.
4. Click Apply and return to return to the Lookup_1 node settings.
5. Click Save to save the changes to the Lookup_1 node.

Check your progress

The following image shows that the DataStage flow with the Lookup stage added. The {{site.data.keyword.datastage}} flow is now complete. The last task before running the flow is to specify the name for the output file.

Task 8: Edit the Sequential file node and run the DataStage flow

Follow these steps to edit the Sequential file node to create a final output file as a data asset in the project, and then compile and run the DataStage flow:

Double-click the Sequential_file_1 node to edit the settings.
Click the Input tab.
Expand the Properties section.
For the Target File, type MORTGAGE_APPLICANTS_INTEREST_RATES.CSV.
Select Create data asset.
For the First line is column names field, select True.
Click Save.
Click Run which compiles and then runs the DataStage flow.
Click the Logs on the toolbar to watch the flow's progress. It is normal to see warnings during the run, and then you see that the flow ran successfully.

Check your progress

The following image shows that the DataStage flow ran successfully. Now that the DataStage flow created the output file, you need to create the catalog where you will publish the output file.

DataStage run complete

Task 9: Create a catalog to store the published data asset

If you have the appropriate role and permissions to create a catalog, then follow these steps to create the Mortgage Approval Catalog. Otherwise, skip to Task 10: View the output and publish to a catalog to use the Default catalog.

From the Cloud Pak for Data navigation menu , choose Catalogs > All catalogs.
If you see a Mortgage Approval Catalog on the Catalogs page, then skip to Task 10: View the output and publish to a catalog. Otherwise, follow these steps to create a new catalog:
Click Create Catalog.
For the Name, copy and paste the catalog name exactly as shown with no leading or trailing spaces, Mortgage Approval Catalog.
Select Enforce data policies, confirm the selection, and accept the defaults for the other fields.
Click Create.

Check your progress

The following image shows your catalog. Now that you the Mortgage Approval Catalog exists, you can publish the output file to the catalog.

Task 10: View the output and publish to a catalog

Follow these steps to view the output file in the project, and then publish it to a catalog:

From the Cloud Pak for Data navigation menu , choose Projects > All projects.
Open the Multicloud data integration project.
On the Assets tab, click Data > Data asset.
Open the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file.
Scroll to the right to see your integrated data with interest rates at the end of each data entry.
Click Multicloud data integration in the navigation trail to return to the project.
On the Assets tab, click the Overflow menu at the end of the row for the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file, and choose Publish to catalog.
Select the Default Catalog (or your catalog name) from the list, and click Publish.
From the Cloud Pak for Data navigation menu , choose Catalogs > All catalogs.
Open the Default Catalog (or your catalog name).
Search for Mortgage.
Open the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file.
Click Asset tab to view the data.

Check your progress

The following image shows the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file in catalog. The data that lenders need to make mortgage decisions is now available.

As a data engineer at Golden Bank, you integrated the mortgage applicant, application, credit rating, and credit score information, and published that data in a catalog.

Cleanup (Optional)

If you would like to retake the tutorials in the Multicloud data integration use case, delete the following artifacts.

Artifact	How to delete
Mortgage Approval Catalog	Delete a catalog
Multicloud data integration sample project	Delete a project

Next steps

Sign up for another Data fabric use case.
Try other tutorials:
- Transform data with DataStage
- Refine data

Learn more

Parent topic: Data fabric tutorials