Data governance and privacy Tutorial: Know your data

Take this tutorial to work with your trusted and protected data with the Data governance and privacy use case of the data fabric trial. Your goal is to evaluate, share, shape, and analyze data in the data fabric.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will view catalog assets, manually enrich assets and create relationships, visualize data, and filter data to improve quality. Right-click the image and open it in a new tab to view a larger image.

Screenshots of the tutorial

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Analyst, you will need to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.

In this tutorial, you will complete these tasks:

  1. Understand data assets.
  2. Enrich assets and create relationships.
  3. Add enriched data to a project.
  4. Visualize the data.
  5. Prepare the data for analytics and AI.
  6. Cleanup (Optional)

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Prerequisites

Complete the Trust your data and Protect your data tutorials:

Tip: If you encounter a guided tour while completing this tutorial in the Cloud Pak for Data user interface, click Maybe later.

Task 1: Understand data assets

Data assets in catalogs are much more than pointers to data. They contain information about the format and meaning of the data and statistics about the data values. Follow these steps to understand the value of data assets:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > All catalogs.

  2. Open the Mortgage Approval Catalog.

  3. The featured assets section shows Recently added assets, assets that Watson recommends which are suggested assets from AI and machine learning based on your past usage and popularity, and Highly rated assets that catalog collaborators rated and reviewed.

  4. Click Hide featured assets to close that section.

  5. Search for mortgage.

  6. Click MORTGAGE_APPLICANTS_TRUST to view that catalog asset. The Overview tab and the side panel provide basic information about the asset such as the description, a rating, tags, where the asset is located, business terms, data classes, and related assets.

  7. Click the Profile tab. The profile information helps you understand the content, the quality, and usability of the data.

  8. Scroll to the right to locate the ZIP_CODE column.

  9. The data class that was automatically assigned to the ZIP_CODE column is Commercial and Government Entity, but the values are zip codes. You can easily reclassify this column. Click the drop-down list to see other possible data classes and their confidence levels. Select US Zip Code.

  10. Click the Asset tab to see a preview of the data.

  11. To view column metadata, click the View View icon icon for the EMPLOYMENT_STATUS column to see the assigned business terms. Click Close to close the column metadata window.

Checkpoint Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST asset in the catalog. You explored the type of information that {{site.data.keyword.datahub}} automatically adds to data assets during metadata enrichment. In the next task, you will manually enrich this data asset.

MORTGAGE_APPLICANTS_TRUST asset

Task 2: Enrich assets and create relationships

You can make assets more valuable by adding information to them. For example, you can add your opinion of the asset, update asset properties, and create relationships to link assets. Follow these steps to enrich assets and create relationships:

  1. For the MORTGAGE_APPLICANTS_TRUST catalog asset, click the Review tab. Rate and comment on this asset so that others can find the asset easily.

    1. Select 5 stars for the rating.

    2. For the review, type:

      This contains high quality customer data from the mortgage system.
      
    3. Click Submit.

  2. Click the Overview tab.

  3. To edit the asset name, click the Edit Edit icon icon next to the asset name.

    1. Change the name to:

      MORTGAGE_APPLICANTS_TRUST_PROTECT
      
    2. Click Apply.

  4. In the Description section in the right side panel, click the Add Add icon. Note: If this asset has an existing description, you will see an Edit Edit icon icon instead of an Add icon.

    1. Type the description:

      Mortgage applicants from the Mortgage System
      
    2. Click Add. Note: If you are editing an existing description, you will see a Save button instead of an Add button.

  5. Because this asset relates to mortgage loans, next to Business terms, click the Add Add icon icon.

    1. In the Search field, type loan. Note: It is not necessary to press Enter after typing the search term. You will see a list of results immediately after typing the search term.

    2. Select Loan.

    3. Click Add.

  6. Because this asset contains personal information, next to Classifications, click Add Add icon icon.

    1. Select Personally Identifiable Information.

    2. Click Add.

  7. Because this asset is related to other mortgage assets, next to Related assets, click Add asset.

    1. Select Is related to, and click Next.

    2. Select the CREDIT_SCORE and MORTGAGE_APPLICATION assets, and click Add.

  8. Click MORTGAGE_APPLICATION to view that related asset.

Checkpoint Check your progress

The following image shows the Overview tab for the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the catalog. You made these assets more valuable by reviewing, updating properties, and adding relationships to the assets. In the next task, you will add the enriched asset to a project.

MORTGAGE_APPLICANTS_TRUST with related assets

Task 3: Add enriched data to a project

The data analysts team needs the mortgage applicants data in the mortgage analysis project to refine, visualize, analyze, and use as training data for models. Follow these steps to add the enriched data to a project:

  1. Click Mortgage Approval Catalog in the navigation trail.
    Navigation trail

  2. At the end of the MORTGAGE_APPLICANTS_TRUST_PROTECT catalog asset row, click the Overflow menu Overflow menu, and choose Add to project.

    1. In the Target drop down list, select the Data governance and privacy project.

    2. Click Add.

  3. When the notification displays, click Go to project. If you miss the notification, then:

    1. Click the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

    2. Click the Data governance and privacy project.

  4. In the project, click the Assets tab to see the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset.

Checkpoint Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the project. Now you are ready to visualize the data.

MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the project

Task 4: Visualize the data

You need to cleanse and refine the mortgage applicants data to get it ready for your analytical tools and models. A quick and easy way to determine how it needs to be shaped is to visualize the data in Data Refinery. The visualization is based on the first 5,000 rows of the data. Follow these steps to visualize the data:

Tip: If this occasion is your first time accessing Data Refinery, you might see a guided tour asking if you want to tour. For now, click Maybe later.
  1. Click the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset to preview the data.

  2. Click Prepare data to open the data asset in Data Refinery, and wait for the data to be read and processed.

  3. In the Information panel, click the X to close the panel.

  4. In the Steps panel, click the X to close the panel.

  5. Click the Visualizations tab.

  6. For the Column to visualize, select EMPLOYMENT_STATUS.

  7. Click Visualize data. The tool selects a pie chart as the best chart type for this column, which shows the distribution of applicants by employment status. Notice the suggested chart types that are indicated by a blue dot next to bar, word cloud, and sunburst.

  8. For the Chart type, select the Bubble chart type. The Bubble chart is one easy way to quickly visualize the distribution of values in a particular data set.

  9. From the Chart type drop-down, select the Relationship chart type.

  10. This chart type requires two columns. Select these columns:

    1. For the first column, select EMPLOYMENT_STATUS.

    2. Click Add another column.

    3. For the second Column, select EDUCATION.

  11. With the Relationship chart, you can select endpoints to see the relationships. For example, you can see applicants employment status by level of education.

Checkpoint Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset visualized in Data Refinery. You are now ready to cleanse the data.

Relationship visualization

Task 5: Prepare the data for analytics and AI

You can't process applicants without a social security number, so you need to review the data and remove any applicants without social security numbers. To prepare the MORTGAGE_APPLICANTS_TRUST_PROTECT data, you will:

Follow these steps to prepare the data:

  1. In the Data Refinery, click the Profile tab.

  2. Scroll to the right to locate the Social_Security_Number column. Notice several missing values.

  3. Click the Data tab to filter out these records. In the status bar at the bottom of the screen, Data Refinery indicates that the FULL DATA SET is 1101 rows.

  4. If the Steps panel is not visible, click Steps to open the panel.

  5. Click New step.

    1. In the Cleanse section, select Filter.

    2. In the Column field, select the Social_Security_Number column.

    3. In the Operator field, select Is not empty.

    4. Click Apply. Notice in the status bar at the bottom of the screen, Data Refinery now indicates that the FULL DATA SET is 1000 rows because the rows with missing Social Security Numbers are filtered out. Notice that a new step displays in the Steps panel showing the Filter operation.

  6. Click the Profile tab.

  7. Scroll to the right to locate the Social_Security_Number column. Notice that the missing values are gone.

  8. From the toolbar, click the Save icon Save icon.

  9. From the toolbar, click the Export icon, and choose Export current data to CSV.
    Export as csv icon

    1. Save the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv to a local folder.

    2. Navigate to that folder, and open the CSV file, which contains 1000 rows and no applicants are missing the social security number.

  10. Return to Cloud Pak for Data, and click the Data governance and privacy project in the navigation trail.
    Navigation trail

  11. Click All assets, and locate the new Data Refinery flow asset with the name MORTGAGE_APPLICANTS_TRUST_PROTECT_flow.

Tip: You can save the refined data set to the project or to an external data source, such as the Db2 Warehouse instance where the original data sets are stored. For more information, refer to Creating jobs in Data Refinery.

Checkpoint Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv file that you refined in Data Refinery. This data set contains the information about those mortgage applicants who provided a social security number.

Refined data asset

As a Data Analyst for Golden Bank, you learned how to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.

Cleanup (Optional)

If you would like to retake the tutorials in the Data governance and privacy use case, delete the following artifacts.

Artifact How to delete
Imported business terms Delete governance artifacts
Banking category Delete a category
Data protection rules: Confidential Information and Redact Social Security Number Delete data protection rules
Mortgage Approval Catalog Delete a catalog
Data governance and privacy sample project Delete a project

Next steps

Learn more

Parent topic: Data fabric tutorials