Creating a virtualized table from files in Cloud Object Storage in Data Virtualization

You can use data that is stored as files on IBM® Cloud Object Storage, Amazon S3, or Ceph® data sources to create virtual tables. Segment or combine data from one or more files to create a virtual table.

Before you begin

To access data in cloud object storage, you must create a connection to the data source where the files are located, such as Amazon S3, Ceph, or IBM Cloud Object Storage. For more information about object storage data sources, see Data sources in object storage in Data Virtualization.

Procedure

  1. On the navigation menu, click Data > Data virtualization.
    The service menu opens to the Data sources page by default.
  2. On the service menu, click Virtualization > Virtualize and then click the Files tab.
    The list of available data sources appears. You can narrow down the displayed assets by using the available filters.

    If you specified a bucket name when you set up the data source connection, click The down chevron icon to expand the object storage connection details to see the Service type and Bucket information. If you didn't specify a bucket name when you set up the data source connection, you can use the Bucket input field to find a specific bucket in the endpoint.

    If the list of data sources does not appear, click RefreshThe refresh icon.

  3. Select the object storage endpoint where you want to browse files and file paths.
    For Cloud Object Storage, the endpoint is the URL for the object storage.
    A list of file paths or buckets in the endpoint appears. You can navigate through the file path structure or you can click The view icon to see details of the contents of the first file in the file path.

    You cannot add the bucket to your cart. To add a file path to your cart, you must select the file path to preview the files in the path, and then click Add to cart. You cannot select a file at the bucket level, you must add the file to a file path in the bucket.

  4. Select the file or file path that you want to virtualize and click Add to cart.
    Important:
    • You cannot virtualize a single file in a file path that contains multiple files. The URL resolves to the parent path where the file is located and the entire file path is virtualized. If you want to virtualize a single file, you can move it to a separate file path. The separate file path must not be a sub-file path of any other file path that is, or will be, virtualized.
    • Files that you want to virtualize must be within a file path and not at the same level as the bucket. For example, you cannot virtualize a file s3a://mynewbigsqlbucket/mydata.csv; you must put mydata.csv into a file path and virtualize s3a://mynewbigsqlbucket/fi1epath1/mydata.csv because the virtualization process cannot create an external table by using only a bucket name without a path.
  5. Click View cart to preview your file data selections as a virtual file.
    From this window, you can edit schema names, preview files that participate in a merged table, or remove a selection from your cart.

    If you have Watson™ Knowledge Catalog installed, you can publish your virtual table to a catalog. For more information, see Publishing virtual data to a catalog with Data Virtualization.

  6. Recommended: Update the type of partitioned columns from STRING to something more appropriate. Manually inspect and specify proper types for partitioning columns for best performance.
  7. Optional: Click The overflow menu icon and select Edit columns.
    You can edit any column name that is not tagged as a Partitioning column and change column types by using the drop-down menu. When you are happy with your edits, click Apply. Updated column names are shown after you virtualize the table.
  8. Select the appropriate option to assign the virtual table to be created from file data:
    Assign to When to use this option
    Data request Select Data request if you created the virtual table in response to a data request. Then, choose the appropriate request. The table also appears in Virtualized data.
    Project Select Project if you created the virtual table to use in a specific analytics project. Then, choose the appropriate project. The table also appears in Virtualized data.
    Virtualized data Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is the default if no data requests or projects exist.
  9. Select Publish to catalog if you also want to publish to a selected catalog.
    A list of available catalogs is shown in the drop-down menu. Each catalog is tagged as Governed or Not governed.
    Note: You must have at least one catalog in Watson Knowledge Catalog.

    You must have permission to publish to a catalog. An administrator can enable whether all virtual objects are published to a selected governed catalog, which prevents a user from publishing to a specified catalog.

  10. Specify a schema in the Schema field.
    You can also create a schema by following these steps.
    • If you have the Data Virtualization Engineer or User role, leave the Schema field as default to create a schema with your user ID.
    • If you have the Data Virtualization Admin role, leave the Schema field as default to create a schema with your user ID or enter the new schema name in the Schema field.
    For more information, see Creating schemas for virtual objects.
  11. Click Virtualize to complete the process.
    When the status window appears, you can select to view your virtualized data or virtualize more data. However, you must wait until virtualization is complete before you navigate away from the page.

Results

If Data Virtualization and Watson Knowledge Catalog are installed in the same OpenShift® project (namespace), your virtual object is published to the primary catalog.

What to do next

  • View the table structure and metadata.
  • Manage access to the table.
  • Edit column names and the types of your object storage assets so that you can prepare accurate data for virtualization.
  • Refresh partitions for your virtualized table so that your data is up to date and accurate. Click Refresh partitions in the overflow menu on the Virtualized data page to identify new partitions. For more information, see Data sources in object storage in Data Virtualization.
  • Collect statistics for your virtualized table to optimize query performance. For more information, see Collecting statistics in Data Virtualization.
  • Optionally, on the Virtualized data page, publish your virtual object to the catalog. For more information, see Publishing virtual data to the catalog.