Planning to access data

To make Cloud Pak for Data work for you, you need access to data. Planning involves confirming that the data source is supported in the services and tools that you want to use, authentication requirements, security, and more.

Location

You can host your data on Cloud Pak for Data by using one of the multiple databases that are available with the data source services or you can create a connection to an external data source.

Use a data source service on Cloud Pak for Data
Deploy a database in the cluster with one of the data source services. Some of the data source services are included in Cloud Pak for Data; others are separately priced. For the list of data sources, see Data source services.
Connect to data outside of Cloud Pak for Data
Access your data where it already resides. You can create a connection to cloud or on-prem data.

Planning which data sources you can connect to

When you plan a connection to a data source, consider all these areas.

  1. Which services support the connection? Not all services support the same data sources. See Connection types.

    If the data source that you want to connect to is not defined, check whether the data source has an available JDBC driver. Then, a Cloud Pak for Data administrator can upload the JDBC JAR files so that users can create a Generic JDBC connection to the data source. Because of the variety of third-party JAR files that are available, be sure to test the connection thoroughly.

  2. Which tools from those services support the connection? Also, check for any data source connection restrictions or setup requirements that might apply to specific tools. Refer to the lists in the section for Supported connection types by tool at Connection types for analytics projects and catalogs.

  3. Restrictions for the connection. Review the information for the specific data source connection in Connection types for analytics projects and catalogs:
    • If you need to store the data results in the data source, confirm that the data source supports write, export, or target access. Some connections are for source data only.
    • Supported file formats, if applicable.
    • Prerequisites and restrictions.
    • Supported product versions.

Sharing connections among Cloud Pak for Data users

You can share connections with other users by adding the connections to the Platform assets catalog (a platform-level connection) or by using a Watson Knowledge Catalog. Users can use the same connections with multiple services or instances or across multiple projects.

The catalogs have these characteristics.

Platform assets catalog Watson Knowledge Catalog
An administrator must install the Cloud Pak for Data common core services. An administrator must install the Watson Knowledge Catalog service and set up the default catalog.
A platform connection that is shared with other services is a reference to the original platform connection. If you update a platform connection, it will automatically update everywhere that the connection was added (at the time of using the connection). Connections that are published to other catalogs or projects are copies.
All users on the platform automatically have View access at the minimum. The administrator can manage collaborators on platform connections. The administrator must add collaborators to the catalog.
The Platform assets catalog is not governed. A catalog that you create with Watson Knowledge Catalog can be governed.
Not all services can use connections in the Platform assets catalog. For more information, see Connecting to data sources at the platform level. For more information, see Adding a connection asset to a catalog.

Authentication features for connecting to data sources

Confirm if you want your organization to follow a consistent approach for any of the authentication features. Each data source has its own authentication requirements such as the username and password or a secret key and API key. By default, users enter the credentials manually. The following authentication features are available in Cloud Pak for Data:

Personal credentials

If you want to users to access only the data sources for which they have credentials, disable shared credentials:

When users create a connection to a data source, they can select either personal credentials or shared credentials. By default, credentials are set to shared, and the credentials that are entered by the creator of the connection are available to all other collaborators within the project when they access the data with the connection. With personal credentials, each user enters their own credentials when they create a new connection or when they use the connection to access data. An administrator can disable shared credentials.
Vaults and secrets

If you need extra security for storing credentials, set up secrets and vaults to store credentials:

Cloud Pak for Data includes an internal vault that is accessible through the Credentials and Secrets API. An administrator can enable the vaults interface so that users can add secrets to the internal vault from the web client or use secrets from an external vault, such as CyberArk or HashiCorp. For the list of services that support connections that use secrets from vaults, see Managing secrets and vaults.

An administrator can disable the internal vault, ensuring that only external vaults and secrets can be used in Cloud Pak for Data. An administrator can also set a policy that requires all connections to use a vault and secrets for credentials and SSL certificates.

Kerberos authentication

Kerberos is a network authentication protocol that uses strong cryptography for authentication and authorization for client/server applications. Cloud Pak for Data supports Kerberos in multiple connections to remote data sources. See Kerberos authentication in Cloud Pak for Data.

Cloud Pak for Data credentials

If you don't want users to use or view data source credentials, use Cloud Pak for Data credentials:

Some data sources allow users to use their Cloud Pak for Data credentials for authentication. Users log in to Cloud Pak for Data and never enter credentials for the data source connection. If they change their Cloud Pak for Data password, they don't need to change the password for each data source connection. The Use my platform login credentials option is available only if the respective service is deployed on the instance of Cloud Pak for Data where the users create the connection. For the list of data sources that support Cloud Pak for Data (JWT) credentials, see Authentication and authorization.

SSL in data source connections

Some data sources require you to use SSL for secure communication. Each data source has its own SSL certificate. Other data sources support SSL but do not require it. Ensure that you understand what information you need to provide to communicate securely with your data source.

You can also set up secrets and vaults to store SSL certificates. (SSL certificates can have their own secrets so you can use secrets for credentials, SSL certificates, or both.) Cloud Pak for Data includes an internal vault that is accessible through the Credentials and Secrets API. An administrator can enable the vaults interface so that users can add secrets to the internal vault from the web client or use secrets from an external vault, such as CyberArk or HashiCorp. For setup information and the list of services that support connections that use secrets from vaults, see Managing secrets and vaults.

For more information about encryption and SSL, see Encryption.

FAQs

Q: My data is stored in several data sources. How can I find out whether Cloud Pak for Data supports the data sources?

A: Check whether the data sources are listed in Connection types.


Q: The data source that I use is not supported by Cloud Pak for Data. What can I do?

A: A Cloud Pak for Data administrator can upload the JDBC JAR files so that users create a Generic JDBC connection to the data source. Make sure that you test the connection and that it works in the services and tools that you plan to use.


Q: If I want to transform my data with ETL, and then analyze the data in a data-modeling tool. How do I know which tools support my data source?

A: You can use DataStage® for ETL tasks. One of the available modeling tools is the AutoAI graphical tool in Watson Studio. See the section for Supported connection types by tool at Connection types for analytics projects and catalogs and confirm that the data source is supported by both services and tools.


Q: What are "private" and "public" connections and how do they differ from connections with "personal" or "shared" credentials?

A: If you add a data source connection to Watson Knowledge Catalog, you can mark the connection as public or private. See Controlling access to an asset in a catalog.

When you create the connection to a data source, you can select either personal or shared credentials. By default, connections are shared and the credentials that are entered by the creator of the connection are available to all other users when they access data with the connection. With personal credentials, each user enters their own credentials when they create a new connection or when they use the connection to access data. An administrator can disable shared credentials.


Q: How do I know whether my organization should create connections at the platform level or at the service level?

A: Platform-level connections are preferred because users can use the same connections with multiple services or across projects. For example, you can use the same connection with Data Refinery (included in the Watson Studio or the Watson Knowledge Catalog services) to prepare the data, and then use the output in Cognos® Analytics to create a dashboard. For information about platform-level connections, Connecting to data sources at the platform level.