June 24, 2021 By Torsten Steinbach 6 min read

IBM Cloud Pak for Data now comes with an integrated connector to data lakes on IBM Cloud Object Storage.

IBM Cloud provides a fully managed data lake platform for data stored on IBM Cloud Object Storage. The central service of this platform is IBM Cloud SQL Query, which provides data ingestion (streaming-based), data preparation, data analytics and an integrated relational table catalog.

IBM Cloud Pak® for Data as a Service — a managed platform for broad range of data analytics and artificial intelligence (AI) — now comes with a built-in connector to the IBM Cloud SQL Query service. This allows you to set up connections to data lakes on IBM Cloud Object Storage and import the data lake table’s assets into projects and catalogs in Cloud Pak for Data.

You can create and use these cloud data lake connections like any other data source connection in Cloud Pak for Data. You can create them in your own projects, in a shared catalog and in a deployment. This post provides a short description of creating a cloud data lake connection in a project.

Adding a cloud data lake connection

Navigate to one of your projects and click Add to project:

Select Connection as the asset type to add:

For connection type, select SQL Query and then click Select:

Fill out the connection properties for your SQL Query service instance. See below for details on how to identify the values for your account:

Finding the CRN of the IBM Cloud SQL Query service

Navigate to the IBM Cloud console for your services and select the SQL Query instance to which you want to connect. In the service dashboard, you’ll find the CRN of the instance, as seen in the following screenshot:

Identifying the target Cloud Object Storage

IBM Cloud SQL Query uses a defined IBM Cloud Object Storage (COS) location to persist query results, and that’s why you must specify this location now. You can select any COS bucket to which you have write access. A recommended best practice is that you specify the COS bucket that is being created when you launch the SQL Query web console for the first time. Click Launch SQL Query UI. In the SQL Query UI, you can copy the value in Target location to your clipboard:

Getting an API key

In IBM Cloud console, navigate to Manage > Access (IAM):

Select API keys and then click Create an IBM Cloud API key:

Specify a revealing name for your new API key and click Create:

The API key is created and can immediately be copied to clipboard or downloaded to your local disk. Note that you will not be able to retrieve the actual value of the API key after you close this dialog, so make sure to download or save it otherwise at this point:

Importing a single data asset from Cloud Data Lake

In your project, click Add to project and select asset type Connected data:

In the following dialog, click Select source, select your new Cloud Data Lake connection from the list of connections and select a table that you want to import. After a few seconds (after a sample query finishes that is executed in the background), you can click Select.

Note the table catalog of a fresh instance if SQL Query is empty. In this case, no table is available to import. You first need to populate it with the table definitions for the data that make up your data lake. See the section Creating tables in Cloud Data Lake at the end of this blog post:

Provide a name for the new data asset and click Create:

After successful creation of the data asset, you can navigate to it and, for instance, do a preview of its content:

You can also generate a data profile in order get a statistical overview of the content of a data asset:

Batch importing multiple data assets from a Cloud Data Lake

If you have a cloud data lake set up already with a larger data model of tables, you can import all of its content in one operation using the Metadata Import function.

In your project, click Add to project and select asset type Metadata Import:

Provide a name for the import job and select your connection to the Cloud Data Lake as source and then click Next:

Select all tables that you want to import. Note that there is a background query submitted for each table that you select. Then, click Choose:

Review the batch import details before clicking Create to run the import job:

After a short while, you will see the all selected Cloud Data Lake tables as data assets in your project:

Creating tables in Cloud Data Lake

Each instance of IBM Cloud SQL Query provides its own table catalog. You can create your own tables with DDL SQL statements, very similar to how you do that in any relational database system. You can find more details about the metastore catalog in the SQL Query documentation. In addition, you will find a few ready-to-use DDL examples right inside the SQL Query UI in the Samples menu. You can create tables by explicitly specifying the list of columns and their data types. SQL Query also allows you to automatically infer the schema of the table from the data on IBM Cloud Object Storage. You then just have to specify the location of the data:

Further resources

Was this article helpful?
YesNo

More from Cloud

New 4th Gen Intel Xeon profiles and dynamic network bandwidth shake up the IBM Cloud Bare Metal Servers for VPC portfolio

3 min read - We’re pleased to announce that 4th Gen Intel® Xeon® processors on IBM Cloud Bare Metal Servers for VPC are available on IBM Cloud. Our customers can now provision Intel’s newest microarchitecture inside their own virtual private cloud and gain access to a host of performance enhancements, including more core-to-memory ratios (21 new server profiles/) and dynamic network bandwidth exclusive to IBM Cloud VPC. For anyone keeping track, that’s 3x as many provisioning options than our current 2nd Gen Intel Xeon…

IBM and AWS: Driving the next-gen SAP transformation  

5 min read - SAP is the epicenter of business operations for companies around the world. In fact, 77% of the world’s transactional revenue touches an SAP system, and 92% of the Forbes Global 2000 companies use SAP, according to Frost & Sullivan.   Global challenges related to profitability, supply chains and sustainability are creating economic uncertainty for many companies. Modernizing SAP systems and embracing cloud environments like AWS can provide these companies with a real-time view of their business operations, fueling growth and increasing…

Experience unmatched data resilience with IBM Storage Defender and IBM Storage FlashSystem

3 min read - IBM Storage Defender is a purpose-built end-to-end data resilience solution designed to help businesses rapidly restart essential operations in the event of a cyberattack or other unforeseen events. It simplifies and orchestrates business recovery processes by providing a comprehensive view of data resilience and recoverability across primary and  auxiliary storage in a single interface. IBM Storage Defender deploys AI-powered sensors to quickly detect threats and anomalies. Signals from all available sensors are aggregated by IBM Storage Defender, whether they come…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters