S3
Db2® Big SQL tables can be created in a location that is specified as an S3a Object Storage URI. Such tables store data in the Object Storage service, which is typically remote to the Db2 Big SQL cluster and managed separately by, for example, IBM®'s Cloud Object Storage - S3 API service that is available through IBM Cloud.
About this task
Simple Storage Service (S3) is an object store interface protocol that was invented by Amazon. S3a is the name of a Hadoop component that understands the S3 protocol. S3a provides an interface for Hadoop services, such as Db2 Big SQL, to consume S3-hosted data.
Before you can use an S3a uniform resource identifier (URI) as a location in a Hive or Db2 Big SQL table, you must add the connection information about the S3a Object Storage service to the core-site.xml file. Add the properties in Table 1 for the S3a file system configuration. After you add these properties, restart the HDFS, MapReduce, YARN, Hive, and Db2 Big SQL services. Obtain the authentication and connection property values for your Object Storage service to use here.
Property name | Property value |
---|---|
fs.s3a.impl | org.apache.hadoop.fs.s3a.S3AFileSystem |
fs.s3a.endpoint | The endpoint from which the S3 service is provided. IBM's Cloud Object Storage – S3 API is available from many regions. For example, the San Jose region's endpoint is s3-api.sjc-us-geo.objectstorage.service.networklayer.com. |
fs.s3a.secret.key | The required secret key from the S3 endpoint. |
fs.s3a.access.key | The required access key from the S3 endpoint. |
fs.s3a.bucket.bucket-name.endpoint | This property is used for connections to a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.endpoint. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.endpoint. |
fs.s3a.bucket.bucket-name.secret.key | This property is used for authentication of a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.secret.key. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.secret.key. |
fs.s3a.bucket.bucket-name.access.key | This property is used for authentication of a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.access.key. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.access.key. |
When using IBM Cloud COS (Cloud Object Store) you should generate HMAC credentials which will have the access_key_id and secret_access_key properties needed for S3 connection. Through the IBM Cloud COS UI you can create HMAC credentials following these steps. Make sure that {"HMAC":true} is specified in the parameters field.
Use the value of access_key_id for fs.s3a.access.key and the value of secret_access_key for fs.s3a.secret.key in the core-site.xml.
Limitations:
Adding these fs.s3a.* properties to the core-site.xml file can expose the credentials in plain text to anyone with access to the core-site.xml file. If you prefer not to expose this information, see Enabling Db2 Big SQL to use a credential keystore file.
Having the credentials globally available through the core-site.xml file makes data that is stored on the object storage server publicly accessible to anyone with access to the Hadoop cluster. The data does not have an owner, group, or permissions set. Db2 Big SQL can use GRANT statements to restrict access to a table, but the data is still accessible to everyone through Hadoop shell commands. Note that Db2 Big SQL impersonation is not applicable to these tables.
Accessing S3a buckets and files through Hadoop
About this task
s3atables, using the IBM Cloud San Jose S3 endpoint.
# aws s3 ls s3atables --recursive --endpoint-url=https://s3-api.sjc-us-geo.objectstorage.softlayer.net
2016-07-27 14:51:16 922 country/country.csv
2016-07-27 18:11:22 367 staff/job=Clerk/l1469234296227-6-r-00000
2016-07-27 18:03:29 286 staff/job=Mgr/l1469234296227-6-r-00003
2016-07-27 18:13:50 362 staff/job=Sales/l1469234296227-6-r-00003
# hdfs dfs -ls s3a://s3atables/
Found 1 items
drwxrwxrwx - 0 2016-07-27 14:39 s3a://s3atables/country
# hdfs dfs -ls s3a://s3atables/country
Found 1 items
-rw-rw-rw- 1 922 2016-07-27 14:51 s3a://s3atables/country/country.csv
By using an S3a client, you can create a bucket and add files in the S3 Object Storage service.
# hdfs dfs -ls -R s3a://s3atables/staff
drwxrwxrwx - 0 2016-07-27 18:10 s3a://s3atables/staff/job=Clerk
-rw-rw-rw- 1 367 2016-07-27 18:11 s3a://s3atables/staff/job=Clerk/l1469234296227-6-r-00000
drwxrwxrwx - 0 2016-07-27 17:57 s3a://s3atables/staff/job=Mgr
-rw-rw-rw- 1 286 2016-07-27 18:03 s3a://s3atables/staff/job=Mgr/l1469234296227-6-r-00003
drwxrwxrwx - 0 2016-07-27 18:08 s3a://s3atables/staff/job=Sales
-rw-rw-rw- 1 362 2016-07-27 18:13 s3a://s3atables/staff/job=Sales/l1469234296227-6-r-00003
Creating a Db2 Big SQL table with S3a Object Storage location
About this task
CREATE EXTERNAL HADOOP TABLE country (
SALESCOUNTRYCODE INT,
COUNTRY VARCHAR(40),
ISOTHREELETTERCODE VARCHAR(3),
ISOTWOLETTERCODE VARCHAR(2),
ISOTHREEDIGITCODE VARCHAR(3),
CURRENCYNAME VARCHAR(50),
EUROINUSESINCE TIMESTAMP
) LOCATION 's3a://s3atables/country';
CREATE HADOOP TABLE staff (
ID SMALLINT,
NAME VARCHAR(9),
DEPT SMALLINT,
YEARS SMALLINT,
SALARY DECIMAL(7,2),
COMM DECIMAL(7,2)
) PARTITIONED BY (JOB VARCHAR(5))
LOCATION 's3a://s3atables/staff';
LOAD HADOOP USING FILE URL 's3a://data-files/staff.csv'
WITH SOURCE PROPERTIES (
'field.indexes'='1,2,3,5,6,7,4',
'replace.with.null'='NULL',
'replace.string.with.null'='NULL')
INTO TABLE staff;
Enabling Db2 Big SQL to use a credential keystore file
About this task
When the Object Storage URI is used as the location for a Db2 Big SQL table, the connection secret key and access key can be stored in an encrypted credential file instead of the core-site.xml file. Two credential keystore files are required for Db2 Big SQL and the other Hadoop services to work. The first keystore file will be created with the Hadoop credential command and the second will be created with the Db2 Big SQL $BIGSQL_HOME/libexec/credential.sh script.