Using S3 Object Storage with Big SQL tables

Big SQL tables can be created in a location that is specified as an S3a Object Storage URI. Such tables store data in the Object Storage service, which is typically remote to the Big SQL cluster and managed separately by, for example, IBM®'s Cloud Object Storage - S3 API service that is available through IBM Bluemix®.

About this task

Simple Storage Service (S3) is an object store interface protocol that was invented by Amazon. S3a is the name of a Hadoop component that understands the S3 protocol. S3a provides an interface for Hadoop services, such as Big SQL, to consume S3-hosted data.

Before you can use an S3a uniform resource identifier (URI) as a location in a Hive or Big SQL table, you must add the connection information about the S3a Object Storage service to the core-site.xml file. You can access the HDFS service core-site.xml file from the Ambari web console to add the properties in Table 1 for the S3a file system configuration. After you add these properties, restart the HDFS, MapReduce, YARN, Hive, and Big SQL services. Obtain the authentication and connection property values for your Object Storage service to use here.

Table 1. Properties for the S3a file system configuration
Property name	Property value
fs.s3a.impl	org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.endpoint	The endpoint from which the S3 service is provided. IBM's Cloud Object Storage – S3 API is available from many regions. For example, the San Jose region's endpoint is s3-api.sjc-us-geo.objectstorage.service.networklayer.com.
fs.s3a.secret.key	The required secret key from the S3 endpoint.
fs.s3a.access.key	The required access key from the S3 endpoint.

Limitations:

The S3a component supports only one endpoint definition.
Adding these fs.s3a.* properties to the core-site.xml file makes data that is stored in the object storage server publicly accessible to anyone with access to the Hadoop cluster. The data does not have an owner, group, or permissions set. Big SQL can use GRANT statements to restrict access to a table, but the data is still accessible to everyone through Hadoop shell commands. Note that Big SQL impersonation is not applicable to these tables. If you prefer not to expose this information, see Enabling Big SQL to use a credential keystore file.

Accessing S3a buckets and files through Hadoop

About this task

S3a Object Storage has a one-layer hierarchy. A bucket is a top level that contains files. From an S3 client (available as an external download), you can list the files in a bucket. For example, the following list shows the files in bucket s3atables, using the Bluemix San Jose S3 endpoint.


# aws s3 ls s3atables --recursive --endpoint-url=https://s3-api.sjc-us-geo.objectstorage.softlayer.net
2016-07-27 14:51:16 922 country/country.csv
2016-07-27 18:11:22 367 staff/job=Clerk/l1469234296227-6-r-00000
2016-07-27 18:03:29 286 staff/job=Mgr/l1469234296227-6-r-00003
2016-07-27 18:13:50 362 staff/job=Sales/l1469234296227-6-r-00003

You can use the Hadoop or HDFS shell commands and the S3a file system URI to show file and bucket information. The Hadoop shell commands (-ls, -mkdir, -put, -get, -cp, -mv, -cat, and others) can be used with the S3a file system on your Hadoop cluster. The format of the S3a file system URI is s3a://<bucket>/<path>. For example:


# hdfs dfs -ls s3a://s3atables/
Found 1 items
drwxrwxrwx - 0 2016-07-27 14:39 s3a://s3atables/country

# hdfs dfs -ls s3a://s3atables/country
Found 1 items
-rw-rw-rw- 1 922 2016-07-27 14:51 s3a://s3atables/country/country.csv

By using an S3a client, you can create a bucket and add files in the S3 Object Storage service.

The Hadoop S3a file system implementation allows zero-length files to be treated as directories, and file names that contain a forward slash (/) are treated as nested directories. For example, the URI s3a://s3atables/staff refers to a zero-length file named staff in bucket s3atables. Zero-length file staff/job=Clerk is treated as nested directories, and file staff/job=Clerk/l1469234296227-6-r-00000 is treated as a file in that directory.


# hdfs dfs -ls -R s3a://s3atables/staff
drwxrwxrwx - 0 2016-07-27 18:10 s3a://s3atables/staff/job=Clerk
-rw-rw-rw- 1 367 2016-07-27 18:11 s3a://s3atables/staff/job=Clerk/l1469234296227-6-r-00000
drwxrwxrwx - 0 2016-07-27 17:57 s3a://s3atables/staff/job=Mgr
-rw-rw-rw- 1 286 2016-07-27 18:03 s3a://s3atables/staff/job=Mgr/l1469234296227-6-r-00003
drwxrwxrwx - 0 2016-07-27 18:08 s3a://s3atables/staff/job=Sales
-rw-rw-rw- 1 362 2016-07-27 18:13 s3a://s3atables/staff/job=Sales/l1469234296227-6-r-00003

Creating a Big SQL table with S3a Object Storage location

About this task

You can create a Big SQL table by specifying a location that is an S3a file system URI. A zero-length file is created in the S3a bucket for the table. For example:


CREATE EXTERNAL HADOOP TABLE country (
  SALESCOUNTRYCODE INT,
  COUNTRY VARCHAR(40),
  ISOTHREELETTERCODE VARCHAR(3),
  ISOTWOLETTERCODE VARCHAR(2),
  ISOTHREEDIGITCODE VARCHAR(3),
  CURRENCYNAME VARCHAR(50),
  EUROINUSESINCE TIMESTAMP
) LOCATION 's3a://s3atables/country';

You can also create a partitioned table. For example:

CREATE HADOOP TABLE staff (
  ID SMALLINT,
  NAME VARCHAR(9),
  DEPT SMALLINT,
  YEARS SMALLINT,
  SALARY DECIMAL(7,2),
  COMM DECIMAL(7,2)
) PARTITIONED BY (JOB VARCHAR(5))
  LOCATION 's3a://s3atables/staff';

Here is an example of loading data from an S3a source file:


LOAD HADOOP USING FILE URL 's3a://data-files/staff.csv'
  WITH SOURCE PROPERTIES (
    'field.indexes'='1,2,3,5,6,7,4',
    'replace.with.null'='NULL',
    'replace.string.with.null'='NULL')
  INTO TABLE staff;

Enabling Big SQL to use a credential keystore file

About this task

When the Object Storage URI is used as the location for a Big SQL table, the connection secret key and access key can be stored in an encrypted credential file instead of the core-site.xml file. Two credential keystore files are required for Big SQL and the other Hadoop services to work. The first keystore file will be created with the Hadoop credential command and the second will be created with the Big SQL $BIGSQL_HOME/libexec/credential.sh script. Both keystore files will contain the same keys and both will be added to the hadoop.security.credential.provider.path property in the core-site.xml file.

Note: The syntax and parameters for credential.sh are similar to those for the Hadoop credential command; however, this script uses the JDK included with Big SQL.

Procedure

Complete the steps in Enabling Hadoop services to use a credential keystore file. This will enable all the other services to also use the credential keystore file.

Create a credential keystore on the HDFS by using the $BIGSQL_HOME/libexec/credential.sh script:

$BIGSQL_HOME/libexec/credential.sh create fs.s3a.access.key -provider jceks://<path>/<jceks_filename>.jceks -v <access key>

For example, if the jceks file will be created on the HDFS as /hdfs/tmp/s3a_bigsql.jceks:

$BIGSQL_HOME/libexec/credential.sh create fs.s3a.access.key -provider jceks://hdfs/tmp/s3a_bigsql.jceks -v 'myaccesskey'
$BIGSQL_HOME/libexec/credential.sh create fs.s3a.secret.key -provider jceks://hdfs/tmp/s3a_bigsql.jceks -v 'mysecretkey'

You can verify the credential keystore file by using the list command:

$BIGSQL_HOME/libexec/credential.sh list -provider jceks://hdfs/tmp/s3a_bigsql.jceks
Listing aliases for CredentialProvider: jceks://hdfs/tmp/s3a_bigsql.jceks
fs.s3a.access.key
fs.s3a.secret.key

Set read permissions for the keystore file so that all users can read the file:

hadoop fs -chmod ugo+r /tmp/s3a_bigsql.jceks
hadoop fs -ls /tmp/s3a_bigsql.jceks
 -rwxr--r-- 3 hdfs hdfs 1045 2016-08-15 17:01 /tmp/s3a_bigsql.jceks

Add or update the hadoop.security.credential.provider.path property in the bigsql-conf.xml file on every Big SQL node to include the provider URI from Step 1 above. In this example, the provider is jceks://hdfs/tmp/s3a_bigsql.jceks. The property can have several credential URIs that are separated by commas. Big SQL will use the jceks files that are listed in the bigsql-conf.xml file, whereas the other services, including YARN and MapReduce, will use the jceks files that are listed in the core-site.xml file.
Restart the HDFS, MapReduce, YARN, Hive, HBase, Big SQL, and any other affected service.