Automatic Hive catalog syncing to the Big SQL catalog

Tables that are created, altered, or dropped by Hive clients can have the associated catalog changes automatically applied to Big SQL.

Enabling automatic syncing

Automatic syncing of the Big SQL catalog and the Hive metastore is enabled by default. If you have disabled automatic syncing (see Disabling automatic syncing), you can enable this feature again by selecting the Enable Metadata Sync service actions menu item.

When you enable this feature, you also create a scheduled event for the following synchronization call:

db2 "CALL SYSPROC.ADMIN_TASK_ADD ('Synchronise MetaData Changes from Hive',
  NULL, NULL, NULL, NULL, 'SYSHADOOP', 'HCAT_AUTOSYNC_OBJECTS', NULL, NULL, NULL)

You can invoke the HCAT_SYNC_OBJECTS stored procedure to add all current Hive tables from all schemas to the Big SQL catalog:

db2 "CALL SYSHADOOP.HCAT_SYNC_OBJECTS('.*', '.*', 'a', 'REPLACE', 'CONTINUE', 'TRANSFER OWNERSHIP TO HIVEOWNER')"

Important: Tables that are created under the Hive default schema are not automatically synced; you must synchronize these tables manually if you want them in Big SQL.

Running Metadata Sync

You can invoke the HCAT_SYNC_OBJECTS stored procedure for all current schemas and tables by selecting the Run Metadata Sync service actions menu item. With impersonation enabled, this action simply reports that impersonation is turned on. Users are responsible for synchronizing their own tables and schemas. The bigsql user does not necessarily have permission to query a table in the HDFS, and invoking this action as bigsql for all user schemas and tables might result in permissions errors.

Monitoring the synchronization

The bigsql.log file receives log information directly from the HCAT_AUTOSYNC_OBJECTS process.

Automatic syncing uses a scheduled DB2 administrative task scheduler (ATS) sync procedure. The description of the sync procedure is Synchronise MetaData Changes from Hive. You can monitor this procedure like any other ATS scheduled procedures. For information about ATS and how to monitor such procedures, see Administrative task scheduler.

By default, event files are placed in the HDFS under /user/bigsql/sync. The file pattern for those files is fixed, and only files that match this pattern are processed. The general file pattern is DDL-timestamp-UUID.json. For example: DDL-[0-9]{17}-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}.json. If files are left in this directory for some time, it is good practice to investigate. If you disable automatic syncing, files collect in this directory but are not processed. You must determine whether or not such files should be retained. Alternatively, you can manually remove the hive.metastore.event.listeners entry and restart Hive. In this case, files do not accumulate in the sync directory.

Disabling automatic syncing

You can disable automatic syncing from the Ambari interface. Click Big SQL > Service Actions and select Disable Metadata Sync. Disabling automatic syncing removes the scheduled process. Events are still gathered and placed in the sync directory, but are never processed. The following procedure call is the equivalent command line option:

db2 "CALL SYSPROC.ADMIN_TASK_REMOVE ('Synchronise MetaData Changes from Hive',NULL)"

Synchronization errors

In the event of a synchronization failure, the associated event-file is moved from the events-directory (by default: /user/bigsql/sync) to an "errors" sub-directory (by default: /user/bigsql/sync/errors). It is good practice to monitor this directory for any synchronization errors and address any issues where necessary.

When any issues have been addressed, and where synchronization for the associated object(s) is still required, the relevant event-files can then be moved back into the events-directory (by default: /user/bigsql/sync) where automatic synchronization will take place.