ibm-data2vec

Description

The ibm-data2vec function is an implementation of a self-supervised database embedding algorithm. The database embedding takes as input a text file created from a multi-modal relational table, and builds a relationship map between text tokens using the relational data model. The input training document generated from a relational table consists of string tokens representing different relational entities in the original table. The ibm-data2vec function views the training document as a set of sentences, where each sentence represents a relational table row. However, unlike the traditional natural language processing approaches, such as word embedding, database embedding views each sentence as an unordered bag of tokens (words), where each word is related equally to every other word. In addition, ibm-data2vec supports two special tokens: primary key tokens representing a row, and EMPTY tokens for relational NULL values.

After the training is completed, for each token, ibm-data2vec generates a vector of pre-defined length (dimension) that encodes the meaning of that token (the inferred meaning captures the collective contributions of neighboring tokens in all rows in which the input token appears). The core numerical computations of the training process are parallelized using multiple threads, and accelerated using hardware-accelerated numerical computations. The final trained model is stored as a binary file using the Db2® zload format.

Format

The ibm-data2vec component can only be invoked using the ZADE Data2Vec instance function:

⋮
ZADE zade = new ZADE();
zade.Data2Vec(args);
⋮

Parameters

int num_threads

The number of threads to use for parallelization.

String input_file

The name of the input file.

String output_file

The name of the output file.

String format

The format for the Db2 storage layout.

String vocab_file

An optional parameter that is the name of the vocab file.

If the vocab_file_fmt parameter is not specified, format 1 of the vocab file will be generated.

int vocab_file_fmt

An optional parameter that is the format number of the vocab file. The supported formats are:

1: The first vocab file format.
2: Adds DB2_GENERATED_COLUMNNAME columns.

If omitted, format 1 of the vocab file will be generated.

Note: The vocab_file and vocab_file_fmt parameters will be deprecated in a future release.

Output files

binary: A file, stored in output_dir, that contains the trained model, written using the Db2 zload format.
log: A file, stored in the current working directory, that contains the execution log messages from the function.