CSV keyword definitions

Use this topic to obtain detailed information about the keywords that are used in a configuration file to mask data in a CSV file. This topic expands on the information in the CSV Quick reference guide topic, and it is primarily intended for new users and users who create configuration files infrequently.

The keywords that are described in this topic are shown in uppercase letters for emphasis, but you can enter them in uppercase or lowercase letters, or any combination of the two. An equal sign is used to separate each keyword from its user-assigned value, such as FILETYPE=CSV.

Tip: You can use the CSV configuration file template to create a configuration file by doing the following. Copy and paste the configuration file template into a text editor, and modify the template to meet your needs. Then, name the configuration file and save it for use with the data privacy application.

Note: Some of the keywords in this topic are used in the same manner to mask data in other file types, such as XML files. Those keywords are repeated in the quick reference guides for each file type for ease of referencing by users who mask files of only one type. Other keywords, such as the STRICTMETADATA keyword, are used for all supported file types, but their use varies slightly from one file type to another.

REPORT

Specify the file location and report file name for the directory in which you want the file masking report saved, along with the name under which you want it saved. This keyword applies to all work items in a configuration file, so specify it only once in each file.

You can specify a file path or URI for the directory and report file name.

Note: When masking with Windows, the location provided must be on the local file system and not on the Hadoop Distributed File System (HDFS).

Here is an example of a valid path and report file name in Windows:
REPORT=C:\IBM\InfoSphere\Optim\mod\MODApp\Reports\ccn.xml

In Windows, file paths are entered with backward slashes, as shown in the example, but URIs are entered with forward slashes.

Here is an example of a valid URI and report file name:
REPORT=file:///tmp/reports/report.xml

In non-Windows environments, forward slashes are used in both file paths and URIs.

Note: A file masking report is generated for each configuration file and saved in XML format and HTML format. The report lists relevant details about the masking process, including the number of elements that were masked and the number that could not be masked. The report also includes links to any exception files generated during the masking process. A separate exceptions file is created for each input file and work item for which any error or warning messages were issued. The exception files are saved as text files in a user-specified output directory.

REPLACE

Specify Yes or No to indicate whether you want the output and report files overwritten if they already exist, unless you are masking in Hadoop.

Specify Yes if you want your output files to have the same name as your input files, and you want your output files and report files overwritten if they already exist.

Specify No if you do not want your output files to have the same name as your input files, and you do not want your output files and report files overwritten if they already exist. If you specify No, a unique number is appended to the output file name to ensure uniqueness. For example, if the input file is named mask_ccn.csv, the output file is renamed mask_ccn_0001.csv. The number that is appended to the file name is incremented each time the masking process is run against the same input file. For example, if an input file named mask_ccn.csv is masked three times, the output files are saved as mask_ccn_0001.csv, mask_ccn_0002.csv, and mask_ccn_0003.csv.

The REPLACE keyword is not used when masking in Hadoop, so delete this line from the configuration file or comment it out by typing a pound symbol (#) at the beginning of the line. This keyword is not used when masking in Hadoop because the output directory cannot exist before the masking job is run. Since the output directory cannot exist before masking, the output directory cannot contain any files to be overwritten.

WORKITEM

Specify a character string, such as a name or number, to identify each work item in the configuration file. Each WORKITEM keyword identifies the start of a new work item in the configuration file. We recommend assigning a unique identifier to each work item because this entry is often used in error messages to identify the location in which an error occurred.

Here are a few examples of how you might identify your work items:

WORKITEM=1

WORKITEM=CCN_masking

WORKITEM=“ EMAIL MASKING ”

As the last example illustrates, if the value specified for this keyword includes significant leading or trailing spaces, enclose the string in single (') or double quotation marks (").

Every configuration file must contain at least one work item. The keywords that follow the first work item apply to the first work item. This is the case until another WORKITEM keyword is encountered in the file, in which case any subsequent keywords apply to the second work item. This process continues throughout the configuration file until all work items are processed.

Each work item can include the following keywords, beginning with the INPUT keyword.

INPUT

Identify the directory in which the files you want to mask are located:

In Hadoop, specify an HDFS location, as in this example: INPUT=hdfs:/user/iod/iod_in.

In all other cases, specify a path or URI for the input directory.
- Here is an example of a Windows path: INPUT=C:\IBM\InfoSphere\Optim\mod\MODApp\Input_Files.
- Here is an example of a URI: INPUT=file:///tmp/myinput.

Note: The files you want to mask are identified later in the configuration file with the FILEPATTERN keyword, except when you are masking in Hadoop. The FILEPATTERN keyword is ignored when masking files in Hadoop because every file in the input directory is read, and the output is a collection of files of masked data from the nodes of the Hadoop cluster. In Hadoop, the contents of an output file might be derived from a number of input files.

OUTPUT

Identify the directory in which you want the masked files saved:

In Hadoop, specify an HDFS location that is different from the input directory, as in this example: OUTPUT=hdfs:/user/iod/iod_out.

In all other cases, specify a path or URI for the output directory.
- Here is an example of a Windows path: OUTPUT=C:\IBM\InfoSphere\Optim\mod\MODApp\Output_Files.
- Here is an example of a URI: OUTPUT=file:///tmp/myoutput.

You can specify the same path or URI for the input and output directories, except in Hadoop. However, if you specify the same location for both directories and REPLACE=Yes, the masked output file will overwrite the input file, resulting in the loss of the unmasked data. But if REPLACE=No, a unique number is appended to the output file name to ensure that the input file is not overwritten.

HADOOPCACHE

Do not specify this keyword unless you are masking in Hadoop. When masking in Hadoop, identify the HDFS location for the cache directory.

Here is an example of an HDFS location for a cache directory: HADOOPCACHE=hdfs:/user/my/cache.

Hadoop provides a distributed cache into which large files are placed. When a Hadoop cache directory is specified, all the files in the directory are entered into the distributed cache.

Note: Before you mask CSV data in Hadoop, read the information in Prerequisites and restrictions for masking in Hadoop, and then do the steps that are described in Masking in Hadoop.

FILEPATTERN

Identify the files you want to mask in the input directory as follows, unless you are masking data in Hadoop.

To mask a single file, specify a specific file name, such as FILEPATTERN=customers.csv.

To mask multiple files with similar names, use the asterisk wildcard character to indicate you want to mask all input files that match a certain file pattern, such as FILEPATTERN=US_CUST*.

You also can use Java™ regular expression syntax or regex to identify the file pattern.

The FILEPATTERN keyword is not used when masking in Hadoop, so delete this line from the configuration file or comment it out by typing a pound symbol (#) at the beginning of the line. This keyword is not used when masking in Hadoop because every file in the input directory is read, and the output is a collection of files of masked data from the nodes of the Hadoop cluster. In Hadoop, the contents of an output file might be derived from a number of input files.

FILETYPE: Specify CSV as the file type for the files you want to mask.

FIELDDELIMITER: Specify the character that is used to separate fields in the CSV file, such as a comma (,). This delimiter cannot be the same as the DATADELIMITER.

DATADELIMITER: Specify the character that is used to identify significant leading or trailing spaces in the data, such as single or double quotation marks (‘ or “). This delimiter cannot be the same as the FIELDDELIMITER.
Field delimiter characters that are contained within a data delimiter character are treated as ordinary text. For example, if FIELDDELIMITER=, and DATADELIMITER=”, it would mean that the line one, two, “three, four” contains three fields, not four fields, with the final field being the data item three, four.

ESCAPECHARACTER: Specify an escape character to process delimiter characters as text. The rules for using the specified escape character in a configuration file conform to standard C and Java usage. So, to specify a backslash as an escape character, type \\, as in this example: ESCAPECHARACTER=\\.

PARALLEL

Specify None, Low, Medium, or High to identify the degree of parallel operations that are allowed when multiple files are being masked.

Setting this keyword to Low, Medium, or High might boost performance, especially on systems with multiple processors.

When parallelism is used, each thread works on an input file, so there are never more threads than input files. For example, if only one file is specified, only one thread is created, regardless of how this keyword is set.

STRICTMETADATA

Specify Yes or No to indicate whether you want a warning message displayed if a row contains more or less fields than the number identified in the configuration file.

Specify Yes to receive a warning message.

Specify No to avoid receiving a warning message. No is default value.

Here is an example of how this keyword works. If STRICTMETADATA=Yes and the field specification portion of the configuration file indicates that there are 10 fields in a row, a warning message is issued if either of the following conditions exist:

There are nine fields or fewer in the row.

There are 11 fields or more in the row.

BULKSIZE

Specify any integer from 0 to 2147483647 (2,147,483,647) to allow batch processing of a specified number of rows. A bulk size of 100, for example, would provide 100 rows of data to the data swapping provider, which would interchange the contents of those 100 rows.

A zero entry effectively means the same thing as an entry of 1. If a bulk size is not specified, each row is processed separately.

The largest feasible bulk size is a product of parallelism, the size of a data row, and the current Java heap size. With maximum parallelism, you can have 16 threads. If a data row is 1024 bytes on average and bulk size is set to 100, you would have 100 x 16 x 1024 bytes. A typical default Java heap size is 256 megabytes. With a bulk size of 16,000, a thread count of 16, and a 1024-byte record, you might easily exhaust the available heap, resulting in an out of memory error. So, exercise caution when you specify an integer for the BULKSIZE keyword.

CHARSETNAME: Identify the character set in which the input files are encoded, such as UTF-8. The default value is the default character set encoding of the Java virtual machine on which the data privacy application is running.
Valid values for this keyword are also determined by the Java implementation of the computer on which the data privacy application is running. The specified character set name is valid if a call to Java using the object invocation Charset.isSupported(charsetname) returns true on the computer running the application.

CONTINUEONERROR

Specify Yes or No to indicate whether you want processing to continue if a nonfatal error is encountered in a work item. A nonfatal error is an error that does not require immediate termination of all processing.

Specify Yes to continue processing any other work items in the configuration file if a nonfatal error is encountered.

Specify No to terminate processing, without processing any other work items in the configuration file, if a nonfatal error is encountered.

Note: Three keywords are provided in the configuration file to control how the application behaves when various error conditions are encountered during the masking process: CONTINUEONERROR, MAXERRORFILE, and MAXERRORRUN.

MAXERRORFILE

Identify the maximum number of masking errors that are allowed in a file before processing is terminated. Specify any integer from 0 to 999999999999 (12 nines).

Zero means that an infinite number of masking errors are allowed. Thus, the lowest tolerance level you can set is 1, which means processing terminates if one masking error is encountered in a file during the masking process.

If the number of masking errors that are encountered in a file exceeds this limit, processing terminates and an error message similar to the following is issued:


The masking failed because the number of errors encountered in an individual file exceeds 
the limit specified for the “maxerrorfile” keyword. The error limit specified for that 
keyword is x, and the number of errors currently encountered is y.

MAXERRORRUN

Identify the maximum number of masking errors that are allowed in a work item before processing is terminated. Specify any integer from 0 to 999999999999 (12 nines).

If the number of masking errors that are encountered in a work item exceeds this limit, processing terminates and an error message similar to the following is issued:


The masking of the work item failed because the number of errors encountered exceeds 
the limit specified for the “maxerrorrun” keyword. The error limit specified for that 
keyword is x, and the number of errors currently encountered is y.

ACTION: Identify the type of action to be done. Currently, the only valid setting is ACTION=Masking.
You must specify this keyword and value in each work item before the field specifications portion of the file.

Field Specifications for CSV files

This portion of the configuration file is used to specify the field specifications for the fields to be masked. It is also used to identify all of the fields in the CSV file in the order in which they appear in the file. This information is required for all fields in the input file so that the application can correctly identify the fields that you want to mask.

The primary field specification keywords, other than those used to specify the masking arguments, are FIELD, DATATYPE, KEY, and MASK.

FIELD

Identify the name of every field in the input file, including any fields that you do not want to mask and any user-defined key fields. (Key fields are described later in this topic.)

Fields are named entities, so the first value after the equals sign is the name of the field, as in this example: FIELD=CUST_ID.

You must list every field in the input file, as shown in the following example:


FIELD=CUST_ID,KEY=1
FIELD=CUSTNAME,KEY=2
FIELD=PHONE_NUMBER
FIELD=YTD_SALES
FIELD=SALESMAN_ID
FIELD=CREDITCARD_NUMBER,mask="PROVIDER=CCN,FLDDEF1=(NAME=CREDITCARD_NUMBER,-
-DT=WVARCHAR_SZ,LEN=80),MTD=repeatable,WHENINV=PRE"
FIELD=CREDITCARD_TYPE
FIELD=CREDITCARD_EXP
FIELD=CREDITCARD_CVV
FIELD=DRIVER_LICENSE
FIELD=CREDITCARD_HISTORY

In the preceding example, the user wants to mask only one field, the CREDITCARD_NUMBER field. Two fields are identified as KEY fields, CUST_ID and CUSTNAME. The remaining fields will not participate in the masking processes, but are included in the configuration file to show the order in which the fields appear in the input file.

If a field is referenced in the masking specification, that field's name must match the name that is specified in the masking string, as indicated by the two CREDITCARD_NUMBER entries in the following example:


FIELD=CREDITCARD_NUMBER,MASK="PROVIDER=CCN,FLDDEF1=(NAME=CREDITCARD_NUMBER,-
-DT=WVARCHAR_SZ,LEN=80),MTD=repeatable,WHENINV=PRE"

In the preceding example, notice that a hyphen (-) line-continuation character is entered at the end of the first line and the beginning of the second line. The user entered the hyphens to indicate that the masking information is specified over two lines.

DATATYPE

Use this optional keyword to specify the data type for each field you want to mask. If not specified, the data type entered in the field definition (FLDDEFn) masking string is used. If a data type is not entered in the masking string, WVARCHAR_SZ is used.

The data privacy application supports all of the data types available with the Optim data privacy providers. However, since CSV files are inherently text-based, WVARCHAR_SZ and DATETIME_SZ are the only values that are generally used with these files.

If you use this keyword, specify the appropriate data type for the field after the FIELD keyword and value, as in this example.


FIELD=MR1,DATATYPE=DATETIME_SZ,MASK="PROVIDER=AGE,FLDDEF1=(NAME=MR1,DT=DATETIME_SZ),YEAR=1,-
-SRCDF=\"%YYYY/%MM/%DD\"

In the example, the DATATYPE keyword is set to DATETIME_SZ, and that data type is repeated in the provider string (see the DT=DATETIME_SZ entry). Also, notice that the hyphen (-) line-continuation character is entered at the end of the first line and the beginning of the second line to indicate that the masking information is specified over two lines.

KEY

Identify any key fields in the input file that you want to appear in error messages to help identify the row in which an error occurred, as illustrated later in this description.

First, you need to know how to assign key numbers to your key fields. You can assign a key number to any field in the file, except the fields you want to mask. You cannot identify a masking-eligible field as part of a row key because they contain sensitive data that must not be included in error messages.

Key fields must be numbered sequentially from 1 through the maximum number of fields in a row. For example, if you want to assign keys to three fields in a row, those fields must be numbered sequentially, without any gaps in the numbers, as in the following sequence: KEY=1, KEY=2, and KEY=3. The fields, however, do not have to appear sequentially within a row. The first key field might be assigned to the third field in the row, and the second and third keys might be assigned (in any order) to the fifth and seventh fields in the row.

Keys also must be unsigned: -1 and +1, for example, are both invalid.

Row keys serve a similar purpose to the primary keys in a database table. The keys are used in reporting problems encountered masking a field. This is helpful when row numbers cannot be supplied to help identify where an error occurred. For example, the concept of row numbers does not exist for Hadoop masking, nor does it exist in some circumstances for file-system file masking. Thus, row keys can be used in those circumstances to identify the location of an error.

Here is an example of how row keys are used to help identify an error in a row. In this example, two fields (CUST_ID and CUSTNAME) are identified as KEY fields.


FIELD=CUST_ID,KEY=1
FIELD=CUSTNAME,KEY=2

Now, in the following example, the user is masking customer credit card numbers. During the masking process, an invalid card number is encountered in a row in which CUST_ID is 10237, and CUSTNAME is ABC Company. Given this scenario, an error message similar to the following message would be issued to identify the row in which the error occurred:


An error has occurred in row 151 with key '10237','ABC Company'.
Element CREDITCARD_NUMBER is invalid and cannot be masked.

This message not only identifies where the error occurred in the file, it also shows the name and ID of the customer who has an invalid credit card number. As this example illustrates, assigning key numbers to one or more fields in a row can help identify the precise row in which an error was encountered. This information, in turn, can be helpful in researching and analyzing the reason for an error, and correcting the problem. It is recommended, therefore, that you use this optional keyword to assign keys to one or more fields in a data row.

MASK

Specify the masking PROVIDER, field definition (FLDDEFn), and masking syntax for each masking field. The syntax that is used in the masking application is identical to that used in the Optim data privacy providers, UDFs, and Lua scripts. Therefore, if the data types and column names match, you can copy and paste strings from any of these sources into a configuration file.

The Optim data privacy provider library (ODPP) includes a set of out-of-the-box privacy algorithms, referred to as data privacy providers. The masking application supports the following providers:

CCN for masking credit card numbers
AFF for Affinity data masking of undifferentiated or dynamically formatted values
NID for masking National IDs
AGE for masking age by incrementing or decrementing a date value
EMAIL for masking email addresses
HASH to generate an integer hash code from a string

For details about specifying the masking syntax for each provider, see the appropriate provider topic. For example, to mask an email address, see Email privacy provider.

For information about specifying the field definition (FLDDEFn) parameter for each provider, see Field definition parameter.

For general information about the syntax used in the provider and field definition topics, see Syntax conventions.

When you specify a masking string with the MASK keyword, the masking string must be enclosed in single or double quotes, as in this example:


FIELD=CREDITCARD_NUMBER,MASK="PROVIDER=CCN,FLDDEF1=(NAME=CREDITCARD_NUMBER,DT=WVARCHAR_SZ,-
-LEN=80),MTD=repeatable,WHENINV=PRE"

In the example, notice that a hyphen (-) line-continuation character is entered at the end of the first line and the beginning of the second line. The user entered the hyphens to indicate that the masking information is specified over two lines.

If the masking string includes single quotes within an entry, use double quotes to enclose the masking string. Conversely, if the masking string includes double quotes within an entry, use single quotes to enclose the masking string, as illustrated in this example.


FIELD=DT,MASK='PROVIDER=AGE,FLDDEF1=(NAME="DT",DATATYPE=DATETIME_WSZ,LENGTH=128),-
-YEAR=1,SRCDF="%YYYY/%MM/%DD"'

In the preceding example, notice that the NAME and SRCDF entries include double quotes ("), so the masking string is enclosed in single quotes ('). Also, notice that a hyphen is entered at the end of the first line and the beginning of the second line to indicate the masking information is specified over two lines.