XML keyword definitions

Use this topic to obtain detailed information about the keywords that are used in a configuration file to mask data in an XML file. This topic expands on the information in the XML Quick reference guide topic, and it is primarily intended for new users and users who create configuration files infrequently.

The keywords that are described in this topic are shown in uppercase letters for emphasis, but you can enter them in uppercase or lowercase letters, or any combination of the two. An equal sign is used to separate each keyword from its user-assigned value, such as FILETYPE=XML.

Tip: You can use the XML configuration file template to create a configuration file by doing the following. Copy and paste the configuration file template into a text editor, and modify the template to meet your needs. Then, name the configuration file and save it for use with the data privacy application.
Note: Some of the keywords in this topic are used in the same manner to mask data in other file types, such as CSV files. Those keywords are repeated in the quick reference guides for each file type for ease of referencing by users who mask files of only one type. Other keywords, such as the STRICTMETADATA keyword, are used for all supported file types, but their use varies slightly from one file type to another.
REPORT
Specify the file location and report file name for the directory in which you want the file masking report saved, along with the name under which you want it saved. This keyword applies to all work items in a configuration file, so specify it once in each file.

You can specify a file path or URI for the directory and report file name.

  • Here is an example of a valid path and report file name in Windows:

    REPORT=C:\IBM\InfoSphere\Optim\mod\MODApp\Reports\ccn.xml

    In Windows, file paths are entered with backward slashes, as shown in the example, but URIs are entered with forward slashes.

  • Here is an example of a valid URI and report file name:

    REPORT=file:///tmp/reports/report.xml

    In non-Windows environments, forward slashes are used in both file paths and URIs.

Note: A file masking report is generated for each configuration file and saved in XML format and HTML format. The report lists relevant details about the masking process, including the number of elements that are masked and the number that could not be masked. The report also includes links to any exception files generated during the masking process. A separate exceptions file is created for each input file and work item for which any error or warning messages were issued. The exception files are saved as text files in a user-specified output directory.
REPLACE
Specify Yes or No to indicate whether you want the output and report files overwritten if they already exist. This keyword applies to all work items in a configuration file, so specify it once in each file.
  • Specify Yes if you want your output files to have the same name as your input files, and you want your output files and report files overwritten if they already exist.
  • Specify No if you do not want your output files to have the same name as your input files, and you do not want your output files and report files overwritten if they already exist. If you specify No, a unique number is appended to the output file name to ensure uniqueness. For example, if the input file is named mask_ccn.xml, the output file is renamed mask_ccn_0001.xml. The number that is appended to the file name is incremented each time the masking process is run against the same input file. For example, if an input file named mask_ccn.xml is masked three times, the output files are saved as mask_ccn_0001.xml, mask_ccn_0002.xml, and mask_ccn_0003.xml.
WORKITEM
Specify a character string, such as a name or number, to identify each work item in the configuration file. Each WORKITEM keyword identifies the start of a new work item in the configuration file. We recommend assigning a unique identifier to each work item because this entry is often used in error messages to identify the location in which an error occurred.

Here are a few examples of how you might identify your work items:

  • WORKITEM=1
  • WORKITEM=CCN_masking
  • WORKITEM=“  EMAIL MASKING  ”

As the last example illustrates, if the value specified for this keyword includes significant leading or trailing spaces, enclose the string in single (') or double quotation marks (").

Every configuration file must contain at least one work item. The keywords that follow the first work item apply to the first work item. This is the case until another WORKITEM keyword is encountered in the file, in which case any subsequent keywords apply to the second work item. This process continues throughout the configuration file until all work items are processed.

Each work item can include the following keywords, beginning with the INPUT keyword.

INPUT
Identify the directory in which the files you want to mask are located. You can specify a file path or URI.
  • Here is an example of a Windows path: INPUT=C:\IBM\InfoSphere\Optim\mod\MODApp\Input_Files.
  • Here is an example of a URI: INPUT=file:///tmp/myinput.
Note: The files you want to mask are identified later in the configuration file with the FILEPATTERN keyword.
OUTPUT
Identify the directory in which you want the masked files saved. You can specify a file path or URI.
  • Here is an example of a Windows path: OUTPUT=C:\IBM\InfoSphere\Optim\mod\MODApp\Output_Files.
  • Here is an example of a URI: OUTPUT=file:///tmp/myoutput.

You can specify the same path or URI for the input and output directories. However, if you do that and REPLACE=Yes, the masked output file will overwrite the input file, resulting in the loss of the unmasked data. But if REPLACE=No, a unique number is appended to the output file name to ensure that the input file is not overwritten.

HADOOPCACHE
XML files currently cannot be masked in Hadoop, so this keyword is not used when masking XML files.
FILEPATTERN
Identify the files you want to mask in the input directory, as follows:
  • To mask a single file, specify a specific file name, such as FILEPATTERN=customers.xml.
  • To mask multiple files with similar names, use the asterisk wildcard character to indicate you want to mask all input files that match a certain file pattern, such as FILEPATTERN=US_CUST*.
  • You also can use Java™ regular expression syntax or regex to identify the file pattern.
FILETYPE
Specify XML as the file type for the files you want to mask.
FIELDDELIMITER
This keyword is not used when masking an XML file.
DATADELIMITER
This keyword is not used when masking an XML file.
ESCAPECHARACTER
This keyword is not used when masking an XML file.
PARALLEL
Specify None, Low, Medium, or High to identify the degree of parallel operations that are allowed when multiple files are being masked.

Setting this keyword to Low, Medium, or High might boost performance, especially on systems with multiple processors.

When parallelism is used, each thread works on an input file, so there are never more threads than input files. For example, if only one file is specified, only one thread is created, regardless of how this keyword is set.

STRICTMETADATA
Specify Yes or No to indicate whether you want a warning message displayed if any of the fields that are referenced in the configuration file are not present in an XML record. (The repeating XML structure that makes up each record in an XML file is indicated by the NODE keyword in the configuration file.)
  • Specify Yes to receive a warning message.
  • Specify No to avoid receiving a warning message. No is default value.
Note: This keyword is ignored if the DATAPRESENCE keyword is specified.
BULKSIZE
Specify any integer from 0 to 2147483647 (2,147,483,647) to allow batch processing of a user-specified number of XML records. A bulk size of 100, for example, would provide 100 records to the data swapping provider, which would interchange the contents of those 100 records.

A zero entry effectively means the same thing as an entry of 1. If a bulk size is not specified, each XML record is processed separately.

The largest feasible bulk size is a product of parallelism, the size of a data record, and the current Java heap size. With maximum parallelism, you can have 16 threads. If a data record is 1024 bytes on average and bulk size is set to 100, you would have 100 x 16 x 1024 bytes. A typical default Java heap size is 256 megabytes. With a bulk size of 16,000, a thread count of 16, and a 1024-byte record, you might easily exhaust the available heap, resulting in an out of memory error. So, exercise caution when you specify an integer for the BULKSIZE keyword.

CHARSETNAME
Identify the character set in which the input files are encoded, such as UTF-8. The default value is the default character set encoding of the Java virtual machine on which the data privacy application is running.

Valid values for this keyword are also determined by the Java implementation of the computer on which the data privacy application is running. The specified character set name is valid if a call to Java using the object invocation Charset.isSupported(charsetname) returns true on the computer running the application.

CONTINUEONERROR
Specify Yes or No to indicate whether you want processing to continue if a nonfatal error is encountered in a work item. A nonfatal error is an error that does not require immediate termination of all processing.
  • Specify Yes to continue processing any other work items in the configuration file if a nonfatal error is encountered.
  • Specify No to terminate processing, without processing any other work items in the configuration file, if a nonfatal error is encountered.
Note: Three keywords are provided in the configuration file to control how the application behaves when various error conditions are encountered during the masking process: CONTINUEONERROR, MAXERRORFILE, and MAXERRORRUN.
MAXERRORFILE
Identify the maximum number of masking errors that are allowed in a file before processing is terminated. Specify any integer from 0 to 999999999999 (12 nines).

Zero means that an infinite number of masking errors are allowed. Thus, the lowest tolerance level you can set is 1, which means processing terminates if one masking error is encountered in a file during the masking process.

If the number of masking errors that are encountered in a file exceeds this limit, processing terminates and an error message similar to the following is issued:

The masking failed because the number of errors encountered in an individual file exceeds 
the limit specified for the “maxerrorfile” keyword. The error limit specified for that 
keyword is x, and the number of errors currently encountered is y.
MAXERRORRUN
Identify the maximum number of masking errors that are allowed in a work item before processing is terminated. Specify any integer from 0 to 999999999999 (12 nines).

Zero means that an infinite number of masking errors are allowed. Thus, the lowest tolerance level you can set is 1, which means processing terminates if one masking error is encountered in a work item during the masking process.

If the number of masking errors that are encountered in a work item exceeds this limit, processing terminates and an error message similar to the following is issued:


The masking of the work item failed because the number of errors encountered exceeds 
the limit specified for the “maxerrorrun” keyword. The error limit specified for that 
keyword is x, and the number of errors currently encountered is y.
ACTION
Identify the type of action to be done. Currently, the only valid setting is ACTION=Masking.

You must specify this keyword and value in each work item before the field specifications portion of the file.

Field Specifications for XML files

This portion of the configuration file is used to specify the field specifications for the fields to be masked and any user-defined key fields. Unlike in a CSV file, you do not have to identify all of the fields in the XML file in the order in which they appear in the file. In an XML file, the location of a masking field is identified by a combination of two keywords, the NODE and PATH keywords.

The primary field specification keywords, other than those used to specify the masking arguments, are NODE, FIELD, PATH, DATATYPE, DATAPRESENCE, KEY, and MASK.

NODE
Specify the base path to the fields (or data elements) you want to mask and any user-specified key fields. (Key fields are described later in this topic.)

The location of the fields to be masked is specified in XPath syntax:

  • The first portion of the XPath, which is the base path, is specified with the NODE keyword. The specified value identifies the path to the repeating (record) structure in the XML file that is referenced during the masking process. This value is made up of a path of XML elements, relative to the root node, and a forward slash is used to separate each element in the path.
  • The second portion of the XPath, which is the partial path to the field, is identified with the PATH keyword. This entry identifies the location of each masking field or key field in the file, relative to the record path identified with the NODE keyword.

The complete path to each field or data element is determined by combining the base path for the NODE with the PATH for a specific field. Thus, Node + Path = the absolute field path.

In the following XML file example, the base path for the node is /customers/customer/.


<?xml version="1.0" encoding="utf-8"?
<customers>
      <customer>
            <email_address>bobby@yahoo.comemail_address>bobby@yahoo.com>
      </customer>
</customers>

Based on this example, you would set the NODE keyword to NODE=/customers/customer/, and you would specify the path to the email field (later in the configuration file) as PATH=email_address. Given these two paths, the absolute field path to the email field is /customers/customer/email_address.

FIELD
Identify the name of each field or data element you want to mask, and identify each key field in the XML file.

Fields are named entities, so the first value after the equals sign is the name of the field, as in this example: FIELD=CUST_ID.

If a field is referenced in the masking specification, that field's name must match the name that is specified in the masking string, as indicated by the EMAIL entries in the following example:


FIELD=EMAIL,PATH=email_address,MASK="PROVIDER=EML,MTD=REP,FLD-DEF1=(NAME=EMAIL,DT=WVARCHAR_SZ)"
PATH
Specify the partial path to each masking field or data element and any user-specified key fields. This entry identifies the location of each field in the file, relative to the record path identified with the NODE keyword.

The complete path to each field or data element is determined by combining the base path for the NODE with the PATH for a specific field. See the NODE keyword for an example of how this works.

Here is an example of how the partial path for an email field might be specified: PATH=email_address.

DATATYPE
Use this optional keyword to specify the data type for each field you want to mask. If not specified, the data type entered in the field definition (FLDDEFn) masking string is used. If a data type is not entered in the masking string, WVARCHAR_SZ is used.

The data privacy application supports all of the data types available with the Optim™ data privacy providers. However, since XML files are inherently text-based, the only values that are generally used with these files are WVARCHAR_SZ and DATETIME_SZ.

If you use this keyword, specify the appropriate data type for the field after the FIELD keyword and value, as in this example.


FIELD=MR1,PATH=MR1,DATATYPE=DATETIME_SZ,MASK="PROVIDER=AGE,FLDDEF1=(NAME=MR1,DT=DATETIME_SZ),-
-YEAR=1,SRCDF=\"%YYYY/%MM/%DD\"

In the example, the DATATYPE keyword is set to DATETIME_SZ, and that data type is repeated in the provider string (see the DT=DATETIME_SZ entry). Also, notice that a hyphen (-) line-continuation character is entered at the end of the first line and the beginning of the second line. The user entered the hyphens to indicate that the masking information is specified over two lines.

DATAPRESENCE
Indicate what type of message, if any, should be displayed if an XML field is not found in the input file in the location indicated by the NODE and PATH keywords.
  • To receive an error message, specify Mandatory.
  • To receive a warning message, specify Warn.
  • To receive no message, specify Optional.
Note: If you do not specify this keyword but STRICTMETADATA=Yes, a warning message is displayed.
KEY
Identify any key fields in the input file that should appear in error messages to help identify the location in which an error occurred, as illustrated later in this description.

First, you need to know how to assign key numbers to your key fields. You can assign a key number to any field in the file, except the fields you want to mask. You cannot identify a masking-eligible field as a key because they contain sensitive data that should not be included in error messages.

Key fields must be numbered sequentially from 1 through the maximum number of fields in a record. For example, if you want to assign keys to three fields in a record, those fields must be numbered sequentially, without any gaps in the numbers, as in the following sequence: KEY=1, KEY=2, and KEY=3. The fields, however, do not have to appear sequentially within the records.

Keys also must be unsigned: -1 and +1, for example, are both invalid.

Key fields serve a similar purpose to the primary keys in a database table. The keys are used in reporting problems encountered masking a field. This is helpful when row numbers cannot be supplied to identify where an error occurred, as in the case of an XML file.

Here is an example of how keys are used to help identify where an error occurred. In this example, two fields (CUST_ID and CUSTNAME) are identified as KEY fields.


FIELD=CUST_ID,PATH=cid,KEY=1
FIELD=CUSTNAME,PATH=cname,KEY=2

Now, in the following example, the user is masking customer credit card numbers. During the masking process, an invalid card number is encountered in a record in which CUST_ID is 10237, and CUSTNAME is ABC Company. Given this scenario, an error message similar to the following message would be issued to identify the row in which the error occurred:


An error has occurred in row <unavailable> with key '10237','ABC Company'. 
Element CREDITCARD_NUMBER is invalid and cannot be masked.

This message not only identifies where the error occurred in the file, it also shows the name and ID of the customer who has an invalid credit card number. As this example illustrates, assigning key numbers to one or more fields in a record can help identify the record in which an error was encountered. This information, in turn, can be helpful in researching and analyzing the reason for an error, and correcting the problem. It is recommended, therefore, that you use this optional keyword to assign keys to one or more fields in a record.

MASK
Specify the masking PROVIDER, field definition (FLDDEFn), and masking syntax for each masking field. The syntax that is used in the masking application is identical to that used in the Optim data privacy providers, UDFs, and Lua scripts. Therefore, if the data types and column names match, you can copy and paste strings from any of these sources into a configuration file.

The Optim data privacy provider library (ODPP) includes a set of out-of-the-box privacy algorithms, referred to as data privacy providers. The data privacy application supports the following providers:

  • CCN for masking credit card numbers
  • AFF for Affinity data masking of undifferentiated or dynamically formatted values
  • NID for masking National IDs
  • AGE for masking age by incrementing or decrementing a date value
  • EMAIL for masking email addresses
  • HASH to generate an integer hash code from a string

For details about specifying the masking syntax for each provider, see the individual provider topics. For example, to mask an email address, see Email privacy provider.

For information about specifying the field definition (FLDDEFn) parameter for each provider, see Field definition parameter.

For general information about the syntax that is used in the provider and field definition topics, see Syntax conventions.

When you specify a masking string with the MASK keyword, the masking string must be enclosed in single or double quotation marks, as in this example:


FIELD=CREDITCARD_NUMBER,PATH=CCN,MASK="PROVIDER=CCN,FLDDEF1=(NAME=CREDITCARD_NUMBER,DT=WVARCHAR_SZ,-
-LEN=80),MTD=repeatable,WHENINV=PRE"

In the example, notice that a hyphen (-) line-continuation character is entered at the end of the first line and the beginning of the second line. The user entered the hyphens to indicate that the masking information is specified over two lines.

If the masking string includes single quotes within an entry, use double quotes to enclose the masking string. Conversely, if the masking string includes double quotes within an entry, use single quotes to enclose the masking string, as illustrated in this example.


FIELD=DT,PATH=DATE,MASK='PROVIDER=AGE,FLDDEF1=(NAME="DT",DATATYPE=DATETIME_WSZ,LENGTH=128),-
-YEAR=1,SRCDF="%YYYY/%MM/%DD"'

In the preceding example, notice that the NAME and SRCDF entries include double quotes ("), so the masking string is enclosed in single quotes ('). Also, notice that a hyphen is entered at the end of the first line and the beginning of the second line to indicate that the masking information is specified over two lines.