Analysis threshold settings

Analysis threshold settings specify how the InfoSphere™ Information Analyzer performs analysis. You can modify the individual settings for column analysis, table analysis, and cross-table analysis. Each threshold option is a percentage that controls the fraction of values that are required to meet this threshold before making the inference.

A threshold setting is a percentage that controls the total number of data values that are required to meet the threshold before an inference is made by the system. For example, if the threshold setting ConstantThreshold equals 99%, then almost all of the values in a column must be the same for that column to be inferred as a constant. If 100% of the values in a column are the same, then the column will be inferred as a constant. If 99.1% of the values in the column are the same, the column will also be inferred as a constant. However, in a table of one million rows, this value indicates that there are 9000 non-constant values. This means that there is a possibility that the inference might be wrong. When you review the results in the frequency distribution of a column, you can examine the values and determine whether 0.9% of the values are inconsistent or invalid. You can then change the inference in the frequency distribution if you choose.

Column analysis settings

The following table shows the column analysis settings that you can modify. Note that the default values listed are the system defaults. If you modified the default system settings in the Analysis Settings workspace, then the modified settings are displayed. View column analysis settings by going to the Home navigator menu and selecting Configuration > Analysis Settings.

Table 1. Column analysis settings
Setting	Description
Nullability threshold	Infers whether a column allows null values. If a column has null values with a frequency percentage equal to or greater than the nullability threshold, the system determines that the column allows null values. If null values do not exist in the column or the frequency percent is less than the threshold, the system determines that the column does not allow null values. The default is 1.0%.
Uniqueness threshold	Infers whether a column is considered unique. If a column has a percentage of unique values equal to or greater than the uniqueness threshold, the system determines that the column is unique. The default is 99.0%.
Constant threshold	Infers whether a column contains constant values. If a column has a single distinct value with a frequency percentage equal to or greater than the constant threshold, the system determines that the column is constant. The default is 99.0%.

Classification order of preference settings

The following table shows the classification order settings that you can enable based on your column analysis preferences. You can enable an additional level of data classification such as credit card number or name by clicking Enable. You can change the order of the classification settings that you have enabled by using the arrows to the right of the Classification Order of Preference section.

Table 2. Classification order of preference settings
Data class	Description
Account Number	Infers whether a column can be considered an account number.
Addresses	Infers whether a column can be considered an address.
AMEX	Infers whether a column can be considered an American Express credit card number.
Bank Account Number	Infers whether a column can be considered a bank account.
Canadian SIN	Infers whether a column can be considered a Canadian social insurance number (SIN).
Code	A column that contains code values that represent a specific meaning. For example, a column with the class of Code might contain data about the area code in a telephone number.
Company Name	Infers whether a column can be considered a company name.
Computer Addresses	Infers whether a column can be considered a computer address.
Country Code	Infers whether a column can be considered a country or region code.
Credit Card Number	Infers whether a column can be considered a credit card number.
Date	Infers whether a column can be considered a chronological data. For example, a column with the class of Date might contain data such as 10/10/07.
Date of Birth	Infers whether a column can be considered a date of birth.
Diners Club	Infers whether a column can be considered a Diners Club credit card number.
Discover	Infers whether a column can be considered a Discover credit card number.
Drivers License	Infers whether a column can be considered a drivers license number.
Email Address	Infers whether a column can be considered an e-mail address.
France INSEE	Infers whether a column can be considered a French National Institute for Statistics and Economic Studies (INSEE) number.
Gender	Infers whether the column is a gender code, such as Male/Female or M/F.
Host Name	Infers whether a column can be considered a host name.
Identification Number	Infers whether a column can be considered an identification number.
Identifier	A data value that is used to reference a unique entity. For example, a column with the class of Identifier might be a primary key or contain unique information such as a customer number.
Indicator	Infers whether a column contains binary values such as M/F, 0/1, True/False, Yes/No.
International Securities Identification Number	Infers whether a column can be considered an international securities identification number.
International Standard Book Number	Infers whether a column can be considered an international standard book number.
IP Address	Infers whether a column can be considered an IP address.
Italy Fiscal Code	Infers whether a column can be considered an Italian fiscal code.
Japan CB	Infers whether a column can be considered a Japanese CB number.
Large Object	A large object is a column whose length is greater than the length threshold. A column defined as a large object will not be explicitly profiled. Large Object is assigned to columns that have a BLOB data type.
MasterCard	Infers whether a column can be considered a MasterCard credit card number.
Names	Infers whether a column can be considered a name.
Notes Email Address	Infers whether a column can be considered an IBM Lotus Notes® e-mail address.
Passport Number	Infers whether a column can be considered a passport number.
Personal Addresses	Infers whether a column can be considered a personal address.
Personal Identification Information	Infers whether a column can be considered personal identification information.
Personal Identification Numbers	Infers whether a column can be considered a personal identification number.
Personal Name	Infers whether a column can be considered an individual's personal name.
Quantity	A column that contains data about a numerical value. For example, a column with the class of Quantity might contain data about the price of an object.
Spanish NIF	Infers whether a column can be considered a Spanish identification number (número de identificación or NIF).
Text	A column that contains unformatted alphanumeric data and special character data.
UK NINO	Infers whether a column can be considered a UK national insurance number (NINO).
Universal Product Code	Infers whether a column can be considered a universal product code.
Unknown	Columns that cannot be classified by the system are defined as Unknown. The Unknown class is applied temporarily during analysis.
URL	Infers whether a column can be considered a Web address.
US Phone Number	Infers whether a column can be considered a U.S. telephone number.
US SSN	Infers whether a column can be considered a U.S. Social Security number (SSN).
US State Code	Infers whether a column can be considered a U.S. state code or abbreviation.
US Zip	Infers whether a column can be considered a U.S. postal code.
US Zip+4	Infers whether a column can be considered an extended U.S. ZIP Code.
Visa	Infers whether a column can be considered a Visa credit card number.

Table analysis settings

The following table shows the table analysis settings that you can modify for primary key analysis and foreign key analysis. Note that the default values listed are the system defaults. If you modified the default system settings in the Analysis Settings workspace, then the revised settings are displayed.

Table 3. Table analysis settings
Setting	Description
Primary key threshold	Infers whether a column, either a single column or multiple column concatenation, can be considered a primary key candidate.
Data sample size	Controls the number of records that are included when a data sample of the table or file is created. The default is 2,000 records.
Data sample method	Determines which type of method is used to create a data sample: sequential, random, or every nth value.
Data sample parameter	Specifies the n value for the n data sampling method. The default is 10.
Composite key maximum	Determines the maximum number of columns that can be combined when you search for primary key candidates. The default is 2.

Cross-table analysis settings

You can modify the default cross-table analysis settings. If you modified the default system settings in the Analysis Settings workspace, those defaults are displayed.

Common domain threshold setting: Determines the percentage of distinct values in the frequency distribution of one column that match distinct values in the frequency distribution of another column. If the percentage of matching distinct values is equal to or greater than the threshold, then the two columns are inferred to have a common domain. The default is 98%.