Information icon IBM InfoSphere Information Analyzer, Version 8.5
space Feedback

Analysis threshold settings

Analysis threshold settings specify how the InfoSphere™ Information Analyzer performs analysis. You can modify the individual settings for column analysis, table analysis, and cross-table analysis. Each threshold option is a percentage that controls the fraction of values that are required to meet this threshold before making the inference.

A threshold setting is a percentage that controls the total number of data values that are required to meet the threshold before an inference is made by the system. For example, if the threshold setting ConstantThreshold equals 99%, then almost all of the values in a column must be the same for that column to be inferred as a constant. If 100% of the values in a column are the same, then the column will be inferred as a constant. If 99.1% of the values in the column are the same, the column will also be inferred as a constant. However, in a table of one million rows, this value indicates that there are 9000 non-constant values. This means that there is a possibility that the inference might be wrong. When you review the results in the frequency distribution of a column, you can examine the values and determine whether 0.9% of the values are inconsistent or invalid. You can then change the inference in the frequency distribution if you choose.

Column analysis settings

The following table shows the column analysis settings that you can modify. Note that the default values listed are the system defaults. If you modified the default system settings in the Analysis Settings workspace, then the modified settings are displayed. View column analysis settings by going to the Home navigator menu and selecting Configuration > Analysis Settings.
Table 1. Column analysis settings
Setting Description
Nullability threshold Infers whether a column allows null values. If a column has null values with a frequency percentage equal to or greater than the nullability threshold, the system determines that the column allows null values. If null values do not exist in the column or the frequency percent is less than the threshold, the system determines that the column does not allow null values. The default is 1.0%.
Uniqueness threshold Infers whether a column is considered unique. If a column has a percentage of unique values equal to or greater than the uniqueness threshold, the system determines that the column is unique. The default is 99.0%.
Constant threshold Infers whether a column contains constant values. If a column has a single distinct value with a frequency percentage equal to or greater than the constant threshold, the system determines that the column is constant. The default is 99.0%.

Classification order of preference settings

The following table shows the classification order settings that you can enable based on your column analysis preferences. You can enable an additional level of data classification such as credit card number or name by clicking Enable. You can change the order of the classification settings that you have enabled by using the arrows to the right of the Classification Order of Preference section.
Table 2. Classification order of preference settings
Data class Description
Account Number Infers whether a column can be considered an account number.
Addresses Infers whether a column can be considered an address.
AMEX Infers whether a column can be considered an American Express credit card number.
Bank Account Number Infers whether a column can be considered a bank account.
Canadian SIN Infers whether a column can be considered a Canadian social insurance number (SIN).
Code A column that contains code values that represent a specific meaning. For example, a column with the class of Code might contain data about the area code in a telephone number.
Company Name Infers whether a column can be considered a company name.
Computer Addresses Infers whether a column can be considered a computer address.
Country Code Infers whether a column can be considered a country or region code.
Credit Card Number Infers whether a column can be considered a credit card number.
Date Infers whether a column can be considered a chronological data. For example, a column with the class of Date might contain data such as 10/10/07.
Date of Birth Infers whether a column can be considered a date of birth.
Diners Club Infers whether a column can be considered a Diners Club credit card number.
Discover Infers whether a column can be considered a Discover credit card number.
Drivers License Infers whether a column can be considered a drivers license number.
Email Address Infers whether a column can be considered an e-mail address.
France INSEE Infers whether a column can be considered a French National Institute for Statistics and Economic Studies (INSEE) number.
Gender Infers whether the column is a gender code, such as Male/Female or M/F.
Host Name Infers whether a column can be considered a host name.
Identification Number Infers whether a column can be considered an identification number.
Identifier A data value that is used to reference a unique entity. For example, a column with the class of Identifier might be a primary key or contain unique information such as a customer number.
Indicator Infers whether a column contains binary values such as M/F, 0/1, True/False, Yes/No.
International Securities Identification Number Infers whether a column can be considered an international securities identification number.
International Standard Book Number Infers whether a column can be considered an international standard book number.
IP Address Infers whether a column can be considered an IP address.
Italy Fiscal Code Infers whether a column can be considered an Italian fiscal code.
Japan CB Infers whether a column can be considered a Japanese CB number.
Large Object A large object is a column whose length is greater than the length threshold. A column defined as a large object will not be explicitly profiled. Large Object is assigned to columns that have a BLOB data type.
MasterCard Infers whether a column can be considered a MasterCard credit card number.
Names Infers whether a column can be considered a name.
Notes Email Address Infers whether a column can be considered an IBM Lotus Notes® e-mail address.
Passport Number Infers whether a column can be considered a passport number.
Personal Addresses Infers whether a column can be considered a personal address.
Personal Identification Information Infers whether a column can be considered personal identification information.
Personal Identification Numbers Infers whether a column can be considered a personal identification number.
Personal Name Infers whether a column can be considered an individual's personal name.
Quantity A column that contains data about a numerical value. For example, a column with the class of Quantity might contain data about the price of an object.
Spanish NIF Infers whether a column can be considered a Spanish identification number (número de identificación or NIF).
Text A column that contains unformatted alphanumeric data and special character data.
UK NINO Infers whether a column can be considered a UK national insurance number (NINO).
Universal Product Code Infers whether a column can be considered a universal product code.
Unknown Columns that cannot be classified by the system are defined as Unknown. The Unknown class is applied temporarily during analysis.
URL Infers whether a column can be considered a Web address.
US Phone Number Infers whether a column can be considered a U.S. telephone number.
US SSN Infers whether a column can be considered a U.S. Social Security number (SSN).
US State Code Infers whether a column can be considered a U.S. state code or abbreviation.
US Zip Infers whether a column can be considered a U.S. postal code.
US Zip+4 Infers whether a column can be considered an extended U.S. ZIP Code.
Visa Infers whether a column can be considered a Visa credit card number.

Table analysis settings

The following table shows the table analysis settings that you can modify for primary key analysis and foreign key analysis. Note that the default values listed are the system defaults. If you modified the default system settings in the Analysis Settings workspace, then the revised settings are displayed.
Table 3. Table analysis settings
Setting Description
Primary key threshold Infers whether a column, either a single column or multiple column concatenation, can be considered a primary key candidate.
Data sample size Controls the number of records that are included when a data sample of the table or file is created. The default is 2,000 records.
Data sample method Determines which type of method is used to create a data sample: sequential, random, or every nth value.
Data sample parameter Specifies the n value for the n data sampling method. The default is 10.
Composite key maximum Determines the maximum number of columns that can be combined when you search for primary key candidates. The default is 2.

Cross-table analysis settings

You can modify the default cross-table analysis settings. If you modified the default system settings in the Analysis Settings workspace, those defaults are displayed.

Common domain threshold setting
Determines the percentage of distinct values in the frequency distribution of one column that match distinct values in the frequency distribution of another column. If the percentage of matching distinct values is equal to or greater than the threshold, then the two columns are inferred to have a common domain. The default is 98%.

PDFThis topic is also in the IBM InfoSphere Information Analyzer User's Guide.

Update timestamp Last updated: 2010-09-30