IBM InfoSphere Information Server suite-wide glossary

This glossary includes terms and definitions for IBM® InfoSphere® Information Server.

The following cross-references are used in this glossary:

See refers you from a term to a preferred synonym, or from an acronym or abbreviation to the defined full form.
See also refers you to a related or contrasting term.

To view glossaries for other IBM products, go to www.ibm.com/software/globalization/terminology.

A

abbreviation: A shortened form of a word or phrase that represents the full form of the term in a glossary.
accepted term: A term in a glossary that has been accepted as a new, valid term for general use within an organization by the glossary administrator. See also candidate term.
action: The part of a standardization rule that specifies how the rule processes a record. See also condition, standardization rule.
aggregate: 1. (n) In information analysis, a calculation that returns a single result value from several relational data rows or dimensional members. Typical examples of an aggregate are total and average.; 2. (v) To collect related information for processing and analysis.
analysis database: A database that InfoSphere Information Analyzer uses when it runs analysis jobs and where it stores the extended analysis information. The analysis database does not contain the InfoSphere Information Analyzer projects, analysis results, and design-time information; all of this information is stored in the metadata repository.
asset collection: A set of assets that have been grouped together to work on as a set instead of individually.
assign: To link between two assets in the metadata repository.

B

base column: In a cross-domain or cross-table information analysis, the column of data that is the driver for the analysis.
baseline analysis: A type of data analysis that compares a saved view of the results of a column analysis to another view of the same results that are captured later.
benchmark: A quantitative quality standard that defines the minimum level of acceptability for the data or the tolerance for some level of exceptions in a data analysis.
binding: In information analysis, a direct relationship between a logical element in a data rule and an actual column in a table in a data source.
blocking: A process that partitions records into subsets that share common characteristics with the goal to limit the number of record pairs being examined during matching. By limiting matching to records pairs within a subset, successful matching becomes computationally feasible for large data sets.
blueprint: A collection of diagrams that include information technology elements, which represent the architecture of an information project, and method elements, which represent standard information technology practices.
bridge: A component that converts metadata from one format to another format by mapping the metadata elements to a standard model. This model translates the semantics of the source tool into the semantics of the target tool. For example, the source tool might be a business intelligence or data modeling tool, and the target tool might be the metadata repository. Or, the source tool might be the metadata repository, and the target tool might be a file that is used by a data modeling tool. See also connector, metadata repository.
business analyst: A specialist who analyzes business needs and problems. A business analyst consults with users and stakeholders to identify opportunities for improving business return through information technology and then transforms the requirements into a technical form.
business intelligence asset (BI asset): An information asset that is used by business intelligence (BI) tools to organize reports and models that provide a business view of data. These assets include BI reports, BI models, BI collections, and cubes.
business lineage: The lifecycle of a unit of data, such as a table or a column, as it moves between information assets (data sources). Unlike data lineage, no information about data transformations is included in business lineage information. Business lineage information is like a summary of data lineage information. See also data lineage.
business metadata: Metadata that provides a business context and a business name for assets that are created and managed by other applications. Business metadata includes terms, information governance rules, labels, and stewards.

C

candidate column: A column that is used as a placeholder in a mapping.
candidate term: A term in a glossary that is being considered but that has not yet become standard or accepted. See also accepted term, standard term.
cardinality: In information analysis, a measure of the number of unique values in a column.
catalog: An authoritative dictionary of the assets and the metadata about assets that are used throughout the enterprise. A catalog is the collection of glossary assets, and metadata about information assets that is stored in the metadata repository.
category: A word or phrase that classifies and organizes terms in the glossary. A category can contain other categories, and it can also contain terms. In addition, a category can reference terms that it does not contain.
class: The syntactic category for a group of related values. A value can be assigned to different classes in different contexts or scenarios. See also value.
classification: 1. The process of grouping values into specific classes. See also class.; 2. The system that defines classes and the relationships among those classes. See also class.
clerical record: A record for which the matching process cannot definitively determine if the record is a duplicate record or a nonmatched record or if the record is a matched record or a nonmatched record. See also duplicate record, matched record, nonmatched record.
client tier: The client programs and consoles that are used for development, administration, and other tasks for the InfoSphere Information Server suite and product modules and the computers where they are installed.
column analysis: A data quality process that describes the condition of data at the field level.
common domain: In information analysis, the set of columns that share overlapping and potentially redundant values.
commonality: In information analysis, a measure of the number of matching values in a set of paired columns.
complex flat file: A file that has hierarchical structure, especially mainframe data structures and XML files.
compute node: A processing node in a parallel processing environment that handles elements of the job logic. Any processing node that is not a conductor node is a compute node. See also processing node.
condition: The part of a standardization rule that defines the requirements that the record must meet for the rule to apply to that record. A pattern is a type of condition. See also action, pattern, standardization rule.
conductor node: The processing node that initiates the job run. See also processing node.
connector: A component that provides data connectivity and metadata integration for external data sources, such as relational databases or messaging software. A connector typically includes a stage that is specific to the external data source. See also bridge, operator, plug-in.
constant: Data that has an unchanging, predefined value to be used in processing.
contained term: A term in a category in a glossary. A term must be contained by only one category.
context: The hierarchy of elements within which an element exists. For example, the context of a term in a glossary is the hierarchy of categories in which the term is contained.
cross-domain analysis: A type of data analysis that identifies the overlap of data values between two columns of data.
cross-table analysis: A type of data analysis that combines foreign key analysis and cross-domain analysis. Foreign keys reference primary keys that are already defined or identified during primary key analysis.
custom attribute: A user-defined property for an asset that further describes assets of that type. For example, a custom attribute for database tables might be "Expected maximum row count." The custom attribute would be available for every database table, and might contain different values for different database tables.
cutoff: A threshold that specifies how the scored record pairs are categorized as matched, nonmatched, or clerical records based on the weight generated by the matching process.

D

data analyst

A specialist who consults with users, business analysts, and stakeholders and then creates and runs processes in order to review and analyze the content, structure, and quality of data.

database

A collection of interrelated or independent data items that are stored together to serve one or more applications.

database schema

A collection of database objects such as tables, views, indexes, or triggers that define a database. A schema provides a logical classification of database objects.

data class

In information analysis, a classification that designates the logical type of data in a data field. A data class categorizes a column according to how the data in the column is used. For example, the classification INDICATOR represents a binary value such as TRUE/FALSE or YES/NO.

data cleansing

The process of preparing, standardizing, deduplicating, and integrating data from one or more sources such that it conforms to organizational requirements.

data click activity

A simple set of steps that move and transform data.

data enrichment

The process of adding and correcting the values of records from records that have been identified as representing similar entities.

data field

A column or field that contains a specific set of data values that are common to all records in a file or table.

data file

1. A file that stores a collection of fields in a native system file instead of in a database table.

2. The information asset that represents a collection of fields that are stored in a single file, such as a flat file, a complex flat file, or a sequential file.

data file structure

A collection of fields.

data lineage

The lifecycle of a unit of data, such as a table or a column, that indicates where the data comes from or where it goes to and how the data changes as it moves between data stores of any type. Data lineage is often expressed as a graph of a detailed bi-directional data flow. See also business lineage.

data partitioning

The process of logically or physically partitioning data into segments that are more easily maintained or accessed.

data pipelining

The process of pulling records from the source system and moving them through a sequence of functions that are defined in the data flow.

data rule

An expression that is generated out of a data rule definition that evaluates and analyzes conditions found during data profiling and data quality assessment. Data rules define specific tests, validations, or constraints associated with the data.

data set

A set of parallel data files and the descriptor file that refers to them. Data sets optimize the writing of data to disk by preserving the degree of partitioning. See also data file.

data source

The source of data itself, such as a database or XML file, and the connection information necessary for accessing the data.

data store

A place (such as a database system, file, or directory) where data is stored.

deduplication

The process of creating representative records from a set of records that have been identified as representing the same entities. See also matching, survivorship, and data enrichment.

deprecated term

A term in a glossary that is no longer approved for use. Typically, deprecated terms are replaced with a new term or a synonym. See also replacement term.

design metadata

Metadata about the data flow that is included within a job design.

development glossary

A glossary that contains only the categories and terms that are being created or revised as part of a configured workflow and that have not been published yet. See also published glossary.

diagram

A graphical representation of the logical data model or a subject area.

domain

In information analysis, the set of data in a column.

domain analysis

A type of data analysis where the values of columns are identified and marked as invalid values.

DS engine

1. See InfoSphere Information Server engine.

2. See server engine.

duplicate record

A record that matches a master record. The duplicate record is likely to represent the same unique entity as the master record. See also master record.

E

ELT (extract, load, and transform): The process of extracting data from one or more sources, loading it directly into a relational database, and then running data transformations in the relational database.
engine: See InfoSphere Information Server engine.
engine tier: The logical group of engine components for the InfoSphere Information Server suite and product modules (the InfoSphere Information Server engine components, service agents, and so on) and the computer or computers where those components are installed.
ETL (extract, transform, and load): The process of collecting data from one or more sources, cleansing and transforming it, and then loading it into a database.
event: An occurrence of significance to a task or system. Events can include completion or failure of an operation, a user action, or the change in state of a process.
exception: A condition or event that might require additional information or investigation.
exception set: Groups of exception records that were generated by a particular event and the details about those exception records.
extended data source: A data structure that cannot be written to disk or that cannot be imported into the metadata repository.
extract, load, and transform (ELT): See ELT.
extract, transform, and load (ETL): See ETL.

F

flat file: A file that has no hierarchical structure.
format analysis: A type of data analysis that validates the pattern of characters that is used to store a data value in selective columns (for example, telephone numbers or Social Security numbers) that have a standard general format.
frequency distribution: In information analysis, the number of occurrences of each unique value in a column and the characteristics of that column. A frequency distribution is a foundation on which other analyses are run when profiling data.

G

general format: In information analysis, the use of a character symbol for each unique data value. For example, all alphabetic characters in a column are replaced with the letter A.
global logical variable: In information analysis, a value that you set to represent a specific piece of data. It is a shared construct that can be used in all data rule definitions. See also data rule.
glossary: The controlled vocabulary and associated information governance policies and rules that define business semantics. Business and IT professionals can use a glossary to manage enterprise-wide information according to defined regulatory requirements or operational needs of the business. See also category, term, information governance policy, information governance rule.
glossary assets: The following set of assets: categories, terms, information governance policies, and information governance rules.

I

impact analysis: The process of identifying where objects are used and what other objects that they depend on.

implemented data resource: An information asset that represents a database and its contents (schemas, database tables, and stored procedures), a data file and its contents (data file structures and data file fields), or a data item definition.
inference: In information analysis, a statistical measure in which probabilities are interpreted as degrees of belief.
inferred data type: In information analysis, the optimum data type that is identified during data analysis that can be used for an individual data value.
information analysis: The data analysis processes that assess the quality of your data, profile the data for integration and migration, and verify any external data sources.
information asset: A piece of information that is of value to the organization and can have relationships, dependencies, or both, with other information assets. Information assets include those assets created or imported by InfoSphere Information Server products, such as business intelligence (BI) reports, jobs, or mapping specifications.
information governance: The procedures that an organization uses to maintain oversight and accountability of information assets.
information governance policy: A natural language description of an information governance subject area. An information governance policy is made up of information governance rules. See information governance rule.
information governance rule: A natural language definition of a characteristic for making information assets compliant with corporate objectives.
InfoSphere Information Server engine: The software that runs tasks or jobs, such as discovery, analysis, cleansing, or transformation. The engine includes the server engine, the parallel engine, and the other components that make up the runtime environment for InfoSphere Information Server and its product modules.
input link: A link that connects a data source to a stage. See also link.
investigation: A process of profiling the data source to understand the source data in order to identify relevant values, structures, and patterns.

J

job: The design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then can load that data into a target system. Types of jobs include parallel jobs, sequence jobs, server jobs, and mainframe jobs. See also job design, job executable,job parameter.
job activity: In a sequence job, a type of stage that indicates the actions that occur when the sequence job runs.
job design: The metadata that defines the sources and targets that are used within a job and the logic that operates on the associated data. A job design is composed of stages and the links between those stages. The job design is stored in the metadata repository, separate from the job executable. See also job.
job executable: The set of binary objects, generated scripts, and associated files that are used when running a job. See also job and job design.
job parameter: A processing variable that can be used at various points in a job design and overridden when the job is executed in order to dynamically influence the processing of the job. Job parameters are most often used for paths, file names, database connection information, or job logic. See also job, job design, and parameter set.
job run: A specific run of a job. A job can run multiple times, producing multiple job runs.
job sequence: See sequence job.
job template: A job design that only uses job parameters to specify values during runtime. See also job design and job parameter.

K

key analysis: A type of data analysis that evaluates data tables to find primary, foreign, and natural key candidates.

L

label

A short descriptor or keyword that classifies or categorizes information assets in the metadata repository, including categories and terms in the glossary.

link

A representation of a data flow that joins the stages in a job. A link connects data sources to processing stages, connects processing stages to each other, and also connects those processing stages to target systems. The types of links are input link, output link, reference link, and reject link. See also input link, output link, reference link, reject link.

literal

A character string whose fixed value is defined by the characters themselves.

logical asset

A logical data model element.

logical data model

The data model that captures the business definition of information assets by using the entity-relationship modeling approach. The logical data model consists of a set of related entities and their business associations. The logical data model can be represented graphically in the logical data model diagram. The logical data model contains logical entities, logical relationships, entity generalization hierarchies, and logical domains.

long description

An extended description of a term in a glossary that fully defines the term. See also short description.

lookup table

A database table used to map one or more input values to one or more output values.
A data source that has a key value that jobs use to retrieve reference information.

M

mapping specification: A set of mappings that describe how data is extracted, transformed, or loaded from one data source to another.
master record: During one-source matching, the record that is considered to be the primary record of a set of related records. Each group of two or more matched records has one master record. See also one-source matching.
master schema definition: A physical model of the inferred properties that are generated out of the selected data. It reflects the inferences of the data instead of the original definitions of the metadata.
match comparison: An algorithm that analyzes the values in columns and then calculates a score that contributes to the composite weight, which is used to determine the strength of the match. See also score.
Match Designer database: A database that stores the results of match test passes that are generated by InfoSphere QualityStage.
matched record: A data record that is identified to be the same as a reference record by a two-source matching process. See also two-source matching.
matching: A probabilistic or deterministic record linkage process that automates either the identification of records that are likely to represent the same entity or the identification of a relationship among records.
metadata repository: A shared component that stores design-time, runtime, glossary, and other metadata for product modules in the InfoSphere Information Server suite.
metadata services: A shared set of components that provide common functions (such as import and export) to other product modules in the InfoSphere Information Server suite.
metric: 1. A measure to assess performance in a key area of a business.; 2. In information analysis, a mathematical calculation that is performed on statistical results from data rules, rule sets, and other metrics themselves. A metric consolidates measurements from various data analysis steps to reduce hundreds of detailed analytical results into a few meaningful measurements that effectively convey the overall quality of the data.

N

node: 1. A logical processing unit that is defined in a configuration file by a virtual name and a set of associated details about the physical resources, such as the server, the disks, its pools, and so on.; 2. Any computer system that has a parallel engine installed on it. See also parallel engine.
nonmatched record: A record that is not a matched record, clerical record, or duplicate record. See also matched record, clerical record, and duplicate record.

O

one-source matching: The process of matching records within one source. See also matching and deduplication.
operand: An entity on which an operation is performed.
operational metadata: Metadata that describes the events and processes that occur and the objects that are affected when a job is run. See also operations database.
operations database: A component of the metadata repository that stores both the operational metadata and the information about the system resources that were used when a job is run for the product modules in the InfoSphere Information Server suite. See also metadata repository, operational metadata.
operator: A runtime object library that is part of the parallel engine and that executes the logic as defined in its corresponding stage. See also connector, stage.
output link: A link that is connected to a stage and generally moves processed data from the stage. See also link, reject link.
override: An object that defines how to change the processing of data as specified in classifications or standardization rules.

P

pack: A collection of components that extends existing capabilities.
paired column: In a cross-domain analysis or cross-table analysis, the column of data that has been matched to the base column.
parallel engine: The component of the InfoSphere Information Server engine that runs parallel jobs.
parallel job: A job that is compiled and run on the parallel engine and that supports parallel processing system features, including data pipelining, partitioning, and distributed execution. See also job.
parameter set: A set of job parameters. See also job parameter and value set.
parsing: A process that analyzes a sentence or phrase by dividing the strings into tokens before trying to determine the meaning of the strings. See also token.
pattern: The sequence of class labels assigned to the values in a data record which can be used to identify a subset of records that might be standardized the same way. See also class and value.
pattern-action language: The language that defines standardization rules. See also standardization rule.
physical asset: A physical data model element or an implemented data resource.
physical data model: The data model that represents the design schema for the information assets by using the relational model approach. The physical data model is typically generated from the logical data model by using the same modeling tools, although it can be reverse engineered from an existing database. A physical data model can be implemented many times. The physical data model contains design tables, design stored procedures, and physical domains.
physical data resource (PDR): See implemented data resources.
plug-in: A type of stage that is used to connect to data sources but that does not support parallel processing capabilities. See also connector.
policy: 1. The set of characteristics that defines the behavior of a runtime artifact.; 2. See information governance policy.
precision: In information analysis, a measurement of the ability to distinguish between nearly equal values.
process asset: A mapping component, a mapping specification, or potentially other similar assets, such as a job, rule, or parameter.
processing node: The logical nodes in the system where jobs are run. The configuration file can define one processing node for each physical node in the system or multiple processing nodes for each physical node. See also compute node.
project: A container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.
publish: To make analysis results, rules, and other entities visible to a broader audience outside the scope of a project.
published glossary: The glossary that includes the set of categories and terms that have been approved and published as part of a configured workflow.
PX engine: See parallel engine.

R

redundancy: In information analysis, a measure of the number of columns that have the same values or common domains.
referenced term: A term in a glossary that is referred to by a category instead of being contained in that category. A term can be referred to by multiple categories. A term cannot be contained by and referenced by the same category.
reference link: An input link on a Transformer or Lookup stage that defines where the lookup tables exist. See also link.
reference match: See two-source matching.
reference table: A data table that you use in comparisons during data analysis.
referential integrity: An analysis that is run after foreign key analysis to ensure that foreign key candidates match the values of an associated primary key.
reject link: An output link that identifies errors when the stage is processing records and that routes those rejected records to a target stage. See also link, output link.
related term: A term in a glossary that is related to the term in question. This relationship can be used for "see also" relationships to terms that are similar but not identical. The relationship is symmetrical; that is, if you specify that term A has term B as a related term, then term B has term A as a related term. A term can have multiple related terms.
relationship: 1. A defined connection between the rows of a table or the rows of two tables. A relationship is the internal representation of a referential constraint.; 2. An association between glossary assets.
replaced by term: A term in a glossary that supersedes another term. Typically, deprecated terms specify replacement terms to identify which term replaces the deprecated term. See also deprecated term.
report: A set of data deliberately laid out to communicate business information.
repository: A persistent storage area for data and other application resources.
repository tier: The repository tier consists of the metadata repository and, if installed, other data stores to support other product modules. The metadata repository contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules.
representative record: The record that is created during survivorship and populated with the best available data from a group of records. See also survivorship.
routine: A program or sequence of instructions called by a program. Typically, a routine has a general purpose and is frequently used.

S

score: In the matching process, the result of a match comparison. See also matching and match comparison.
separation character: A character that separates or delimits tokens. See also token.
separation list: The list of separation characters. See also separation character.
sequence job: A job whose job design is composed of job activities and the triggers between those job activities that are run in a specified order.
server engine: The component of the InfoSphere Information Server engine that runs server jobs and job sequences.
server job: A job that is compiled and run on the server engine.
services tier: The application server, common services, and product services for the InfoSphere Information Server suite and product modules and the computer or computers where those components are installed.
short description: A brief description that defines a term in a glossary. See also long description.
similarity threshold: In information analysis, a comparison threshold that defines the degree of variation that is allowed in the spelling or representation of another value.
source-to-target mapping: A row in a mapping specification that describes a transformation between one or more source columns and business terms to one or more target columns and business terms.
stage: The element of a job design that describes a data source, a data processing step, or a target system and that defines the processing logic that moves data from input links to output links. A stage is a configured instance of a stage type. See also job design, stage type.
stage type: An object that defines the capabilities of a stage, the parameters of the stage, and the libraries that the stage uses at run time. See also stage.
standard deviation: A measurement of how varied the values in a frequency distribution are from the average value of the distribution. A low standard deviation value means that the values are close to the average value, whereas a high standard deviation value means that the values are more widely dispersed over a large range of values.
standardization: A process that separates records into parts, changes them to implement enterprise data quality standards, and potentially enriches the data for when it is used.
standardization rule: One or more conditions, such as a pattern, and the associated set of actions, which is used to standardize data. See also condition, pattern, action, and standardization.
standard term: A term in a glossary that has been thoroughly evaluated and approved by the team and that has been defined as definitively describing a characteristic of the enterprise or organization. See also candidate term.
standard value: The element of a classification definition that is a standardized spelling or representation of the value and that can be used to facilitate matching.
steward: The user or group of users that is responsible for the definition, purpose, and use of glossary assets or the information assets that are described in the metadata repository. The steward does not have to be a user of the glossary.
strip character: A character to be removed when parsing text into tokens. See also token.
strip list: The list of strip characters. See also strip character.
subscription: (1) The set of mappings between source replication objects and target replication objects.; (2) In the common event framework, a definition of how to process a certain type of event.
survivorship: The data cleansing process of evaluating a group of related records and creating one representative record. See also representative record.
synonym: A term in a glossary that has the same meaning as another term. A term can have multiple synonyms. The relationship is symmetrical and transitive; that is, if term A is a synonym of term B, and term B is a synonym of term C, each term is a synonym of the others.

T

table analysis: A data analysis process that consists of primary key analysis and the assessment of multicolumn primary keys and potential duplicate values.
technical metadata: Metadata that provides details about source and target systems, database table and field structures, and dependencies of information assets.
term: In a glossary, a word or phrase that describes a characteristic of the enterprise. By assigning assets to terms in the glossary, you can organize your information assets based on business meaning.
threshold: A customizable value for defining the acceptable tolerance limits (maximum, minimum, or reference limit) for an application resource or system resource. When the measured value of the resource is greater than the maximum value, less than the minimum value, or equal to the reference value, an exception or event is raised.
tier: The logical group of components and the computers on which those components are installed.
token: A syntactic element, such as a phrase, a word, or a set of one or more characters, that is used for analyzing and processing text.
tokenization: The process that segments data into tokens. See also token and parsing.
tolerance: See benchmark.
trigger: A representation of dependencies between workflow tasks that joins job activities in a sequence job. Job activities typically have one input trigger, but multiple output triggers.
two-source matching: The process of matching records between two sources. See also matching.

U

unduplicate: See deduplication.
unduplicate match: See one-source matching.
uniqueness: In information analysis, a measure of the value occurring exactly once in the table data.

V

validity: A data analysis process that evaluates columns for valid and invalid values.
value: 1. When standardizing data, a phrase, a word, or a set of one or more characters that is used for analyzing and processing text. See also token.; 2. The content of a variable, parameter, special register, or field.
value set: A named set of values that can be used to override the default values for the job parameters that are grouped in a parameter set. See also job parameter and parameter set.
view: A logical table that is based on data stored in an underlying set of tables. The data returned by a view is determined by a SELECT statement that is run on the underlying tables.
virtual column: In information analysis, a single column or a concatenation of two or more columns that can be analyzed as if it is an existing physical data column.

W

weight: In the matching process, a factor that indicates the relative importance of part of a record. See also score.
workflow: The glossary development process that adds an approval step and a publishing step to the creation or revision of glossary content. New or revised glossary assets are added to the development glossary and sent for review and approval before being published to the published glossary.