Standardizing your data

You determine which document types you want to include in your deployed project. For each document type, you choose the fields that you want to formalize as properties in the metadata for each document object that is stored.

About this task

When you build a document processing application that is based on this document processing project, the end result is that documents are processed and sent to a content repository. The processing includes the classification of document type and the extraction of data.

How the data for each document is stored depends on the choices that you make when you standardize the data. Standardizing data means that you choose which document types to send to the object store as classes, and for each class, which fields to designate as properties by mapping them to a data definition.
During data standardization

For each document type that you choose to deploy, one class is added to the repository configuration at deployment time.

For each field that you map to a data definition, one property is added to the document-type class in the repository configuration.

The outcome at processing time
The values for all the fields that are associated with the document type are extracted into a JSON file for the document, which is stored with the document.

The values for all the fields that are mapped to a data definition during data standardization are also stored as a property value on the document class.

Your document classification and data extraction models include information from the pre-trained document types that are present in every project. After you make decisions about the document types and set up your data extraction, you have a full set of document types and fields that are candidates for deploying to the repository. For deployment, choose only the document types that are relevant for your organization. For each document type that you deploy, choose only the fields that are likely to be used in downstream applications.

Important: Standardizing the data adds your document types to your project. If you create and deploy a project without standardizing your data, your document types won't be classified.
Understanding data definitions
A data definition is a way to describe and standardize a point of data so that other services and applications know what to expect from the data and can easily ingest and use it without any conversion. An example of a data definition is a US Social Security Number, which has a similar format and meaning in multiple contexts.

When you select your document types for deployment, you determine whether your designated fields are going to map to existing data definitions, or whether you want to add new data definitions that describe the field and value pair. If you create new data definitions, they are added to the data definition collection. This makes the new data definitions available to other users and other automation or service builders.

When possible, it is recommended to use available data definitions to create consistency for future applications and integrations.

Procedure

To standardize your data:

  1. From the Designer screen, on the Data standardization tile, click Start.
  2. Choose a tile for a document type that you want to deploy, and set the Deploy toggle control to Yes.
  3. On that tile, click Start.
    The fields for that document type are displayed.
  4. From the list of fields, choose the first field that you want to establish as a property, and click Define.
    Tip: If you see a field in this list that you do not want to extract data for, you can use the Delete control to remove it.
  5. In the Define for field panel, choose from the following options:
    Choice Actions
    Match field to an existing data definition You specify a namespace and an existing data definition to use for the field. The data definitions are from the default set, under the Common namespace, or from collections that have been created by other users.

    This option is available only for simple fields. You cannot match a composite field to an existing data definition, in this case you must create a new data definition.

    Create a new data definition for the field You provide a title, an optional description, whether or not the field is required, and information about the value, such as Type and additional properties.

    Titles must be unique, you cannot use a title that is already used for another data definition.

  6. Click Create to add the data definition.
  7. Repeat the steps for each field that you want to add as a property for the document type class.
    Remember: All field values are saved in the value JSON file, but not all fields are necessarily needed as a property on the class.

What to do next

Return to the Designer screen to configure retention for your document types.