Standardizing your data
You determine which document types you want to include in your deployed project. For each document type, you choose the fields that you want to formalize as properties in the metadata for each document object that is stored.
About this task
When you build a document processing application that is based on this document processing project, the end result is that documents are processed and sent to a content repository. The processing includes the classification of document type and the extraction of data.
- During data standardization
-
For each document type that you choose to deploy, one class is added to the repository configuration at deployment time.
For each field that you map to a data definition, one property is added to the document-type class in the repository configuration.
- The outcome at processing time
- The values for all the fields that are associated with the document type are extracted into a
JSON file for the document, which is stored with the document.
The values for all the fields that are mapped to a data definition during data standardization are also stored as a property value on the document class.
Your document classification and data extraction models include information from the pre-trained document types that are present in every project. After you make decisions about the document types and set up your data extraction, you have a full set of document types and fields that are candidates for deploying to the repository. For deployment, choose only the document types that are relevant for your organization. For each document type that you deploy, choose only the fields that are likely to be used in downstream applications.
Important: Standardizing the data adds your document types to your project. If you create and deploy a project without standardizing your data, your document types won't be classified. - Understanding data definitions
- A data definition is a way to describe and standardize a point of data so that other services
and applications know what to expect from the data and can easily ingest and use it without any
conversion. An example of a data definition is a US Social Security Number, which has a similar
format and meaning in multiple contexts.
When you select your document types for deployment, you determine whether your designated fields are going to map to existing data definitions, or whether you want to add new data definitions that describe the field and value pair. If you create new data definitions, they are added to the data definition collection. This makes the new data definitions available to other users and other automation or service builders.
When possible, it is recommended to use available data definitions to create consistency for future applications and integrations.
Procedure
To standardize your data: