Understanding model accuracy

Automation Document Processing provides a pre-trained model for common document types. As you add sample documents and fields, the overall model is trained to match your specific requirements. You evaluate the resulting models to determine which model to use for your project.

Best practices for model building

  • Ensure that documents in a document type does not exist in another document type.
  • Ensure that the document set provided for each document type adequately represents the type.
  • Test files against the models to determine the best model that works for your document set. Testing includes files that do not belong to any of your document types.

Model accuracy

A model's accuracy is based on the correct predictions that are made for your document classes. Training files are bundled into a group, and then verified against algorithms to predict accuracy. 70% of the data set is used to generate the classifier, and 30% of the data set is compared to the classifier to measure how well the classifier is predicting the results. Accuracy is the number of correct predictions from all predictions made. Accuracy is calculated by using the number of correct predictions or all predictions made.

Confidence Scores

The score levels are based on the documents provided. Using the best practices for model building ensures better confidence scores.

Model

  1. Low: The model has a low confidence, which might lead to incorrect document types.
  2. Medium: The model is likely to get a mixture of correct and incorrect types.
  3. High: The model has a high confidence, which means the model is likely to get a high number of the correct types. Because of the conservative nature of the confidence scoring, it is possible that some documents cannot be classified successfully.

Possible reasons for low scores

  • Sufficient features might not be extracted from the documents that are provided for each document type.
  • Documents that are submitted under each document type can be similar to other document types.
  • Document set provided are insufficient to get a good confidence score. Consider adding more training files to help improve the confidence levels.

Document Type

  1. Low: Documents belonging to that document type are likely to be incorrectly classified.
  2. Medium: Documents belonging to that document type are likely to be correctly and incorrectly classified.
  3. High: Documents belonging to that document type is correctly classified, although you might get a Low accuracy for some documents.
Possible reasons for low scores
  • Insufficient features available to correctly identify documents in the document class.
  • Documents that are submitted for a document type can be similar to other document types.

Evaluating the models

Classification

It might seem obvious that the model with the highest confidence score is the right model to choose for your project. However, the accuracy for your most important document types should be a consideration when you evaluate the models.

For example, if one of your document types is misclassified, but with a low confidence, that might mean that the model has the right information about the document type, but encountered an unusual sample.

A low confidence classification means that if your model encounters that document type in a processing application, the application will flag it for a user to fix the classification.

On the other hand, if you have a model that classifies as invoice as a bill of lading, with a high degree of confidence, you can see that the model is not going to be as useful when it's time to process documents. If the model is not accurate, you can check where your known categories don't match with the document types that the model assigns.

Adding more samples can help refine the model.

Extraction

An extraction model tells you how many fields or values it found, with low, medium, or high confidence.

However, this model does not know whether the values that are extracted are what you expect them to be. Check the extraction results to ensure that you are getting the expected values.