Best practices on OCR with Content Analyzer

The best practices on optical character recognition (OCR) with Content Analyzer help you understanding the factors that are involved in OCR results and performances.

The heuristic algorithms in OCR might not be accurate. So, the following points help you to understand the factors that impact OCR results and performance, and suggest document formats that improve those results.

Quality

First, a document should have clear and machine written text. Here are some factors that reduce OCR success and should be avoided where possible.

Skewing and distortion – OCR can handle a small amount of skewing, but recognition degrades as skewing and distortion increase.
Noise – speckles, streaks, watermarks, stamps, and other marks that are not part of the text that interferes with OCR. This can include images with handwritten notes, circled text, and other notations, which are done to document prior to scanning sometimes. When noise touches text, it can interfere with character recognition. Even when it does not, it can interfere with line recognition and block identification.
Background – As part of text extraction, OCR must determine what is text and what is background. Content Analyzer supports color and gray scale images but too many colors or some color combinations can interfere with its ability to identify foreground and background colors. Inverse text is supported by Content Analyzer.

Size

Content Analyzer supports processing of files up to 250 MB. However, file size directly impacts the time to process. Using file types and formats that generate smaller files improve the performance. If using images from digital cameras or mobile phones, recognize that they are intended to provide the photograph quality images and are fairly large. Where possible, use documents that are generated by scanners or other sources where file size can be controlled.

Format

Some file formats and compressed files, such as JPG or lossy compressed file, keep the file size smaller but blurred and loss of clarity to character edges. Ultimately, they lead to lower OCR results. Use lossless compression formats where possible. Note that while PDF documents themselves are lossless, if the images used to generate them are lower quality, the PDF OCR results reflect that.

Fonts

The OCR engine is capable of recognizing text with many different fonts. However, standard fonts, such as Arial and New Times Roman, provide better recognition results than fonts that have more unusual character shapes. OCR can also handle different font sizes, although very small sizes cannot be enough pixels to clearly identify the characters, and large font sizes cannot be recognized as text at all. Testing is required to determine where those limits are for your documents.

DPI settings

Optimal DPI for recognition is generally 200 or 300 DPI for both the X and Y axis. Higher DPIs can be used for better recognition with small fonts or when recognizing languages containing intricate characters. Therefore, however, higher DPIs also increase file size and processing time. Recognition results for DPIs below 200 are lower from the loss of character clarity. DPI for the X and Y axis should be the same. Note that fax images are both low DPI and have unequal DPI values for the X and Y coordinate. As such, fax recognition is lower than for other types of documents.

Note: To process the file you need the following font size and DPI:

If the font size is higher than 12 with 100 DPI, you can get a high level of confidence.
If the font size is below 8 with 200 DPI or low, you can get a low level of confidence, or might not get the correct OCR.
If any font size is with 300 DPI, you can get a high level of confidence.

Character substitution

Some characters are very difficult to differentiate for an OCR engine. For example, O (capital letter), o (lowercase letter), and 0 (number). Content Analyzer does perform semantic normalization, so if looking for ‘cost’, it finds a match with ‘C0st’. Text accuracy can also be improved for key values, by selecting Mostly Alphabetic or Mostly Numeric when defining the keys in the ontology.

Performance

When planning a system, one of the key metrics is throughput. How long does it take the average document to process, and by extension, how many documents per hour and per day can you process. You need to test to determine what your performance ultimately is. However, here are the primary factors, some of which have already been mentioned.

Hardware speed – This is somewhat of a given, but important to consider when determining performance. If you are testing the Development system on that does not match the Production system that you run on, the performance results differ.
System size – Content Analyzer is capable of processing documents in parallel, the exact number being based on the size of your system. So, if one document takes X seconds, it does not necessarily mean that 10 documents take 10 times X seconds. Testing with many documents is needed to determine total throughput.
File size – OCR analyzes images pixel by pixel. Therefore, large images have more pixels and take longer to process.
Complex of image – Images with lots of texts take longer time to process than images with a small amount of text. For this reason, avoid by using sample documents when performance testing, as they often have less text and process faster than actual documents.

Summary

The information provided here should give you a better understanding of why you are seeing the recognition results and performance of your documents. Ideally, it can also suggest the ways that you can modify your file ingestion methods to generate documents that provide the most optimal results.