How IBM’s latest research is leading the optical character recognition (OCR) revolution and pushing the boundaries of capabilities.

Documents have always been (and continue to be) a significant data source for any business or corporation. It’s crucial to be able to scan and digitize physical documents to extract their information and represent them in a way that allows for further analysis (e.g., for a mortgage or loan process for a bank) no matter how the data is captured. Even for documents created digitally (e.g., PDF documents) the process of extracting information can be a challenge.

At IBM, we are treating this as a multi-disciplinary challenge spanning across computer vision, natural language understanding, information representation and model optimization. With this approach, we are advancing the state-of-the-art in document understanding, which allows our models to analyze the layout and reading order in complex documents and understand visuals and represent them in multimodality manners that understand plots, chart and diagrams.

This work led to the new enhanced optical character recognition (OCR) IBM has created to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

Cleaner and more accurate extraction creates multiple benefits, including the following:

  • Accelerated workflows
  • Automated document routing and content processing
  • Reduced costs
  • Superior data security
  • Disaster recovery

Also, there are a variety of use cases that utilize optical character recognition technology that will benefit from the enhancements being made by IBM. From data extraction to automating big data processing workflows, OCR powers many systems and services used every day.

Document understanding

Document understanding is the ability to read these business documents—either programmatically or by OCR—and interpret their content so it can take part in an automatic business process. An example of an automatic business process utilizing OCR would be insurance automated claims processing, where data is extracted from ID cards, claims forms and claim descriptions, among others.

To perform the digitization of documents, optical character recognition (OCR) is utilized. OCR is composed of two stages:

  • Detection: Localize the various words in the document.
  • Recognition: Identify the comprising characters in the detected words.

This means that with OCR, we know where the words are on the document and what those words are. However, when using OCR, challenges arise when documents are captured under any number of non-ideal conditions. This can include incorrect scanner settings, insufficient resolution, bad lighting (e.g., mobile capture), loss of focus, unaligned pages and added artifacts from badly printed documents.

Our team focused on these two challenging areas to address how the next generation of OCR technology can detect and extract data from low-quality and natural-scene image documents.

Better training and accuracy

Imagine for a moment that you are going to build a computer vision system for reading text in documents or extracting structure and visual elements. To train this system, you will undoubtedly need a lot of data that has to be correctly labeled and sanitized for human errors. Furthermore, you might realize that you require a different granularity of classes to train a better model—but acquiring new labeled data is costly. The cost will likely force you to make some compromises or use a narrower set of training regimens which may affect accuracy.

But what if you could quickly synthesize all of the data you need? How would that affect the way you approach the problem?

Synthetic data is at the core of our work in document understanding and our high-accuracy technology. As we developed our OCR model, we required significant amounts of data—data that is hard to acquire and annotate. As a result, we created new methods to synthesize data and apply optimization techniques to increase our architecture accuracy given that the synthetic data can be altered.

Now we are synthesizing data for object segmentation, text recognition, NLP-based grammatical correction models, entity grouping, semantic classification and entity linkage.

Another advantage of synthetic data generation is the ability to control the granularity and format of the labels, including different colors, font, font sizes, background noise, etc. This enables us to design architectures that can recognize punctuation, layout, handwritten characters and form elements.

By leveraging synthetic data to train models mentioned previously, we’re excited to announce this effort has resulted in a major update to our core OCR model, providing a significant boost in accuracy and lower processing time.

Higher-level document understanding

Not all documents within an enterprise are of equal value. For example, business documents are central to the operation of business and are at the heart of digital transformation. Such documents include contracts, loan applications, invoices, purchase orders, financial statements and many more. The information in these business documents is presented in natural language and is unstructured. Understanding these documents poses a change due to the complex document layout and the poor-quality scans.

Now with IBM’s latest OCR technology, these critical documents can be read and the key information contained within can be extracted.

Conclusion

As data continues to provide the key insights enterprises need to analyze their business, understand their customers and automate workflows, document-understanding technology like optical character recognition (OCR) is more important than ever.

IBM’s latest research is leading the OCR revolution by pushing the boundaries of OCR capabilities and raising the standard for OCR in the development community. We’re committed to improving our product and providing our customers with the highest level of performance and accuracy possible.

This new OCR technology is being rolled out across all IBM products utilizing OCR and will allow users to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

To learn more, check out the documentation and release notes.

The new OCR technology is already available in IBM Watson Discovery—try it out and get started today.

Was this article helpful?
YesNo

More from Automation

IBM continues to support OpenSource AsyncAPI in breaking the boundaries of event driven architectures

3 min read - IBM® Event Automation’s event endpoint management capability makes it easy to describe and document your Kafka topics (event sources) according to the open source AsyncAPI Specification. Why is this important? AsyncAPI already fuels clarity, standardization, interoperability, real-time responsiveness and beyond. Event endpoint management brings this to your ecosystem and helps you seamlessly manage the complexities of modern applications and systems. The immense utility of Application Programming Interfaces (APIs) and API management are already widely recognized as it enables developers to collaborate…

Generative AI meets application modernization

2 min read - According to a survey of more than 400 top IT executives across industries in North America, three in four respondents say they still have disparate systems using traditional technologies and tools in their organizations. Furthermore, the survey finds that most executives report being in the planning or preliminary stages of modernization. Maintaining these traditional, legacy technologies and tools, often referred to as “technical debt,” for too long can have serious consequences, such as stalled development projects, cybersecurity exposures and operational…

Maximize business potential with IBM API Connect and the Open Network for Digital Commerce  

4 min read - Imagine a marketplace where even the smallest local shop can compete with big-name brands, reaching customers far and wide. This is the promise of the Open Network for Digital Commerce (ONDC) in India. By integrating it with IBM API Connect®, businesses can unlock new growth opportunities and elevate their digital presence.   Let’s dive into how this integration works and why it’s so innovative.  What are IBM API Connect and ONDC?  Think of IBM API Connect as your ultimate toolkit…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters