Supplementing OCR Engine Table Identification

The OCR Engine identifies the lines on the page. In previous versions, some tables might not have been detected correctly because the OCR Engine did not correctly identify the line between the headers and rows. In this release, new logic was added to enhance the detection of table lines by the OCR Engine, which eventually improves the overall table extraction.
The output from the OCR Engine is used to identify the rows of the table. In previous versions, sometimes incorrect rows were grouped together, especially if a line separation was missing in between the rows (gridless tables for example). In this release, new fuzzy logic was added to enhance the identification of rows.

Header/Table detection improvements

If the table headers are not getting extracted properly, for example if multiple headers are getting combined, or the header is split, or the table data is truncated, or the table adds inappropriate data to the end, then it is recommended to annotate the page that contains the table as follows:

Upload the single page that contains the table in Extraction model > Teach model.
Annotate the table headers as appropriate.
Retrain the extraction model.
Review and publish the model.
Re-process the document to get better table detection results.

Promotion of key-value pairs (KVPs) inside a table cell as a column of the table

If a table header that is defined in the ontology is not explicitly present in the table header row in the document page, but is present as a KVP inside a cell, then the value of the KVP is promoted as the column, as defined in the ontology of the table.

Examples:

Example 1:

In this example, we assume the table was defined in the ontology with the following columns: ItemID, Description, Amount, and PurchaseNumber.

The PurchaseNumber column has the alias PO. Here, the PO KVP that is found in the cell is promoted as a new column.

Example 2

In this example, the PO column already exists, so the "PO: ABC" values in the Description column are not used as KVPs for this column. Only the values in the actual PO column are used.

Example 3

In this example, the KVP is found in the Description column, then the Amount column. In this case, the PO column is defined based on where the KVP appears first. So for row 2, the PO number in the output will be empty.

Note: In order to extract the KVP inside a cell to promote it to a separate column, the field and the value must be separated by a colon, and there should be a space between the colon and the value.
Valid examples:
PO: 12345
PO : 12345

Invalid example:

PO:12345

Linking a field to a table summary field

If summary data was extracted as part of the table, and if there is a field in the ontology that is linked to this table summary data, then a new key-value pair is created from this summary data. For more information, see Adding fields in IBM Docs.

Annotating the first row of a table during table annotation

When you annotate tables in the extraction model, you can now annotate the first row of the table to help the model capture the table. For more information, see Annotating table fields in IBM Docs.

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBYVB","label":"IBM Cloud Pak for Business Automation"},"ARM Category":[{"code":"a8m3p000000hAKAAA2","label":"Design-\u003EADP App Development"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"22.0.1"}]

Tips

Table detection improvements in IBM Automation Document Processing 22.0.1

General Page

Supplementing OCR Engine Table Identification

Header/Table detection improvements

Promotion of key-value pairs (KVPs) inside a table cell as a column of the table

If a table header that is defined in the ontology is not explicitly present in the table header row in the document page, but is present as a KVP inside a cell, then the value of the KVP is promoted as the column, as defined in the ontology of the table.

Linking a field to a table summary field

Annotating the first row of a table during table annotation

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?