Get PDF Region Text

Gets the text of a region of the PDF file, using the OCR method, according to the specified coordinates.

Command availability: IBM RPA SaaS and IBM RPA on premises

Script syntax

IBM RPA's proprietary scripting language has a syntax similar to other programming languages. The script syntax defines the command's syntax in the script file. You can work with this syntax in IBM RPA Studio's Script mode.

pdfRegionText --region(Rectangle) --language(String) [--autodetect(Boolean)] [--useocr(Boolean)] --ocrprovider(Nullable<OpticalCharacterRecognitionProvider>) [--googlevisionclientsecret(String)] --page(Numeric) --dpix(Numeric) --dpiy(Numeric) --file(Pdf) (String)=value

Input parameter

The following table displays the list of input parameters available in this command. In the table, you can see the parameter name when working in IBM RPA Studio's Script mode and its Designer mode equivalent label.

Designer mode label Script mode name Required Accepted variable types Description
OCR Provider ocrprovider Required OpticalCharacterRecognitionProvider Text recognition method to use.

See the ocrprovider parameter options
API Parameters googlevisionclientsecret Optional Text The absolute path to the JSON file containing the API parameters. Refer to the Google Cloud Vision™External Link documentation for details about the JSON format.
Region region Required Rectangle Region that delimits the location in the PDF file from where the text should be obtained. The region is made up of the coordinates X and Y, in addition to the width and height. To obtain the region of the file, use the IBM RPA Studio Region Selector, located in the upper menu: "Tools> Pdf> Region Selector".
Auto Detect autodetect Optional Boolean When enabled, it automatically selects the OCR provider used to obtain text from a PDF file.
Use OCR useocr Optional Boolean When enabled, it allows the OCR provider to be selected to retrieve the text.
Page page Required Number Number of the page from which the text should be retrieved.
DpiX dpix Required Number Number of pixels per inch on the horizontal axis of the PDF file.
DpiY dpiy Required Number Number of pixels per inch on the vertical axis of the PDF file.
File file Required PDF Full path to the PDF file from which the text should be retrieved.
Clear trailing line breaks sanitize Optional when the OCR Provider parameter is Abbyy Boolean Trims the result text, removing trailing Unicode line break characters.
Language cultureformat Required Text, Culture Language in which the value to be obtained is written.

For supported languages see Supported languages.

ocrprovider parameter options

The following table displays the options available for the ocrprovider input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Abbyy Abbyy Abbyy OCR Provider
Google Google Google Tesseract OCR provider.
Google Cloud Vision GoogleVision Google Cloud Vision API.

Output parameter

Designer mode label Script mode name Accepted variable types Description
Value text Text Returns the text retrieved from the specified region of the PDF.

Example

The Get PDF Region Text command obtains the text "Bill To, Lucas Lima" from a PDF file, from a specific region.

defVar --name pdfFile --type Pdf
defVar --name obtainedText --type String
// Use a PDF file.
pdfOpen --file "fileForPDFCommands.pdf" pdfFile=value
// Enter the parameters according to the coordinates of your PDF file.
pdfRegionText --region "158,258,136,86" --language "eng" --page 1 --dpix 110 --dpiy 110 --file ${pdfFile} obtainedText=value
logMessage --message "Text obtained from the specified region: ${obtainedText}" --type "Info"
// The obtained text is displayed in the console.