Get PDF Region Text

Gets the text of a region of the PDF file, using the OCR method, according to the specified coordinates.

Command availability: IBM RPA SaaS and IBM RPA on premises

Script syntax

IBM RPA's proprietary scripting language has a syntax similar to other programming languages. The script syntax defines the command's syntax in the script file. You can work with this syntax in IBM RPA Studio's Script mode.

pdfRegionText --region(Rectangle) --language(String) [--autodetect(Boolean)] [--useocr(Boolean)] --ocrprovider(Nullable<OpticalCharacterRecognitionProvider>) [--googlevisionclientsecret(String)] --page(Numeric) --dpix(Numeric) --dpiy(Numeric) --file(Pdf) (String)=value

Input parameter

The following table displays the list of input parameters available in this command. In the table, you can see the parameter name when working in IBM RPA Studio's Script mode and its Designer mode equivalent label.

Designer mode label	Script mode name	Required	Accepted variable types	Description
OCR Provider	`ocrprovider`	`Required`	`OpticalCharacterRecognitionProvider`	Text recognition method to use. See the `ocrprovider` parameter options
API Parameters	`googlevisionclientsecret`	`Optional`	`Text`	The absolute path to the JSON file containing the API parameters. Refer to the Google Cloud Vision™ documentation for details about the JSON format.
Region	`region`	`Required`	`Rectangle`	Region that delimits the location in the PDF file from where the text should be obtained. The region is made up of the coordinates X and Y, in addition to the width and height. To obtain the region of the file, use the IBM RPA Studio Region Selector, located in the upper menu: "Tools> Pdf> Region Selector".
Auto Detect	`autodetect`	`Optional`	`Boolean`	When enabled, it automatically selects the OCR provider used to obtain text from a PDF file.
Use OCR	`useocr`	`Optional`	`Boolean`	When enabled, it allows the OCR provider to be selected to retrieve the text.
Page	`page`	`Required`	`Number`	Number of the page from which the text should be retrieved.
DpiX	`dpix`	`Required`	`Number`	Number of pixels per inch on the horizontal axis of the PDF file.
DpiY	`dpiy`	`Required`	`Number`	Number of pixels per inch on the vertical axis of the PDF file.
File	`file`	`Required`	`PDF`	Full path to the PDF file from which the text should be retrieved.
Clear trailing line breaks	`sanitize`	`Optional when the OCR Provider parameter is Abbyy`	`Boolean`	Trims the result text, removing trailing Unicode line break characters.
Language	`cultureformat`	`Required`	`Text, Culture`	Language in which the value to be obtained is written. For supported languages see Supported languages.

`ocrprovider` parameter options

The following table displays the options available for the ocrprovider input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label	Script mode name	Description
Abbyy	`Abbyy`	Abbyy OCR Provider
Google	`Google`	Google Tesseract OCR provider.
Google Cloud Vision	`GoogleVision`	Google Cloud Vision API.

Output parameter

Designer mode label	Script mode name	Accepted variable types	Description
Value	`text`	`Text`	Returns the text retrieved from the specified region of the PDF.

Example

The Get PDF Region Text command obtains the text "Bill To, Lucas Lima" from a PDF file, from a specific region.

defVar --name pdfFile --type Pdf
defVar --name obtainedText --type String
// Use a PDF file.
pdfOpen --file "fileForPDFCommands.pdf" pdfFile=value
// Enter the parameters according to the coordinates of your PDF file.
pdfRegionText --region "158,258,136,86" --language "eng" --page 1 --dpix 110 --dpiy 110 --file ${pdfFile} obtainedText=value
logMessage --message "Text obtained from the specified region: ${obtainedText}" --type "Info"
// The obtained text is displayed in the console.