依 OCR 取得 PDF 文字

從 PDF 檔案取得文字。

指令可用性: IBM RPA SaaS 及 IBM RPA 內部部署

說明

使用 OCR 從 PDF 檔案取得文字。剖析演算法會使用錨點文字作為參照，並根據相符的錨點文字所在的位置，從某個位置傳回文字。

Script 語法

extractPdfText --page(Numeric) --language(String) [--searchregion(Rectangle)] --anchor(String) --anchorprovider(Nullable<OpticalCharacterRecognitionProvider>) [--googlevisionclientsecret(String)] --comparison(OcrStringComparison) --fuzzyalgorithm(Nullable<FuzzyStringComparisonAlgorithms>) --tolerance(Nullable<FuzzyStringComparisonTolerance>) --manualTolerance(Numeric) --segmentation(StringSegmentation) [--anchorhighcontrast(Boolean)] --targetregion(Rectangle) --targetprovider(Nullable<OpticalCharacterRecognitionProvider>) [--targethighcontrast(Boolean)] --file(Pdf) (Image)=anchor (Boolean)=success (Image)=image (String)=text (Rectangle)=bounds

相依關係

IBM RPA Studio 具有 Helper 工具，可用來設定稱為擷取 PDF 文字的指令的參數。您可以在工具標籤上找到它。

輸入參數

下表顯示此指令中可用的輸入參數清單。在表格中，當您在 IBM RPA Studio的 Script 模式及其 Designer 模式對等標籤中工作時，可以看到參數名稱。

設計程式模式標籤	Script 模式名稱	必要	接受的變數類型	說明
頁面	`page`	`Required`	`Number`	要剖析的頁碼。
語言	`language`	`Required`	`Text`, `Culture`	剖析期間要考量的語言。如需可用的語言，請參閱支援的語言。
搜尋區域	`searchregion`	`Optional`	`Rectangle`	剖析演算法應該在其中搜尋錨點文字的區域。
錨點	`anchor`	`Required`	`Text`	要用作錨點的文字。
錨點 OCR 提供者	`anchorprovider`	`Required`	`OpticalCharacterRecognitionProvider`	要使用的 OCR 提供者。請參閱 `anchorprovider` 參數選項。
API 參數	`googlevisionclientsecret`	`Optional`	`Text`	包含 API 參數之 JSON 檔案的絕對路徑。如需 JSON 格式的詳細資料，請參閱 Google Cloud Vision™ 文件。
比較	`comparison`	`Required`	`OcrStringComparison`	比較類型與錨點文字。請參閱 `comparison` 參數選項。
模糊演算法	`fuzzyalgorithm`	`Only when comparison is ApproximatelyEquals`	`FuzzyStringComparisonAlgorithms`	比較字串時要使用的模糊演算法。如需詳細資料，請參閱 `fuzzyalgorithm` 參數選項。
允差	`tolerance`	`Only when comparison is ApproximatelyEquals`	`FuzzyStringComparisonTolerance`	要在模糊演算法中使用的容錯層次。請參閱 `tolerance` 參數選項。
容錯值	`manualTolerance`	`Only when tolerance is Manual`	`Number`	介於 0 與 100 之間的容錯百分比，其中 100 表示完全相符。
區段	`segmentation`	`Required`	`StringSegmentation`	要搜尋的文字類型。請參閱 `segmentation` 參數選項。
加強錨點 OCR 的對比	`anchorhighcontrast`	`Optional`	`Boolean`	啟用以加強 PDF 對比。
文字擷取區域	`targetregion`	`Required`	`Rectangle`	要使用 OCR 剖析文字的區域。此區域相對於錨點文字所在的位置。
擷取 OCR 提供者	`targetprovider`	`Required`	`OpticalCharacterRecognitionProvider`	取得目標區域中的文字時要使用的 OCR 提供者。如需詳細資料，請參閱 `targetprovider` 參數選項。
加強文字擷取 OCR 的對比	`targethighcontrast`	`Optional`	`Boolean`	啟用以加強 OCR 對照來剖析要取得的文字。
PDF	`file`	`Required`	`PDF`	要取得文字的 PDF 檔案。

`anchorprovider` 參數選項

下表顯示 anchorprovider 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱	說明
Google	`Google`	Google Tesseract™ OCR 提供者。
Google Cloud Vision	`GoogleVision`	Google Cloud Vision™ OCR 提供者。
Abbyy	`Abbyy`	Abbyy™

`comparison` 參數選項

下表顯示 comparison 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱	說明
大約等於	`ApproximatelyEquals`	考量容錯層次，檢查兩個字串是否相似。
開頭為	`Begins_With`	檢查字串是否以子字串開頭。
包含	`Contains`	檢查字串是否包含子字串。
結尾為	`Ends_With`	檢查字串是否以子字串結尾。
等於	`Equal_To`	檢查字串是否等於另一個字串。
符合	`Matches`	檢查字串是否符合另一個字串。這會使用正規表示式。

`fuzzyalgorithm` 參數選項

下表顯示 fuzzyalgorithm 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱
Dice 係數	`DiceCoefficient`
Hamming 距離	`HammingDistance`
Jaccard 距離	`JaccardDistance`
Jaro 距離	`JaroDistance`
Jaro Winkler 距離	`JaroWinklerDistance`
Levenshtein 距離	`LevenshteinDistance`
最長一般子序列	`LongestCommonSubsequence`
最長一般子字串	`LongestCommonSubstring`
重疊係數	`OverlapCoefficient`
Ratcliff Obershelp 相似性	`RatcliffObershelpSimilarity`
Sorensen Dice 距離	`SorensenDiceDistance`
Tanimoto 係數	`TanimotoCoefficient`

`tolerance` 參數選項

下表顯示 tolerance 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱	說明
手動	`Manual`	如果兩個字串至少有 X% 相等，則會將它們評估為大約相等。 X 是使用者在 manualTolerance 中定義的值（介於 0 和 100 之間），其中 100 表示字串完全相等。
一般	`Normal`	如果兩個字串至少有 50% 相等，則會將它們評估為大約相等。
強	`Strong`	如果兩個字串至少有 75% 相等，則會將它們評估為大約相等。
弱	`Weak`	如果兩個字串至少有 25% 相等，則會將它們評估為大約相等。

`segmentation` 參數選項

下表顯示 segmentation 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱	說明
詞組	`Phrase`	搜尋詞組。
Word	`Word`	搜尋 aa 單字。

`targetprovider` 參數選項

下表顯示 targetprovider 輸入參數可用的選項。此表格顯示在 Script 模式下以及在 Designer 模式對等標籤下工作時可用的選項。

設計程式模式標籤	Script 模式名稱	說明
Google	`Google`	Google Tesseract™
Google Cloud Vision	`GoogleVision`	Google Cloud Vision™
Abbyy	`Abbyy`	Abbyy™

輸出參數

設計程式模式標籤	Script 模式名稱	接受的變數類型	說明
錨點	`anchor`	`Image`	在包含錨點文字的原始檔頁面中找到的影像。
順利完成	`success`	`Boolean`	如果找到文字，則為 `True`；否則為 `False`。
映像檔	`image`	`Image`	包含目標文字的影像。
文字	`text`	`Text`	從文字擷取區域傳回的目標文字。
範圍	`bounds`	`Rectangle`	在 PDF 中找到文字的區域。

範例

下列範例顯示動作中的 extractPdfText 指令。

defVar --name pdfFile --type Pdf
defVar --name getText --type String
// Enter a pdf file on this command.
pdfOpen --file "file.pdf" pdfFile=value
// Use the helper selection tool to use this command.
extractPdfText --page 1 --language en-US --searchregion "171,273,134,89" --anchor Bill --anchorprovider "Google" --comparison "Equal_To" --segmentation "Word" --targetregion "-23,-37,716,404" --targetprovider "Google" --file ${pdfFile} getText=text
logMessage --message "${getText}" --type "Info"
//Get the text from a PDF file and display it in the console.

限制

與 Google Cloud Vision™ 和 Google Tesseract™相比，ABBYY ® 的運作方式不同。它以不同方式劃分文字元件，並以不同順序取得元件。

依 OCR 取得 PDF 文字

說明

Script 語法

相依關係

輸入參數

anchorprovider 參數選項

comparison 參數選項

fuzzyalgorithm 參數選項

tolerance 參數選項

segmentation 參數選項

targetprovider 參數選項