OCR
Overview
OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.
OCR is commonly used for text recognition and extraction from the following types of documents:
- Non-editable scanned PDF files
- Photographs of documents.
- Scene photos such as advertising layouts, signboards, etc.
- Identification cards, passports, vehicle license plates, and other official plates.
- Invoices, bills, receipts, and other financial documents.
The following features support OCR:
- PDF to Word
- PDF to Excel
- PDF to PowerPoint (PPT)
- PDF to HTML
- PDF to Rich Text Format (RTF)
- PDF to Text (TXT)
- PDF to CSV
- Text extraction from PDF
- Table extraction from PDF
OCR Language Support of ComPDFKit Conversion SDK:
Script / Notes | Language (Native) | Language (In English) |
---|---|---|
Latn; American | English | English |
Latn; Canadian | Français canadien | French |
Hans/Hant | 中文简体 | Chinese (Simplified) |
Hans/Hant | 中文繁体 | Chinese (Traditional) |
Jpan | 日本語 | Japanese |
Kore | 한국어 | Korean |
Latn | Deutsch | German |
Latn | Српски (латиница) | Serbian (latin) |
Latn | Occitan, lenga d'òc, provençal | Occitan |
Latn | Dansk | Danish |
Latn | Italiano | Italian |
Latn; European | Español | Spanish |
Latn; European | Português (Portugal) | Portuguese |
Latn | Te reo Māori | Maori |
Latn | Bahasa Melayu | Malay |
Latn | Malti | Maltese |
Latn | Nederlands | Dutch |
Latn; Bokmål | Norsk | Norwegian |
Latn | Polski | Polish |
Latn | Română | Romanian |
Latn | Slovenčina | Slovak |
Latn | Slovenščina | Slovenian |
Latn | shqip | Albanian |
Latn | Svenska | Swedish |
Latn | Swahili | Swahili |
Latn | Wikang Tagalog | Tagalog |
Latn | Türkçe | Turkish |
Latn | oʻzbekcha | Uzbek |
Latn | Tiếng Việt | Vietnamese |
Latn | Afrikaans | Afrikaans |
Latn | Azərbaycan | Azerbaijani |
Latn | Bosanski | Bosnian |
Latn | Čeština | Czech |
Latn | Cymraeg | Welsh |
Latn | Eesti keel | Estonian |
Latn | Gaeilge | Irish |
Latn | Hrvatski | Croatian |
Latn | Magyar | Hungarian |
Latn | Bahasa Indonesia | Indonesian |
Latn | Íslenska | Icelandic |
Latn | Kurdî | Kurdish |
Latn | Lietuvių | Lithuanian |
Latn | Latviešu | Latvian |
Whether to include OCR background image
When the OCR function is enabled and the target conversion format is Word, PPT, RTF, or HTML, you need to pay attention to whether to set the IsContainOCRBgImage
option. If the IsContainOCRBgImage
option is selected, a large image will be written in the target document as a background image. Text and tables will be displayed on this background image. If the IsContainOCRBgImage
option is not selected, the images on the PDF page will be extracted and written into the target document.
Convert images to other document formats
The OCR function also supports converting input images into Word, Excel, PPT, HTML, CSV, RTF, TXT, Json and other formats. This sample demonstrates how to use the ComPDFKit OCR function to convert image files to DOCX file.
// Support jpg, jpeg, png, bmp formats
string inputFilePath = "***.jpg";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterWord converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeWord, inputFilePath) as CPDFConverterWord;
CPDFConvertWordOptions wordOptions = new CPDFConvertWordOptions();
wordOptions.IsAllowOCR = true;
wordOptions.OCRLanguage = ComDocumentAIOCR.Language.ENGLISH;
wordOptions.LayoutOpts = LayoutOptions.RetainPageLayout;
int pageCount = converter.GetPagesCount();
int[] pageArray = new int[pageCount];
for (int i = 0; i < pageArray.Length; i++)
{
pageArray[i] = i + 1;
}
ConvertError error = ConvertError.ERR_UNKNOWN;
converter.Convert(outputFolderPath, ref outputFileName, wordOptions, pageArray, ref error, getPorgress);
Notice
- The quality of OCR results is related to the quality of the input image. If the input image resolution is lower, then the quality of the OCR results will also be affected. A good way is that the more pixels in the glyph, the better. If the glyph bounding box is smaller than 20x20 pixels, the OCR quality will begin to decline exponentially. The ideal image is a grayscale image with a resolution of around 300 DPI.
- When performing OCR recognition, you need to pay attention to setting the OCR language and ensure that the selected OCR language is consistent with the language of the PDF document to obtain the best OCR conversion quality.
- When the OCR option is enabled, the
IsContainImages
option will no longer work. At this time, the pictures in PDF are controlled by theIsContainOCRBgImage
. - When using the image conversion to other document format function, please pay attention to the input image format support: JPG, JPEG, PNG, BMP.
- The OCR function currently does not support operating systems lower than Windows 10.
Integrate the Library of OCR
- Add the "DocumentAI_Windows_NetFramework.dll" in the "lib" folder to the project and reference.
Include the files "DocumentAI.dll", "onnxruntime.dll", and "paddle2onnx.dll" from the x64 folder in the project, and set the Copy to Output Directory property of these dynamic libraries to Copy if newer.
Set the option parameter
options.IsAllowOCR
to true.
Sample
This Sample demonstrates how to use the ComPDFKit OCR function to convert a PDF to DOCX file.
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterWord converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeWord, inputFilePath) as CPDFConverterWord;
CPDFConvertWordOptions wordOptions = new CPDFConvertWordOptions();
wordOptions.IsAllowOCR = true;
wordOptions.OCRLanguage = ComDocumentAIOCR.Language.ENGLISH;
wordOptions.IsContainAnnotations = true;
wordOptions.IsContainImages = true;
wordOptions.LayoutOpts = LayoutOptions.RetainPageLayout;
int pageCount = converter.GetPagesCount();
int[] pageArray = new int[pageCount];
for (int i = 0; i < pageArray.Length; i++)
{
pageArray[i] = i + 1;
}
ConvertError error = ConvertError.ERR_UNKNOWN;
converter.Convert(outputFolderPath, ref outputFileName, wordOptions, pageArray, ref error, getPorgress);