OCR

Overview

OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

Non-editable scanned PDF files
Photographs of documents.
Scene photos such as advertising layouts, signboards, etc.
Identification cards, passports, vehicle license plates, and other official plates.
Invoices, bills, receipts, and other financial documents.

The following features support OCR:

PDF to Word
PDF to Excel
PDF to PPT
PDF to HTML
PDF to RTF
PDF to TXT
PDF to CSV
PDF to searchable PDF
Extract PDF to JSON
Extract PDF to Markdown

OCR Language Support of ComPDFKit Conversion SDK:

Script / Notes	Language (Native)	Language (In English)
Latn; American	English	English
Latn; Canadian	Français canadien	French
Hans/Hant	中文简体	Chinese (Simplified)
Hans/Hant	中文繁体	Chinese (Traditional)
Jpan	日本語	Japanese
Kore	한국어	Korean
Latn	Deutsch	German
Latn	Српски (латиница)	Serbian (latin)
Latn	Occitan, lenga d'òc, provençal	Occitan
Latn	Dansk	Danish
Latn	Italiano	Italian
Latn; European	Español	Spanish
Latn; European	Português (Portugal)	Portuguese
Latn	Te reo Māori	Maori
Latn	Bahasa Melayu	Malay
Latn	Malti	Maltese
Latn	Nederlands	Dutch
Latn; Bokmål	Norsk	Norwegian
Latn	Polski	Polish
Latn	Română	Romanian
Latn	Slovenčina	Slovak
Latn	Slovenščina	Slovenian
Latn	shqip	Albanian
Latn	Svenska	Swedish
Latn	Swahili	Swahili
Latn	Wikang Tagalog	Tagalog
Latn	Türkçe	Turkish
Latn	oʻzbekcha	Uzbek
Latn	Tiếng Việt	Vietnamese
Latn	Afrikaans	Afrikaans
Latn	Azərbaycan	Azerbaijani
Latn	Bosanski	Bosnian
Latn	Čeština	Czech
Latn	Cymraeg	Welsh
Latn	Eesti keel	Estonian
Latn	Gaeilge	Irish
Latn	Hrvatski	Croatian
Latn	Magyar	Hungarian
Latn	Bahasa Indonesia	Indonesian
Latn	Íslenska	Icelandic
Latn	Kurdî	Kurdish
Latn	Lietuvių	Lithuanian
Latn	Latviešu	Latvian

Convert images to other document formats

The OCR function also supports converting input images into Word, Excel, PPT, HTML, CSV, RTF, TXT, Json and other formats. This sample demonstrates how to use the ComPDFKit OCR function to convert image files to DOCX file.

// Set the OCR model path and language.
string modelPath = "***";
LibraryManager.SetDocumentAIModel(modelPath, OCRLanguage.e_English);

// Supports jpg, jpeg, png, bmp, tiff formats
string inputFilePath = "***";
string password = "***";
string outputFileName = "***";

WordOptions wordOptions = new WordOptions();
wordOptions.ContainImage = true;
wordOptions.ContainAnnotation = true;
wordOptions.EnableAiLayout = true;
// Enable OCR option.
wordOptions.EnableOCR = true;

ErrorCode error = CPDFConversion.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions);

Notice

The quality of OCR results is related to the quality of the input image. If the input image resolution is lower, then the quality of the OCR results will also be affected. A good way is that the more pixels in the glyph, the better. If the glyph bounding box is smaller than 20x20 pixels, the OCR quality will begin to decline exponentially. The ideal image is a grayscale image with a resolution of around 300 DPI.
When performing OCR recognition, you need to pay attention to setting the OCR language and ensure that the selected OCR language is consistent with the language of the PDF document to obtain the best OCR conversion quality.
The OCR function currently does not support operating systems lower than Windows 10.

Integrate the Library of OCR

Include the files "DocumentAI.dll", "fastdeploy.dll", "yaml-cpp.dll", "onnxruntime.dll", "onnxruntime_providers_shared.dll", and "paddle2onnx.dll" from the "x64" folder into your project, and set their Copy to Output Directory property to Copy if newer.

Set the option parameter options.EnableOCR to true.

Sample

This Sample demonstrates how to use the ComPDFKit OCR function to convert a PDF to DOCX file.

string modelPath = "***";
LibraryManager.SetDocumentAIModel(modelPath, OCRLanguage.e_English);

string inputFilePath = "***";
string password = "***";
string outputFileName = "***";

WordOptions wordOptions = new WordOptions();
wordOptions.ContainImage = true;
wordOptions.ContainAnnotation = true;
wordOptions.EnableAiLayout = true;
// Enable OCR option.
wordOptions.EnableOCR = true;

ErrorCode error = CPDFConversion.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions);

OCR ​

Overview ​

Convert images to other document formats ​

Notice ​

Integrate the Library of OCR ​

Sample ​

OCR

Overview

Convert images to other document formats

Notice

Integrate the Library of OCR

Sample