Guides

OCR

Overview

OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

Non-editable scanned PDF files
Photographs of documents.
Scene photos such as advertising layouts, signboards, etc.
Identification cards, passports, vehicle license plates, and other official plates.
Invoices, bills, receipts, and other financial documents.

The following features support OCR:

PDF to Word
PDF to Excel
PDF to PowerPoint (PPT)
PDF to HTML
PDF to Rich Text Format (RTF)
PDF to Text (TXT)
Text extraction from PDF
Table extraction from PDF

OCR Language Support of ComPDFKit Conversion SDK:

Script / Notes	Language (Native)	Language (In English)
Latn; American	English	English
Latn; Canadian	Français canadien	French
Hans/Hant	中文简体	Chinese (Simplified)
Hans/Hant	中文繁体	Chinese (Traditional)
Jpan	日本語	Japanese
Kore	한국어	Korean
Latn	Deutsch	German
Latn	Српски (латиница)	Serbian (latin)
Latn	Occitan, lenga d'òc, provençal	Occitan
Latn	Dansk	Danish
Latn	Italiano	Italian
Latn; European	Español	Spanish
Latn; European	Português (Portugal)	Portuguese
Latn	Te reo Māori	Maori
Latn	Bahasa Melayu	Malay
Latn	Malti	Maltese
Latn	Nederlands	Dutch
Latn; Bokmål	Norsk	Norwegian
Latn	Polski	Polish
Latn	Română	Romanian
Latn	Slovenčina	Slovak
Latn	Slovenščina	Slovenian
Latn	shqip	Albanian
Latn	Svenska	Swedish
Latn	Swahili	Swahili
Latn	Wikang Tagalog	Tagalog
Latn	Türkçe	Turkish
Latn	oʻzbekcha	Uzbek
Latn	Tiếng Việt	Vietnamese
Latn	Afrikaans	Afrikaans
Latn	Azərbaycan	Azerbaijani
Latn	Bosanski	Bosnian
Latn	Čeština	Czech
Latn	Cymraeg	Welsh
Latn	Eesti keel	Estonian
Latn	Gaeilge	Irish
Latn	Hrvatski	Croatian
Latn	Magyar	Hungarian
Latn	Bahasa Indonesia	Indonesian
Latn	Íslenska	Icelandic
Latn	Kurdî	Kurdish
Latn	Lietuvių	Lithuanian
Latn	Latviešu	Latvian

Whether to include OCR background image

When the OCR function is enabled and the target conversion format is Word, PPT, RTF, or HTML, you need to pay attention to whether to set the IsContainOCRBgImage option. If the IsContainOCRBgImage option is selected, a large image will be written in the target document as a background image. Text and tables will be displayed on this background image. If the IsContainOCRBgImage option is not selected, the images on the PDF page will be extracted and written into the target document.

Note

The quality of OCR results is related to the quality of the input image. If the input image resolution is lower, then the quality of the OCR results will also be affected. A good way is that the more pixels in the glyph, the better. If the glyph bounding box is smaller than 20x20 pixels, the OCR quality will begin to decline exponentially. The ideal image is a grayscale image with a resolution of around 300 DPI.
When performing OCR recognition, you need to pay attention to setting the OCR language and ensure that the selected OCR language is consistent with the language of the PDF document to obtain the best OCR conversion quality.
When the OCR option is enabled, the IsContainImages option will no longer work. At this time, the pictures in PDFs are controlled by the IsContainOCRBgImage.

Sample

This Sample demonstrates how to use the OCR of ComPDFKit Conversion SDK to convert a PDF to a Word file.

java

        CPDFConvertWordOptions cpdfConvertWordOptions = new CPDFConvertWordOptions();
        cpdfConvertWordOptions.setContainAnnot(true);
        cpdfConvertWordOptions.setAllowOcr(true);
        cpdfConvertWordOptions.setContainOcrBg(true);
        String inputPath = rootDir + input_file + "word.pdf";
        List<Integer> pageCounts = getPageCounts(cpdfConvertWord.getPageCount(inputPath, password));
        ConvertResult convert = cpdfConvertWord.convert(inputPath, rootDir + output_file, "", cpdfConvertWordOptions, pageCounts, password, page -> {
        });

OCR ​

OCR