Tutorials

How to Recognize & Edit Scanned PDFs with ComPDFKit in C#

By ComPDFKit | Sun. 28 Apr. 2024
Content EditorOCRC#

PDFs have been widely used in various industries due to their ability to preserve document layout and compatibility with any system. However, when users attempt to edit PDF content using standard PDF editors, they encounter limitations with scanned or image-based PDFs, as these cannot be easily modified. To recognize and edit scanned PDF documents, it’s required to enable the OCR feature to convert them into editable documents. Then, utilize a PDF editor to modify the content, similar to editing in Word. Alternatively, it’s possible to re-convert the editable PDF into Word format and edit it using Microsoft Word.

 

Standard PDF editors usually lack support for recognizing and editing image-based and scanned PDFs. For applications or systems requiring this capability, ComPDFKit Conversion SDK offers an easy and robust solution. This post will provide comprehensive steps to seamlessly integrate ComPDFKit Conversion SDK on Windows using C#.

 

OCR - Key for Recognizing Scanned PDF Documents

OCR (Optical Character Recognition) is a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. 

 

ComPDFKit researches advanced AI-based OCR algorithms to extract text from scanned PDFs by preprocessing the document to enhance image quality, segmenting pages into text regions, detecting text within these regions, recognizing characters through pattern analysis, and refining the text using language models and postprocessing techniques. Finally, the recognized text is outputted in a machine-readable format, enabling tasks such as document indexing and text analysis.

 

ComPDFKit OCR is now available to be seamlessly integrated into multiple platforms including Windows, Mac, and Linux( C++, Java, Python, PHP). It ensures precise OCR recognition with support for nearly 50 languages, encompassing English, French, Japanese, Korean, German, Latin, Chinese, Italian, Spanish, and many others. 

 

ComPDFKit Conversion SDK empowers businesses and developers with robust features and dedicated technical support while reducing development costs, enhancing efficiency, and accelerating time-to-market.

 

How to Integrate Conversion SDK to Recognize and Edit Scanned PDFs Using C# in Windows

 

Step 1. Download Conversion SDK

It is available to download ComPDFKit Conversion SDK on NuGet or contact our sales team to obtain the latest package. Then, you can apply for a 30-day free license key to test the integration of Conversion SDK.

 

Step 2. Create a New Windows Project

1. Fire up Visual Studio 2017, choose File -> New -> Project..., and then select Visual C# -> Windows Desktop -> Console App(.NET Framework) as shown below.

create-new-windows-project

 

2. Choose the options for your new project as shown below. Please make sure to choose .NET Framework 4.6.1 as the programming framework.

create-new-windows-project

 

3. Place the project to the location as desired. Then, click OK.

 

Step 3. Add ComPDFKit Conversion SDK Package

1. Copy all files in the "lib" folder to the project folder.

 

2. Add ComPDFKit Conversion SDK dynamic library to References. In order to use ComPDFKit Conversion SDK APIs in the project, you must add the reference to the project first.

 

In Solution Explorer, right-click the project and click Add -> Reference…

add-package

 

In the Add Reference dialog, click the Browse tab, navigate to the project folder, select "ComPDFKit_Conversion.dll" dynamic library, and then click OK.

add-reference

 

3. Add ComPDFKit Conversion SDK library to the project. Add the "x64" and "x86" folder into the project. Please make sure to set the property Copy to Output Directory of "CPDFConverterNative.dll" and "opencv_world420.dll" to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.

add-library

 

4. Copy the "resource" folder to the project folder. Please make sure to set the property Copy to Output Directory of all files in the "resource" folder to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.

resouce

 

Step 4. Apply Your License Key

It is necessary to initialize ComPDFKit Conversion SDK with a license before calling any API. You can contact the ComPDFKit team to get a free trial license.

string resPath = "***";
string libPath = "***";
string license = "***";
CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);

 

Step 5. Integrate the Library of OCR

1. Add the "DocumentAI_Windows_NetFramework.dll" in the "lib" folder to the project and reference.

integrate-ocr-lib

 

2. Include the files "DocumentAI.dll", "onnxruntime.dll", and "paddle2onnx.dll" from the x64 folder in the project, and set the Copy to Output Directory property of these dynamic libraries to Copy if newer.
copy-if-newer

 

3. Set the option parameter options.IsAllowOCR to true.

 

Step 6. Convert PDF to Searchable & Editable PDF

After successfully integrating the OCR library, use the following code example to convert scanned PDFs to editable and searchable documents in multiple desired languages.

string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";

CPDFConverterSearchablePDF converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypePDFSearchable, inputFilePath) as CPDFConverterSearchablePDF;

CPDFConvertPDFSearchableOptions searchableOptions = new CPDFConvertPDFSearchableOptions();
searchableOptions.OCRLanguage = ComDocumentAIOCR.Language.ENGLISH;


int pageCount = converter.GetPagesCount();
int[] pageArray = new int[pageCount];
for (int i = 0; i < pageArray.Length; i++)
{
    pageArray[i] = i + 1;
}

ConvertError error = ConvertError.ERR_UNKNOWN;
converter.Convert(outputFolderPath, ref outputFileName, searchableOptions, pageArray, ref error, getPorgress);

 

Step 7. Integrate PDF SDK to Edit Searchable PDF

So far, you have achieved recognizing image-based PDFs with OCR and converting them to editable PDFs. To further edit document content, seamlessly integrate our PDF SDK for Windows using C#. We have detailed documentation guiding you to add Content Editor to your applications.

 

Final Words

The AI-powered OCR functionality provided by ComPDFKit becomes the key to recognizing and editing scanned PDF documents. With ComPDFKit Conversion SDK and PDF SDK, developers can easily integrate OCR and Content Editor into existing applications to empower users to extract text from scanned PDFs and convert them to editable PDF files, dramatically improving document searchability and editability. This streamlines document management, providing users with a superior experience.


If you are interested in using ComPDFKit to extract and edit scanned PDFs, it is recommended visiting our free online tools to experience and estimate how it performs and whether they meet your needs. Moreover, please feel free to contact us to apply for the free trial licenses to test your projects.