Tutorials

How to Extract Text from PDFs in C# (Code Example Tutorial)

By ComPDFKit | Wed. 27 Mar. 2024
Data ExtractionConversion SDKC#

PDFs are extensively used for storing documents in a multitude of industries, including invoices, business reports, research papers, contracts, e-books, and more. However, when it comes to automatically processing text information from PDF documents, the initial step is to extract text from PDFs. 

 

In this article, you'll discover how to extract text from an entire PDF in C#, and also how to extract text from a specific page in PDF using C# with the ComPDFKit PDF library.

 

Steps to Extract Text from PDF using C#:

1. Download Text Extraction C# PDF Library

2. Create a New Project in Visual Studio

3. Install the ComPDFKit C# PDF Library 

4. Apply the License Key

5. Extract Text using ComPDFKit

 

extract text from PDF in C#

 

1. Download Text Extraction C# PDF Library

To extract text from PDF files, we'll utilize the ComPDFKit C# PDF library. It serves as a versatile toolkit encompassing functionalities for creating, viewing, annotating, editing, converting, and signing PDF documents. Additionally, it offers the capability to extract text from PDF files. You can easily access the SDK by contacting our sales team.

 

ComPDFKit stands out as a powerful and feature-rich PDF library, providing comprehensive solutions for developers to build applications and systems.

    • Multiple Platforms: ComPDFKit supports a wide range of platforms, including Web, Windows, Mac, Android, iOS, and Linux, ensuring flexibility and accessibility across different environments.

    • Various Frameworks: It provides support for multiple frameworks, not limited to web frameworks. Developers can leverage .NET frameworks and cross-platform frameworks such as React Native, Flutter, .NET Core, UWP, React, Vue, and more, expanding possibilities for development.

    • Web Integrations: ComPDFKit seamlessly integrates with renowned web systems like Sharepoint, Salesforce, Microsoft Teams, and Microsoft OneDrive, enhancing collaboration and workflow efficiency within existing ecosystems.

 

2. Create a New Project in Visual Studio

Open the Visual Studio software and go to the File menu. In this article, we will build a console application (.NET Framework) for Windows. The steps are shown as below: 

 

Choose File -> New -> Project..., and then select Visual C# -> Windows Desktop -> Console App(.NET Framework) .

Create a new project in Visual Studio

Create a new project in Visual Studio

 

Configure your new project as below:

    • Specify a name and location for your project. 

    • Please make sure to choose .NET Framework 4.6.1 as the programming framework. 

    • Click on the "OK" button to create your console application project.

 

Configure new project in Visual Studio

Configure new project in Visual Studio

 

Next, We can add the ComPDFKit library to test the code.

 

3. Install the ComPDFKit C# PDF Library 

Copy all files in the "lib" folder to the project folder. And then add ComPDFKit Conversion SDK dynamic library to References. In order to use ComPDFKit Conversion SDK APIs in the project, you must add the reference to the project first.

 

In Solution Explorer, right-click the project and click Add -> Reference

install C# pdf library-1

 

In the Add Reference dialog, click the Browse tab, navigate to the project folder, select "ComPDFKit_Conversion.dll" dynamic library, and then click OK button.

install C# pdf library-2

 

Add ComPDFKit Conversion SDK library to the project. Add the "x64" and "x86" folder into the project. Please make sure to set the property Copy to Output Directory of "CPDFConverterNative.dll" and "opencv_world420.dll" to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.

install C# pdf library-3

 

Copy the "resource" folder to the project folder. Please make sure to set the property Copy to Output Directory of all files in the "resource" folder to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.

install C# pdf library-4

 

4. Apply the License Key

Before initiating calls to the PDF extraction API, it's essential to initialize the ComPDFKit library with a license. ComPDFKit is available under a commercial license, but it also offers a free trial license. You can obtain the free trial license by reaching out to our sales team.

 

Here's how you can apply the license key in your code:

string resPath = "***";
string libPath = "***";
string license = "***";
CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);

 

5. Extract Text using ComPDFKit C# PDF Library

ComPDFKit enables text extraction from PDF files and conversion of PDF pages into PDF objects. This means you have the flexibility to extract all text from an entire PDF or selectively extract text from a specific page. Below are detailed examples of how to achieve this functionality.

 

Extract Text from an Entire PDF in C#

You can set the input file path, output file name, and file path, then utilize the CPDFConverterJsonText to extract text. Additionally, you can specify extraction properties, such as allowing OCR to open.

 

The following is the sample code snippet to extract text from an entire PDF document in C#:

string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";

CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;

CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;

ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);

 

Note that Disabling OCR (Optical Character Recognition) can result in the inability to extract text from tables within images.

 

When we use the CPDFConverterJsonText class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDF are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the CPDFConverterJsonText class will return Text objects exactly as they are defined in the PDF page content streams.

 

Extract Text from a Specific Page in PDF in C#

We also support extracting text from specific pages or page ranges in a PDF file. The following is the sample code snippet to extract text from a page in a PDF file using C#:

string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";

CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;

CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
pageCount = jsonTextConverter.GetPagesCount();

int[] pageArray = new int[pageCount];
for (int i = 0; i < pageArray.Length; i++)
{
  pageArray[i] = i + 1;
}


ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, pageArray, ref error);

 

Conclusion

In this article, you've learned how to extract text from PDF files in C# with step-by-step instructions and code samples. Whether you need to extract text from an entire PDF document or from a specific page, you can accomplish it easily using the C# PDF library.

 

For further exploration, you can refer to the documentation to discover more about the ComPDFKit C# PDF library, including how to extract data from PDF, extract tables from PDF, and extract images from PDF files. If you have any inquiries or require assistance, feel free to contact our free technical support team for assistance.

 

Read More

Extract Text From PDF in C# Using iTextSharp VS ComPDFKit

• How to Convert Image to Word Programmatically in C#

• How to Build a Windows PDF Viewer or Editor in C#