PDFs are extensively used for storing documents in a multitude of industries, including invoices, business reports, research papers, contracts, e-books, and more. However, when it comes to automatically processing text information from PDF documents, the initial step is to extract text from PDFs.
In this article, you'll discover how to extract text from an entire PDF in C#, and also how to extract text from a specific page in PDF using C# with the ComPDFKit PDF library.
Steps to Extract Text from PDF using C#:
1. Download Text Extraction C# PDF Library
2. Create a New Project in Visual Studio
3. Install the ComPDFKit C# PDF Library
5. Extract Text using ComPDFKit
1. Download Text Extraction C# PDF Library
To extract text from PDF files, we'll utilize the ComPDFKit C# PDF library. It serves as a versatile toolkit encompassing functionalities for creating, viewing, annotating, editing, converting, and signing PDF documents. Additionally, it offers the capability to extract text from PDF files. You can easily access the SDK by contacting our sales team.
ComPDFKit stands out as a powerful and feature-rich PDF library, providing comprehensive solutions for developers to build applications and systems.
• Multiple Platforms: ComPDFKit supports a wide range of platforms, including Web, Windows, Mac, Android, iOS, and Linux, ensuring flexibility and accessibility across different environments.
• Various Frameworks: It provides support for multiple frameworks, not limited to web frameworks. Developers can leverage .NET frameworks and cross-platform frameworks such as React Native, Flutter, .NET Core, UWP, React, Vue, and more, expanding possibilities for development.
• Web Integrations: ComPDFKit seamlessly integrates with renowned web systems like Sharepoint, Salesforce, Microsoft Teams, and Microsoft OneDrive, enhancing collaboration and workflow efficiency within existing ecosystems.
2. Create a New Project in Visual Studio
Open the Visual Studio software and go to the File menu. In this article, we will build a console application (.NET Framework) for Windows. The steps are shown as below:
Choose File -> New -> Project..., and then select Visual C# -> Windows Desktop -> Console App(.NET Framework) .
Create a new project in Visual Studio
Configure your new project as below:
• Specify a name and location for your project.
• Please make sure to choose .NET Framework 4.6.1 as the programming framework.
• Click on the "OK" button to create your console application project.
Configure new project in Visual Studio
Next, We can add the ComPDFKit library to test the code.
3. Install the ComPDFKit C# PDF Library
Copy all files in the "lib" folder to the project folder. And then add ComPDFKit Conversion SDK dynamic library to References. In order to use ComPDFKit Conversion SDK APIs in the project, you must add the reference to the project first.
In Solution Explorer, right-click the project and click Add -> Reference…
In the Add Reference dialog, click the Browse tab, navigate to the project folder, select "ComPDFKit_Conversion.dll" dynamic library, and then click OK button.
Add ComPDFKit Conversion SDK library to the project. Add the "x64" and "x86" folder into the project. Please make sure to set the property Copy to Output Directory of "CPDFConverterNative.dll" and "opencv_world420.dll" to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.
Copy the "resource" folder to the project folder. Please make sure to set the property Copy to Output Directory of all files in the "resource" folder to Copy if newer. Otherwise, you should copy it to the same folder with the executable file manually before running the project.
4. Apply the License Key
Before initiating calls to the PDF extraction API, it's essential to initialize the ComPDFKit library with a license. ComPDFKit is available under a commercial license, but it also offers a free trial license. You can obtain the free trial license by reaching out to our sales team.
Here's how you can apply the license key in your code:
string resPath = "***";
string libPath = "***";
string license = "***";
CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);
5. Extract Text using ComPDFKit C# PDF Library
ComPDFKit enables text extraction from PDF files and conversion of PDF pages into PDF objects. This means you have the flexibility to extract all text from an entire PDF or selectively extract text from a specific page. Below are detailed examples of how to achieve this functionality.
Extract Text from an Entire PDF in C#
You can set the input file path, output file name, and file path, then utilize the CPDFConverterJsonText
to extract text. Additionally, you can specify extraction properties, such as allowing OCR to open.
The following is the sample code snippet to extract text from an entire PDF document in C#:
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);
Note that Disabling OCR (Optical Character Recognition) can result in the inability to extract text from tables within images.
When we use the CPDFConverterJsonText
class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDF are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the CPDFConverterJsonText
class will return Text objects exactly as they are defined in the PDF page content streams.
Extract Text from a Specific Page in PDF in C#
We also support extracting text from specific pages or page ranges in a PDF file. The following is the sample code snippet to extract text from a page in a PDF file using C#:
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
pageCount = jsonTextConverter.GetPagesCount();
int[] pageArray = new int[pageCount];
for (int i = 0; i < pageArray.Length; i++)
{
pageArray[i] = i + 1;
}
ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, pageArray, ref error);
Conclusion
In this article, you've learned how to extract text from PDF files in C# with step-by-step instructions and code samples. Whether you need to extract text from an entire PDF document or from a specific page, you can accomplish it easily using the C# PDF library.
For further exploration, you can refer to the documentation to discover more about the ComPDFKit C# PDF library, including how to extract data from PDF, extract tables from PDF, and extract images from PDF files. If you have any inquiries or require assistance, feel free to contact our free technical support team for assistance.
Read More
• Extract Text From PDF in C# Using iTextSharp VS ComPDFKit