In this era of information explosion, we harness vast amounts of data to train AI Large Language Models, set up databases, and more aimed at filtering valuable information. PDF documents, one of the primary data sources, hold a wealth of valuable information. For developers, extracting text from PDFs is the first step for effective data extraction.
Some of you may be concerned about how to extract text from PDFs in C#. iTextSharp always stands out as an effective solution for PDF text extraction. In this guide, we'll delve into utilizing iTextSharp for PDF text extraction in C#, covering everything from installation and project setup to providing code samples. Additionally, we'll introduce and compare it with another robust C# library, ComPDFKit, to help you make informed decisions.
What Is iTextSharp Library?
iTextSharp is a C# library for creating and manipulating PDF documents in the .NET framework. It empowers developers to dynamically generate, manipulate, and extract content from PDF files. One of its standout features is pdf2Data, a data extraction solution that smartly recognizes and extracts data from PDF documents, based on selection rules that are defined in a template. This article will illustrate how iTextSharp can be efficiently employed for managing and extracting PDFs in the C# programming environment.
What Is ComPDFKit?
ComPDFKit, a comprehensive and professional PDF SDK, seamlessly operates on Web, Windows, Android, iOS, Mac, and Server, with support for cross-platform frameworks such as React Native, Flutter, etc. Regardless of whether you need functionalities like viewing, annotating, editing, data extraction, or conversion, ComPDFKit has you covered.
ComPDFKit Conversion SDK for .Net Framework allows developers to effortlessly integrate PDF Extract into Windows apps in C#. Whether you're using a local SDK, an online API, or an on-premises Processor, it helps users accurately recognize the structure and content elements of PDFs, enabling classifying and merging the raw data and saving it in a variety of desired formats. In this guide, we will dive into how ComPDFKit extracts text from PDF within the C# environment.
How to Extract Text from PDF in C# Using ComPDFKit?
Download ComPDFKit C# Library for Text Extraction
First, you need to download and install ComPDFKit C# library in Nuget. Please make sure that you have satisfied the system requirements listed below.
Platform: Windows
System Requirements: Windows 7, 8, 10, and 11 (32-bit, 64-bit)
Integrated Development Environment: Visual Studio 2017 or higher
Framework Requirements: .NET Framework 4.6.1 or higher
Create a New Windows Project and Apply the License
Follow the instructions about how to make a program on our Documentation to create a new project. After that, contact our sales to get a free trial license to initialize the ComPDFKit Conversion SDK. Then, insert the license by following.
string resPath = "***";
string libPath = "***";
string license = "***";
CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);
Extract Text from PDFs
To extract text from PDF documents in C# using ComPDFKit, simply follow these code samples.
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);
Notice
• Disabling OCR (Optical Character Recognition) can result in the inability to extract text from tables within images.
• When we use the CPDFConverterJsonText class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDFs are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the CPDFConverterJsonText class will return Text objects exactly as they are defined in the PDF page content streams.
How to Extract Text from PDFs Using iTextSharp?
The steps to use iTextSharp for text extraction are similar to ComPDFKit. Once you have installed iTextSharp PDF library and created a project. Follow the below example to extract text from PDF files using iTextSharp C# library.
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
Text Extraction C# Libraries Comparison
iTextSharp, without arguing, is a powerful PDF text extraction library in C#. As an open-source library, it is always a go-to solution for developers to integrate common PDF features into their apps. When comparing to ComPDFKit, it is obvious that using iTextSharp requires writing lengthy and complex codes while ComPDFKit only takes less than 10 lines of code to extract text from PDFs.
With iTextSharp, developers are expected to write custom code or logic to achieve the desired functionality, which requires in-depth knowledge of the library and C#. Fortunately, ComPDFKit C# library provides out-of-the-box features, making text extraction a convenient process for developers.
Therefore, on the premise that ComPDFKit and iTextSharp have similar accuracy in PDF text extraction, ComPDFKit is superior in terms of performance and code readability.
Conclusion
This article has introduced both iTextSharp and ComPDFKit, demonstrating how to extract text from PDFs in C# using each tool. Upon comparison, it's evident that iTextSharp is a solid open-source option, whereas ComPDFKit offers simpler and shorter code for text extraction from PDF files. If you're interested in utilizing ComPDFKit for PDF text extraction in C#, you can take advantage of a free trial license to integrate our PDF Extract into your applications or systems. Additionally, before diving in, we offer a free online demo for PDF data extraction.
Read More
• How to Extract Text from PDFs in C# (Code Example Tutorial)