Tutorials

How to Extract Text from PDFs with ComPDFKit in Objective-C

By ComPDFKit | Thu. 14 Nov. 2024
Objective-CData Extraction

Text extraction could be used in text retrieval, analysis. It’s easy for us to obtain the normal text of PDFs. However, there is always some text which we can not obtain just by copying. That's because the text in PDFs may contain in graphs, tables, images, or scanned files, except for normal text.

 

The methods to extract them are much more complicated than normal text. This article will tell you how to extract all kinds of text with ComPDFKit PDF SDK and Conversion SDK in Objective-C.

 

 

Extract Normal Text from PDFs

 

For a huge PDF document, extracting text automatically is more convenient than extracting manually. The text in PDFs is contained in content streams. So, ComPDFKit PDF SDK could retrieve the text and extract as you want. 

 

Here we are going to show you the methods to extract text from the whole PDFs, the specific pages in PDFs, and the page region in PDFs. 

 

Text from the Whole PDFs:

ComPDFKit PDF SDK provides the text extractor API to extract the text in PDFs. If you want to extract the text of the whole PDFs, follow the method in code below. 

 

NSURL *url = [NSURL fileURLWithPath:@""];
CPDFDocument *document = [[CPDFDocument alloc] initWithURL:url];
NSInteger pageCount = [document pageCount];
NSString *text = @"";
for (int i=0; i<pageCount; i++) {
    CPDFPage *page = [document pageAtIndex:i];
    NSString *string = [page string];
    text = [text stringByAppendingString:string];
}

 

Text from the Specific Pages in PDFs:

As for specific pages, you need to open the document and find the pages you selected by the page index. Finally, find the text content by text string. You can follow the method below to extract the text.

 

NSURL *url = [NSURL fileURLWithPath:@""];
CPDFDocument *document = [[CPDFDocument alloc] initWithURL:url];
CPDFPage *page = [document pageAtIndex:0];
NSString *text = [page string];

 

Text from the Page Region in PDFs:

Considering the needs of extracting parts of the PDF documents, ComPDFKit supports choosing a particular region to extract the text. The following code sample shows the details of methods.

 

NSURL *url = [NSURL fileURLWithPath:@""];
CPDFDocument *document = [[CPDFDocument alloc] initWithURL:url];
CPDFPage *page = [document pageAtIndex:0];
CGRect rect = CGRectMake(0, 0, 200, 40);
NSString *text = [page stringForRect:rect];

 

 

Extract Text from Images & Scanned PDFs

 

We often find some text in PDFs you can not choose and copy. That’s because the text is in images or scanned PDFs. To extract As for text from images or scanned PDF documents, you will need the OCR function of ComPDFKit Conversion SDK. 

 

OCR is an acronym for Optical Character Recognition which is an intelligent technology that reads and extracts text from images or scanned documents. ComPDFKit Conversion SDK supports OCR on Windows, Android, and iOS.

 

With the continuous improvement of OCR, we can work effectively with its various benefits like making hand-writing notes editable, turning legal contacts to electronic files for storage, protecting historical documents, etc. 

 

For extracting text from images and scanned PDFs, you need to process the image or scanned PDFs to black and white version first. Then, recognize the features of each black field to compare with and match the learned characters. Finally, correct the files according to the word in the library.

 

 

Extract Text from Graphs or Tables

 

To extract the text from graphs or tables, there are three steps: detection, extraction, and conversion. We are going to tell you the method of ComPDFKit Conversion SDK. Here are the documents’ pictures of before and after extraction.

 Extract Text: Before Extraction

Before Extraction

 

Extract Text: After Extraction

After Extraction

 

Detection: This is the step to seek the location of the graphs and tables. Because of the complicated tables, we may use different methods to extract the text like lines or coordinates. 

 

Extraction: The structure and content of the graphs or tables is important in this step. For the content, identify the fonts, styles, acronyms, and abbreviations of content, and match them with libraries. As for the graphs or tables, it’s a big problem about how to identify the tables across multiple pages. But the ComPDFKit Conversion SDK can do all that by analyzing graphic ruling lines, content layout.

 

Conversion: After extracting the text, we need to convert the content to editable files like word or excel which is according to the original content. 

 

 

Related Functions

 

The way to obtain messages is developed rapidly. At the same time, we also think about how to protect our works and personal information. For the functions of security, please connect to the related blog posts or documentation.

 

Redaction: It’s used to irreversibly remove the information you don’t want to show in PDF files like your identification number. And the content we support to redact is including text, images, and vector graphics.

 

PDF Permission: It’s used to secure a PDF document from being reshared, copied, printed, etc.

 

Watermark: Adding a non-removable watermark to documents can discourage viewers from sharing your content or taking screenshots.

 

 

Conclusion

 

Now we have a clear understanding about the text in PDFs, and the methods about how to extract the text from PDFs. If you are interested in our PDF SDK and Conversion SDK, please feel free to contact us.