Guides

Data Extraction Guides

Unleash the Power of Data with ComPDFKit Conversion SDK's Data Extraction to detect, recognize, analyze, and extract the PDF text, text structure, table, etc.

Note: Extracting PDF content to JSON files does not support page range selection. The extraction defaults to all pages. Once the processing starts, it can not be canceled.

Extract Text from PDFs

Overview

It is to extract the text from PDF documents.

Note

Disabling OCR (Optical Character Recognition) can result in the inability to extract text from tables within images.
When we use the convert class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDFs are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the convert class will return Text objects exactly as they are defined in the PDF page content streams.

Sample

This is a sample to extract text content from a PDF document.

java

        CPDFConvert cpdfConvertJson = new CPDFConvertJson();
        CPDFConvertJsonOptions cpdfConvertJsonOptions = new CPDFConvertJsonOptions();
        cpdfConvertJsonOptions.setAllowOcr(true);
        cpdfConvertJsonOptions.setContainOcrBg(true);
        cpdfConvertJsonOptions.setOnlyAiTable(true);
        cpdfConvertJsonOptions.setPdtToJsonEnum(PDFToJsonEnum.TEXT);
        convert = cpdfConvertJson.convert(file.getPath(), null, num + "" + time, cpdfConvertJsonOptions, null, dto.getPassword(), null);

Data Extraction Guides ​

Extract Text from PDFs ​

Data Extraction Guides

Extract Text from PDFs