Skip to content
Guides

Extract Tables from PDFs

Overview

It is to extract tables from PDF documents.

Standard table and non-standard table

Commonly, tables can be divided into two categories: standard tables and non-standard tables. The specific definitions are as follows:

  • Standard table: The table border and the inner lines of the table are complete and clear. There is no need to manually add table lines to divide the table content.

image-20231116145224545

  • Non-standard tables: Table borders or inner lines are missing, and table lines are unclear. Table lines need to be manually added to separate the table content.

image-20231116145517818

Note

  • Non-standard tables in the original PDF document cannot be extracted when the OCR option is not enabled.
  • It is recommended to enable OCR or AI layout analysis options for higher accuracy of table extraction and the support of non-standard table recognition.

Sample

This is a sample to extract the table and table content from a PDF document.

java
  CPDFConvert cpdfConvertJson = new CPDFConvertJson();
        CPDFConvertJsonOptions cpdfConvertJsonOptions = new CPDFConvertJsonOptions();
        cpdfConvertJsonOptions.setAllowOcr(true);
        cpdfConvertJsonOptions.setContainOcrBg(true);
        cpdfConvertJsonOptions.setOnlyAiTable(true);
        cpdfConvertJsonOptions.setPdtToJsonEnum(PDFToJsonEnum.TABLE);
        convert = cpdfConvertJson.convert(file.getPath(), null, num + "" + time, cpdfConvertJsonOptions, null, dto.getPassword(), null);