Guides

Extract Tables from PDF

Overview

To extract table content from a PDF document.

Standard table and non-standard table

Commonly, tables can be divided into two categories: standard tables and non-standard tables. The specific definitions are as follows:

Standard table: The table border and the inner lines of the table are complete and clear. There is no need to manually add table lines to divide the table content.

Non-standard tables: Table borders or inner lines are missing, and table lines are unclear. Table lines need to be manually added to separate the table content.

Notice

Non-standard tables in the original PDF document cannot be extracted when the OCR option is not enabled.
It is recommended to enable OCR or AI layout analysis options for higher accuracy of table extraction and the support of non-standard table recognition.

Sample

Full sample code which illustrates the table extraction capabilities.

string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";

CPDFConverterJsonTable converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonTable, inputFilePath) as CPDFConverterJsonTable;

CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
jsonOptions.IsAILayoutAnalysis = false;

ConvertError error = ConvertError.ERR_UNKNOWN;
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);

Extract Tables from PDF ​

Overview ​

Standard table and non-standard table ​

Notice ​

Sample ​

Extract Tables from PDF

Overview

Standard table and non-standard table

Notice

Sample