On this page
Guides
Extract Tables from PDF
Overview
To extract table content from a PDF document.
Standard table and non-standard table
Commonly, tables can be divided into two categories: standard tables and non-standard tables. The specific definitions are as follows:
- Standard table: The table border and the inner lines of the table are complete and clear. There is no need to manually add table lines to divide the table content.
- Non-standard tables: Table borders or inner lines are missing, and table lines are unclear. Table lines need to be manually added to separate the table content.
Notice
- Non-standard tables in the original PDF document cannot be extracted when the OCR option is not enabled.
- It is recommended to enable OCR or AI layout analysis options for higher accuracy of table extraction and the support of non-standard table recognition.
Sample
Full sample code which illustrates the table extraction capabilities.
c#
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterJsonTable converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonTable, inputFilePath) as CPDFConverterJsonTable;
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
jsonOptions.IsAILayoutAnalysis = false;
ConvertError error = ConvertError.ERR_UNKNOWN;
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);