Tutorials

How to Extract Words from PDF using PHP - PDF Parsing API

By ComPDFKit | Fri. 23 Aug. 2024
Data ExtractionPDF APIPHP

 

How to Extract Words from PDF using PHP - PDF Parsing API

 

 

In today's digital age, the ability to parse and extract text from PDF documents is essential for enhancing efficiency, reducing error rates, and automating business processes. It's the same for PHP projects. In this article, we will delve into how to call ComPDFKit's PDF API in PHP to extract text from PDF documents efficiently.

 

This technology proves to be invaluable across various domains, significantly simplifying manual workflows and improving data accuracy and accessibility. The applications are widespread, including but not limited to:

 

  • Automated handling and auditing of bank statements and financial reports

  • Automatic grading and correction of exam papers and student assignments

  • Extraction of medical records and diagnostic reports for archiving and quick retrieval

  • Automatic extraction of customer data and feedback forms, which are then stored in database systems for data mining and analysis

 

 

Windows   Web   Android   iOS   Mac   Server   React Native   Flutter   Electron
30-day Free

 

 

Step1: Get and Access the License of PHP PDF API

 

For ComPDFKit API users, we provide 1000 free PDF API requests. Follow the steps below to access the license and start your API requests.

 

  1. Register ComPDFKit API to go to the dashboard. You will see the API Keys, the progress of your API plan, and the status of API requests on your dashboard.

 

Register ComPDFKit API

 

  1. Create a project and get the Public Key and Secret Key.

After your account is created, a default project will be created. You can create more projects to call ComPDFKit API. All supported PDF APIs could be checked on the documentation pages.

 

There are unique Public Key and Secret Key for each project. Remember to apply the right key for the corresponding project.

 

ComPDFKit API Dashboard

 

 

Step2: Authentication PDF API for PDF Text Extraction

 

You need to replace the real publicKey and secretKey to get the accessToken. Then, use the accessToken to create a task, upload files, extract PDF words, and get the extracted PDF Text JSON file.

 

PHP code example to authenticate ComPDFKit PDF text Extracting API:

$params = [
    'publicKey' => $publicKey,
    'secretKey' => $secretKey
];
$headers = ['Content-Type: application/json'];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/oauth/token',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_HTTPHEADER => $headers,
    CURLOPT_POSTFIELDS => json_encode($params)
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$accessToken = $result['data']['accessToken'];
$bearerToken = "Bearer $accessToken";



Step3: Create Task - Extract PDF Text

 

You need to replace the accessToken which was obtained from the previous step. Set the language type you want to display the error information (1, English, 2, Chinese). ComPDFKit PDF API parameters can be found on the Quick Start --> Request Description page.

 

After replacing them, you will get the taskId in the response data. PHP code example to create PDF text extracting task:

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/pdf/json?language=' . $language,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$taskId = $result['data']['taskId'];

 

 

Step4: Upload Files for PDF Parser

 

Replace the information in the PHP code:

 

  • PDF Files: The PDF you want to extract Text from.

  • taskId: Obtained in the tast creating step.

  • Language: The language you want to display the error information.

  • accessToken: Obtained in the Authentication step.

 

ComPDFKit API provide AI, OCR, etc. You can also input the parameters in this step:

  • type:Options to extract contents (0: text, 1: table) Default 0.

  • isAllowOcr: Whether to allow to open OCR (1: yes, 0: no), Default 0.

  • isOnlyAiTable: Whether to enable AI to recognize table (1: yes, 0: no) Default 0.

 

PHP code example to upload PDFs to parsing:

$params = [
    'taskId' => $taskId, // ID of your task
    'file' => new CURLFile($pdfPath), // Files you need to process
    'language' => $language,
    'password' => '',
    'parameter' => json_encode(['type' => 1, 'isAllowOcr' => 1, 'isContainOcrBg' => 0])
];
$headers = [
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/file/upload',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_HTTPHEADER => $headers,
    CURLOPT_POSTFIELDS => $params
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$fileKey = $result['data']['fileKey'];



Step5: Process and Extract Text From Uploaded PDF Files

 

Execute the tast to extract Words from PDF you uploaded. Here is the PHP code example:

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/execute/start?language=' . $language . '&taskId=' . $taskId,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);



Step6: Get Task Information of PDF Text Extraction

 

Follow the PHP code example below to obtain the task information. Replace the needed information like taskId and access_token. The PDF PDF parser and extracted result file is presented in a JSON file, which is a structured data format beneficial for the reuse of PDF text extraction.

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];

$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/taskInfo' . '?taskId=' . $taskId,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);



Conclusion

 

Beyond the ability to extract text from PDFs, we also support the extraction of tables, images, and other elements. This comprehensive functionality makes our PDF API solution an invaluable tool for anyone dealing with vast amounts of data encapsulated within PDF files.

 

By accurate data extraction, we empower users to quickly and efficiently harness the full potential of the information contained in their documents. Whether for research, data analysis, or simply improving productivity, ComPDFKit API stands as a cornerstone for better handling of PDF data.

 

 

Windows   Web   Android   iOS   Mac   Server   React Native   Flutter   Electron
30-day Free