Tutorials

How to Extract Text Data From a PDF Using Java

By ComPDFKit | Fri. 23 Aug. 2024
JavaData ExtractionPDF API

 

How to Extract Text Data From a PDF Using Java

 

PDFs are a ubiquitous format for documents due to their versatility and consistency across platforms. However, extracting text data from PDFs can be a challenging task, especially when you need to automate this process in your Java projects.

 

This article provides a comprehensive guide on how to use the PDF Extract API to efficiently convert PDF data into text in Java.

 

Windows   Web   Android   iOS   Mac   Server   React Native   Flutter   Electron
30-day Free

 

 

Step1: Get the License of Java PDF API

 

For ComPDFKit API users, we provide 1000 free PDF API requests. Follow the steps below to access the license and start your API requests.

 

  1. Register ComPDFKit API to go to the dashboard. You will see the API Keys, the progress of your API plan, and the status of API requests on your dashboard.

 

Get and Access the License of Java PDF API

 

  1. Create a project and get the Public Key and Secret Key.

After your account is created, a default project will be created. You can create more projects to call ComPDFKit API. All supported PDF APIs could be checked on the documentation pages.

 

There are unique Public Key and Secret Key for each project. Remember to apply the right key for the corresponding project.

 

get the Public Key and Secret Key of ComPDFKit API

 

 

Step2: Authentication PDF API for PDF Text Data Extraction

 

You need to replace the real publicKey and secretKey to get the accessToken. Then, use the accessToken to create task, upload files, extract PDF words, and get the extracted PDF Text JSON file.

 

Java code example to authenticate ComPDFKit PDF text Extracting API:

import java.io.*;
import okhttp3.*;
public class main {
  public static void main(String []args) throws IOException{
    OkHttpClient client = new OkHttpClient().newBuilder()
      .build();
    MediaType mediaType = MediaType.parse("text/plain");
    RequestBody body = RequestBody.create(mediaType, "{\n    \"publicKey\": \"{{public_key}}\",\n    \"secretKey\": \"{{secret_key}}\"\n}");
    Request request = new Request.Builder()
      .url("https://api-server.compdf.com/server/v1/oauth/token")
      .method("POST", body)
      .build();
    Response response = client.newCall(request).execute();
  }
}

 

 

Step3: Create Task - Extract Text From a PDF

 

You need to replace with the accessToken which was obtained from the previous step. Set the language type you want to display the error information (1, English, 2, Chinese). ComPDFKit PDF API parameters could be found on Quick Start --> Request Description page.

 

After replacing them, you will get the taskId in the response data. Java code example to create PDF text extracting task:

import java.io.*;
import okhttp3.*;
public class main {
  public static void main(String []args) throws IOException{
    OkHttpClient client = new OkHttpClient().newBuilder()
      .build();
    MediaType mediaType = MediaType.parse("text/plain");
    RequestBody body = RequestBody.create(mediaType, "");
    Request request = new Request.Builder()
      .url("https://api-server.compdf.com/server/v1/task/pdf/json?language={{language}}")
      .method("GET", body)
      .addHeader("Authorization", "Bearer {{accessToken}}")
      .build();
    Response response = client.newCall(request).execute();
  }
}

 

 

Step4: Upload Files for PDF Document Parsing

 

Replace the information in the Java code:

 

  • PDF Files: The PDF you want to extract Text from.

  • taskId: Obtained in the tast creating step.

  • Language: The language you want to display the error information.

  • accessToken: Obtained in the Authentication step.

 

ComPDFKit API provide AI, OCR, etc. You can also input the parameters in this step:

 

  • type:Options to extract contents (0: text, 1: table) Default 0.

  • isAllowOcr: Whether to allow to open OCR (1: yes, 0: no), Default 0.

  • isOnlyAiTable: Whether to enable AI to recognize table (1: yes, 0: no) Default 0.

 

Java code example to upload PDFs to parsing:

import java.io.*;
import okhttp3.*;
public class main {
  public static void main(String []args) throws IOException{
    OkHttpClient client = new OkHttpClient().newBuilder()
      .build();
    MediaType mediaType = MediaType.parse("text/plain");
    RequestBody body = new MultipartBody.Builder().setType(MultipartBody.FORM)
      .addFormDataPart("file","{{file}}",
                       RequestBody.create(MediaType.parse("application/octet-stream"),
                                          new File("<file>")))
      .addFormDataPart("taskId","{{taskId}}")
      .addFormDataPart("language","{{language}}")
      .addFormDataPart("password","")
      .addFormDataPart("parameter","{  \"type\": 1, \"isAllowOcr\":1, \"isContainOcrBg\":0}")
      .build();
    Request request = new Request.Builder()
      .url("https://api-server.compdf.com/server/v1/file/upload")
      .method("POST", body)
      .addHeader("Authorization", "Bearer {{accessToken}}")
      .build();
    Response response = client.newCall(request).execute();
  }
}

 

 

Step5: Process and Extract PDF to Text

 

Execute the tast to extract Words from PDF you uploaded. Here is the Java code example:

import java.io.*;
import okhttp3.*;
public class main {
 public static void main(String []args) throws IOException{
   OkHttpClient client = new OkHttpClient().newBuilder()
     .build();
   MediaType mediaType = MediaType.parse("text/plain");
   RequestBody body = RequestBody.create(mediaType, "");
   Request request = new Request.Builder()
     .url("https://api-server.compdf.com/server/v1/execute/start?taskId={{taskId}}&language={{language}}")
     .method("GET", body)
     .addHeader("Authorization", "Bearer {{accessToken}}")
     .build();
   Response response = client.newCall(request).execute();
 }
}

 

 

Step6: Get Task Information of PDF Data Extraction

 

Follow the Java code example below to obtain the task information. Replace the needed information like taskId and access_token. The PDF PDF parser and extracted result file is presented in a JSON file, which is a structured data format beneficial for the reuse of PDF text extraction.

import java.io.*;
import okhttp3.*;
public class main {
  public static void main(String []args) throws IOException{
    OkHttpClient client = new OkHttpClient().newBuilder()
      .build();
    MediaType mediaType = MediaType.parse("text/plain");
    RequestBody body = RequestBody.create(mediaType, "");
    Request request = new Request.Builder()
      .url("https://api-server.compdf.com/server/v1/task/taskInfo?taskId={{taskId}}")
      .method("GET", body)
      .addHeader("Authorization", "Bearer {{accessToken}}")
      .build();
    Response response = client.newCall(request).execute();
  }
}

 

 

Conclusion

 

In conclusion, extracting text data from a PDF using Java can be efficiently accomplished through the use of specialized APIs. This tutorial walked you through each step of the process, from getting and accessing the license for the Java PDF API to uploading files and finally extracting the required text.

 

However, text extraction is just one facet of what you can achieve with PDF APIs. These tools also allow you to extract tables, images, and various other elements from PDFs, thereby providing a comprehensive approach to PDF data extraction. This versatility ensures that your applications can handle a wide variety of data formats found within PDF documents.

 

We also provide more programming language tutorials to extract PDF Text data like PHP, CUIL, etc. Get 1000 free PDF API requests per month now.

 

Windows   Web   Android   iOS   Mac   Server   React Native   Flutter   Electron
30-day Free