How to read PDF files in Java

A lot many times we needed to read the pdf files to extract text out of it. This post explores about how to read a pdf file and extract text from it.

Assumptions

You have a maven project setup. If not quickly browse through How to create a simple Java project in Maven

We shall be using Apache PDFBox to enable reading PDF files

Adding Dependency

Add following snippet to the pom.xml

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.8</version>
</dependency>

Reading the PDF File

Let’s the snippet using Apache PDFBox to read the file

public class PdfReader {

  public static void main(String[] args) throws IOException {
    PDFTextStripper pdfTextStripper = new PDFTextStripper();

    PDDocument pdDocument = PDDocument.load(args[0]);
    pdfTextStripper.setStartPage(0);
    pdfTextStripper.setEndPage(10);

    // let's print it to console
    pdfTextStripper.writeText(pdDocument, new OutputStreamWriter(System.out));
  }

}

We create an instance of PDDocument from the PDF file that we want to read. We have set the start and end page for simplicity. You can set it as per needs. The key here is to use inbuilt PDFTextStripper class, which can extract the text from the PDF document.

The final step is to call the writeText() API. Here we write on the Console, so we create an OutputStreamWriter, it can be any Writer instance. This is it, once you run this program, it shall read the PDF and dump the text content on the screen.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.