How to read PDF files in Java

A lot many times we needed to read the pdf files to extract text out of it. This post explores about how to read a pdf file and extract text from it.

Assumptions

You have a maven project setup. If not quickly browse through How to create a simple Java project in Maven

We shall be using Apache PDFBox to enable reading PDF files

Adding Dependency

Add following snippet to the pom.xml

Reading the PDF File

Let’s the snippet using Apache PDFBox to read the file

We create an instance of PDDocument from the PDF file that we want to read. We have set the start and end page for simplicity. You can set it as per needs. The key here is to use inbuilt PDFTextStripper class, which can extract the text from the PDF document.

The final step is to call the writeText() API. Here we write on the Console, so we create an OutputStreamWriter, it can be any Writer instance. This is it, once you run this program, it shall read the PDF and dump the text content on the screen.

Leave a Reply

Your email address will not be published. Required fields are marked *