How to read web pages using Java?



In earlier posts we have seen how to read text files and binary files in Java. But how to read a remote resource like a web page in Java. Let’s see how to do it.

Reading a web page

Reading a remote web page is quite similar to the way we have read a text file or a binary file. The IO API’s depict the powerful Decorator pattern is.

Steps:

  • Create an instance of URL
  • Open the connection
  • Get connection input stream
  • Create a BufferedReader and read the content
  • Close the resources

This sounds quite simple. Let’s see the code

package com.codezuzu.io;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

/**
 * http://www.codezuzu.com
 */
public class NetworkReader {

  public void readFromSite(String url) throws IOException {
    // validate url

    URL siteUrl = new URL(url);
    URLConnection connection = siteUrl.openConnection();

    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

    String line;
    while((line = bufferedReader.readLine()) != null) {
      System.out.println(line);
    }

    bufferedReader.close();
  }

  public static void main(String[] args) throws IOException {
    NetworkReader networkReader = new NetworkReader();
    networkReader.readFromSite("http://www.codezuzu.com");
  }

}

The details are quite simple. The BufferedReader part is common to other reading programs.

One application of this simple code code could be fetch the page as part of a big crawler program.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.