Pages

Monday, May 16, 2011

Jsoup Java Library for HTML parsing, Implementation and Examples


What is the issue that it resolves? 
You have HTML in a Java String or to fetch the webpage, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. 

Fetching the URL from the web:
Method:  Jsoup.connect(String url);

This method basically returns a Document (org.jsoup.nodes.Document) http://jsoup.org/apidocs/org/jsoup/nodes/Document.html , which is basically the HTML document transformed into this Class. The text, body, head, title etc of the Document can be accessed. Let us cite the various usage of the same. 

Getting the Title of the Document : 
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

Here it returned a String which was the title of the Document, in most cases this is not the return Value. When we need to access attributes in body or say in the header the method returns an object of type Element (org.jsoup.nodes.Element).

Getting the text from an HTML file :

Document doc = Jsoup.parse(url);
String title = doc.title();
String text = doc.text(); 

Retrieving the metadata (description and keywords):

  Document d =  Jsoup.connect("http://in.yahoo.com/").get();
Element meta = d.select("meta[name=description]").first();
System.out.println(meta.attr("content"));
meta = d.select("meta[name=keywords]").first();
System.out.println(meta.attr("content"));

Download the jsoup jar (version 1.5.2)

No comments:

Post a Comment