Pages

Thursday, May 26, 2011

Stemming of words: Porter Stemmer Algorithm

Searching for some libraries which could transform words to their root forms. So that when I encounter words like filling, filled, fill, they are all treated as the same. 


Definition of stemming: stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form


A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".


Something about Porter Stemmer Algorithm: 
A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.


Martin Porter released an official free-software implementation of the algorithm.





public class ASimpleStemmer {

public static void main(String[] args) {
Stemmer s = new Stemmer(); // See the official implementation, link given above.
s.add("abominable".toCharArray(), 10);
s.stem();
System.out.println(s.toString()); //Output: abomin

s.add("abominate".toCharArray(), 9);
s.stem();
System.out.println(s.toString()); //Output: abomin
}
}









Wednesday, May 18, 2011

Example of ConcurrentModificationException


private static void refineMap(BayesianResult bayes, HashMap terms) {
for (Map.Entry entry : terms.entrySet()) {
String word = entry.getKey();
Token t = entry.getValue();
if ((t.getClean_count()+t.getSens_count()) < 10) {
terms.remove(word);
continue;
}
}
}

Here I am modifying the HashMap ( terms.remove(word) ) while iteration is already being done on it. This throws an exception Exception in thread "main" java.util.ConcurrentModificationException


Monday, May 16, 2011

Jsoup Java Library for HTML parsing, Implementation and Examples


What is the issue that it resolves? 
You have HTML in a Java String or to fetch the webpage, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. 

Fetching the URL from the web:
Method:  Jsoup.connect(String url);

This method basically returns a Document (org.jsoup.nodes.Document) http://jsoup.org/apidocs/org/jsoup/nodes/Document.html , which is basically the HTML document transformed into this Class. The text, body, head, title etc of the Document can be accessed. Let us cite the various usage of the same. 

Getting the Title of the Document : 
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

Here it returned a String which was the title of the Document, in most cases this is not the return Value. When we need to access attributes in body or say in the header the method returns an object of type Element (org.jsoup.nodes.Element).

Getting the text from an HTML file :

Document doc = Jsoup.parse(url);
String title = doc.title();
String text = doc.text(); 

Retrieving the metadata (description and keywords):

  Document d =  Jsoup.connect("http://in.yahoo.com/").get();
Element meta = d.select("meta[name=description]").first();
System.out.println(meta.attr("content"));
meta = d.select("meta[name=keywords]").first();
System.out.println(meta.attr("content"));

Download the jsoup jar (version 1.5.2)