Pages

Thursday, May 26, 2011

Stemming of words: Porter Stemmer Algorithm

Searching for some libraries which could transform words to their root forms. So that when I encounter words like filling, filled, fill, they are all treated as the same. 


Definition of stemming: stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form


A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".


Something about Porter Stemmer Algorithm: 
A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.


Martin Porter released an official free-software implementation of the algorithm.





public class ASimpleStemmer {

public static void main(String[] args) {
Stemmer s = new Stemmer(); // See the official implementation, link given above.
s.add("abominable".toCharArray(), 10);
s.stem();
System.out.println(s.toString()); //Output: abomin

s.add("abominate".toCharArray(), 9);
s.stem();
System.out.println(s.toString()); //Output: abomin
}
}









No comments:

Post a Comment