Word Stemming - Martin Porter Algorithm - PHP Class

Word Stemming:

Word stemming is the process of removing suffixes from words to create a stem or, in simplistic form 'a common or shared start to a word'.

Why is stemming words useful? 

If you go to a search engine and search for "recent achievements in linguistics" you will probably want the search engine to find all web sites about linguistics and what is achievable as well as what has been achieved.

What if a report or web site contains the words...

Achievable
or
Achieved
or
Achieving

...but not the word Achievements?

For you and I as humans we can see straight away that Achievable, Achieved, Achieving and Achievements are related but for a search engine they have to be able to work this out for themselves.

When stemming is necessary

By applying stemming to the above words we get the psuedo-word 'Achiev'.

This means that the search engine can find all web sites that contain the stemmed word Achiev, it can then rank those with the exact match of 'achievements' higher than those with the words 'achievable', 'achieved', and 'achieving'.

While this example illustrates a very simplistic algorithm for finding and ranking search results it does point you in the right direction to start researching word stemming.

Resources for Word Stemming

Martin Porter created the original Porter Stemming algorithm back in 1979.  He has source code in ANSI C, and links to code for the Porter Stemming Algorithm in other languages.

Richard Heyes has implemented the Porter Stemming algorithm as a PHP 5 class (PHP 5 only).

Because I still primarily code in PHP 4, I have retro fitted Richards class for PHP 4. 

View my Porter Stemmer for PHP 4

Something to note is that my PHP 4 implementation cannot be called statically i.e. you have to create an object before calling the stem function (see source code).

Enjoy

Dean Layton-James
CEO Layton-James Associates Ltd