MTLookup v2 Improvements: Stemming

09/15/2005

Some days ago, a new version of MTLookup was released. Please read Second version of MTLookup released for the announcement. I will describe the new features in several posts, which will be published here in the Movable Type Weblog.

Today, I want to tell you about stemming and how MTLookup benefits from this.

The Problem

If you already used the first version of MTLookup, maybe you saw this problem yourself: you had to write the search term in exactly the same way that the author wrote it.

If the author only used the word »template«, but you searched for »templates«, you would not hit that article. There was only thing that you could do: try different variations such as...

  • template style
  • templates style
  • template styles
  • templates styles

Of course, nobody did this. So sometimes articles were not found. Or they were not rated correctly, as their rank regarding a keyword was split.

The Solution

Fortunately, a solution exists. It is a technique called stemming: each word that is to be put into an index, will be modified and a base form of the word is used. With this mapping, several words - for example the singular / plural variation - are mapped into the same result.

There are several stemming algorithms. The most well-known algorithm has been developed by Martin Porter and is documented in The Porter Stemming Algorithm.

Let us make an example. The following sentence...

MTLookup, the search engine for Movable Type related information, has been released

will be mapped to...

mtlookup the search engin for movabl type relat inform has been releas

For example, you can see that the word »released« is mapped to »releas«. Also other variations of this word (release, releases, releasing, ...) would be mapped to the same base form of the word.

MTLookup and Stemming

If MTLookup indexes a website, it reads each webpage. However, the webpage's text is not stored in its original form. Instead, it is modified by stemming.

In the very same way, a search phrase that has been entered by a user is also stemmed. Then the stemmed version of the search phrase is compared to the stemmed version of the webpage's text.

Resulting from this, you do not have to consider variations of a word. MTLookup will find them automatically.

Related Articles

The new version of MTLookup is described on the Movable Type Weblog. These articles are...

If you want to try, use MTLookup.

mgs | 09/15/2005

Feedback is welcome!

What do you think about this entry? Was it interesting or boring? I would like to hear your comments. If the text was helpful, please consider setting a link to http://www.movable-type-weblog.com/.

No spam please!

For protecting this weblog I have installed the MT-Approval Plugin. You have to view a new comment in preview mode, before it is saved on the server. Moreover, I will view your comment manually, before it is published. You can find more information on the subject in the entry Weblog Spamming Basics.

With an active TypeKey session, your comment will be published immediately.

Post a new comment

TypeKey has temporarily been disabled at this location. Please create your comment without using TypeKey or log in from the preview dialog.




Remember Me?