MTLookup v2 Improvements: Website Indexing
09/14/2005
Some days ago, a new version of MTLookup was released. Please read Second version of MTLookup released for the announcement. I will describe the new features in several posts, which will be published here in the Movable Type Weblog.
Today, I want to inform you about website indexing and show how new websites can be included into MTLookup.
First Release
When the first version of MTLookup was released in June, just three websites were spidered. These were my own Movable Type Weblog, Elise's Learning Movable Type, and Arvind's Movalog. I talked to both Elise and Arvind before the first version was developed. Both looked at pre-release versions and gave important suggestions.
Initially, there were about 250 articles that were included in the MTLookup index.
The XML File
MTLookup is based on a central component, the so-called MTLookupBot. It is responsible for spidering a website, collecting all important information and storing the data into a MySQL database. I wanted that component to behave in a specific manner, and this turned out to be quite difficult. There were two major problems...
- Not each and every article that is found on a website should be stored in the index. Some are not related to Movable Type, and should be ignored. And those, which are to be included, should be categorized.
- Extracting a good excerpt is an important task. You probably know the problem yourself: if you are searching with one of the big search engines, the result lists are not always perfect. Sometimes they just show a couple of words, connected with three dots. Sometimes even "noise words" are shown, which belong to the webpage's navigation or some other unimportant content.
I thought that some additional information from the article's author might help. With the help of a Movable Type Index Template, the author should create an XML-file, describing all interesting articles. As the author is able to access the article before the final webpage is build, he can export a good excerpt - for example, by using the Movable Type »MTEntryExcerpt« tag.
A specification MTLookup: How to get Indexed was written and published.
However, as development of the MTLookupBot proceeded, this component became better and better. It is now able to spider an entire website, choose those articles that are interesting for the Movable Type community, extract a good excerpt, and categorize the article.
Today, MTLookup does not need the XML-file any longer. Of course, an author will always be able to describe the article in a better way. However, if the author does not want to perform that task, the MTLookupBot is able to deliver a good result.
Currently, 20 websites are spidered. Over the past weeks, I talked with most of their authors and showed them the generated output. None of those websites is indexed with the help of an XML-file. MTLookupBot does the job on its own, by reading and analyzing the website's pages.
Websites
After the first version of MTLookup had been released, several suggestions for improvement were made. The one that was mentioned most often, was "more websites, more articles".
With this version of MTLookup, there are now 20 websites with more than 2000 articles. There are three groups of websites...
- Tutorials: this group is made up of 10 websites with about 700 articles. These are articles, which have been written by users of Movable Type, giving technical information.
- Six Apart: there are 7 websites with about 1400 articles from the company that developed Movable Type. This list includes the Movable Type 3.2 documentation, the knowledgebase, the plugin directory and the ProNet weblog.
- CSS: users working with Movable Type also have questions regarding the use of CSS. So I included 3 websites with about 200 articles.
The websites will be spidered on a regular basis. New articles will be inserted into MTLookup, and modified articles will be changed.
Why isn't »your favourite website« included?
No website has been excluded intentionally. I am rather new to Movable Type, having bought the licence some months ago. So chances are high, that I missed some interesting websites.
If you know a website, or even maintain a website yourself, which might be interesting for the Movable Type community, please let me know. Use the email address from Contact.
How to be Included?
If you want MTLookup to spider your website, there won't be any work on your side. As I do no longer recommend using the XML-file, you just have to tell me a URL, and the MTLookupBot will start its job.
Related Articles
The new version of MTLookup is described on the Movable Type Weblog. These articles are...
- Second version of MTLookup released
- MTLookup v2 Improvements: Website Indexing
- MTLookup v2 Improvements: Stemming
- MTLookup v2 Improvements: Persistent User Data
- MTLookup v2 Improvements: Query Feedback
- MTLookup v2 Improvements: This and That
If you want to try, use MTLookup.
mgs | 09/14/2005
Feedback is welcome!
What do you think about this entry? Was it interesting or boring? I would like to hear your comments. If the text was helpful, please consider setting a link to http://www.movable-type-weblog.com/.
No spam please!
For protecting this weblog I have installed the MT-Approval Plugin. You have to view a new comment in preview mode, before it is saved on the server. Moreover, I will view your comment manually, before it is published. You can find more information on the subject in the entry Weblog Spamming Basics.
With an active TypeKey session, your comment will be published immediately.
Post a new comment
TypeKey has temporarily been disabled at this location. Please create your comment without using TypeKey or log in from the preview dialog.

