FAQ: Special Characters appear as Garbage in the Entries
Question
When looking at my generated webpages, some of these pages contain characters that really look strange. When looking at the same text in Movable Type, anything looks fine.
Answer
The problem occurs rather often in varying situations. Most of the time it has something to do with the character set. Let us first have a look at the basics...
Character Set
Anything that is stored in electronic form is saved as a series of 0s and 1s. Usually, these are combined to form a string of numbers, ranging from 0 to 255. Even a text, as the one that you are currently reading, is stored as a sequence of numbers.
A charset defines which characters belong to which number. For example, with popular charsets if the computer sees the number 65, it knows that a capital A has to be output to screen.
There are several charsets. Some well-known charsets, which play a role with Movable Type, are "iso-8859-1" and "utf-8".
Several charset have a common base: there are many characters which are mapped identical in different charsets. This is true for most "normal characters". However, special characters (some punctuation marks, accented characters, umlauts, etc.) are mapped differently. Maybe in one charset the German umlaut "ä" is represented as 228. However, in another charset the number 228 is not the umlaut, but a "»" character instead.
Now, if some text is published in one charset, but the client browser displays it as if it were from a different charset, garbage characters appear in your text.
PublishCharset
When publishing a website with the help of Movable Type, there is one important setting in mt.cfg (with Movable Type 3.18 or earlier) or mt-config.cgi (with Movable Type 3.2). It is the PublishCharset setting. By default, this is set to UTF-8 (or Shift_JIS for Japanese users).
This setting is responsible for the encoding that is used by Movable Type for rebuilding the entries.
Meta Content Type
Each HTML page should have a tag in the head section, which tells the browser about the encoding of the webpage. For UTF-8 the following should be used...
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
It tells the browser, that the content of the webpage has been encoded with the UTF-8 charset. So if the browser sees the number 65, it knows that a capital A has to be shown.
Problem 1: Different Settings
Suppose that you tell Movable Type with the help of the PublishCharset setting to publish in the UTF charset. However, your templates do not contain the above meta tag. Instead they read...
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
Of course, this might result in problems: you encode your text in one charset, but you tell your reader's browser to display it in another charset.
With large portions of text, no problems will be visible. This happens, because many characters are encoded identical in ISO and in UTF. However, with special characters the problems can be seen: there will be garbage, where readable text should be.
If you look at such a page with your browser, you can easily demonstrate the problem. Most browsers have a menu command for switching the charset. Effectively, you tell the browser "I know that the server told you to display the text as ISO. But I want you to show the text as if it were UTF". If you look at such a page and switch the charset, the garbage characters will suddenly toggle to the correct characters.
Always watch out that both settings refer to the same charset!
Problem 2: Server not configured correctly
Once in a while, both settings (PublishCharset and meta tag) are correct. Nevertheless, the page is not be viewed correctly in the browser.
This might be caused by a problem with your internet servera, as it does not serve the pages correctly. You have to contact the server administration for that. They might have to change some internal settings. The exact procedure depends on whether this is a UNIX or a Windows server, and which internet server is used.
You can prove these kind of problems by proceeding as follows: download an HTML file, which shows the problems, to your local PC. Then open the HTML file in your browser (e.g. by double-clicking it). If the problems are gone, your server admin should now be convinced that there is an issue with the server not serving the HTML files correctly.
Problem 3: Copy / Paste
Many users write their entries with some text processing software (such as Microsoft Word), then copy the text, and paste it into the Movable Type edit textbox. With such a technique, problems might be created: the software also uses a certain encoding and if it does not fit to your publishing encoding, you will see garbage again.
However, this problem is slightly different. Most of the time, you will already be able to see the problems when the text has been pasted into the edit text box. There is no need to rebuild the project. You will immediately see the problems in the edit textbox.
Sometimes the problems can be solved by not directly copying the text from the text processing software into Movable Type. Use a simple editor such NOTEPAD as a temporary store: copy the text in your text processing software, paste it into NOTEPAD, select the text in NOTEPAD, copy it again, and then paste it in the Movable Type edit textbox.
mgs
Feedback is welcome!
What do you think about this entry? Was it interesting or boring? I would like to hear your comments. If the text was helpful, please consider setting a link to http://www.movable-type-weblog.com/.
No spam please!
For protecting this weblog I have installed the MT-Approval Plugin. You have to view a new comment in preview mode, before it is saved on the server. Moreover, I will view your comment manually, before it is published. You can find more information on the subject in the entry Weblog Spamming Basics.
With an active TypeKey session, your comment will be published immediately.
Post a new comment
TypeKey has temporarily been disabled at this location. Please create your comment without using TypeKey or log in from the preview dialog.
Comment
Tom Keating | January 19, 2006 09:52 PM
great tips on character issues.
Have you seen this issue?
You post an entry which has two consecutive spaces which are converted into the weird A with a caret on top.
For example if I put this in the MT edit window:
"I went to the zoo. Hello"
it would look like
"I went to the zoo. A Hello" (can't do the exact A with a caret here)
Look at all the weird A's here that occur after 2 spaces:
The really strange part is that if I go back and resave the entry it's fine (removes the A character and makes it 2 spaces). I'm wondering if it only happens with posts I set to post in the future which then requires my mt-rebuild script to run. Still, that shouldn't matter. Verry odd.
Suggestions?
Comment
Michael G. Schneider | January 20, 2006 11:30 AM
You should find out whether this only happens with future posts. If it only happens with those posts, you should look at the mt-rebuild script.
Comment
iswimp | May 16, 2006 06:32 PM
I have a similiar problem, but not really the same. When I type a Swedish character into MT and save, when it has saved it turns up as a questionmark (?) and the same ofcourse if I publish it.. But I've checked through this faq and I have the iso charset enabled..

