This word intentionally left corrupted

Update 12-sep-11: on Google’s Webmaster Central an interesting article appeared on internationalization. It covers the same issues as discussed in this article and addresses some other, related matters yet from an other perspective. It’s worth a read.

Background

So what is character encoding? Character encoding is about how characters are stored in memory using unique bit patterns – and that is far more complex than it seems to be at first sight…

encodings
There are numerous schemes that define the character ‘meaning’ of a string of bits. Examples of (widely used) encoding schemes are windows-1252, ISO-8859-1 and UTF-8 but there are dozens of others. Each of these schemes defines which letters, digits, special characters and control characters are supported by the scheme and which bit pattern to use for their storage. Some of the schemes support just a limited character set whereas others might support an enormously wide variation of characters.

What we have is a good number of schemes that all define some subset of characters and how they are represented in a bit pattern. Alas these bit patterns don’t always represent the same characters, some bit patterns and / or characters can be missing from one character set to another and so forth.

Now it’s easy to see what happens if we would mix up character sets…

Causes

Character sets can be mixed up more easily than you might expect. This is partially caused by the fact that various tools (e.g. Notepad) are not character encoding aware. Texts are copied and modified using these tools unaware of any underlying character set. As soon as these texts are input into a system that is encoding aware, issue might start to become visible.

Another important cause for encoding issues is a faulty website installation. If you know in advance that your site will contain a lot of exotic characters (from a Western European perspective, that is), be sure to hire someone who knows about character encoding. Preventing corrupted characters from appearing is a lot more easy that repairing them afterwards!

Often character encoding gets corrupted when a database is copied from one computer to another. So pay special attention in this sort of situations, e.g. when moving from one host to another.

 

Prevention

Prevention begins with a proper installation of your WordPress website, including WordPress’s own settings, the (MySQL) database settings and the server settings of your host. You need to pick the character encoding and collation (collation can be regarded as the sort order for a character set) that support the characters that you need; it’s nearly 100% sure that UTF-8 will do just that. So go for UTF-8 with collation UTF8_general_ci.

You can check these settings:

  • In your WP admin check under Settings | Reading that “Encoding for pages and feeds” is set to UTF-8.
  • In your WP configuration file wp-config.php you need this declaration:

[php]/** Database Charset to use in creating database tables. */
define(‘DB_CHARSET’, ‘utf8′);

/** The Database Collate type. Don’t change this if in doubt. */
define(‘DB_COLLATE’, ”);[/php]

  • Your database and it’s individual tables must have utf_general_ci defined as collation.
  • Your theme must add this line to the <head> section of your pages:

[html]<meta charset=”UTF-8″ />[/html]
Apart from a proper installation, the best safeguard against corrupted characters is to input texts into the WordPress editor yourself. That is, without using any copy / paste functions.  If you need to enter special characters, use the appropriate function, accessible by the button with ? (omega) symbol.

Normally I advise to use one of the functions accessible by the clipboard buttons ( T or W) to paste text into the editor. Apart from standing a better chance that your character encoding won’t get corrupted, this extra step also stops you from accidentally pasting all sorts of markup into your next. That markup won’t be visible in the editor (unless you change to HTML view) but so much the more in the front end of your website!

Repair

The easiest way to correct corrupted data is to go through all the pages of your website and ‘just’ re-enter that data. If your website is too large to go through it manually, you can try to have it corrected automatically. However, changes are big that automatic correction is impossible (since no program can decide what the character should have been once it is corrupted) and you will have to rely upon a mixed approach.

Lastly: don’t forget to have your website checked and corrected to prevent any new issues from arising!

If you want to give repairing a try yourself, start off with reading this article on converting WordPress database character sets.