Monday, 5 November 2012

Doing RSS right (3) - character encoding

OK, I promise I'll shut up about RSS after this posting (and my previous two).

This posting is about one final problem in including text from RSS feeds, or Atom feeds, or almost anything else, into web pages. The problem is that text is made up of characters and that 'characters' are an abstraction that computers don't understand. What computers ship around (across the Internet, in files on disk, etc.) while we are thinking about characters are really numbers. To convert between a sequence of numbers and a sequence of characters you need some sort of encoding, and the problem is that there are lots of these and they are all different. In theory if you don't know the encoding you can't do anything with number-encoded text. However most of the common encodings use the numbers from the ASCII encoding for common letters and others symbols. So in practice lot of English and European text will come out right-ish even if it's being decoded based on the wrong encoding.

But once you move away from the characters in ASCII (A-Z, a-z, 0-9, and a selection of other common ones) to the slightly more 'esoteric' ones -- pound sign, curly open and close quotation marks, long typographic dashes, almost any common character with an accent, and any character from a non-European alphabet -- then all bets are off. We've all seen web pages with strange question mark characters (like this �) or boxes where quotation marks should be, or with funny sequences of characters (often starting Â) all over them. These are both classic symptoms of character encoding confusion. It turns out there's a word to describe this effect: 'Mojibake'.

Now I'm not going to go into detail here about what the various encodings look like, how you work with them, how you can convert from one to another, etc. That's a huge topic, and in any case the details will vary depending on which platform you are using. There's what I think is a good description of some of this at the start of chapter 4 of 'Dive into Python3' (and this applies even if you are not using Python). But if you don't like this there are lots of other similar resources out there. What I do want to get across is that if you take a sequence of numbers repenting characters from one document and insert those numbers unchanged into another document then that's only going to work reliably if the encodings of the two documents are identical. There's a good chance that doing this wrong may appear to work as long as you restrict yourself to the ASCII characters, but sooner or later you will hit something that doesn't work.

What you need to do to get this right is to convert the numbers from the source document into characters according to the encoding of your source document, and then convert those characters back into numbers based on the encoding of your target. Actually doing this is left as an exercise for the reader.

If your target document is an HTML one then there's an alternative approach. In HTML (and XML come to that) you can represent almost any character using a numeric character entity based on the Universal Character Set from Unicode. If you always represent anything not in ASCII this way then the representation of you document will only contain ASCII characters, and these come out the same in most common encodings. So if someone ends up interpreting your text using the wrong encoding (and that someone could be you if, for example, you edit you document with an editor that gets character encoding wrong) there's a good chance it won't get corrupted. You should still clearly label such documents with a suitable character encoding. This is partly because (as explained above) it is, at least in theory, impossible to decode a text document without this information, but also because doing so helps to defend against some other problems that I might describe in a future posting.

3 comments:

  1. Unfortunately this lovely world of character encoding also bleeds into identity management. We just encountered a problem when users used 8 bit Asci characters (e.g. £) in their passwords with our login gateways. Turns out if the login page is utf-8 it doesn't work, setting it to charset=iso-8859-1 worked as this is the closest approximation of the windows active directory charset we have (I believe the euro symbol will now fail in passwords....but since it isn't on our keyboards it's a better option). When you combine this with the impact of kerberos enctypes and character sets you enter a lovely world where the client settings, webserver, html, tomcat, kerberos enctypes and password store can all fight about character sets.

    ReplyDelete
  2. Yup that's a further example of much the same problem. Actually it's also an example of one of the 'other problems' that I mentioned near the end - in this case 'what encoding is used for data from forms?'. When I last looked (quite a while ago) this was a mess, but a likely outcome will be that form data will uploaded in the same encoding as used by the page. Hence why changing the encoding of the page helped.

    In your password case, what's happening is that the user is entering the _characters_ of his password and these are being encoded into numbers for transmission to your web server. Then, by the sound of it you are passing those numbers into a password verification function which is making assumptions about the underlying character encoding. Providing they match, it works. If they don't match then it works much of the time but fails when presented with a less-than-common character like £. Sound familiar?

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete