I always knew that there were other character encoding systems than ASCII, but to be honest I never really understood any of the details until recently. I know that I should. But don't worry: I haven't shipped any applications seeping with replacement characters. Just as most programmers have never used a rotation matrix, quite simply I have yet to write any software that deals with storage of text. However, a couple of months ago I was faced with my ignorance as my team were converting our first Word documents to AsciiDoc. I wrote about that too, but this post is about one of our small hiccups we had along the way.
Before even thinking about automating our conversion process, we wanted to simply convert one document manually - we needed to familiarize ourselves with the AsciiDoc syntax anyway. In the first crude attempt, we just opened the Word document and copied all content to a text editor. After adding the most essential markup we saved the file in ANSI encoding (which means the standard encoding for the system) and fed it to AsciiDoc. How could this go wrong?
Cannot decode byte '\x93': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
From looking at the error message it would seem like AsciiDoc didn't support ASCII after all. In order to understand this we needed to revisit a couple of character encodings - specifically the message mentions UTF-8. UTF-8 is a variable length character encoding which has the property that its byte representation of any character belonging to the smaller Latin-1 character set coincides with the Latin-1 encoding. Similarly, the Latin-1 byte representation of any ASCII character coincides with the ASCII encoding. In other words, any ASCII encoding represents at the same time a valid Latin-1 and UTF-8 encoding. The opposite is not true. Historically, this property has ensured backwards compatibility with existing encodings as more and more code points were needed.
In our example it would follow that if our input file were not a valid UTF-8 encoding then it could not be a valid Latin-1 or ASCII encoding either. Looking at the problematic character we would reach the same conclusion. The character 0x93 is designated as "unused" in Latin-1 and therefore also in UTF-8. The error message is correct: 0x93 is not a valid UTF-8 character. So how did an invalid character end up in the file to begin with?
The reason is that the standard encoding for the Windows operating system is neither ASCII, Latin-1 or UTF-8, but a fourth encoding known as Windows 1252. Unlike Latin-1, the Windows 1252 encoding uses the range 0x80 to 0x9F for a number of "extra" characters. Among those characters are the left a and right double quotation marks, characters 147 and 148 respectively.
The fix was just a matter of saving the file as UTF-8 rather than ANSI. When the file is saved as UTF-8 all characters will be converted to their UTF-8 equivalents. For all characters which is part of the Latin-1 character set, no conversion is done. However, all other characters will be converted. Specifically, the Windows 1252 codes 147 and 148 will be converted to the unicode code points U+201C and U+201D respectively and stored as two 3-byte encodings, 0xE2809C and 0xE2809D. Even though the fix was simple, this was a great opportunity to brush up on character encodings.Home