Visual Studio 2008 can transparently handle unicode, i.e. developers can edit files with it without being aware how those files are encoded. Since programming is a collaborative effort and tastes (or requirements) vary, we have very little control over the encoding of the source files we process with
After recurring agony and grief over stray unicode-encoded files (SVN, backups, etc.) we decided to embrace the 21st centure instead of fighting it and put graceful unicode-handling into
unicoder.py reflects modest ambitions, namely:
- before reading in a file, find out how it is encoded
- remember how the file is encoded
- decode the read-in text
- when writing back the file, encode the embellished text in the way it was encoded before (i.e. how you remember it)
Understand what "BOM"s are:
UnicodeSmartReadreceives a path, open the file behind that path and reads it in. The resulting text is checked for various BOMs to determine the encoding. That encoding is stored in a hash, with the absoulte path for the read file as the key and a string denoting the found encoding as its value (ISO-8859-1 if none is found). The global hash is named
A BOM is not considered part of the text.
UnicodeSmartWritereceives a path and a text. The text is written to the path, but it is written to the file with the encoding remembered in the
encodingHash(and the BOM, if any, is prepended) to the encoded text. After writing, the entry for the written file's absolute path is removed from the hash.
UnicodeSmartWrite have to be balanced. If
UnicodeSmartWrite can't find the (absolute) path for the file to write in the
encodingHash, an exception is thrown. This is cheap and has the extra benefit of making confusing files (i.e. writing to a file different from the one we have read from) virtually impossible. In other words,
unicoder.py assumes that only files that have been read by
unicoder.py are written (back) by
The unit-tests check if the encoding is determined correctly for test files, located in the
unicode.txt(UTF-16 little endian)
unicode-big-endian.txt(UTF-16 big endian)
The base names of the files have been taken from the menu items in MS Windows'
notepad. We used
notepad to create these files.
The unit-tests also check whether files have been written back correctly by comparing them against prepared samples. The original file contents of the test files are kept in the
-orig.txt files, the samples we compare the results from
UnicodeSmartWrite are named
The unit-tests are cheap, could be refactored for squeezing out duplication – even more so if you rename the test-files in such a fashion that the base name is exactly as the strings denoting the encoding.