Whilst we whittled our filesystem problems down to a remaining few and sent our first Release Candidate out into the wild, we discovered we had another specter on the horizon to deal with: International Filename Support. Python generally handles this pretty well: it defaults to the web standard, UTF-8, so if you received a UTF-8 string, python will print the correct representation upon your call to “print”. No other work is necessary. This does not go so smoothly if the string you get is not encoded in UTF-8 (or ascii, since it is a true subset of UTF-8). We learned this limitation, and how to overcome it, over the course of two frustating days.
In our testing, we used another commercial SFTP Client to put some files with international characters in their names onto our test server (to wit: the files were called Québécois and Dvořàk). Unbeknownst to us, the client we used defaulted to Latin-1, aka ISO-8859-1 encoding. However, at this point, we also did not know about encoding in python, so we just output the strings as we received them. What we saw was Qu?b?cois and Dvo??k from the Terminal, and even worse in Finder, Qu? and Dvo? (more on why this was so later).
Python does not auto-detect encodings. You can get some third-party modules to get Python to try and do this.
We knew we had international characters, and we also knew that Mac OS X likes its characters to be encoded as UTF-8 (sort of).
So we tried this:
`output_string = input_string.encode(‘utf-8’)`
`UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 0: ordinal not in range(128)`
It looks like python is guessing the string is ASCII. We think it’s UTF-8, so let’s try it again:
`ouptut_string = input_string.decode(‘utf-8’).encode(‘utf-8’)`
`UnicodeDecodeError: ‘utf8’ codec can’t decode
bytes in position 2-4: invalid data`
Oh dear. At this point, I insisted the client we were using was definitely not encoding filenames as UTF-8 data, but Jeff insisted that it had to be (it’s the standard, after all). Then we had an argument about the semantics of decoding vs. encoding. On a whim, I tried decoding the string using ‘latin-1’ as an argument. Ta da! No more Unicode exception! We came to the following conclusion about python encoding/decoding: python always stores strings in an internal, canonical representation. Therefore strings are *always* implicitly decoded from ASCII to this form.
In short, python does this with every incoming string:
`canonical_string = decode(input_string, ‘ascii’)`
`output_string = encode(canonical_string, ‘ascii’)`
If the incoming strings are not ASCII-encoded, you must explicitly call decode() on them with the appropriate codec as an argument. Our codec in this case is Latin-1 (aka ISO-8859-1); so far so good.
Now that we have our string object, we must call encode() on it with ‘utf-8’ as an argument, since UTF-8 is almost what Mac OS X expects. I say “almost” because there are two possibilities for UTF-8 encoding: “Canonical From” and “Decomposed Form”. The difference is in how characters with diacritics, like à or é, are transmitted. Mac OS X uses decomposed form, which simply means that à is transmitted as two characters, \` and a, which are then combined. Python defaults to canonical form, so before we re-encode the strings as UTF-8, we’ve got to make this switch.
decomposed_string = unicodedata.normalize(‘NFD’, \
Now we can finish up the task.
`output_string = decomposed_string.encode(‘utf-8’)`
Hooray! We’re done.
But wait… what happens if some other client uses a different encoding? Well, of course the characters will display incorrectly. We need some sort of default encoding that will work. We saw above that using UTF-8 as a default will not work, since there are encodings of characters in latin-1 (and probably other codecs) that are invalid in utf-8. We settled on defaulting to ASCII. This is acceptable in all cases because of a basic truth about text encoding: every single character is transmitted as at least one byte of data. ASCII has a printable representation of every possible byte. So while the character à does not have an encoding in ASCII, its byte sequence, `\xc3\xa0`, does, though it will usually just print as `??` since both those numbers are greater than `0x7F` and ASCII is not standardized above `0x7F`.
Putting it all together, this is basically the function we use to handle these strings.
def re_encode(input_string, decoder = ‘utf-8’, encoder = ‘utf=8’):
output_string = unicodedata.normalize(‘NFD’,\
output_string = unicodedata.normalize(‘NFD’,\
And that’s really all there is to it. Python wins the game. By defaulting to ASCII encoding, you won’t get any unhandled exceptions, and you’ll also know pretty quickly that something is wrong (just look for the `???????`s). For a much lengthier discussion of what Unicode is and does, see Joel Spolsky’s verbose take on the matter.