Sunday, February 5, 2012

Python unicode string gotcha

Consider the following executed in python’s 2.7 interactive interpreter:

>>> s = u'Ω'
>>> se = s.encode("mbcs")
>>> print s, se
Ω Ω
>>> s == se
True
>>> print s.lower(), se.lower()
ω Ω
>>> s.lower() == se.lower()
False

Bizarre?  Not if you consider that an ansi string has no way of knowing its encoding.  Of course it could try to use the default encoding, but clearly it doesn’t.  Python’s str.lower() does not convert non-ascii characters.  See also a related question at Stackoverflow.

So what all these have to do with PyScripter?  In my previous post,I mentioned that breakpoints did not work in python 2.x when the filename contained non-ascii chars.  For the curious here is what was happening:

  • PyScripter passes unicode filenames to compile, which returns code objects with an ansi filenames.
  • PyScripter passes to the the debugger’s Bdb.set_break the same unicode filenames.
  • The debugger stores breakpoints with filenames converted through os.path.normcase. This function on Windows converts filenames to lowercase.  Since PyScripter passes unicode filenames, the filenames are properly converted to lower case
  • When the debugger checks whether we hit a breakpoint, it uses the frame’s filename, coming from the code object’s filename, which is an ansi string (str).  It converts the filename to lowercase again using the os.path.normcase but now non-ascii chars are not properly converted.
  • The debugger then compares the frame’s filename with the filenames of the stored breakpoints.
  • And as in the code snippet above the lowercase filenames do not match, since unicode.lower() and str.lower() behave differently on non-ascii characters.

As mentioned in the previous post, this has now been fixed.

The dreaded Unicode Encode/Decode Errors

About a third of all bug reports I get for PyScripter relate to using PyScripter with python file paths containing  non-ascii characters.  In such cases a UnicodeEncodeError or UnicodeDecodeError may occur when you run or debug a script or even while PyScripter is trying to provide Code Completion.  In this post I will try to provide a description of the problem and solutions.
All strings, filenames and source code included, inside PyScripter are in unicode.  However the Python compiler infrastructure internally uses in various places encoded strings.  You might have thought, as I did, that Python 3k sorted all this mess.  Even in python 3k, the compile function, on Windows, converts unicode filenames into the default locale encoding (the “mbcs” encoding) and back into unicode.  This is unnecessary and a source of many problems.  For more information see my question at StackOverflow and a related python bug report.  In Python 2.x the compile function accepts unicode filenames, but the code object it generates contains an encoded filename.  So, my first advise is:
Don’t use filenames that cannot be encoded in the default locale encoding, i.e. filenames for which filename.encode(“mbcs”) fails.
This is not a PyScripter issue but a python one and there isn’t much one can do about it.
If you work with filenames containing non-ascii characters that can be encoded in the default locale encoding then the situation is as follows:
  1. Working with Python 3k you should not have any problems.  If you do have, then it is a bug and should be reported.
  2. When you work with python 2.x then you need to modify the file site.py in the Lib subdirectory of the python installation path.  In the function setencoding, change the following statement:
  3. if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    to
    if 1:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    Due to a bug in the current versions 2.4.3 and 2.4.6, breakpoints are not recognized even after modifying site.py.  This has now been fixed and the fix will be available with the next release.
    Note that if you have filenames containing, for example, Chinese characters, which work OK in your computer in which the default locale supports Chinese characters, you may have problems when you move your files to a different computer with a default locale not supporting Chinese characters.