| 81 | |
| 82 | === Unicode == |
| 83 | |
| 84 | ==== Background ==== |
| 85 | |
| 86 | As a universal requirement, Sahana must support any unicode character in user-strings (i.e. from the database, forms or translations). |
| 87 | |
| 88 | Python 'unicode' objects are tuples of 4-byte codes from the unicode table (each code representing a character), which can be used to store strings containing any unicode characters. Such 'unicode' objects are not printable, though, i.e. they are not generally understood outside of the Python VM. When writing to interfaces, unicode-objects must be encoded as strings of printable characters, which Python represents as 'str' objects. The most common character encoding that covers all unicode characters is UTF-8. |
| 89 | |
| 90 | The str() constructor in Python 2 assumes that its argument is ASCII-encoded, and raises an exception for unicode-objects that contain non-ASCII characters. To prevent that, we must implement safe ways for converting unicode into str, enforcing UTF-8 encoding. |
| 91 | |
| 92 | Additionally, indices in str objects count byte-wise, not character-wise - which can lead to invalid characters when extracting substrings from UTF-8 encoded strings. Therefore, for any substring- or character-operations we must safely convert str into unicode, assuming UTF-8 encoding. |
| 93 | |
| 94 | ==== Unicode-Guideline ==== |
| 95 | |
| 96 | 1) All functions dealing with user-strings should be designed to accept both str and unicode, while safely handling strings with non-ASCII characters. For unicode-safe conversions, we use s3_unicode(s) and s3_str(s), instead of unicode(s) and str(s). |
| 97 | |
| 98 | 2) Where we receive str input, we assume utf-8 encoding. Most common encodings are subsets of utf-8 so that this is the safest assumption we can make. |
| 99 | |
| 100 | 3) Before indexing, splitting, slicing or iterating over a user-string, we always convert it into a unicode using: |
| 101 | |
| 102 | {{{ |
| 103 | s = s3_unicode(s) |
| 104 | }}} |
| 105 | |
| 106 | 4) We assume that any (external) function we call may attempt to convert input by calling str() - so we generally deliver all strings as utf-8 encoded str to prevent UnicodeDecodeErrors. This can be done by: |
| 107 | |
| 108 | {{{ |
| 109 | s = s3_str(s) |
| 110 | }}} |
| 111 | |
| 112 | or: |
| 113 | |
| 114 | {{{ |
| 115 | s= s3_unicode(s).encode("utf-8") |
| 116 | }}} |
| 117 | |
| 118 | 5) System-strings (like table or field names, attribute names, etc.) should never contain non-ASCII characters, so that they safely pass through str(). |
| 119 | |
| 120 | 6) In reading XML, we follow the encoding specified in the XML declaration rather than making assumptions about the encoding. For all other sources, we |
| 121 | assume utf-8 (see 2). In exports, we always write utf-8. |
| 122 | |
| 123 | 7) All code is utf-8 encoded, so that all string constants are automatically utf-8 encoded str. We do not use u"..." for string constants. |