86 | | As a universal requirement, Sahana must support any unicode character in user-strings (i.e. from the database, forms or translations). |
87 | | |
88 | | Python 'unicode' objects are tuples of 4-byte codes from the unicode table (each code representing a character), which can be used to store strings containing any unicode characters. Such 'unicode' objects are not printable, though, i.e. they are not generally understood outside of the Python VM. When writing to interfaces, unicode-objects must be encoded as strings of printable characters, which Python represents as 'str' objects. The most common character encoding that covers all unicode characters is UTF-8. |
89 | | |
90 | | The str() constructor in Python 2 assumes that its argument is ASCII-encoded, and raises an exception for unicode-objects that contain non-ASCII characters. To prevent that, we must implement safe ways for converting unicode into str, enforcing UTF-8 encoding. |
91 | | |
92 | | Additionally, indices in str objects count byte-wise, not character-wise - which can lead to invalid characters when extracting substrings from UTF-8 encoded strings. Therefore, for any substring- or character-operations we must safely convert str into unicode, assuming UTF-8 encoding. |
| 86 | As a universal requirement, Sahana ''must'' support any '''unicode''' character in user-strings (i.e. from the database, forms or translations). |
| 87 | |
| 88 | Python 'unicode' objects are tuples of 4-byte codes from the unicode table (each code representing a character), which can be used to store strings containing any unicode characters. |
| 89 | |
| 90 | Such 'unicode' objects are not printable, though, i.e. they are not generally understood outside of the Python VM. When writing to interfaces, unicode-objects must be ''encoded'' as strings of printable characters, which Python represents as 'str' objects. The most common character encoding that covers all unicode characters is UTF-8. |
| 91 | |
| 92 | The str() constructor in Python 2 assumes that its argument is ASCII-encoded, and raises an exception for unicode-objects that contain non-ASCII characters. To prevent that, we must implement safe ways for converting unicode into str, ''enforcing'' UTF-8 encoding. |
| 93 | |
| 94 | Additionally, indices in str objects count byte-wise, not character-wise - which can lead to invalid characters when extracting substrings from UTF-8 encoded strings. Therefore, for any substring- or character-operations we must safely ''decode'' the str into a unicode object, ''assuming'' UTF-8 encoding. |