| 317 | |
| 318 | === Turkish letters İ and ı === |
| 319 | |
| 320 | In Turkish, the letters {{{I}}} and {{{i}}} are not a upper/lowercase pair. Instead, there are two pairs {{{(İ, i)}}} and {{{(I, ı)}}}, i.e. one with and one without the dot above. |
| 321 | |
| 322 | According to the Unicode spec, the lowercase pendant for {{{İ}}} is a sequence of two unicode characters, namely the {{{i}}} (with the dot) and the code point U0307 which mean "with dot above". The latter is there to preserve the information about the dot for the conversion back to uppercase. |
| 323 | |
| 324 | Python-2 did not implement the U0307 character, so it converted the letters like this: |
| 325 | {{{#!python |
| 326 | >>> u"İ".lower().upper() |
| 327 | u'I' |
| 328 | >>> u"ı".upper().lower() |
| 329 | u'i' |
| 330 | |
| 331 | # NB with utf-8-encoded str, Python-2 doesn't "İ".lower() at all! |
| 332 | >>> print "İ".lower() |
| 333 | İ |
| 334 | }}} |
| 335 | |
| 336 | Python-3 does implement the U0307 character, so the behavior is different: |
| 337 | {{{#!python |
| 338 | >>> "İ".lower().upper() |
| 339 | 'İ' |
| 340 | >>> "ı".upper().lower() |
| 341 | 'i' |
| 342 | }}} |
| 343 | |
| 344 | Critically, the U0307 character changes the string length (it's an extra character!): |
| 345 | {{{#!python |
| 346 | # Python-2 |
| 347 | >>> len(u"İ".lower()) |
| 348 | 1 |
| 349 | |
| 350 | # Python-3 |
| 351 | >>> len("İ".lower()) |
| 352 | 2 |
| 353 | }}} |
| 354 | |
| 355 | This is just something to keep in mind - an actual forward/backward compatibility pattern must be developed for the specific use-case. Neither the Python-2 nor the Python-3 are particularly helpful for generalization, the Turkish I's always need special treatment. |