I18nGuy > Internationalizing Turkish
Internationalization for Turkish:
Many software and web applications that are already internationalized and are
successfully supporting many languages,
often suffer catastrophic failure when they add support for the Turkish language.
This page explains the difficulty of supporting the Turkish language and typical solutions.
There are 3 sections:
A brief overview of Turkish characters and encodings.
Turkish language problem and solutions.
A brief history of the Turkish language is offered as background material.
The Ottoman Turkish language is commonly known today simply as Turkish. Modern Turkish is spoken by about 65 million people in the Republic of Turkey and 200,000 people in Northern Cyprus. Turkish is in the Ural-Altaic family of languages.
|q, w, x||Not used in Turkish|
|ç, ş||Cedilla added to c and s|
|ğ||Breve (aka hook) added to g,|
used to lengthen a preceding vowel
|ö, ü||Dieresis added to o and u|
|ı, I||Undotted lowercase i added.|
Capitalizes as undotted uppercase I
|i, İ||Dotted i retains the dot when capitalized.|
|â, î, û||Circumflex used to elongate a, i, and u.
This is no longer common practice.
|0-9||European digits replace Arabic-Hindi numerals|
Linguists and historians were assembled by the reformer Atatürk to create a new Turkish alphabet using the Latin script. The script was completed in August of 1928. The implementation required some modification to the base Latin letters.
Turkish, like other languages written in Latin script, is written left-to-right.
The Turkish alphabet is sorted as follows:
a, b, c, ç, d, e, f, g, ğ, h, ı, i, j, k, l, m, n, o, ö, p, r, s, ş, t, u, ü, v, y, z
Q, w, and x, the Latin letters that are not part of the Turkish alphabet may collate either after the letter z or in the locations according to English collation.
|Unix||ISO8859-9 (aka Latin 5)|
|IBM||IBM cp905 (EBCDIC)|
Latin Extended-A blocks
The Turkish alphabet contains 29 uppercase and 29 lowercase letters. Punctuation is the same as those characters traditionally used with the Latin script. Therefore, the small number of characters used in Turkish can be encoded in an 8-bit encoding scheme.
To get details on these and other encodings, visit I18nGuy's Code Pages At The Push Of A Button.
Click, if your browser does
not show all 4 letters
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I.
Modifying or extending the Latin alphabet does not make Turkish unusual. Many languages have done so. However, usually when characters are added, both upper and lower case versions are added. As the characters are added in pairs, properties and mappings (other than collation) of the original English characters are unaffected. Therefore dependencies on internal keywords built on English letters are unaffected.
Turkish instead, added letters that change the relationship between two of the English letters. Instead of the original case mapping of lower dotted i to upper dotless I, Turkish maps the lower dotted i to the new upper dotted İ, and the lower dotless ı to the upper dotless I.
The change in the case rules for the letter i frequently breaks software logic. Applications that have been internationalized are conditioned to easily accept new characters, collations, case rules, encodings, and other character-based properties and relationships. However, their design often doesn't anticipate that properties or rules of English letters will change.
Many applications have an internal table of English keywords. For example, a product may have a command language and have a table to identify the commands and associate them with the procedures that implement the appropriate functions. When it is given a command, the product looks through the table for a match. For ease of use, usually a case-insensitive lookup is used.
When support for the Turkish language is added, the case rules are changed to use the Turkish mapping for the letter i. Within the application, programmers depend on case-insensitivity and may encode keywords using either case. To understand how Turkish case rules break the program logic, suppose the applications needs to lookup the keyword "quit" in all lowercase letters. If the keyword table has the keyword encoded as "QUIT", it will not find a match. This is because the lowercase dotted i no longer has the uppercase dotless I as its uppercase equivalent. Note that this problem occurs with internal text and lookups. The user interface is generally translated and will work well with Turkish rules. But the program logic maps the translated terms to internal keywords which may or may not match the casing of internal tables.
Databases also can fail when the Turkish language is incorporated. Although the database software is designed to work with different collations and casings, there are often dependencies that the metaschema names will be in English and work with English case rules. When the rules are changed to Turkish, the database software may not be able to find schema objects with names such as "files".
Often companies adopt a quick workaround: They use the Turkish case rules except the case of the letter i remains as in English. This fixes the logic problems, but irritates Turkish end-users since functions that depend on case now do the wrong thing with the four i letters.
Some of the internal table problems can be worked around by adding entries for keywords that use the letter i, to cover all cases. For example, quit might have two entries, one with a dotted and one with a dotless i: "quit" and "quıt". But this utilizes more memory and can hurt performance. And keywords like "mississippi" of course would need 16 entries to cover all the variations of the four i letters in the word. This approach is also error-prone. If the table is modified, programmers may forget to add additional "i" entries.
The correct solution is to have internal program logic use separate collation and case rules that do not change when the user selects international settings. This ensures that lookups of internal tables or database schemas work consistently.
However, the solution is difficult to implement since it can require specifying the locale (which selects the collation and case rules) on most function calls, and knowing which locale to use in every instance.
|Old Anatolian Turkish||13th - 15th Centuries|
|Ottoman Turkish||16th - 19th Centuries|
|Modern Turkish||1928 and after|
During the Ottoman Empire (1299-1922), three languages were in common use: Ottoman Turkish, Persian and Arabic. The languages were all written using Arabic script. During this era, Ottoman Turkish borrowed both vocabulary and syntax from Persian and Arabic. Because the 3 languages come from different language families, the borrowings made Turkish difficult to use. Incompatibilities between the spoken language and the Arabic script made spelling and writing difficult. In the nineteenth century the need for reform was recognized.
|Ural-Altaic||used by Anatolian|
In 1923, cultural and political reform was led by Mustafa Kemal. Mustafa Kemal was later called Atatürk, "father of the Turks". He was also the first president of the Republic of Turkey. Turkey wanted to be free of its Ottoman past and Islamic influences and be more aligned with the West. Atatürk and his Nationalistic party instituted the adoption of the Latin script and new vocabulary in May of 1928.
Atatürk rejected a five-year plan proposed by a planning commision, to migrate to the Latin script. Instead, the plan was revised to adopt Latin script system in three months. Schools began teaching Latin script in the fall of that year. Grammar lessons were cancelled until new grammar books were available. By November 1928, Latin script became law, and using Arabic script to represent Turkish became illegal.
Turkey made other changes besides the writing system, including replacing the Islamic calendar with the Gregorian calendar on January 1, 1927. The Islamic calendar is used now only for determining religious holidays. Sunday is the mandated secular Sabbath rather than Friday the traditional Muslim Sabbath.
Copyright © 2004-2010 Tex Texin. All rights reserved.
Top of page