I18nGuy > Internationalizing Turkish
I18n Guy, your I18n advisor

Internationalization for Turkish:
Dotted and Dotless Letter "I"

Many software and web applications that are already internationalized and are successfully supporting many languages, often suffer catastrophic failure when they add support for the Turkish language. This page explains the difficulty of supporting the Turkish language and typical solutions.

There are 3 sections:
A brief overview of Turkish characters and encodings.
Turkish language problem and solutions.
A brief history of the Turkish language is offered as background material.

The Ottoman Turkish language is commonly known today simply as Turkish. Modern Turkish is spoken by about 65 million people in the Republic of Turkey and 200,000 people in Northern Cyprus. Turkish is in the Ural-Altaic family of languages.

The Turkish Alphabet
abcçdefgğ hıijklmn oöprsşt uüvyz

Characteristics of Turkish Language and Character Encoding

Modifications to Latin Script for Modern Turkish
LettersChanges
q, w, xNot used in Turkish
ç, şCedilla added to c and s
ğBreve (aka hook) added to g,
used to lengthen a preceding vowel
ö, üDieresis added to o and u
ı, IUndotted lowercase i added.
Capitalizes as undotted uppercase I
i, İDotted i retains the dot when capitalized.
â, î, ûCircumflex used to elongate a, i, and u.
This is no longer common practice.
0-9European digits replace Arabic-Hindi numerals

Designing Turkish For Latin Script

Linguists and historians were assembled by the reformer Atatürk to create a new Turkish alphabet using the Latin script. The script was completed in August of 1928. The implementation required some modification to the base Latin letters.

Writing Direction

Turkish, like other languages written in Latin script, is written left-to-right.

Turkish Collation

The Turkish alphabet is sorted as follows:
a, b, c, ç, d, e, f, g, ğ, h, ı, i, j, k, l, m, n, o, ö, p, r, s, ş, t, u, ü, v, y, z

Q, w, and x, the Latin letters that are not part of the Turkish alphabet may collate either after the letter z or in the locations according to English collation.

Encodings used for Turkish
SystemEncoding
UnixISO8859-9 (aka Latin 5)
Windowscp1254
DOSIBM cp857
IBMIBM cp905 (EBCDIC)
UnicodeLatin and
Latin Extended-A blocks

Turkish Encoding Standards

The Turkish alphabet contains 29 uppercase and 29 lowercase letters. Punctuation is the same as those characters traditionally used with the Latin script. Therefore, the small number of characters used in Turkish can be encoded in an 8-bit encoding scheme.

To get details on these and other encodings, visit I18nGuy's Code Pages At The Push Of A Button.

Why Applications Fail With The Turkish Language

Turkish Has Four Types Of Letter I
 DottedDotless
UpperİI
U+0130U+0049
Loweriı
U+0069U+0131
Click, if your browser does
not show all 4 letters

Turkish Has An Important Difference

Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I.

Modifying or extending the Latin alphabet does not make Turkish unusual. Many languages have done so. However, usually when characters are added, both upper and lower case versions are added. As the characters are added in pairs, properties and mappings (other than collation) of the original English characters are unaffected. Therefore dependencies on internal keywords built on English letters are unaffected.

English vs. Turkish Case Mappings
LanguageLetterLowercase
Map
Uppercase
Map
EnglishiiI
Turkishdotted iiİ
Turkishdotless ııI

Turkish Case Mappings and Case-Insensitivity

Turkish instead, added letters that change the relationship between two of the English letters. Instead of the original case mapping of lower dotted i to upper dotless I, Turkish maps the lower dotted i to the new upper dotted İ, and the lower dotless ı to the upper dotless I.

The change in the case rules for the letter i frequently breaks software logic. Applications that have been internationalized are conditioned to easily accept new characters, collations, case rules, encodings, and other character-based properties and relationships. However, their design often doesn't anticipate that properties or rules of English letters will change.

Turkish case rules break logic.
Expected matches fail, and new matches created,
even though search table is unchanged.
Case
Rules
Search
Term
Table
Entry
Match?
Englishquitquityes
EnglishQUITquityes
EnglishquitQUITyes
EnglishQUITQUITyes
 
Turkishquitquityes
TurkishQUITquitno
TurkishquitQUITno
TurkishQUITQUITyes
Turkishquıtquitno
TurkishQUİTquityes
TurkishquıtQUITyes
TurkishQUİTQUITno

How Applications Fail With Turkish Language

Many applications have an internal table of English keywords. For example, a product may have a command language and have a table to identify the commands and associate them with the procedures that implement the appropriate functions. When it is given a command, the product looks through the table for a match. For ease of use, usually a case-insensitive lookup is used.

When support for the Turkish language is added, the case rules are changed to use the Turkish mapping for the letter i. Within the application, programmers depend on case-insensitivity and may encode keywords using either case. To understand how Turkish case rules break the program logic, suppose the applications needs to lookup the keyword "quit" in all lowercase letters. If the keyword table has the keyword encoded as "QUIT", it will not find a match. This is because the lowercase dotted i no longer has the uppercase dotless I as its uppercase equivalent. Note that this problem occurs with internal text and lookups. The user interface is generally translated and will work well with Turkish rules. But the program logic maps the translated terms to internal keywords which may or may not match the casing of internal tables.

Databases also can fail when the Turkish language is incorporated. Although the database software is designed to work with different collations and casings, there are often dependencies that the metaschema names will be in English and work with English case rules. When the rules are changed to Turkish, the database software may not be able to find schema objects with names such as "files".

Solutions

Often companies adopt a quick workaround: They use the Turkish case rules except the case of the letter i remains as in English. This fixes the logic problems, but irritates Turkish end-users since functions that depend on case now do the wrong thing with the four i letters.

Some of the internal table problems can be worked around by adding entries for keywords that use the letter i, to cover all cases. For example, quit might have two entries, one with a dotted and one with a dotless i: "quit" and "quıt". But this utilizes more memory and can hurt performance. And keywords like "mississippi" of course would need 16 entries to cover all the variations of the four i letters in the word. This approach is also error-prone. If the table is modified, programmers may forget to add additional "i" entries.

The correct solution is to have internal program logic use separate collation and case rules that do not change when the user selects international settings. This ensures that lookups of internal tables or database schemas work consistently.

However, the solution is difficult to implement since it can require specifying the locale (which selects the collation and case rules) on most function calls, and knowing which locale to use in every instance.


Brief Background on the Ottoman Turkish Language

History of Turkish Language is divided into 3 periods.
Period Name Era
Old Anatolian Turkish13th - 15th Centuries
Ottoman Turkish 16th - 19th Centuries
Modern Turkish 1928 and after

During the Ottoman Empire (1299-1922), three languages were in common use: Ottoman Turkish, Persian and Arabic. The languages were all written using Arabic script. During this era, Ottoman Turkish borrowed both vocabulary and syntax from Persian and Arabic. Because the 3 languages come from different language families, the borrowings made Turkish difficult to use. Incompatibilities between the spoken language and the Arabic script made spelling and writing difficult. In the nineteenth century the need for reform was recognized.

Languages coexisting during Ottoman Empire
LanguageLanguage
Family
Usage
PersianIndo-Europeandiplomacy, art,
literature
Arabic Semitic liturgy
Ottoman
Turkish
Ural-Altaicused by Anatolian
peasants

In 1923, cultural and political reform was led by Mustafa Kemal. Mustafa Kemal was later called Atatürk, "father of the Turks". He was also the first president of the Republic of Turkey. Turkey wanted to be free of its Ottoman past and Islamic influences and be more aligned with the West. Atatürk and his Nationalistic party instituted the adoption of the Latin script and new vocabulary in May of 1928.

Atatürk's Speedy Implementation

Atatürk rejected a five-year plan proposed by a planning commision, to migrate to the Latin script. Instead, the plan was revised to adopt Latin script system in three months. Schools began teaching Latin script in the fall of that year. Grammar lessons were cancelled until new grammar books were available. By November 1928, Latin script became law, and using Arabic script to represent Turkish became illegal.

Turkey made other changes besides the writing system, including replacing the Islamic calendar with the Gregorian calendar on January 1, 1927. The Islamic calendar is used now only for determining religious holidays. Sunday is the mandated secular Sabbath rather than Friday the traditional Muslim Sabbath.