W3C GEO Task Force
Glossary of Internationalization Terminology for the World Wide Web

DRAFT Version

NOTE: Links to other websites will appear in a new browser window.

A list of terms and definitions related to internationalization and localization in the web environment. The list is created and maintained by the W3C I18N GEO Task Force.

The glossary is in support of HTML Techniques (Draft)

If you would like to link to any of the definitions on this page, you can find the appropriate fragment identifier (i.e. the link's anchor name) by placing your mouse over the term. The identifier will appear in a tooltip. Append a hash mark ("#") and the identifier to the URL for this page: www.i18nguy.com/markup/i18n-glossary.html.

For example, place your mouse over the term "i18n". The tooltip will display the identifier is also "i18n". (Most of the terms use their name as their identifier.) To link to the "i18n" entry, therefore specify: www.i18nguy.com/markup/i18n-glossary.html#i18n

W3C I18N GEO Task Force Home Page

I18nGuy home page

Editor's notes:

Suggested that we add: BDO, LRE, PDF, LRM, ZWNJ. Do we want to document Unicode controls characters?

Glossary of Internationalization Terminology for the World Wide Web
Term	Notes
abjad	A type of writing system where only consonants are generally written.
abugida	A type of writing system whose basic characters denote consonants followed by a particular vowel, and in which diacritics denote the other vowels.
ANSI	American National Standards Institute. Microsoft's collective name for all or any Windows code pages. (As in "ANSI code page".) Sometimes used specifically for code page 1252, which is a superset of ISO/IEC 8859-1.
ASCII	American Standard for Character Information Interchange. ISO 646.
bidi	Internationalization industry jargon. Abbreviation for bidirectional text.
Bidirectional text	Also abbreviated as "bidi", describes text that is primarily written from right-to-left, and which is often mixed with left-to-right text. Examples include text written in Hebrew and Arabic scripts.
Basic Multilingual Plane (BMP)	TBD
BMP	Basic Multilingual Plane
BOM	Byte Order Mark, U+FEFF, Also used as Unicode Character Encoding Signature
byte order mark	U+FEFF, also known as BOM and ZWNSP. Also used as Character Encoding Signature for Unicode encodings (UTF-8, UTF-16, et al.)
character	A member of a set of elements used for the organization, control, or representation of data. For example, "LATIN CAPITAL LETTER A" names a character.
character encoding	TBD
character entity	TBD
character set	TBD
charset	TBD
character encoding signature	TBD
character escape	tbd
character repertoire	A set of characters (in the mathematical sense)
coded character set	TBD
code point	TBD
compatibility character	TBD
complex script	TBD
DBCS	Double-Byte Character Set. A specific type of MBCS, character encodings where characters are of varying byte length, limited to a maximum length of 2 bytes for characters. A character encoding where characters are represented by either one or two bytes. Sometimes DBC is used for double-byte character.
diacritic	TBD
document character set	TBD
escape	see "character escape"
fragment	TBD
GEO	W3C Abbreviation for Guidelines, Education, and Outreach. See www.w3.org/International/geo/
glyph	TBD
goober	A type of consideration for the internationalization of software or Web applications due to local legal, regulatory, or other governmental requirements. See Web Services Internationalization Usage Scenarios, Section 4.15 Legal and Regulatory Goobers
HTTP	HyperText Transfer Protocol
HTTP header	TBD
i18n	Abbreviation. See internationalization. Also see "Origin of the abbreviation i18n".
IANA	Internet Assigned Numbers Authority www.iana.org
IANA Charset Registry	Registry for character encodings used by MIME, Web standards, and others. www.iana.org/assignments/character-sets
internationalization	Designing software to be usable around the world.
IRI	W3C acronym for Internationalized Resource Identifier, an internationalized form of URI. See www.w3.org/International/O-URL-and-ident.
MBCS	Multi-Byte Character Set. A type of character encoding where characters are of varying byte length. Characters may be encoded as 1, 2, 3 or 4 bytes for example in some encodings.
MIME type	TBD
mojibake (文字化け)	Japanese jargon for any of "garbage", "changed", "ghost" or "disguised" characters or what is shown when Japanese characters are not displayed correctly (various black boxes or other nonsense characters). Here are some examples that look like mojibake: █ █ (You should see some black boxes.) There can also be white boxes: █ █ or ǶǶǶ. In Japan, these are sometimes called "TOFU"
NCR	Numeric Character Reference. (See HTML specification.)
NFC	Unicode acronym for Normalization Form C
NLS	Software Industry abbreviation for National Language System. General term refering to features, and libraries and related data supporting internationalization within an operating system or product. Example usage: "NLS Library".
normalization	Unicode term normalization
quirks mode	TBD
PUA	Abbreviation for Unicode term: Private Use Area
SBCS	Single-Byte Character Set. Some vendors refer to this as a code page. A character encoding where each character is represented by one 8-bit value. Sometimes SBC is used for single-byte character.
standards mode	TBD
supplementary character	TBD
tofu (豆腐)	Japanese jargon for the white box character that is displayed by default for an unassigned or unknown character. For example: Ƕ. See mojibake
transcoding	TBD
UCS	Abbreviation for Unicode term: Universal Character Set which is specified by International Standard ISO/IEC 10646. Sometimes also used as Unicode Character Standard.
Unicode	Unicode Character Standard (UCS), Universal Character Set. See Unicode ConsortiumAlso see ISO 10646.
user agent (UA)	TBD
UTF	Abbreviation, Unicode term for Unicode Transformation Format. Also see UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32BE, UTF-32LE
virama	TBD
W3C	Abbreviation for World Wide Web Consortium. See www.w3.org
WAI	W3C abbreviation for Web Accessibility Initiative. See www.w3.org/WAI/
XML	eXtensible Markup Language
XML declaration	TBD
ZWNSP	Zero Width No-break Space. Deprecated. Formerly doubled as a Byte Order Mark, U+FEFF.
Å, Å	The symbol for Ångstrom (U+212B) and the letter A-ring (U+00C5, or U+0041 and U+030A - A and Combining Ring Above). Scandanavian alphabets sort the letter A-ring after the letter Z.

W3C GEO Task ForceGlossary of Internationalization Terminology for the World Wide Web

DRAFT Version

Editor's notes:

Other Terminology Resources

W3C GEO Task Force
Glossary of Internationalization Terminology for the World Wide Web