I18nGuy Home Page --> filenames-i18n.html

Filenames Must Be Made International, Portable and Interoperable

call to action to make filenames international, Unicode, portable, and interoperable

At the Oct. 2009 Unicode and Internationalization Conference (IUC33), I18nGuy (Tex Texin) presented a paper and called for action to address a burgeoning problem with filenames not being portable or interoperable.

The presentation is called Honey, My Unicode Disk Storage Went into the Circular File. The abstract for the presentation follows. (The link is to an Adobe Acrobat version of the presentation (1.7MB PDF) that you can download.)

Proposal for International File Paths and Names

Just as the industry defined Internationalized Resource Identifiers (IRI) (RFC 3987) as a layer built on top of Uniform Resource Identifiers (URI) (RFC 3986, RFC 2396) as a way to provide international identifiers on top of ASCII identifiers, it may be time to create a standard for International File Paths that use Unicode characters to label physical storage so that users can create and retrieve data in a uniform manner that is independent of media, operating system, programming language, locale, etc. This model is similar to the W3C Reference Processing Model for abstracting character data as Unicode characters.

Requirements for File Paths and Names

Internationalized File Paths and Names should:

Abstract for "Honey, My Unicode Disk Storage Went into the Circular File"

This session will present the difficulties of providing a common international interface to file services on different operating systems. Although Unicode supports all the necessary characters, identifying the set of characters that are legitimate on any OS can be difficult. Rules for case-insensitivity, normalization, locales, directory search, wildcards, and many other factors, vary, and may even vary by user and may change from day to day. The presentation will describe the problem space. It suggests some abstractions for solving these problems.

The problem is getting worse today. Programming languages now use Unicode strings. (Hooray!) Programmers treat filenames as strings, but the names are actually binary data. Languages like Python are creating (dangerous) proprietary solutions to address this difference. Other languages may begin designing their own (broken) solutions.

The industry needs to pay attention to this problem and define a way forward.