Wednesday, May 14, 2008

File System Internationalization sucks

Internationalization in file systems really sucks.  There are two perspectives in the world.  First, there are the POSIX proponents who believe that names are simply nul terminated octet sequences that have no meaning except to the application that created them.  Second, there are those who believe that names are should be portable between systems and therefore should all be encoded in a common character set.  Lets call these second group of folks the UNICODE camp. 

I fall into the UNICODE camp.  This is most likely a side effect of having spent nearly fifteen years of my life working on Kermit, an application and file transfer protocol designed specifically to move files (by name) between computer systems using different architectures and locales.  I learned very early on that if you followed the POSIX approach the end result when a file is copied from an EBCDIC system to an ASCII system or a Latin-1 system to a CP437 system is gibberish.  Not only for human beings but for the applications as well.

A globally accessible file system such as AFS is in many regards similar to Kermit except that instead of copying files into a local file system from a remote system, the AFS client makes the entire remote file system accessible to the local machine.    The exact same character set conversion issues occur.  As long as all of the file names are in the same character set all is dandy and applications on one machine can access files created on another machine.

But what happens when the character sets are different?  In that circumstance, the names become gibberish to humans and applications.  In a worst case scenario, the file name as stored in the directory cannot even be represented on the local machine because the file name contains illegal code points according to the rules of the local environment.

This situation doesn't happen as frequently as it could because still most of the world is only storing US-ASCII or ISO-Latin-1 into the file system.  However, even with those restrictions there are still problems.  For example, the following characters are illegal on Windows systems

  " / \ * ? < > | :

It doesn't matter what the underlying file system is.  If those characters are in the name, the name is illegal.  Any name with those characters will not be included in the directory listing.
This in turn means it is impossible to see the file, access the file, rename the file, delete the file, or delete the directory the file is located in.  File systems that include objects with such names must perform name translation in order for the Windows users or applications to be able to manipulate them.

With the introduction of Unicode another set of complications are introduced.  Unicode provides for multiple semantically equivalent encodings of the same string based upon whether composed or decomposed sequences are used.  For historical reasons, MacOS X stores its file names using UTF-8 encoding of decomposed Unicode sequences, Microsoft Windows stores composed Unicode sequences, Linux also stores composed sequences, and all of the sequences for a given string can be different.  That means that a user who types the same string on all three platforms will obtain a different octet sequence for each platform.  So much for interoperability. 

The POSIX supporters make the claim that names must be treated as octet strings because the locale between two different processes on the same machine can be different.  All that tells me is that POSIX allows users to shoot themselves in the foot.  It doesn't mean it is right.  Of course, the POSIX folks do have a point.  If a UNIX system is incapable of communicating the character set that is being used to the file system, how is the file system supposed to do something sane with it to provide for interoperability between heterogeneous environments.

Microsoft Windows has an advantage here in that there is a standard character set for the entire operating system and all file systems: Unicode.  As a result a file system client on Windows can at least ensure that Unicode names are normalized on output, that directory entry names are normalized for display and lookup, that all illegal characters are mapped to something legal, and ensure that all strings communicated with the file server are the original directory entry names and not the normalized names used locally.  This is the approach that will be taken as Unicode is added to the OpenAFS for Windows client.