Saturday, August 2, 2008

OpenAFS for Windows with Unicode is Available

A couple of weeks ago OpenAFS for Windows with Unicode path name support was released.  I thought this was going to be a big deal.  Due to the lack of Unicode support there were all sorts of problems for organizations that wanted to use roaming profiles and redirected folders.  Even more important is the fact that the vast majority of the world does not limit their writing to the characters represented in Windows OEM Code Pages 437 and 850.   For years these individuals could not save their data into AFS using the language of their choice.  

Up to this point, 1.5.5x has had one of the slowest adoption rates of any OpenAFS for Windows release over the last five years.  Is this because it is Summer?  Is it because most users are Americans and they do not require Unicode?  Is it because everyone has given up on AFS?  I don't know.

What I do know is that the Unicode version has been downloaded (in small numbers) by a broad range of top-level domains other than the United States including Malaysia, Russia, Canada, Germany, Taiwan, Brazil, Hong Kong, Poland, Yugoslavia, Croatia, Japan, and Indonesia.  Hopefully, users from these countries will write in to describe how Unicode support has made their lives easier.


Wednesday, May 14, 2008

File System Internationalization sucks

Internationalization in file systems really sucks.  There are two perspectives in the world.  First, there are the POSIX proponents who believe that names are simply nul terminated octet sequences that have no meaning except to the application that created them.  Second, there are those who believe that names are should be portable between systems and therefore should all be encoded in a common character set.  Lets call these second group of folks the UNICODE camp. 

I fall into the UNICODE camp.  This is most likely a side effect of having spent nearly fifteen years of my life working on Kermit, an application and file transfer protocol designed specifically to move files (by name) between computer systems using different architectures and locales.  I learned very early on that if you followed the POSIX approach the end result when a file is copied from an EBCDIC system to an ASCII system or a Latin-1 system to a CP437 system is gibberish.  Not only for human beings but for the applications as well.

A globally accessible file system such as AFS is in many regards similar to Kermit except that instead of copying files into a local file system from a remote system, the AFS client makes the entire remote file system accessible to the local machine.    The exact same character set conversion issues occur.  As long as all of the file names are in the same character set all is dandy and applications on one machine can access files created on another machine.

But what happens when the character sets are different?  In that circumstance, the names become gibberish to humans and applications.  In a worst case scenario, the file name as stored in the directory cannot even be represented on the local machine because the file name contains illegal code points according to the rules of the local environment.

This situation doesn't happen as frequently as it could because still most of the world is only storing US-ASCII or ISO-Latin-1 into the file system.  However, even with those restrictions there are still problems.  For example, the following characters are illegal on Windows systems

  " / \ * ? < > | :

It doesn't matter what the underlying file system is.  If those characters are in the name, the name is illegal.  Any name with those characters will not be included in the directory listing.
This in turn means it is impossible to see the file, access the file, rename the file, delete the file, or delete the directory the file is located in.  File systems that include objects with such names must perform name translation in order for the Windows users or applications to be able to manipulate them.

With the introduction of Unicode another set of complications are introduced.  Unicode provides for multiple semantically equivalent encodings of the same string based upon whether composed or decomposed sequences are used.  For historical reasons, MacOS X stores its file names using UTF-8 encoding of decomposed Unicode sequences, Microsoft Windows stores composed Unicode sequences, Linux also stores composed sequences, and all of the sequences for a given string can be different.  That means that a user who types the same string on all three platforms will obtain a different octet sequence for each platform.  So much for interoperability. 

The POSIX supporters make the claim that names must be treated as octet strings because the locale between two different processes on the same machine can be different.  All that tells me is that POSIX allows users to shoot themselves in the foot.  It doesn't mean it is right.  Of course, the POSIX folks do have a point.  If a UNIX system is incapable of communicating the character set that is being used to the file system, how is the file system supposed to do something sane with it to provide for interoperability between heterogeneous environments.

Microsoft Windows has an advantage here in that there is a standard character set for the entire operating system and all file systems: Unicode.  As a result a file system client on Windows can at least ensure that Unicode names are normalized on output, that directory entry names are normalized for display and lookup, that all illegal characters are mapped to something legal, and ensure that all strings communicated with the file server are the original directory entry names and not the normalized names used locally.  This is the approach that will be taken as Unicode is added to the OpenAFS for Windows client.

Wednesday, March 12, 2008

OpenAFS joins Google Summer of Code 2008

Today OpenAFS submitted an application to take part in the 2008 Google Summer of Code.  OpenAFS project ideas are listed at http://www.openafs.org/gsoc.html.

Thanks to Asanka Herath, Matt Benjamin, Simon Wilkinson and Derrick Brashear for volunteering to be mentors to the next generation of OpenAFS developers.

Update: Monday 17 March 2008, OpenAFS was accepted.

Tuesday, March 4, 2008

OpenAFS vs Norton Internet Security 2008

OpenAFS requires several rules to be set in order to work with Norton Internet Security 2008.

1. Under "Personal Firewall->Program Control" add a "Allow" rule for "C:\Program Files\OpenAFS\Client\Program\afsd_service.exe"
2. Do the same for "fs.exe", "aklog.exe", and other command line utilities if so desired.
3. Under "Personal->Firewall->Trust Control, Trusted tab", add a "Trusted" rule for "02-00-4C-4F-4F-50".
4. Under the "Personal Filewall->Advanced Settings" press the "Configure" button.
5. Add a new rule:
    "Allow", "Inbound", "Any computer", "Protocol: UDP", "Port 7001", and describe it as "AFS Callback Port".  Make it the first rule in the list.
6. Add a new rule:
    "Allow", "Outbound", "Any computer", "Protocol: UDP", "Port range: 7001-7008" and describe it as "AFS Server Ports".  Make it the second rule in the list.

Finally, double check the configuration of the "Microsoft Loopback Adapter" labeled "AFS" in the Network Control Panel.   Make sure that "TCP/IP is checked", that "Client for Microsoft Networking" is checked, and that "File and Printer Sharing" is not checked.

You should now be able to access "\\afs\all" in the Explorer Shell.




Sunday, March 2, 2008

I want my OpenAFS Windows client to be fast

There are a number of configuration knobs available to tune the OpenAFS for Windows client.  The most important related to throughput fall into two categories:

How much data can I cache?
CacheSize
Stats

How Fast Can I Read and Write?
BlockSize
ChunkSize
EnableSMBAsyncStore
SMBAsyncStoreSize
RxMaxMTU
SecurityLevel
TraceOption

All of these options are described in Appendix A of the Release Notes.  Here are the values I use:

CacheSize = 60GB (64-bit)  1GB (32-bit)
Stats = 120,000 (64-bit)  30,000 (32-bit)

BlockSize = 4
ChunkSize = 21 (2MB)
EnableSMBAsyncStore = 1
SMBAsyncStoreSize = 262144 (but would use 1MB if I didn't use cellular networks as often)
RxMaxMTU = 9000
SecurityLevel = 1 (when I need speed I use "fs setcrypt" to adjust on the fly)
TraceOption = 0 (no logging)


Saturday, March 1, 2008

Problems Discovered when Profiling the OpenAFS Windows client

I have spent the last month analyzing the performance of the OpenAFS for Windows cache manager using the Sysinternal's Process Monitor's Profiling toolset. The results were quite eye opening. What I had believed was a highly parallelized code set instead was filled with bottlenecks that seriously hampered the ability to process data at high rates. What follows are some of the most significant issues that were uncovered. Some of the issues are specific to AFS, others are likely to be problems found in many other applications.

Reference Counts
Each of the objects maintained by the cache manager (data buffers, vnodes, cells, smb file handles, directory searches, users, etc) are reference counted in order to determine when they should be garbage collected or can be recycled. Reference counts must be incremented and decremented in a thread safe manner. Otherwise races between the threads when they update the reference count will result in the count becoming inconsistent. Objects will either be freed prematurely (undercounts) or never become available for recycling (overcount). Reference counts were therefore protected by the same read/write locks that protect the hash tables used to find enumerate the objects. The problem is that although a read lock can be used to safely traverse a hash table's link list, a write lock is required to safely update the reference count of the desired object once it is located. As a result, only one thread can be searching for objects or releasing them at a time.

If it were possible to adjust the reference count values in an atomic operation most of the hash table transactions that required write locks could use read locks instead. As it turns out, Windows supports Interlocked increment and decrement operations for aligned 32-bit and 64-bit values. By making use of the Interlocked operations reference counts are safely adjusted and parallel access hash table contents is permitted.

Network Reads
The AFS servers support Rx hot threads. As soon as a message is received by the listener thread, another thread is woken to listen for the next incoming message while the current thread becomes a worker to process the message. The AFS clients did not support Rx hot threads and therefore could only process a single incoming message at a time. By activating Rx hot threads in the AFS client the latency between received messages was significantly reduced.

Lock Churn
Throughout many code paths the same lock or mutex would often be released and re-obtained. Doing so increases the possibility that the current thread will be swapped out and an alternate thread activated. These context switches between threads are expensive and increase the overall clock time required to respond to a request. By refactoring the code it was possible to avoid many such transitions thereby improving overall performance.

Write-to-Read and Read-to-Write Lock Transitions
Similar to the previous case, there are many situations in when it is desirable to either downgrade or upgrade a read-write lock. Write-to-Read transitions are always safe to perform and can be done without forcing a context switch between threads in all cases. Read-to-Write transitions can be done without a context switch whenever the requesting thread is the only reader. Regardless of how often it is the case, a read-to-write transition will be cheaper than dropping the read lock and requesting a write-lock.

Equality comparisons must be cheap
The function used to determine if two File Identifiers are the same is one of the most frequently called functions. It is used every time a vnode or buffer must be located. As a result it must be fast. Instead of comparing each of the elements of a FID, the structure was extended with a hash value that can eliminate the vast majority of false matches with a single comparison. In addition, the function was inlined to avoid the function call overhead.

Do Not Perform Unnecessary Work
The AFS client has an extensive logging infrastructure which is disabled by default. However, it turns out that although the actual logging was disabled a majority of the work that is required to construct the log messages continued to be performed. This unnecessary work was a significant drain on resources and increased clock time for all operations.

Do Not Perform Unnecessary Work - Part II
When copying a file on top of an existing file, the first operation that is performed is to truncate the file. This results in the invalidation of all the cached data buffers associated with the file. The actual truncation is not sent to the file server until the first write completes which is not attempted until the first chunk size of data is ready to be sent. As a result, when the initial data buffers are being written to in the cache the cache manager believed that it must read their contents from the file server. If the pre-fetch criteria are met, additional data buffers would be queued as well. Performing these reads is useless work given the fact that the client will overwrite them or discard them once the truncation is sent to the file server. The answer of course was to check for the outstanding truncation when getting data buffers.

Do Not Perform Unnecessary Work - Part III
Acquiring mutexes and locks are expensive because they often result in the active thread giving up the rest of its allocated time slice and being forced to be rescheduled at a later time. Therefore, if there are locks that are not required to perform the current operation, they should not be acquired.

Do Not Sleep if it is Not Required
If the file server responds EAGAIN to an RPC, the cache manager will under most circumstances put the current thread to sleep and try again in a few seconds provided that the SMB redirector timeout limit has not been reached. There are several operations for which retries are not permitted which include background writes, lock acquisition, etc. Due to an architectural design flaw, the cache manager was putting threads to sleep even if retries were not permitted.

Setting Max MTU Size hurts
Back in 2003 it was discovered that the IPSec VPN products did a very poor job on interacting with AFS due to the reduction in the actual data payload in a UDP packet caused by the addition of the IPSec headers. Due to an ever increasing number of complaints to Help Desks and to OpenAFS stating that AFS didn't work it was decided that the OpenAFS installation packages on Windows would ship with the RxMaxMTU value set to 1260. At the time the performance of the cache manager was so bad that it was not possible to notice the difference. Unfortunately, now that the cache manager is better performing, setting RxMaxMTU to 1260 can result in a reduction in StoreData throughput of 50%

Avoid Modifying Objects Whenever Possible
Every vnode and every data buffer object contains a version number. Every time the vnode changes the file server increments the version number. Doing so automatically invalidates the contents of all caches forcing the clients to re-read the data from the file server. Reading from the file server is an expensive operation so we try to avoid it when we know that the current contents of the cache are already valid. We know that to be true when the cache manager performed the most recent change to the vnode and the version delta is one. Over the summer code was added that would bump the version number on all of the data buffers in this circumstance. However, this had the side effect that writes became slower as the file got larger. By maintaining a range of valid data versions instead of just the current data version, it is possible to maintain the benefits of the existing cached data at a cost that is independent of the file size.

Hash Algorithms Matter
The lock package uses an array of critical section objects to protect the internals of the mutex and read/write locks. Which critical section was used for which lock object was determined by hashing the memory address at which the lock was located. Unfortunately, the distribution of the objects was poor and some critical sections were used much more frequently than others. Worse was the fact that several highly used global locks shared the same critical sections.