Saturday, March 1, 2008

Problems Discovered when Profiling the OpenAFS Windows client

I have spent the last month analyzing the performance of the OpenAFS for Windows cache manager using the Sysinternal's Process Monitor's Profiling toolset. The results were quite eye opening. What I had believed was a highly parallelized code set instead was filled with bottlenecks that seriously hampered the ability to process data at high rates. What follows are some of the most significant issues that were uncovered. Some of the issues are specific to AFS, others are likely to be problems found in many other applications.

Reference Counts
Each of the objects maintained by the cache manager (data buffers, vnodes, cells, smb file handles, directory searches, users, etc) are reference counted in order to determine when they should be garbage collected or can be recycled. Reference counts must be incremented and decremented in a thread safe manner. Otherwise races between the threads when they update the reference count will result in the count becoming inconsistent. Objects will either be freed prematurely (undercounts) or never become available for recycling (overcount). Reference counts were therefore protected by the same read/write locks that protect the hash tables used to find enumerate the objects. The problem is that although a read lock can be used to safely traverse a hash table's link list, a write lock is required to safely update the reference count of the desired object once it is located. As a result, only one thread can be searching for objects or releasing them at a time.

If it were possible to adjust the reference count values in an atomic operation most of the hash table transactions that required write locks could use read locks instead. As it turns out, Windows supports Interlocked increment and decrement operations for aligned 32-bit and 64-bit values. By making use of the Interlocked operations reference counts are safely adjusted and parallel access hash table contents is permitted.

Network Reads
The AFS servers support Rx hot threads. As soon as a message is received by the listener thread, another thread is woken to listen for the next incoming message while the current thread becomes a worker to process the message. The AFS clients did not support Rx hot threads and therefore could only process a single incoming message at a time. By activating Rx hot threads in the AFS client the latency between received messages was significantly reduced.

Lock Churn
Throughout many code paths the same lock or mutex would often be released and re-obtained. Doing so increases the possibility that the current thread will be swapped out and an alternate thread activated. These context switches between threads are expensive and increase the overall clock time required to respond to a request. By refactoring the code it was possible to avoid many such transitions thereby improving overall performance.

Write-to-Read and Read-to-Write Lock Transitions
Similar to the previous case, there are many situations in when it is desirable to either downgrade or upgrade a read-write lock. Write-to-Read transitions are always safe to perform and can be done without forcing a context switch between threads in all cases. Read-to-Write transitions can be done without a context switch whenever the requesting thread is the only reader. Regardless of how often it is the case, a read-to-write transition will be cheaper than dropping the read lock and requesting a write-lock.

Equality comparisons must be cheap
The function used to determine if two File Identifiers are the same is one of the most frequently called functions. It is used every time a vnode or buffer must be located. As a result it must be fast. Instead of comparing each of the elements of a FID, the structure was extended with a hash value that can eliminate the vast majority of false matches with a single comparison. In addition, the function was inlined to avoid the function call overhead.

Do Not Perform Unnecessary Work
The AFS client has an extensive logging infrastructure which is disabled by default. However, it turns out that although the actual logging was disabled a majority of the work that is required to construct the log messages continued to be performed. This unnecessary work was a significant drain on resources and increased clock time for all operations.

Do Not Perform Unnecessary Work - Part II
When copying a file on top of an existing file, the first operation that is performed is to truncate the file. This results in the invalidation of all the cached data buffers associated with the file. The actual truncation is not sent to the file server until the first write completes which is not attempted until the first chunk size of data is ready to be sent. As a result, when the initial data buffers are being written to in the cache the cache manager believed that it must read their contents from the file server. If the pre-fetch criteria are met, additional data buffers would be queued as well. Performing these reads is useless work given the fact that the client will overwrite them or discard them once the truncation is sent to the file server. The answer of course was to check for the outstanding truncation when getting data buffers.

Do Not Perform Unnecessary Work - Part III
Acquiring mutexes and locks are expensive because they often result in the active thread giving up the rest of its allocated time slice and being forced to be rescheduled at a later time. Therefore, if there are locks that are not required to perform the current operation, they should not be acquired.

Do Not Sleep if it is Not Required
If the file server responds EAGAIN to an RPC, the cache manager will under most circumstances put the current thread to sleep and try again in a few seconds provided that the SMB redirector timeout limit has not been reached. There are several operations for which retries are not permitted which include background writes, lock acquisition, etc. Due to an architectural design flaw, the cache manager was putting threads to sleep even if retries were not permitted.

Setting Max MTU Size hurts
Back in 2003 it was discovered that the IPSec VPN products did a very poor job on interacting with AFS due to the reduction in the actual data payload in a UDP packet caused by the addition of the IPSec headers. Due to an ever increasing number of complaints to Help Desks and to OpenAFS stating that AFS didn't work it was decided that the OpenAFS installation packages on Windows would ship with the RxMaxMTU value set to 1260. At the time the performance of the cache manager was so bad that it was not possible to notice the difference. Unfortunately, now that the cache manager is better performing, setting RxMaxMTU to 1260 can result in a reduction in StoreData throughput of 50%

Avoid Modifying Objects Whenever Possible
Every vnode and every data buffer object contains a version number. Every time the vnode changes the file server increments the version number. Doing so automatically invalidates the contents of all caches forcing the clients to re-read the data from the file server. Reading from the file server is an expensive operation so we try to avoid it when we know that the current contents of the cache are already valid. We know that to be true when the cache manager performed the most recent change to the vnode and the version delta is one. Over the summer code was added that would bump the version number on all of the data buffers in this circumstance. However, this had the side effect that writes became slower as the file got larger. By maintaining a range of valid data versions instead of just the current data version, it is possible to maintain the benefits of the existing cached data at a cost that is independent of the file size.

Hash Algorithms Matter
The lock package uses an array of critical section objects to protect the internals of the mutex and read/write locks. Which critical section was used for which lock object was determined by hashing the memory address at which the lock was located. Unfortunately, the distribution of the objects was poor and some critical sections were used much more frequently than others. Worse was the fact that several highly used global locks shared the same critical sections.

No comments: