Monday, March 25, 2013

IOZone Performance Measurements of OpenAFS

The I/O processing pathways were rewritten for the OpenAFS 1.7.22 release. One industry standard method of measuring I/O performance in a file system independent manner is the iozone benchmark developed and maintained by Don Capps of NetApp. 

This blog post will compare the iozone results for OpenAFS 1.5.75 which uses the SMB to AFS gateway service and OpenAFS 1.7.23 which uses the new AFS redirector. 

The test environment includes a Lenovo Thinkpad W701ds workstation running Win7-64 as the client system. 8GB ram, dual Core i7 x920 2.00GHz processors (8 cores total), Windows Experience ratings: 
  • Processor; 7.2 
  • Memory: 7.4 
  • Graphics: 5.8 
  • Gaming: 6.5 
  • Disk: 5.9 
The connection to the file server is a 1Gbit wired network through a 10Gbit switch.  The file server is OSX 10.6.8 Server running on a 2010 Mini Server using iSCSI attached storage sharing a single 1Gbit network interface.  The OpenAFS file server is version 1.6.2 using Demand Attach. The AFS cache manager configuration includes: 
  • BlockSize 1 (4KB) 
  • CacheSize 0x200000 (2GB) 
  • ChunkSize 21 (2MB) 
  • RxUdpBufSize 0xc00000 
 All iozone tests were performed using "-Rac output.wks -g 2G".  

Write Performance Comparisons

One of the big complaints with the OpenAFS SMB to AFS gateway is the poor write throughput.   The iozone output for 1.7.75 demonstrates the limitations.  Although the peak throughput for small files (about 1MB) reaches the 30,000 KBytes/second mark, the sustained throughput for larger files is below 16,000 KBytes/second.
OpenAFS 1.5.75 (SMB) Write Performance

The 1.7.23 AFS Redirector does a much better job.  The peak throughput increases with both the record size and the file size.   Depending on the record size the throughput ranges from 30,000 KBytes/second to 65,000 KBytes/second.  This is more than double the peak throughput of the SMB to AFS gateway.
OpenAFS 1.7.23 (RDR) Write Performance

Read Performance Comparisons

1.5.75 read performance is quite inconsistent.   Although there are peak throughput values above 200,000 KBytes/second the majority of record sizes are read at speeds in the 80,000 to 100,000 KBytes/second range.
OpenAFS 1.5.75 (SMB) Read Performance

The 1.7.23 AFS Redirector is faster by a factor of ten.   The majority of record sizes demonstrate read throughput in the 800,000 KBytes/second to 1,000,000 KBytes/second range.
OpenAFS 1.7.23 (RDR) Read Performance


One of the primary goals of converting OpenAFS from a SMB gateway to a legacy file system redirector was a significant improvement in I/O throughput.  The improvements on the read pathway have certainly be obtained.  The 2x improvement in the write path is good but there is certainly room for further improvement.

Sunday, March 24, 2013

Symbolic Links on Windows

 Over the last month I have learned more about symlinks on Windows than I ever wanted to know.  As many readers are aware, I am the lead developer of the OpenAFS client for Windows and the AFS name space supports two symbolic link type objects:
  • Mount Points: a directory entry that refers to the root directory of an afs volume.
  • Symlinks: a directory entry that refers to any absolute or relative target path; traditionally in POSIX notation.
The original AFS client for Microsoft Windows was implemented as an SMB 1.2 to AFS gateway service and it pre-existed Windows 2000, the first version of Microsoft Windows to include NTFS 3.0 and support for reparse points.  Due to the lack of native OS support, AFS specific command-line tools "fs mkmount", "fs lsmount", "fs rmmount" and "symlink make", "symlink list", and "symlink remove" were provided.

In 2007, Peter Scott and I began work on a Windows Installable File System for AFS.  Technically, the new AFS client is a legacy file system redirector driver which has access to the same functionality and flexibility as NTFS.  In Windows Vista and beyond Microsoft added support for symbolic links to files and directories within NTFS.  They implemented this functionality by combining a directory object or a file object with Reparse Point Data.  The data consists of a Reparse Point Tag value (assigned by Microsoft) and a tag specific data structure.

Microsoft assigns reparse tag values and then includes them in future versions of the ntifs.h header file in the DDK.  If you are developing a file system driver for Windows and wish to have a reparse point tag allocated to your driver, follow the instructions at Microsoft's Reparse Point Tag Request page.  Microsoft is likely to assign only a single Reparse Point Tag value for your driver.  Therefore, I recommend that you request a tag value without the "high latency" or "name surrogate" bits set.  You can always combine those bits with your assigned tag value.   The DDK ntifs.h header includes macros to test various bits:
Reparse Points are a generic mechanism for turning a directory or file object into a reference to something else.  The IsReparseTagMicrosoft() macro is important because it determines which data structure will be set on the file system object.  A Microsoft Tag will use the REPARSE_DATA_BUFFER structure whereas a non-Microsoft Tag will use the REPARSE_GUID_DATA_BUFFER structure.  The latter structure can be customized by the driver vendor.  I recommend defining a structure that contains a driver specific sub-tag value and a union of purpose specific values.  In fact, this is what we did for the AFS redirector.

// Reparse tag AFS Specific information buffer

#define IO_REPARSE_TAG_OPENAFS_DFS 0x00000037L

#define IO_REPARSE_TAG_SURROGATE   0x20000000L

//  {EF21A155-5C92-4470-AB3B-370403D96369}
        0xEF21A155, 0x5C92, 0x4470, 0xAB, 0x3B, 0x37, 0x04, 0x03, 0xD9, 0x63, 0x69);

#define OPENAFS_SUBTAG_UNC        3


    ULONG SubTag;
            ULONG  Type;
            USHORT MountPointCellLength;
            USHORT MountPointVolumeLength;
            WCHAR  Buffer[1];
        } AFSMountPoint;

            BOOLEAN RelativeLink;
            USHORT  SymLinkTargetLength;
            WCHAR   Buffer[1];
        } AFSSymLink;

            USHORT UNCTargetLength;
            WCHAR  Buffer[1];
        } UNCReferral;
} AFSReparseTagInfo;

The motivation behind using reparse points with the AFS redirector is due to limitations of the SMB to AFS gateway. The global AFS name space consists of millions of individual volumes scattered across hundreds or thousands of AFS cells maintained by different organizations. The entire name space can be thought of being rooted at /afs with /afs/ referring to the volume "root.cell" in the cell whose volume location database servers can be found via a DNS SRV query that assumes a one-to-one mapping between the cell name and DNS domain name.  That is too much information but the point is that when the UNC path \\afs\\ is evaluated by an AFS client the subset of the AFS name space it refers to is unlikely to be a single volume.  This is really important because  the Win32 GetVolumeInformationByHandleW and GetDiskFreeSpaceEx API permits an application to query properties of the volume such as the amount of free space, the volume name, serial number, and system flags.

An SMB share UNC path is assumed to refer to a single volume.  The SMB 1.2 server does not return different volume information for different paths.  It always returns the volume information associated with the root of the share.  For AFS this is a nightmare.  Each AFS volume will have a unique name and id.  They will also have an assigned quota, have a certain number of bytes free, and can be either read-only or read-write.  Since the AFS name space and the potential associated storage is infinite but a single volume has finite constraints what should the GetVolumeInformation and GetDiskFree API families return when given an AFS path?  In the SMB world, AFS claims there is only one volume "AFS", it is read-write, the size of the volume is 2TB and there is always 1TB free.

This lying by the SMB to AFS gateway results in some awkward behaviors.
  • Attempts to open a file for write, create a file, truncate a file, or create or remove a directory on a read-only volume returns ERROR_WRITE_PROTECTED even though the volume properties indicate that it is read-write.  This results in awkward error messages from applications such as the Explorer Shell which checks the FILE_READ_ONLY_VOLUME flag to determine whether operations such as New ..., Rename, Delete, etc should be removed from menus when the active directory is part of a read-only volume.
  • Since the volume size is hard coded to be 2TB with 1TB free, it is not possible for applications to create files that are larger than 2TB.
  • But worse, the Windows SMB client believes that there is 1TB free.  It can accept vast amounts of data from the application before it discovers that in fact there is no room on the file server to store it.  When the space suddenly disappears the application and the user will receive a "Delayed Write Error" which effectively means "I know I promised you that I would safely store your data for you but I misplaced it and you can't have it back."  In other words, a fatal data loss occurs which more often than not will result in application failure and perhaps a monetary loss.
  • Mount points and symlinks objects are not exposed to Windows applications.  The applications believe that there are only directories and files.  This has some really negative consequences.  When an attempt is made to delete a directory object via the Explorer Shell, the shell will delete not only the directory entry but all of the contents of the directory tree below it.  If the directory entry was a reparse point, only the reparse point would be removed leaving the target intact.  Instead, the explorer shell attempts to delete everything.   When a symlink refers to a file, the symlink should be removed but the target should be left alone.   Finally, rename operations should be performed on the mount point or symlink and not on the target object.
When Peter and I designed the AFS redirector one of the goals was to address these short comings.   Implementing reparse points for AFS mount points and symlinks was key because reparse points attributes on directory objects are the indication to an application that the directory entry and its target may not be in the same volume; therefore, the volume and disk free information must be fetched.  Of course, not all applications properly pay attention to reparse point attributes.  Application authors frequently assume that a UNC path or a network drive letter mapping must be to an SMB 1.2 share and therefore can only refer to a single volume.  I am tempted to produce a wall of shame for applications that get it wrong.  However, the failure of application authors to implement the correct behavior in their applications is not a reason for a file system to fail to make the data available to them.

Up until the 1.7.21(00) release the AFS redirector exposed mount points and symlink data using the Microsoft assigned IO_REPARSE_TAG_OPENAFS_DFS tag value and the AFSReparseTagInfo structure wrapped by the REPARSE_GUID_DATA_BUFFER structure.  In principal this should have been fine.  Applications should not need to parse the reparse data in order to properly interpret a reparse point.  The file attributes of the reparse point object indicate whether its a file or a directory.  The high latency bit of the reparse point tag indicates if the target object is located in a Hierarchical Storage Management system that might not be able to queries about the target object in a reasonable period of time.  Unfortunately, many applications decide to ignore the FILE_ATTRIBUTE_REPARSE_POINT flag it is returned by a GetFileAttributes or GetFileAttributesEx call even though these APIs explicitly return information about a reparse point and not the target.   Some applications follow this behavior when the reparse point tag is not recognized which usually means when IsReparseTagMicrosoft() returns false.  Others do it always.

What happens when the FILE_ATTRIBUTE_REPARSE_POINT bit is discarded and the rest of the file attributes are assumed to apply to the target file?  In addition to the file attributes field the GetFileAttributes and FindFirstFile family of functions also return the file size.  Now the file size does not have much meaning when the object is a directory but when the target of the reparse point is a file using the wrong file size can be catastrophic.  File contents can be truncated when read or overwritten when written.  Applications will be mighty confused when they continue to append data to a file but believe the file size never changed.  They will be even more confused when they attempt to delete a file only to find that either the reparse point was deleted or the target file but not both.  Regardless, bad things happen and that leaves end users with a bad taste in their mouths.

For the 1.7.22(00) release I decided to significantly flesh out the reparse point handling.  For starters, I had been working with Rex Conn on adding knowledge of AFS Reparse Points to Take Command.  Take Command (and its predecessor 4NT) have long had excellent support for AFS.  Take Command distinguishes in the directory list symlinks to files, symlinks to directories and junctions.  It does so for AFS as well as NTFS.  When Take Command 15 is combined with OpenAFS 1.7.22 users can not only view the target information for AFS mount points and symlinks but can also create them if the Take Command process has the SeCreateSymbolicLinkPrivilege which permits the CreateSymbolicLink API to create a symlink to a directory or a file.

CreateSymbolicLink encapsulates the following operations:
  1. Determine the type of the target object (file or directory)
  2. Create either a directory or a file object to match the target type
  3. Construct the REPARSE_DATA_BUFFER structure using the IO_REPARSE_TAG_SYMLINK tag
  4. Issue the  FSCTL_SET_REPARSE_POINT to assign the reparse data to the directory or file
  5. Close the handle to the file or directory
In other words, the CreateSymbolicLink only creates Microsoft symlinks.  Since the tag type is in the data structure it is fairly easy for a file system driver to accept both the IO_REPARSE_TAG_SYMLINK data and the file system specific data.  Once implemented it became possible for the Take Command MKLINK command to be used to create symlinks within AFS volumes.

For the longest time I resisted squatting on Microsoft's tag and data structure but as long as FSCTL_GET_REPARSE_POINT returns the IO_REPARSE_TAG_OPENAFS_DFS data many applications do the wrong thing.  There simply wasn't any choice from the perspective of application compatibility.  As a result in the 1.7.23(00) release AFS Symlinks will be exposed using the IO_REPARSE_TAG_SYMLINK instead of the IO_REPARSE_TAG_OPENAFS_DFS tag.  Only AFS Mount Points will be exposed using the IO_REPARSE_TAG_OPENAFS_DFS tag.

With this change not only can Take Command understand AFS symlinks but so can the Explorer Shell, the Cygwin POSIX environment, the PowerShell Community Extensions, and anything else that can manipulate NTFS symlinks.  Even Hermann Schinagl's Link Shell Extension.

One might think that everyone might be happy at this point except that end users are still faced with applications that do not know how to properly interpret Microsoft Reparse Points.  One example is Microsoft's own .NET.  In Microsoft's How to: Iterate Through a Directory Tree (C# Programming Guide) the author explains:

  NTFS file systems can contain reparse points in the form of junction points, symbolic links, and hard links. The .NET Framework methods such as GetFiles and GetDirectories will not return any subdirectories under a reparse point. This behavior guards against the risk of entering into an infinite loop when two reparse points refer to each other. In general, you should use extreme caution when you deal with reparse points to ensure that you do not unintentionally modify or delete files. If you require precise control over reparse points, use platform invoke or native code to call the appropriate Win32 file system methods directly.

That is not the only thing that .NET does.  It also hides the FILE_ATTRIBUTE_REPARSE_POINT bit in the file attributes from applications and returns the file size of the reparse point data.  As a result parsing a file stream through a symlink to a file results in the data truncation bug.   If the .NET team truly wanted to hide reparse points from application developers, they should have substituted the file attribute information for the target files in all directory enumeration output.  Providing compatibility for broken applications such as this should not be the responsibility of a file system.  However, applications are more important to end users than file systems and if the applications do not work, the file system will be replaced (or never adopted in the first place.)   As a result a future version of the Windows AFS client will probably include a mechanism for requesting that Symlinks to Files be reported as Files and not IO_REPARSE_TAG_SYMLINK reparse points.

While on the subject of Symlinks and Windows I would also like to discuss other approaches to implementing symlinks on Windows that have been implemented over the years.  As I mentioned, Cygwin supports Microsoft IO_REPARSE_TAG_SYMLINK reparse points as Symlinks.

$ ls -l af*
lrwxrwxrwx 1 Administrators None 9 Sep 19  2012 afs -> //afs/all

However, "ln -s target link" cannot be used to create IO_REPARSE_TAG_SYMLINK reparse points.  This is because "ln -s" creates Cygwin specific symlink objects in the file system.  Instead of using reparse points, Cygwin writes a file that begins with a cookie "!", followed by a Unicode BOM and the target path in Unicode.  The file has the FILE_ATTRIBUTE_SYSTEM attribute set as an indicator that the file might be a Cygwin symlink.

On Windows Server, Microsoft provides both a POSIX environment, Interix, and an NFSv3 implementation.  Interix implements symlinks similarly to Cygwin except that the cookie is "IntxLNK\1" and the format of the target path is different.  While the NFS implementation identifies its Symlinks by use of an extended attribute,
"NfsSymlinkTargetName" which stores the target path.

There is one more type of link object in Windows which is sometimes interpreted as a symlink.  That is the Windows Shortcut .LNK file which is interpreted by the Windows Shell.  One thing that is quite odd is that Cygwin at the present time is capable of writing .LNK files but is not capable of creating IO_REPARSE_TAG_SYMLINK reparse points.
[Update: Corinna Vinschen of Cygwin indicates the reason is that POSIX paths can be stored in .LNK files but IO_REPARSE_TAG_SYMLINK fields require the use of Windows file paths and foreknowledge of the target type.]

Microsoft Windows Reparse Points are an extremely powerful and flexible mechanism for implementing file system specific control points.  Much more powerful than the traditional POSIX symlink although much more complex.  An example of a tool that is more powerful because of its reparse point awareness is Microsoft's "Robust File Copy for Windows" tool better known as RoboCopy.   RoboCopy can be configured to exclude junction points (/XJ) by which they mean reparse points; exclude junction points for directories but not files (/XJD); exclude junction points for files (/XJF); and even copy the symlink instead of the target (/SL).   All of these switches work with the Windows AFS client.

My final comment for this post is that evaluating AFS directories which contain symlinks is an extremely expensive operation.  Unlike the POSIX equivalents, a Windows directory enumeration always returns the WIN32_FIND_DATA structure for each directory entry which contains the file attributes.  A reparse point to a directory must have the FILE_ATTRIBUTES_DIRECTORY bit set and a reparse point to a file must not.  All of the other fields of the WIN32_FIND_DATA structure can be determined from the reparse point itself but AFS does not have a method of hinting the client what the type of the target object is.  As a result, the target path must be evaluated for each and every directory listing.  A directory such as /afs/ which contains more than 30,000 relative symlinks to directories will require nearly twice that number of RPCs to the file server to complete the directory enumeration.  Something to think about when planning your AFS name space.

Thursday, March 14, 2013

JPSoftware's Take Command and OpenAFS

I have been a user of Rex Conn's replacement command processors since the early days of 4DOS.  When I switched to OS/2 and began work on OS/2 C-Kermit, 4OS2 was there for me.  When I added REXX language support to OS/2 C-Kermit, 4OS2 added it as well.  When I moved to Windows NT, there was 4NT waiting for me.  In 2003 I began my work on OpenAFS for Windows (WinAFS) which at the time was implemented as a locally SMB server proxy to the AFS name space.  Before I started work on the WinAFS client, the only method of accessing the AFS name space was by use of Windows drive letter mappings.  It wasn't possible to consistently access the AFS name space via a UNC path.  It wasn't until the OpenAFS 1.3.66 release in July 2004 that it became possible to live entirely in a UNC \\AFS\cellname\path\ world except that the Microsoft command processor (cmd.exe) does not permit UNC paths to be the current directory.  4NT on the other hand supported UNC paths as the current directory for years and it was a natural fit.  Drive letter mappings suddenly became no longer necessary for my day to day activities.

For those readers that are not long time AFS users there are some important things to understand about the AFS name space.  Unlike a Windows file share, the UNC path \\server\share\ does not refer to a single on-disk volume on the specified machine.  Instead with AFS UNC paths \\afs\cell\ refers to the root directory of a volume named root.cell in the specified AFS cell.  AFS UNC paths are location independent and do not signify on which physical machines the data is stored.  In fact, root.cell is in most cases a geographically replicated volume. In addition to directories and files, AFS supports mount points and symlinks as first class file system types.  An AFS mount point is an object that refers to the root directory of another AFS volume and symlinks can refer to any absolute or relative file path.

The AFS name space can therefore be viewed as a directed graph of volumes joined to other volumes where each volume contains a directory tree.  Volumes can be either read/write or read-only snapshots of a read/write volume.  Volumes can be assigned quotas or can be permitted to grow to fill the entire partition on which they are stored.  AFS volumes can be migrated from server to server while in use and the amount of free space can change as a result of the volume being moved.  The AFS name space is therefore a challenge to use when it is accessed via the SMB protocol.

SMB file shares were designed prior to the existence of NTFS Junctions and NTFS Symlinks (added in Vista and Server 2008).  The assumption is that there is only one volume on one partition located at the other end of a UNC path.  Obtaining the free space is most often performed using GetDiskFreeSpace which can only refer to root directories and not GetDiskFreeSpaceEx which can refer to arbitrary paths.  Even the MSDN documentation for these APIs states that the reason to use the Ex version is to avoid unnecessary arithmetic whereas the most important reason for using the Ex version in my opinion is that it works with complex name spaces constructed by NTFS junctions and AFS mount points.

Since the AFS name space is made up of a potentially infinite number of volumes joined together via mount points and volumes can sometimes be read/write and other times be read-only, how should the WinAFS SMB server respond when it is asked to report the total disk space and total free disk space?  Its impossible to provide an accurate value for either of these.  As a result the AFS SMB server would simply lie.  It would report an arbitrarily large number for the partition size and the free space.  Free space was always reported even when there was absolutely none.

Which brings us back to JPSoftware and 4NT.  While it wasn't possible for arbitrary volume information to be obtained via the Win32 API, the AFS fs command obtains this information via the afs path ioctl interface.  In September 2005 Rex Conn added OpenAFS specific knowledge and functionality to 4NT 7.0:
  1. The command parser understands UNIX style inputs /afs/ and automatically converts them to UNC notation \\afs\\user\jaltman when the first component matches the AFS "NetbiosName".
  2. The command language contains @AFSCELL, @AFSMOUNT, @AFSPATH, @AFSSYMLINK, @AFSVOLID, @AFSVOLNAME functions which operate on paths and return AFS specific data.
  3. Free space computations use AFS volume information so it is accurate even when the Win32 GetVolumeInformation() call executed over SMB would not be.
Over the last five years as the AFS Redirector has been developed 4NT (now called Take Command) has been a constant companion.  One of my favorite features of Take Command directory listings is its awareness of Reparse Points.  For example:
As you can see, directory listing expand the target of NTFS Junctions and Symlinks providing the target information.  I have for the longest time wanted this behavior for AFS.   Unfortunately, up until a late TC 14.03 build, Take Command did not understand how to parse the AFS Reparse Point data.  Now that it does we get the same useful output:

Although not shown, symlink to file targets are displayed as well.
With the release of Take Command 15.0 and OpenAFS 1.7.22 the circle has now been completed.  Not only can Take Command display AFS mount point and symlink targets, but Take Command's MKLINK command can be used to create symlinks to both files and directories, and the DEL and RMDIR commands can be used to remove them.

Take Command's GLOBAL command can either cross [/J] or not cross [/N] junctions as specified.

Finally, Take Command properly uses GetVolumeInformationByHandle() to obtain volume information.  As a result the built-in AFS functions operate even when AFS is accessed via an NTFS directory symlink.

I recommend Take Command for any user of OpenAFS that relies upon the command shell.

For further information on Take Command visit the JP Software web site at