Wednesday, October 18, 2006

The need to avoid release labeling and choice for end users

Developers have a tendancy to focus on source code management. We maintain source code repositories to help us manage the development process. Within the repository we construct release branches. Each branch allows a set of sources to be shaped for a specific purpose. Typical branching strategies include separate branches for the maintenance of a public release, for development of the next release, and experimental branches for risky development that might not work out or may have an adverse impact on other developers. Developers often give somewhat arbitrary names to these branches "stable", "unstable", "maintenance", "development", "project foo", etc. that only have meaning to the developers.

As is often the case, the names assigned to the branches have no relationship with the quality of the code on a particular branch. This is especially true for a software project which supports large numbers of operating system platforms. Given the rate of development it may often be true that different branches might be a better choice than others for a given platform.

OpenAFS has traditionally labeled its branches as "stable" and "unstable".   The even numbered branches are "stable" and the odd numbered branches are "unstable".  This has resulted in significant amounts of confusion and frustration for end users.  At any given time end users have been presented with up to three current releases:
  • the last final release off of the "stable" branch
  • the most recent test release off of the "stable" branch
  • the most recent release off of the "unstable" or "development" branch
What's an end user to do?  More importantly, what's an administrator responsible for choosing the release to distribute throughout their organization to do?

When presented with the choice of selecting among "stable", "beta", or "unstable" which do you think the majority of individuals will choose?  End users don't want to install software that is going to cause them to lose data and they don't want to be guinea pigs so more often than not they are going to choose the "stable" release.  Even if this release has a list of known bugs a milelong and is years old. 

The distinction between the various source code branches is of meaning only to the developers.  End users do not think of software as source code.  They think of it as a product and the labels associated with different versions of a product will signfiicantly influence the end user's decisions especially when faced with complex choices they are not qualified distinguish between.  It is unrealistic to assume that an end user is going to understand the importance of file locking or the meaning of a 64-bit file size or the terminology surrounding deadlocks and reference count leaks.  When a typical end user is presented with a choice among two or three complex options without a strong recommendation specifying which should be used, simplistic labels such as "unstable", "stable", "final", "development", "test", "beta", "candidate", etc. are much more influential than they are intended to be. 

The reputation of OpenAFS on the Microsoft Windows and MacOS X platforms is suffering in part because of the choices given to end users and the terminology used to describe them.  End users want something that works.  They want to visit a web site and see that version X.Y.Z is the best version available for their platform and this is what they should be using.  When they experience a problem and see that they are not currently running the recommended version, then they will upgrade.  If they experience a problem and are presented with choices that they can't make heads nor tails of, they are going to take the path that appears to have the least risk.  End users will choose the "stable" or "final" release over something labeled "test", "beta", "unstable", or "development" 9 out of 10 times.  Even though the problem they are experiencing might very well be fixed in one of these apparently riskier releases.

For Windows users the availability of multiple releases has been a serious problem.  The 1.4 series does not contain significant functionality that is meant to protect end users from data loss.  This functionality is only available in the 1.5 series.  Unfortunately, due to the fact that end users are presented with new releases from both the 1.4 and 1.5 branches as they are released it is truly impossible for end users to know which to use without a very clear recommendation from the gatekeepers and perhaps the broader user community. 

One of the other significant problems facing OpenAFS versioning is the length of time it takes in order to get through a test cycle.  It is often the case that a small number of problems on specific operating system versions or hardware architectures can prevent a test cycle from being completed.  In the meantime, the release that should be considered the best choice on all of the other operating system versions and hardware architectures is stuck with a label of "test", "beta", or "candidate" which results in organizations and end users from being willing to install it.

As a result I am recommending that OpenAFS (and all other cross-platform open source projects) avoid the use of the one version is best for all platforms mentality.  Instead of labeling releases as "stable-1-4-2", "stable-1-4-2-beta-1", "stable-1-4-2-rc3", or "unstable-1-5-9", just use numbers such as"1-4-41", "1-4-42", "1-4-43", "1-5-9". This removes the negative connotations associated with the labels.  For each platform a recommended release number can be provided. 

This new approach provides a number of side benefits.  No longer do the developers need to guess at what version numbers to assign to test builds.  When preparing for a new release we want the final version number to be X.Y.Z.00.  Therefore, the developers typically try to assign numbers starting with X.Y.(Z-1).90 in order to ensure that version numbers always increase but to avoid the confusion that might arise if end users thought the test release was in fact the final release. 

Another benefit is that it will be much easier for administrators to convince management to deploy fixes.  Management is always reluctant to deploy a "beta" or "candidate" release because such a release must have bugs.  The reality is that all software has bugs.  Even if there are no known bugs in a given release at the time the release is announced it is guarranteed that over time bugs will be discovered and they will be fixed in later releases.  A "final" release is simply one that is believed to build and run on all supported platforms without known faults.

The requirement that a "final" release build and run on all supported platforms including all new Linux kernels often results in significant delays before important bug fixes can make it out to the user community.  For example, at the AFS & Kerberos Best Practice Workshop a demonstration was given of a bug fix to a problem in the 1.4.1 file serverthat adversely affects client mobility.  The bug fix was committed on June 3rd and yet it has taken until October 17th before a 1.4.2 final release to be issued.  In the meantime, more than four months of end user frustration has accumulated and many sites have deployed 1.4.1 on their file servers instead of one of the "beta" or "candidate" releases that contained the fix.

In speaking with end users, as long as the version label does not contain negative terminology they can push out any build that is recommended.  However, once doubt is raised regarding the quality of the release in the minds of management all bets are off.

It is my hope that OpenAFS and other open source projects will abandon the traditional release labeling and replace it with incremental build numbers and platform specific recommendations.

Friday, September 8, 2006

OpenAFS for Windows September 2006 Status Report is now available

The OpenAFS for Windows September 2006 Status Report is now available:

http://www.secure-endpoints.com/talks/OpenAFS-Windows-Sep-2006-Status-Report.pdf

For the complete list of changes since the 1.2 release see: http://www.openafs.org/dl/openafs/1.5.8/winnt/afs-changes-since-1.2.txt

and of course be sure to read the Release Notes:
 http://www.openafs.org/dl/openafs/1.5.8/winnt/relnotes-frames.htm

As always I encourage all organizations and individuals who wish to support the development of OpenAFS for Windows to contact me. Financial contributions as well as in kind assistance are seriously appreciated.Tax deductible donations may be made via the OpenAFS account operated by Usenix (a 501c3 not for profit corporation.)

Tuesday, July 25, 2006

Kerberos for Windows 3.1 Beta update

KFW 3.1 Beta 1 has been tagged and installers have been built.

An official announcement is soon to come.

Thursday, July 20, 2006

OpenAFS for Windows Large File Support is now available

Several weeks ago, OpenAFS.org announced the release of OpenAFS for Windows 1.5.3.  For Windows this release contains three major changes:

  • First, there are significant changes in the CIFS server compatibility with the Windows CIFS client.   The result should be a major improvement in the interaction with the Explorer Shell.
  • Second, this is the first AFS client on Windows to support the Inline Bulk Status RPCs.  The reason that this is a big deal is that instead of performing one RPC per directory entry the Windows client now performs one RPC for every 50 directory entries.  In addition, directory entries that are unreadable due to access permissions are temporarily cached as expired callbacks.  This significantly reduces the time required to obtain a directory listing or create/open/delete files.
  • Third, for the first time the Windows AFS client is capable of supporting 64-bit Large Files.  You can now use AFS to store DVD images.

With all of these changes I bet you can't wait to get your hands on this release

At long last, Kerberos for Windows 3.1 is nearing release

After more than seven months of waiting MIT Kerberos for Windows 3.1 is finally going to begin beta testing. This release will fix all of the bugs that plagued KFW 3.0 Network Identity Manager 1.0. The memory leaks, the principal name rejections, the non-en_US locale problems. They are all gone.

In addition, KFW 3.1 will not crash on Windows XP64 WOW64 simply because Microsoft failed to actually export tickets from the LSA even though the functions succeed. (This problem is fixed in Vista Beta 2.)

Along with KFW 3.1 will be a new version of the AFS plug-in installer. Now that the NetIDMgr APIs are truly stable the AFS plug-in can be integrated into an OpenAFS release. That said, the KFW 3.1 release will install on 64-bit Windows systems but there will not be 64-bit native support until KFW 3.2.

The most dramatic changes in KFW 3.1 are in the NetIDMgr.  The user interface is much more fleshed out.  In the main window tickets can be sorted in arbitrary ways by clicking the column headers.   The View->Columns menu allows the user to select which ticket properties should be displayed.   The properties include everything from principal names to lifetimes to enc-types to key version numbers.  Tool Tips are supported for the toolbar.   The font used to display tickets is user selectable.   Identities can be added without obtaining credentials via the Global Identity Settings.   The Kerberos 5 Realm configuration editor is complete.   Kerberos 5 credentials can now be obtained from FILE: ccaches.  In general, the overall look and feel of the tool is significantly improved.

I look forward to being able to announce on Tuesday July 25th that the installers are available for download.

Monday, June 12, 2006

Tuesday, March 21, 2006

MIT Kerberos for Windows 3.1 has been delayed

MIT Kerberos for Windows 3.1 has been delayed. The original anticipated delivery date was late January. Unfortunately things have taken a bit longer than expected. On the plus side the Network Identity Manager is feature complete. The next edition will include a wide variety of new tools for viewing the acquired credentials. In addition, all of the known bugs have been fixed. KFW 3.1 will be a significant release.

OpenAFS 1.4.1 release candidate 10 is available

OpenAFS 1.4.1 release candidate 10 is available. See http://lists.openafs.org/pipermail/openafs-announce/2006/000145.html for the announcement letter.

The dates and location for the AFS & Kerberos Best Practices Workshop have been announced. The workshop will be held at the University of Michigan during the week of June 12th to 16th. The program will be similar to last year. Monday there will be a full day tutorial on AFS installation and administration. On Tuesday there will be a full day tutorial on Kerberos installation and administration. Wednesday through Friday there will be a series of talks on AFS and Kerberos given by key developers and members of the user community. Registration fees are $100 per person for the talks and $100 per person for each of the tutorials.

Jeffrey Altman of Secure Endpoints Inc. will be giving a talk at this years NIST PKI R&D Workshop which is being held April 4-6 in Washington D.C.

Tuesday, January 24, 2006

Thread Safety is hard to maintain

I am constantly amazed by how hard it is to write thread safe code and have it stay thread safe. This week I had the pleasure of debugging two very different thread safety problems with very different origins on the theme.

Actually as I think about it, the origin of the problems are fairly similar. In both cases, the root cause was a failure of the thread safe architecture to be appropriately documented. In the first case, the degradation was gradual. The bugs were introduced slowly over time as multiple developers who did not understand the architectural requirements added or modified functionality. The worst part about thread safety problems is that problems that result in reference counting errors in which references are not released are nearly impossible to find. The same holds true for race conditions due to a failure to apply read/write locks or mutexes in a consistent manner.

These errors are extremely easy to introduce and unless there is very thorough auditing of the code it is quite likely that even a well written test suite is not going to trigger the errors. Deadlock conditions and under counting of references are much easier to identify because they typically result in a hard error.

The second problem was even more hideous. The architect made an assumption about the order in which threads would be allocated in an application. It never occurred to her/him that the libraries that are linked into the application might one day be multi-threaded themselves. The end result was that a data structure that was meant to allow thread safe access to objects by allocating a single bit in a table to each thread ended up due to math truncation associating multiple threads with the same bits. The end result is that while not a single line in the program changed, the thread safety of the application went out the window.

I don't know if better documentation would have helped in this second instance but it is clear that heavily used production servers started to bounce because of an undocumented assumption.

My advice to anyone that is developing and maintaining multi-threaded systems is to be careful. What looks on the surface like a trivial change may very well disrupt a a delicately balanced system. Whenever possible find someone else who understands the architecture of the system to review each and every line being very careful to pain attention to the expansion of macros and the side-effects that function calls might have.