Wednesday, September 26, 2007

Windows Error Reporting versus Open Source Development

Windows Error Reporting is one of the greatest services that Microsoft has ever provided to developers of applications and device drivers for Microsoft Windows operating systems.  It provides a registered and verified software developer with access to crash report data that for that developer's applications.

How does it work?
When an application terminates unexpectedly or a user terminates an application  due to a lack of responsiveness, Windows will capture a mini-dump of the application, the version information of all loaded modules, and the version information for the Windows operating system on which it is being run.  The user is then presented a dialog requesting permission to deliver this information to Microsoft. 

Registered application developers provide Microsoft with a mapping file that describes each binary in a product release including version info, link times, and other traits that can be used to uniquely identity the module.  When crash reports are received by Microsoft, the WER servers compare each report against the mapped modules.  When a match occurs, a WER event is generated and the application developer is notified. 

One of the really nice benefits of WER is that it can sort the events into buckets based upon the type of crash, hang, and process state at the time of the crash.  If the same type of crash occurs 50 times, all of the matching events will be placed into the same event bucket.  Application developers can easily compare the state of all of the crash reports to assist in tracking down the cause.

When a fix is available, the application developer can register a response which will be delivered to subsequent users that experience the same type of crash with the same version of the module or application.  These responses can indicate that the software is not supported on the OS version that it is installed on, or that a new version is available, or that a workaround can be found be reading a provided web page. 

This mechanism benefits both the developers and the end users because as soon as a bug is found it can be fixed without requiring that the end users go through a long process of reporting a crash to the developers directly and being unable to provide enough technical detail for the developers to fix it.  Once the fix is available, end users are automatically notified.  Less frustration for end users and for developers.  Everyone wins.

Unless you are an open source developer or end user....

What is the problem with Open Source?

Secure Endpoints is an open source vendor.  We distribute pre-built installers for Kerberos for Windows and OpenAFS for Windows.  For each of these distributions we have binaries and matching symbol data.  When a crash report arrives from WER, the mini-dump is loaded into a debugger along with the matching binaries and symbol data.  Without the binaries or the symbols, the mini-dump information is much less useful before the stack addresses cannot be matched up with specific functions in the application modules.
As long as the version of the application that is installed is the one Secure Endpoints built, we can make use of the crash reports to identify problems, fix them and notify end users via the WER response mechanism. 

What happens when an organization decides to build the product from the published source code instead of using the pre-built binaries?  In that case, WER matches the module names and file version information and places an event into a crash bucket.  Secure Endpoints downloads the crash report, loads it into the debugger only to find that we have neither matching binaries nor matching symbols.  The end result is that the WER report is useless.  The best I can do is file a response to the end user recommending the use of the pre-built binaries.

I can certainly understand why organizations wish to build their own binaries.  In most cases its because they want to be able to debug problems they experience in-house.  For that they need matching symbols files.  This is exactly the reason why both the Kerberos for Windows and OpenAFS for Windows distributions include the symbol files from the official build.  This way organizations have all the necessary pieces: binaries, symbols files and source code.  Organizations that identify problems internally should file bug reports to the open source maintainers so that fixes can be developed and incorporated into future releases.

Saturday, February 24, 2007

Squeaky wheels receive attention (both good and bad)

I spent the past few weeks traveling the country meeting with organizations that use OpenAFS and Kerberos for Windows.  I heard a number of really wonderful things:
  • "We haven't had a show stopper event in more than a year"
  • "The performance is so much better than it used to be.  We no longer receive complaints about how slow it is instead our users send us messages like this one, 'OH My gosh, afs is so fast now since i got my upgrade :)'"
At the same time the amount of funding spent on support and new development has been decreasing.  Budgets are always tight and management wants to spend its money on addressing the issues that cause on-going problems. 

Just a couple of years ago, the OpenAFS Windows client was so bad that not only were organizations sending money but individuals would send personal paypal payments and bottles of tequila as a "thank you for improving my life".   These days expectations have changed.  The assumption is that the OpenAFS Windows client just works.

In the 1.5.15 release of OpenAFS for Windows, a serious data corruption bug was fixed.  As it turns out this bug had been reported to IBM within the last year by an organization that was still using the IBM AFS Windows client.  When the organization switched to OpenAFS it never occurred to them that OpenAFS would have the same problem given their common heritage.  OpenAFS is so much better in so many ways that they "just assumed it had already been fixed."

The truth is that all of the low hanging fruit has already been picked.  Its not that there is no more work to be done but that all of the remaining work is big.  So big in fact that it cannot be paid for out of support budgets.  Instead strategic planning funds must be used and those are much harder to come by especially when the scope of the projects is in developer years and hundreds of thousands of U.S. dollars.  Its no longer possible for someone to ask "how much would it cost to fix xyz?" and receive a response indicating that the work could be done in a few hours or a day or two.

Instead, much of the longer term strategic work that was done to support the Windows Vista platform was unsupported.  Secure Endpoints contributed hundreds of hours of developer time to ensure that there would be an OpenAFS client for the new operating system.  This was done on the assumption that the costs would be re-couped in the future through interest in support contracts.  What a surprise it was to hear this week that existing support contract customers are questioning the need for the support.  The long hours spent improving the product have taken OpenAFS off the radar of senior management and as a result the funding is disappearing.

One large user described how there have been so few reported issues with the 1.4.2 client that he can't justify upgrading to 1.5.15 even though he is aware of all of the significant improvements in performance and stability.  Performance improvements just aren't a reason to upgrade when there are thousands of clients involved.  Stability doesn't matter if the end users are not being adversely affected.   Sure there are bugs and annoyances but the help desk knows how to address them and the users move on with life.   Management simply is not going to spend money on something that is faster or prettier.  If there isn't a critical show stopper issue, it won't be detected by their radar.

Our philosophy is that software is built to address the needs of its users with the goal of making their lives happier and more productive.  Good software doesn't attract unwanted attention.  In the case of a file system or other infrastructure, the end user should be able to take it for granted.  If it receives attention from the user, that is a bad thing.

A good support contract vendor is one that addresses issues promptly when they occur, but more importantly works to ensure that you do not have issues in the first place.  The question is, if support dollars are used to fund development that pro actively addresses issues before they are noticed by the customer, how does the customer know that the support dollars were well spent?  This is especially true when management does not believe that incremental improvements in performance and stability are worth paying for.

I am now beginning to understand the behaviors of large corporations providing support to Federal agencies.  I find them extremely frustrating to deal with because the apparent goal is to deploy software with just the right amount of bugs such that there are never issues that bring the entire system to a halt but that ensure that there is a constant stream of small issues that will keep them on the phone with the agency's help desk.  Every week a report is sent to the customer detailing the number of issues categorized by severity and whether or not the user's problem could be addressed.  Large numbers of low severity issues is encouraged whereas even a single Priority One issue is to be avoided. 

Fortunately for the clients of Secure Endpoints Inc, I believe that our role is to help prevent problems regardless of the severity.  Unfortunately, it is then harder to make the case for additional financial investment in products that are already deemed to be "good enough".

Monday, January 8, 2007

Happy New Year!

It has been many months since this blog has been updated and many wonderful things occurred during the final three months of 2006. 

On the Kerberos front:

On Nov 9th, MIT announced that they want to provide a full-time developer to support Windows development.  As a result, Secure Endpoints Inc. has become a development and support partner.  Secure Endpoints Inc. will continue to enhance Kerberos for Windows and Network Identity Manager as well as issue new releases in conjunction with MIT's Kerberos team.  The primary change is that MIT will no longer be funding Secure Endpoints' efforts.  As a result, Secure Endpoints is reaching out to the broader Kerberos for Windows user community to help support on-going development.

On Nov 30th, MIT Kerberos for Windows 3.1 including Network Identity Manager 1.1.8 was finally released.
Although Network Identity Manager has not changed much on the outside since the KFW 3.0 release, on the inside the changes were dramatic.  A large number of usability issues were addressed and the plug-in interface was improved to support a wider range of functionlity.   KFW 3.1 can be downloaded from MIT:

Development on KFW 3.2 and NIM 1.2 is underway.  Secure Endpoints has posted a development road map including 64-bit Windows support, Vista support, and a wide range of enhancements to the Network Identity Manager user interface.  Financial support from the community is required to sustain the on-going improvements that KFW has received over the last several years.

For OpenAFS for Windows, 2006 was a banner year.  It started off with the 1.4.1 release candidates and ended with the release of 1.5.13.  Throughout those releasesthere were more than 150 improvements to the product.  The most important changes include:
* No more resource leaks within the SMB Server
* Locally managed byte range locks backed by full file locks on the file server
* Improved performance when disconnected from the network
* Improved performance for directory listing
* Improved performance when storing temporary files within AFS
* Improved power management event handling
* Support for file sizes greater than 2GB
* Over quota and disk full errors are now reported
* Significantly improved handling of dirty buffers results in decreased cpu utilization and faster writes
* A Network Identity Manager AFS credential plug-in is provided
* Support for 64-bit Windows
* Support for Microsoft Vista
A summary of the current state of OpenAFS for Windows can be found at as well as the most recent Status Report

Secure Endpoints has published a development road map for OpenAFS for Windows which includes a number of performance improvements to the AFS Client Service as well as a complete set of re-writes of the Explorer Shell integration, the OpenAFS Control Panel, and the development of a Microsoft Management Console for configuring the AFS Client Service.

Finally, perhaps the best surprise for last.  Just before the end of the year the AFS Servers (file, protection, volume, volume database, bos) were made functional once again.  The install wizard has been removed because it made assumptions that no longer hold true, but by manually installing the servers as is done on UNIX, it is now possible to run a cell from a Windows Server.  See the road map for a summary of what still remains to be done.

In 2007, there is much to look forward to.  During the first quarter Secure Endpoints will release a new Network Identity Manager plug-in for obtaining KX509/KCA certificates; and with community support there will be significant releases of both KFW and OpenAFS. 

Mark on your calendar that the next AFS & Kerberos Best Practice Workshop will be held at Stanford during the week of May 7 to 11.  As always full day tutorials will be provided on AFS and Kerberos installation, administration, and maintenance.  This year Secure Endpoints will be providing the Kerberos tutorial.  New this year will be discussion of Kerberos and GSS-API programming practices.

Here's a toast to the accomplishments of 2006 and those that are to come in 2007. 
Happy New Year!!!!