Next: 6 Related Work Up: HLFSD: Delivering Email to Previous: 4 Implementation of hlfsd

Subsections

5 Evaluation

This system is implemented and has been in use on a number of machines for more than a year now. For the first nine months hlfsd was in experimental use. We have since deployed it on most production machines in our department, spanning over a 100 hosts and 3 different architectures.

The goal of this work is to expand the accessibility of electronic mail, improve overall reliability and stability of this vital service, while at the same time maintain the sanity of our SAs (yours truly included). For this to really work, it must have:

1.: Very high availability.
2.: Little overhead.
3.: Little hassle for users and administrators as the system is being used or installed for the first time.

5.1 Performance

We have carried out some measurements to quantify the above requirements and more. The tests were performed on a Sun SPARCstation-2 running SunOS 4.1.3.

Accessing a local spool file via stat normally takes 180 msec without hlfsd. If hlfsd is running and has the user's entry already cached, it takes 60 msec more to access the file. This overhead is attributed to the fact that the kernel has to access a user-level NFS server, making several context switches.

If the entry is not cached, hlfsd forks a child to perform operations which may cause it to hang. The overhead of that fork and other checks is an additional 70 msec (or 130 msec over the regular lookup not using hlfsd). However, this overhead is incurred only once in 5 minutes, because the result of each check is cached for that long by default.

The above times are somewhat significant, but not by much, considering the use of a user-level file-server. (By comparison, in our environment it takes about 0.5 seconds to access a new filesystem using amd.) Given the benefits of hlfsd, we feel that a minimal access slowdown is a small price to pay. In practice, over 12 months of usage we have noticed no visible degradation of service.

The internal data structures (tables and caches) require 50 bytes per user on the system. In our environment, with 750 users, that translates to about 37KB -- rather insignificant given that workstations these days come installed with at least 16-32MB of RAM.

5.1.1 Remailing Lost Mail

The hlfsd distribution contains a perl[26] script called lostaltmail. Remailing a single message with a body size of 1KB, takes an average of 1.2 seconds (total time). In our department, resending an average mailbox file takes about 20 seconds.

Initially we ran the script once a day, but found having to wait up to 24 hours for lost mail redelivery too long. We then experimented with running lostaltmail once an hour. However, we found that frequency too fast. The most likely situation in which hlfsd will repoint its symbolic link to the alternate spool directory is when the user's filesystem is full. A full filesystem is a persistent situation that in most cases takes some time to get fixed, as it requires human intervention. If the situation that causes hlfsd to use the alternate spool directory is likely to persist, running the lostaltmail script will consume unnecessary resources, only to redeliver the mail back to the alternate spool directory. We finally settled on running lostaltmail between 6 and 12 times a day. Depending on the amount of lost mail expected, the script could be run more or less often.

5.1.2 Reliability

We have simulated worst-case scenarios by filling up a user filesystem and letting hlfsd decide to redirect mail to the alternate spool directory. At this point we filled up that filesystem as well. Hlfsd kept on pointing to the alternate spool directory during the cached entry interval, but we observed no mail lost. Instead, the sending side detected that the filesystem was full, and kept the message in the remote (private) spool directory. This is the default behavior sendmail[1] provides.

Hlfsd does not introduce any new problems; that is, if a filesystem is completely full, whatever behavior your current LDA provides is maintained. Since hlfsd uses both the user's filesystem and an alternate spool directory, it actually increases the availability of mail services, by ``virtually'' increasing the disk space available for mail spooling.

Once space has been freed in the user's filesystem, and the cached entry expired, hlfsd pointed its symbolic link back to the user's home directory. The next time the remailing script ran, all ``lost'' mail got resent to its owners.

Since the installation of hlfsd in our production environment, we have seen a few cases of lost mail being resent, mostly due to full filesystems. We know of no case where mail was completely lost.

5.2 Installation

Since hlfsd was written by SAs for other SAs, we have provided it with several command-line options to use at startup time, enabling hlfsd to be tailored for a particular environment. Needless to say, a man page is provided, as well as complete source code. Furthermore, we included a few scripts written in sh and perl which we use in our environment to re/start hlfsd, test for possible configuration anomalies, and resend ``lost'' mail.

The most significant work that SAs need to do is identify programs that need to access mailbox files of other users, and ``setgid'' them to HLFS_GID. In our environment we had to do that for comsat, from, finger and a few others. Our environment uses the rdist[5] automatic software distribution program, and thus these changes were required only in one place -- the top of our rdist tree.

5.3 Problems

There are a few problems, some of which cannot be easily resolved:

Some programs need to be setgid to the special HLFS_GID group. There is no easy way to locate them other than knowing ahead of time what they do. Note that if the programs are not setgid, the only consequence is that these programs are unable to find mailboxes. However, with other methods, if $MAIL is not used, mail is not delivered.
It is possible that the status of a home directory access will change during the time that hlfsd caches this information. Picking a smaller cache expiration time can alleviate this problem, but it increases the resources taken by hlfsd and slows down access to mail. It is left for the individual SAs to change this default value.
Any logins with the same uid and a different home directory may have mail delivered or read from any of their home directory pathnames. Hlfsd stores pathnames in an internal hash table keyed by the uid; therefore, it is undefined which pathname is returned in the case of multiple users with the same uid and different home directories. We provide a script which checks for this situation and warns the SAs.
On systems that cannot turn off the NFS attribute cache, the kernel might return the same symbolic link name for two different users who access the spool directory consecutively, possibly resulting in mail getting delivered to the wrong mailbox! On these systems, hlfsd will not run unless started with a special option. In that case it will set the attribute cache value to the shortest possible interval, but it may not be sufficient.

Next: 6 Related Work Up: HLFSD: Delivering Email to Previous: 4 Implementation of hlfsd

Erez Zadok
12/6/1997