The goal of this work is to expand the accessibility of electronic mail, improve overall reliability and stability of this vital service, while at the same time maintain the sanity of our SAs (yours truly included). For this to really work, it must have:
Accessing a local spool file via stat normally takes 180 msec without hlfsd. If hlfsd is running and has the user's entry already cached, it takes 60 msec more to access the file. This overhead is attributed to the fact that the kernel has to access a user-level NFS server, making several context switches.
If the entry is not cached, hlfsd forks a child to perform operations which may cause it to hang. The overhead of that fork and other checks is an additional 70 msec (or 130 msec over the regular lookup not using hlfsd). However, this overhead is incurred only once in 5 minutes, because the result of each check is cached for that long by default.
The above times are somewhat significant, but not by much, considering the
use of a user-level file-server. (By comparison, in our environment it takes
about 0.5 seconds to access a new filesystem using amd.) Given the
benefits of hlfsd, we feel that a minimal access slowdown is a small price
to pay. In practice, over 12 months of usage we have noticed no visible
degradation of service.
The internal data structures (tables and caches) require 50 bytes per user on the system. In our environment, with 750 users, that translates to about 37KB -- rather insignificant given that workstations these days come installed with at least 16-32MB of RAM.
Initially we ran the script once a day, but found having to wait up to 24 hours for lost mail redelivery too long. We then experimented with running lostaltmail once an hour. However, we found that frequency too fast. The most likely situation in which hlfsd will repoint its symbolic link to the alternate spool directory is when the user's filesystem is full. A full filesystem is a persistent situation that in most cases takes some time to get fixed, as it requires human intervention. If the situation that causes hlfsd to use the alternate spool directory is likely to persist, running the lostaltmail script will consume unnecessary resources, only to redeliver the mail back to the alternate spool directory. We finally settled on running lostaltmail between 6 and 12 times a day. Depending on the amount of lost mail expected, the script could be run more or less often.
Hlfsd does not introduce any new problems; that is, if a filesystem is completely full, whatever behavior your current LDA provides is maintained. Since hlfsd uses both the user's filesystem and an alternate spool directory, it actually increases the availability of mail services, by ``virtually'' increasing the disk space available for mail spooling.
Once space has been freed in the user's filesystem, and the cached entry expired, hlfsd pointed its symbolic link back to the user's home directory. The next time the remailing script ran, all ``lost'' mail got resent to its owners.
Since the installation of hlfsd in our production environment, we have seen a few cases of lost mail being resent, mostly due to full filesystems. We know of no case where mail was completely lost.
The most significant work that SAs need to do is identify programs that need to access mailbox files of other users, and ``setgid'' them to HLFS_GID. In our environment we had to do that for comsat, from, finger and a few others. Our environment uses the rdist[5] automatic software distribution program, and thus these changes were required only in one place -- the top of our rdist tree.