doc/doc-misc/Ext-mbx-locking

   1          UNIX Advisory File Locking Implications on c-client
   2                     Mark Crispin, 28 November 1995
   3
   4
   5         THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE
   6         IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995.  SOME STATEMENTS
   7         IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE
   8         IMAP TOOLKIT.
   9
  10 INTRODUCTION
  11
  12      Advisory locking is a mechanism by which cooperating processes
  13 can signal to each other their usage of a resource and whether or not
  14 that usage is critical.  It is not a mechanism to protect against
  15 processes which do not cooperate in the locking.
  16
  17      The most basic form of locking involves a counter.  This counter
  18 is -1 when the resource is available.  If a process wants the lock, it
  19 executes an atomic increment-and-test-if-zero.  If the value is zero,
  20 the process has the lock and can execute the critical code that needs
  21 exclusive usage of a resource.  When it is finished, it sets the lock
  22 back to -1.  In C terms:
  23
  24   while (++lock)                /* try to get lock */
  25     invoke_other_threads ();    /* failed, try again */
  26    .
  27    .    /* critical code  here */
  28    .
  29   lock = -1;                    /* release lock */
  30
  31      This particular form of locking appears most commonly in
  32 multi-threaded applications such as operating system kernels.  It
  33 makes several presumptions:
  34  (1) it is alright to keep testing the lock (no overflow)
  35  (2) the critical resource is single-access only
  36  (3) there is shared writeable memory between the two threads
  37  (4) the threads can be trusted to release the lock when finished
  38
  39      In applications programming on multi-user systems, most commonly
  40 the other threads are in an entirely different process, which may even
  41 be logged in as a different user.  Few operating systems offer shared
  42 writeable memory between such processes.
  43
  44      A means of communicating this is by use of a file with a mutually
  45 agreed upon name.  A binary semaphore can be passed by means of the
  46 existance or non-existance of that file, provided that there is an
  47 atomic means to create a file if and only if that file does not exist.
  48 In C terms:
  49
  50                                 /* try to get lock */
  51   while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0)
  52     sleep (1);                  /* failed, try again */
  53   close (fd);                   /* got the lock */
  54    .
  55    .    /* critical code  here */
  56    .
  57   unlink ("lockfile");          /* release lock */
  58
  59      This form of locking makes fewer presumptions, but it still is
  60 guilty of presumptions (2) and (4) above.  Presumption (2) limits the
  61 ability to have processes sharing a resource in a non-conflicting
  62 fashion (e.g. reading from a file).  Presumption (4) leads to
  63 deadlocks should the process crash while it has a resource locked.
  64
  65      Most modern operating systems provide a resource locking system
  66 call that has none of these presumptions.  In particular, a mechanism
  67 is provided for identifying shared locks as opposed to exclusive
  68 locks.  A shared lock permits other processes to obtain a shared lock,
  69 but denies exclusive locks.  In other words:
  70
  71         current state           want shared     want exclusive
  72         -------------           -----------     --------------
  73          unlocked                YES             YES
  74          locked shared           YES             NO
  75          locked exclusive        NO              NO
  76
  77      Furthermore, the operating system automatically relinquishes all
  78 locks held by that process when it terminates.
  79
  80      A useful operation is the ability to upgrade a shared lock to
  81 exclusive (provided there are no other shared users of the lock) and
  82 to downgrade an exclusive lock to shared.  It is important that at no
  83 time is the lock ever removed; a process upgrading to exclusive must
  84 not relenquish its shared lock.
  85
  86      Most commonly, the resources being locked are files.  Shared
  87 locks are particularly important with files; multiple simultaneous
  88 processes can read from a file, but only one can safely write at a
  89 time.  Some writes may be safer than others; an append to the end of
  90 the file is safer than changing existing file data.  In turn, changing
  91 a file record in place is safer than rewriting the file with an
  92 entirely different structure.
  93
  94
  95 FILE LOCKING ON UNIX
  96
  97      In the oldest versions of UNIX, the use of a semaphore lockfile
  98 was the only available form of locking.  Advisory locking system calls
  99 were not added to UNIX until after the BSD vs. System V split.  Both
 100 of these system calls deal with file resources only.
 101
 102      Most systems only have one or the other form of locking.  AIX
 103 emulates the BSD form of locking as a jacket into the System V form.
 104 Ultrix and OSF/1 implement both forms.
 105 \f
 106 BSD
 107
 108      BSD added the flock() system call.  It offers capabilities to
 109 acquire shared lock, acquire exclusive lock, and unlock.  Optionally,
 110 the process can request an immediate error return instead of blocking
 111 when the lock is unavailable.
 112
 113
 114 FLOCK() BUGS
 115
 116      flock() advertises that it permits upgrading of shared locks to
 117 exclusive and downgrading of exclusive locks to shared, but it does so
 118 by releasing the former lock and then trying to acquire the new lock.
 119 This creates a window of vulnerability in which another process can
 120 grab the exclusive lock.  Therefore, this capability is not useful,
 121 although many programmers have been deluded by incautious reading of
 122 the flock() man page to believe otherwise.  This problem can be
 123 programmed around, once the programmer is aware of it.
 124
 125      flock() always returns as if it succeeded on NFS files, when in
 126 fact it is a no-op.  There is no way around this.
 127
 128      Leaving aside these two problems, flock() works remarkably well,
 129 and has shown itself to be robust and trustworthy.
 130 \f
 131 SYSTEM V/POSIX
 132
 133      System V added new functions to the fnctl() system call, and a
 134 simple interface through the lockf() subroutine.  This was
 135 subsequently included in POSIX.  Both offer the facility to apply the
 136 lock to a particular region of the file instead of to the entire file.
 137 lockf() only supports exclusive locks, and calls fcntl() internally;
 138 hence it won't be discussed further.
 139
 140      Functionally, fcntl() locking is a superset of flock(); it is
 141 possible to implement a flock() emulator using fcntl(), with one minor
 142 exception: it is not possible to acquire an exclusive lock if the file
 143 is not open for write.
 144
 145      The fcntl() locking functions are: query lock station of a file
 146 region, lock/unlock a region, and lock/unlock a region and block until
 147 have the lock.  The locks may be shared or exclusive.  By means of the
 148 statd and lockd daemons, fcntl() locking is available on NFS files.
 149
 150      When statd is started at system boot, it reads its /etc/state
 151 file (which contains the number of times it has been invoked) and
 152 /etc/sm directory (which contains a list of all remote sites which are
 153 client or server locking with this site), and notifies the statd on
 154 each of these systems that it has been restarted.  Each statd then
 155 notifies the local lockd of the restart of that system.
 156
 157      lockd receives fcntl() requests for NFS files.  It communicates
 158 with the lockd at the server and requests it to apply the lock, and
 159 with the statd to request it for notification when the server goes
 160 down.  It blocks until all these requests are completed.
 161
 162      There is quite a mythos about fcntl() locking.
 163
 164      One religion holds that fcntl() locking is the best thing since
 165 sliced bread, and that programs which use flock() should be converted
 166 to fcntl() so that NFS locking will work.  However, as noted above,
 167 very few systems support both calls, so such an exercise is pointless
 168 except on Ultrix and OSF/1.
 169
 170      Another religion, which I adhere to, has the opposite viewpoint.
 171
 172
 173 FCNTL() BUGS
 174
 175      For all of the hairy code to do individual section locking of a
 176 file, it's clear that the designers of fcntl() locking never
 177 considered some very basic locking operations.  It's as if all they
 178 knew about locking they got out of some CS textbook with not
 179 investigation of real-world needs.
 180
 181      It is not possible to acquire an exclusive lock unless the file
 182 is open for write.  You could have append with shared read, and thus
 183 you could have a case in which a read-only access may need to go
 184 exclusive.  This problem can be programmed around once the programmer
 185 is aware of it.
 186
 187      If the file is opened on another file designator in the same
 188 process, the file is unlocked even if no attempt is made to do any
 189 form of locking on the second designator.  This is a very bad bug.  It
 190 means that an application must keep track of all the files that it has
 191 opened and locked.
 192
 193      If there is no statd/lockd on the NFS server, fcntl() will hang
 194 forever waiting for them to appear.  This is a bad bug.  It means that
 195 any attempt to lock on a server that doesn't run these daemons will
 196 hang.  There is no way for an application to request flock() style
 197 ``try to lock, but no-op if the mechanism ain't there''.
 198
 199      There is a rumor to the effect that fcntl() will hang forever on
 200 local files too if there is no local statd/lockd.  These daemons are
 201 running on mailer.u, although they appear not to have much CPU time.
 202 A useful experiment would be to kill them and see if imapd is affected
 203 in any way, but I decline to do so without an OK from UCS!  ;-) If
 204 killing statd/lockd can be done without breaking fcntl() on local
 205 files, this would become one of the primary means of dealing with this
 206 problem.
 207
 208      The statd and lockd daemons have quite a reputation for extreme
 209 fragility.  There have been numerous reports about the locking
 210 mechanism being wedged on a systemwide or even clusterwide basis,
 211 requiring a reboot to clear.  It is rumored that this wedge, once it
 212 happens, also blocks local locking.  Presumably killing and restarting
 213 statd would suffice to clear the wedge, but I haven't verified this.
 214
 215      There appears to be a limit to how many locks may be in use at a
 216 time on the system, although the documentation only mentions it in
 217 passing.  On some of their systems, UCS has increased lockd's ``size
 218 of the socket buffer'', whatever that means.
 219 \f
 220 C-CLIENT USAGE
 221
 222      c-client uses flock().  On System V systems, flock() is simulated
 223 by an emulator that calls fcntl().  This emulator is provided by some
 224 systems (e.g. AIX), or uses c-client's flock.c module.
 225
 226
 227 BEZERK AND MMDF
 228
 229      Locking in the traditional UNIX formats was largely dictated by
 230 the status quo in other applications; however, additional protection
 231 is added against inadvertantly running multiple instances of a
 232 c-client application on the same mail file.
 233
 234      (1) c-client attempts to create a .lock file (mail file name with
 235 ``.lock'' appended) whenever it reads from, or writes to, the mail
 236 file.  This is an exclusive lock, and is held only for short periods
 237 of time while c-client is actually doing the I/O.  There is a 5-minute
 238 timeout for this lock, after which it is broken on the presumption
 239 that it is a stale lock.  If it can not create the .lock file due to
 240 an EACCES (protection failure) error, it once silently proceeded
 241 without this lock; this was for systems which protect /usr/spool/mail
 242 from unprivileged processes creating files.  Today, c-client reports
 243 an error unless it is built otherwise.  The purpose of this lock is to
 244 prevent against unfavorable interactions with mail delivery.
 245
 246      (2) c-client applies a shared flock() to the mail file whenever
 247 it reads from the mail file, and an exclusive flock() whenever it
 248 writes to the mail file.  This lock is freed as soon as it finishes
 249 reading.  The purpose of this lock is to prevent against unfavorable
 250 interactions with mail delivery.
 251
 252      (3) c-client applies an exclusive flock() to a file on /tmp
 253 (whose name represents the device and inode number of the file) when
 254 it opens the mail file.  This lock is maintained throughout the
 255 session, although c-client has a feature (called ``kiss of death'')
 256 which permits c-client to forcibly and irreversibly seize the lock
 257 from a cooperating c-client application that surrenders the lock on
 258 demand.  The purpose of this lock is to prevent against unfavorable
 259 interactions with other instances of c-client (rewriting the mail
 260 file).
 261
 262      Mail delivery daemons use lock (1), (2), or both.  Lock (1) works
 263 over NFS; lock (2) is the only one that works on sites that protect
 264 /usr/spool/mail against unprivileged file creation.  Prudent mail
 265 delivery daemons use both forms of locking, and of course so does
 266 c-client.
 267
 268      If only lock (2) is used, then multiple processes can read from
 269 the mail file simultaneously, although in real life this doesn't
 270 really change things.  The normal state of locks (1) and (2) is
 271 unlocked except for very brief periods.
 272
 273
 274 TENEX AND MTX
 275
 276      The design of the locking mechanism of these formats was
 277 motivated by a design to enable multiple simultaneous read/write
 278 access.  It is almost the reverse of how locking works with
 279 bezerk/mmdf.
 280
 281      (1) c-client applies a shared flock() to the mail file when it
 282 opens the mail file.  It upgrades this lock to exclusive whenever it
 283 tries to expunge the mail file.  Because of the flock() bug that
 284 upgrading a lock actually releases it, it will not do so until it has
 285 acquired an exclusive lock (2) first.  The purpose of this lock is to
 286 prevent against expunge taking place while some other c-client has the
 287 mail file open (and thus knows where all the messages are).
 288
 289      (2) c-client applies a shared flock() to a file on /tmp (whose
 290 name represents the device and inode number of the file) when it
 291 parses the mail file.  It applies an exclusive flock() to this file
 292 when it appends new mail to the mail file, as well as before it
 293 attempts to upgrade lock (1) to exclusive.  The purpose of this lock
 294 is to prevent against data being appended while some other c-client is
 295 parsing mail in the file (to prevent reading of incomplete messages).
 296 It also protects against the lock-releasing timing race on lock (1).
 297 \f
 298 OBSERVATIONS
 299
 300      In a perfect world, locking works.  You are protected against
 301 unfavorable interactions with the mailer and against your own mistake
 302 by running more than one instance of your mail reader.  In tenex/mtx
 303 formats, you have the additional benefit that multiple simultaneous
 304 read/write access works, with the sole restriction being that you
 305 can't expunge if there are any sharers of the mail file.
 306
 307      If the mail file is NFS-mounted, then flock() locking is a silent
 308 no-op.  This is the way BSD implements flock(), and c-client's
 309 emulation of flock() through fcntl() tests for NFS files and
 310 duplicates this functionality.  There is no locking protection for
 311 tenex/mtx mail files at all, and only protection against the mailer
 312 for bezerk/mmdf mail files.  This has been the accepted state of
 313 affairs on UNIX for many sad years.
 314
 315      If you can not create .lock files, it should not affect locking,
 316 since the flock() locks suffice for all protection.  This is, however,
 317 not true if the mailer does not check for flock() locking, or if the
 318 the mail file is NFS-mounted.
 319
 320      What this means is that there is *no* locking protection at all
 321 in the case of a client using an NFS-mounted /usr/spool/mail that does
 322 not permit file creation by unprivileged programs.  It is impossible,
 323 under these circumstances, for an unprivileged program to do anything
 324 about it.  Worse, if EACCES errors on .lock file creation are no-op'ed
 325 , the user won't even know about it.  This is arguably a site
 326 configuration error.
 327
 328      The problem with not being able to create .lock files exists on
 329 System V as well, but the failure modes for flock() -- which is
 330 implemented via fcntl() -- are different.
 331
 332      On System V, if the mail file is NFS-mounted and either the
 333 client or the server lacks a functioning statd/lockd pair, then the
 334 lock attempt would have hung forever if it weren't for the fact that
 335 c-client tests for NFS and no-ops the flock() emulator in this case.
 336 Systemwide or clusterwide failures of statd/lockd have been known to
 337 occur which cause all locks in all processes to hang (including
 338 local?).  Without the special NFS test made by c-client, there would
 339 be no way to request BSD-style no-op behavior, nor is there any way to
 340 determine that this is happening other than the system being hung.
 341
 342      The additional locking introduced by c-client was shown to cause
 343 much more stress on the System V locking mechanism than has
 344 traditionally been placed upon it.  If it was stressed too far, all
 345 hell broke loose.  Fortunately, this is now past history.
 346 \f
 347 TRADEOFFS
 348
 349      c-client based applications have a reasonable chance of winning
 350 as long as you don't use NFS for remote access to mail files.  That's
 351 what IMAP is for, after all.  It is, however, very important to
 352 realize that you can *not* use the lock-upgrade feature by itself
 353 because it releases the lock as an interim step -- you need to have
 354 lock-upgrading guarded by another lock.
 355
 356      If you have the misfortune of using System V, you are likely to
 357 run into problems sooner or later having to do with statd/lockd.  You
 358 basically end up with one of three unsatisfactory choices:
 359         1) Grit your teeth and live with it.
 360         2) Try to make it work:
 361            a) avoid NFS access so as not to stress statd/lockd.
 362            b) try to understand the code in statd/lockd and hack it
 363               to be more robust.
 364            c) hunt out the system limit of locks, if there is one,
 365               and increase it.  Figure on at least two locks per
 366               simultaneous imapd process and four locks per Pine
 367               process.  Better yet, make the limit be 10 times the
 368               maximum number of processes.
 369            d) increase the socket buffer (-S switch to lockd) if
 370               it is offered.  I don't know what this actually does,
 371               but giving lockd more resources to do its work can't
 372               hurt.  Maybe.
 373         3) Decide that it can't possibly work, and turn off the
 374            fcntl() calls in your program.
 375         4) If nuking statd/lockd can be done without breaking local
 376            locking, then do so.  This would make SVR4 have the same
 377            limitations as BSD locking, with a couple of additional
 378            bugs.
 379         5) Check for NFS, and don't do the fcntl() in the NFS case.
 380            This is what c-client does.
 381
 382      Note that if you are going to use NFS to access files on a server
 383 which does not have statd/lockd running, your only choice is (3), (4),
 384 or (5).  Here again, IMAP can bail you out.
 385
 386      These problems aren't unique to c-client applications; they have
 387 also been reported with Elm, Mediamail, and other email tools.
 388
 389      Of the other two SVR4 locking bugs:
 390
 391      Programmer awareness is necessary to deal with the bug that you
 392 can not get an exclusive lock unless the file is open for write.  I
 393 believe that c-client has fixed all of these cases.
 394
 395      The problem about opening a second designator smashing any
 396 current locks on the file has not been addressed satisfactorily yet.
 397 This is not an easy problem to deal with, especially in c-client which
 398 really doesn't know what other files/streams may be open by Pine.
 399
 400      Aren't you so happy that you bought an System V system?