| 1 | UNIX Advisory File Locking Implications on c-client |
| 2 | Mark Crispin, 28 November 1995 |
| 3 | |
| 4 | |
| 5 | THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE |
| 6 | IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995. SOME STATEMENTS |
| 7 | IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE |
| 8 | IMAP TOOLKIT. |
| 9 | |
| 10 | INTRODUCTION |
| 11 | |
| 12 | Advisory locking is a mechanism by which cooperating processes |
| 13 | can signal to each other their usage of a resource and whether or not |
| 14 | that usage is critical. It is not a mechanism to protect against |
| 15 | processes which do not cooperate in the locking. |
| 16 | |
| 17 | The most basic form of locking involves a counter. This counter |
| 18 | is -1 when the resource is available. If a process wants the lock, it |
| 19 | executes an atomic increment-and-test-if-zero. If the value is zero, |
| 20 | the process has the lock and can execute the critical code that needs |
| 21 | exclusive usage of a resource. When it is finished, it sets the lock |
| 22 | back to -1. In C terms: |
| 23 | |
| 24 | while (++lock) /* try to get lock */ |
| 25 | invoke_other_threads (); /* failed, try again */ |
| 26 | . |
| 27 | . /* critical code here */ |
| 28 | . |
| 29 | lock = -1; /* release lock */ |
| 30 | |
| 31 | This particular form of locking appears most commonly in |
| 32 | multi-threaded applications such as operating system kernels. It |
| 33 | makes several presumptions: |
| 34 | (1) it is alright to keep testing the lock (no overflow) |
| 35 | (2) the critical resource is single-access only |
| 36 | (3) there is shared writeable memory between the two threads |
| 37 | (4) the threads can be trusted to release the lock when finished |
| 38 | |
| 39 | In applications programming on multi-user systems, most commonly |
| 40 | the other threads are in an entirely different process, which may even |
| 41 | be logged in as a different user. Few operating systems offer shared |
| 42 | writeable memory between such processes. |
| 43 | |
| 44 | A means of communicating this is by use of a file with a mutually |
| 45 | agreed upon name. A binary semaphore can be passed by means of the |
| 46 | existence or non-existence of that file, provided that there is an |
| 47 | atomic means to create a file if and only if that file does not exist. |
| 48 | In C terms: |
| 49 | |
| 50 | /* try to get lock */ |
| 51 | while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0) |
| 52 | sleep (1); /* failed, try again */ |
| 53 | close (fd); /* got the lock */ |
| 54 | . |
| 55 | . /* critical code here */ |
| 56 | . |
| 57 | unlink ("lockfile"); /* release lock */ |
| 58 | |
| 59 | This form of locking makes fewer presumptions, but it still is |
| 60 | guilty of presumptions (2) and (4) above. Presumption (2) limits the |
| 61 | ability to have processes sharing a resource in a non-conflicting |
| 62 | fashion (e.g. reading from a file). Presumption (4) leads to |
| 63 | deadlocks should the process crash while it has a resource locked. |
| 64 | |
| 65 | Most modern operating systems provide a resource locking system |
| 66 | call that has none of these presumptions. In particular, a mechanism |
| 67 | is provided for identifying shared locks as opposed to exclusive |
| 68 | locks. A shared lock permits other processes to obtain a shared lock, |
| 69 | but denies exclusive locks. In other words: |
| 70 | |
| 71 | current state want shared want exclusive |
| 72 | ------------- ----------- -------------- |
| 73 | unlocked YES YES |
| 74 | locked shared YES NO |
| 75 | locked exclusive NO NO |
| 76 | |
| 77 | Furthermore, the operating system automatically relinquishes all |
| 78 | locks held by that process when it terminates. |
| 79 | |
| 80 | A useful operation is the ability to upgrade a shared lock to |
| 81 | exclusive (provided there are no other shared users of the lock) and |
| 82 | to downgrade an exclusive lock to shared. It is important that at no |
| 83 | time is the lock ever removed; a process upgrading to exclusive must |
| 84 | not relinquish its shared lock. |
| 85 | |
| 86 | Most commonly, the resources being locked are files. Shared |
| 87 | locks are particularly important with files; multiple simultaneous |
| 88 | processes can read from a file, but only one can safely write at a |
| 89 | time. Some writes may be safer than others; an append to the end of |
| 90 | the file is safer than changing existing file data. In turn, changing |
| 91 | a file record in place is safer than rewriting the file with an |
| 92 | entirely different structure. |
| 93 | |
| 94 | |
| 95 | FILE LOCKING ON UNIX |
| 96 | |
| 97 | In the oldest versions of UNIX, the use of a semaphore lockfile |
| 98 | was the only available form of locking. Advisory locking system calls |
| 99 | were not added to UNIX until after the BSD vs. System V split. Both |
| 100 | of these system calls deal with file resources only. |
| 101 | |
| 102 | Most systems only have one or the other form of locking. AIX |
| 103 | emulates the BSD form of locking as a jacket into the System V form. |
| 104 | Ultrix and OSF/1 implement both forms. |
| 105 | \f |
| 106 | BSD |
| 107 | |
| 108 | BSD added the flock() system call. It offers capabilities to |
| 109 | acquire shared lock, acquire exclusive lock, and unlock. Optionally, |
| 110 | the process can request an immediate error return instead of blocking |
| 111 | when the lock is unavailable. |
| 112 | |
| 113 | |
| 114 | FLOCK() BUGS |
| 115 | |
| 116 | flock() advertises that it permits upgrading of shared locks to |
| 117 | exclusive and downgrading of exclusive locks to shared, but it does so |
| 118 | by releasing the former lock and then trying to acquire the new lock. |
| 119 | This creates a window of vulnerability in which another process can |
| 120 | grab the exclusive lock. Therefore, this capability is not useful, |
| 121 | although many programmers have been deluded by incautious reading of |
| 122 | the flock() man page to believe otherwise. This problem can be |
| 123 | programmed around, once the programmer is aware of it. |
| 124 | |
| 125 | flock() always returns as if it succeeded on NFS files, when in |
| 126 | fact it is a no-op. There is no way around this. |
| 127 | |
| 128 | Leaving aside these two problems, flock() works remarkably well, |
| 129 | and has shown itself to be robust and trustworthy. |
| 130 | \f |
| 131 | SYSTEM V/POSIX |
| 132 | |
| 133 | System V added new functions to the fnctl() system call, and a |
| 134 | simple interface through the lockf() subroutine. This was |
| 135 | subsequently included in POSIX. Both offer the facility to apply the |
| 136 | lock to a particular region of the file instead of to the entire file. |
| 137 | lockf() only supports exclusive locks, and calls fcntl() internally; |
| 138 | hence it won't be discussed further. |
| 139 | |
| 140 | Functionally, fcntl() locking is a superset of flock(); it is |
| 141 | possible to implement a flock() emulator using fcntl(), with one minor |
| 142 | exception: it is not possible to acquire an exclusive lock if the file |
| 143 | is not open for write. |
| 144 | |
| 145 | The fcntl() locking functions are: query lock station of a file |
| 146 | region, lock/unlock a region, and lock/unlock a region and block until |
| 147 | have the lock. The locks may be shared or exclusive. By means of the |
| 148 | statd and lockd daemons, fcntl() locking is available on NFS files. |
| 149 | |
| 150 | When statd is started at system boot, it reads its /etc/state |
| 151 | file (which contains the number of times it has been invoked) and |
| 152 | /etc/sm directory (which contains a list of all remote sites which are |
| 153 | client or server locking with this site), and notifies the statd on |
| 154 | each of these systems that it has been restarted. Each statd then |
| 155 | notifies the local lockd of the restart of that system. |
| 156 | |
| 157 | lockd receives fcntl() requests for NFS files. It communicates |
| 158 | with the lockd at the server and requests it to apply the lock, and |
| 159 | with the statd to request it for notification when the server goes |
| 160 | down. It blocks until all these requests are completed. |
| 161 | |
| 162 | There is quite a mythos about fcntl() locking. |
| 163 | |
| 164 | One religion holds that fcntl() locking is the best thing since |
| 165 | sliced bread, and that programs which use flock() should be converted |
| 166 | to fcntl() so that NFS locking will work. However, as noted above, |
| 167 | very few systems support both calls, so such an exercise is pointless |
| 168 | except on Ultrix and OSF/1. |
| 169 | |
| 170 | Another religion, which I adhere to, has the opposite viewpoint. |
| 171 | |
| 172 | |
| 173 | FCNTL() BUGS |
| 174 | |
| 175 | For all of the hairy code to do individual section locking of a |
| 176 | file, it's clear that the designers of fcntl() locking never |
| 177 | considered some very basic locking operations. It's as if all they |
| 178 | knew about locking they got out of some CS textbook with not |
| 179 | investigation of real-world needs. |
| 180 | |
| 181 | It is not possible to acquire an exclusive lock unless the file |
| 182 | is open for write. You could have append with shared read, and thus |
| 183 | you could have a case in which a read-only access may need to go |
| 184 | exclusive. This problem can be programmed around once the programmer |
| 185 | is aware of it. |
| 186 | |
| 187 | If the file is opened on another file designator in the same |
| 188 | process, the file is unlocked even if no attempt is made to do any |
| 189 | form of locking on the second designator. This is a very bad bug. It |
| 190 | means that an application must keep track of all the files that it has |
| 191 | opened and locked. |
| 192 | |
| 193 | If there is no statd/lockd on the NFS server, fcntl() will hang |
| 194 | forever waiting for them to appear. This is a bad bug. It means that |
| 195 | any attempt to lock on a server that doesn't run these daemons will |
| 196 | hang. There is no way for an application to request flock() style |
| 197 | ``try to lock, but no-op if the mechanism ain't there''. |
| 198 | |
| 199 | There is a rumor to the effect that fcntl() will hang forever on |
| 200 | local files too if there is no local statd/lockd. These daemons are |
| 201 | running on mailer.u, although they appear not to have much CPU time. |
| 202 | A useful experiment would be to kill them and see if imapd is affected |
| 203 | in any way, but I decline to do so without an OK from UCS! ;-) If |
| 204 | killing statd/lockd can be done without breaking fcntl() on local |
| 205 | files, this would become one of the primary means of dealing with this |
| 206 | problem. |
| 207 | |
| 208 | The statd and lockd daemons have quite a reputation for extreme |
| 209 | fragility. There have been numerous reports about the locking |
| 210 | mechanism being wedged on a systemwide or even clusterwide basis, |
| 211 | requiring a reboot to clear. It is rumored that this wedge, once it |
| 212 | happens, also blocks local locking. Presumably killing and restarting |
| 213 | statd would suffice to clear the wedge, but I haven't verified this. |
| 214 | |
| 215 | There appears to be a limit to how many locks may be in use at a |
| 216 | time on the system, although the documentation only mentions it in |
| 217 | passing. On some of their systems, UCS has increased lockd's ``size |
| 218 | of the socket buffer'', whatever that means. |
| 219 | \f |
| 220 | C-CLIENT USAGE |
| 221 | |
| 222 | c-client uses flock(). On System V systems, flock() is simulated |
| 223 | by an emulator that calls fcntl(). This emulator is provided by some |
| 224 | systems (e.g. AIX), or uses c-client's flock.c module. |
| 225 | |
| 226 | |
| 227 | BEZERK AND MMDF |
| 228 | |
| 229 | Locking in the traditional UNIX formats was largely dictated by |
| 230 | the status quo in other applications; however, additional protection |
| 231 | is added against inadvertently running multiple instances of a |
| 232 | c-client application on the same mail file. |
| 233 | |
| 234 | (1) c-client attempts to create a .lock file (mail file name with |
| 235 | ``.lock'' appended) whenever it reads from, or writes to, the mail |
| 236 | file. This is an exclusive lock, and is held only for short periods |
| 237 | of time while c-client is actually doing the I/O. There is a 5-minute |
| 238 | timeout for this lock, after which it is broken on the presumption |
| 239 | that it is a stale lock. If it can not create the .lock file due to |
| 240 | an EACCES (protection failure) error, it once silently proceeded |
| 241 | without this lock; this was for systems which protect /usr/spool/mail |
| 242 | from unprivileged processes creating files. Today, c-client reports |
| 243 | an error unless it is built otherwise. The purpose of this lock is to |
| 244 | prevent against unfavorable interactions with mail delivery. |
| 245 | |
| 246 | (2) c-client applies a shared flock() to the mail file whenever |
| 247 | it reads from the mail file, and an exclusive flock() whenever it |
| 248 | writes to the mail file. This lock is freed as soon as it finishes |
| 249 | reading. The purpose of this lock is to prevent against unfavorable |
| 250 | interactions with mail delivery. |
| 251 | |
| 252 | (3) c-client applies an exclusive flock() to a file on /tmp |
| 253 | (whose name represents the device and inode number of the file) when |
| 254 | it opens the mail file. This lock is maintained throughout the |
| 255 | session, although c-client has a feature (called ``kiss of death'') |
| 256 | which permits c-client to forcibly and irreversibly seize the lock |
| 257 | from a cooperating c-client application that surrenders the lock on |
| 258 | demand. The purpose of this lock is to prevent against unfavorable |
| 259 | interactions with other instances of c-client (rewriting the mail |
| 260 | file). |
| 261 | |
| 262 | Mail delivery daemons use lock (1), (2), or both. Lock (1) works |
| 263 | over NFS; lock (2) is the only one that works on sites that protect |
| 264 | /usr/spool/mail against unprivileged file creation. Prudent mail |
| 265 | delivery daemons use both forms of locking, and of course so does |
| 266 | c-client. |
| 267 | |
| 268 | If only lock (2) is used, then multiple processes can read from |
| 269 | the mail file simultaneously, although in real life this doesn't |
| 270 | really change things. The normal state of locks (1) and (2) is |
| 271 | unlocked except for very brief periods. |
| 272 | |
| 273 | |
| 274 | TENEX AND MTX |
| 275 | |
| 276 | The design of the locking mechanism of these formats was |
| 277 | motivated by a design to enable multiple simultaneous read/write |
| 278 | access. It is almost the reverse of how locking works with |
| 279 | bezerk/mmdf. |
| 280 | |
| 281 | (1) c-client applies a shared flock() to the mail file when it |
| 282 | opens the mail file. It upgrades this lock to exclusive whenever it |
| 283 | tries to expunge the mail file. Because of the flock() bug that |
| 284 | upgrading a lock actually releases it, it will not do so until it has |
| 285 | acquired an exclusive lock (2) first. The purpose of this lock is to |
| 286 | prevent against expunge taking place while some other c-client has the |
| 287 | mail file open (and thus knows where all the messages are). |
| 288 | |
| 289 | (2) c-client applies a shared flock() to a file on /tmp (whose |
| 290 | name represents the device and inode number of the file) when it |
| 291 | parses the mail file. It applies an exclusive flock() to this file |
| 292 | when it appends new mail to the mail file, as well as before it |
| 293 | attempts to upgrade lock (1) to exclusive. The purpose of this lock |
| 294 | is to prevent against data being appended while some other c-client is |
| 295 | parsing mail in the file (to prevent reading of incomplete messages). |
| 296 | It also protects against the lock-releasing timing race on lock (1). |
| 297 | \f |
| 298 | OBSERVATIONS |
| 299 | |
| 300 | In a perfect world, locking works. You are protected against |
| 301 | unfavorable interactions with the mailer and against your own mistake |
| 302 | by running more than one instance of your mail reader. In tenex/mtx |
| 303 | formats, you have the additional benefit that multiple simultaneous |
| 304 | read/write access works, with the sole restriction being that you |
| 305 | can't expunge if there are any sharers of the mail file. |
| 306 | |
| 307 | If the mail file is NFS-mounted, then flock() locking is a silent |
| 308 | no-op. This is the way BSD implements flock(), and c-client's |
| 309 | emulation of flock() through fcntl() tests for NFS files and |
| 310 | duplicates this functionality. There is no locking protection for |
| 311 | tenex/mtx mail files at all, and only protection against the mailer |
| 312 | for bezerk/mmdf mail files. This has been the accepted state of |
| 313 | affairs on UNIX for many sad years. |
| 314 | |
| 315 | If you can not create .lock files, it should not affect locking, |
| 316 | since the flock() locks suffice for all protection. This is, however, |
| 317 | not true if the mailer does not check for flock() locking, or if the |
| 318 | the mail file is NFS-mounted. |
| 319 | |
| 320 | What this means is that there is *no* locking protection at all |
| 321 | in the case of a client using an NFS-mounted /usr/spool/mail that does |
| 322 | not permit file creation by unprivileged programs. It is impossible, |
| 323 | under these circumstances, for an unprivileged program to do anything |
| 324 | about it. Worse, if EACCES errors on .lock file creation are no-op'ed |
| 325 | , the user won't even know about it. This is arguably a site |
| 326 | configuration error. |
| 327 | |
| 328 | The problem with not being able to create .lock files exists on |
| 329 | System V as well, but the failure modes for flock() -- which is |
| 330 | implemented via fcntl() -- are different. |
| 331 | |
| 332 | On System V, if the mail file is NFS-mounted and either the |
| 333 | client or the server lacks a functioning statd/lockd pair, then the |
| 334 | lock attempt would have hung forever if it weren't for the fact that |
| 335 | c-client tests for NFS and no-ops the flock() emulator in this case. |
| 336 | Systemwide or clusterwide failures of statd/lockd have been known to |
| 337 | occur which cause all locks in all processes to hang (including |
| 338 | local?). Without the special NFS test made by c-client, there would |
| 339 | be no way to request BSD-style no-op behavior, nor is there any way to |
| 340 | determine that this is happening other than the system being hung. |
| 341 | |
| 342 | The additional locking introduced by c-client was shown to cause |
| 343 | much more stress on the System V locking mechanism than has |
| 344 | traditionally been placed upon it. If it was stressed too far, all |
| 345 | hell broke loose. Fortunately, this is now past history. |
| 346 | \f |
| 347 | TRADEOFFS |
| 348 | |
| 349 | c-client based applications have a reasonable chance of winning |
| 350 | as long as you don't use NFS for remote access to mail files. That's |
| 351 | what IMAP is for, after all. It is, however, very important to |
| 352 | realize that you can *not* use the lock-upgrade feature by itself |
| 353 | because it releases the lock as an interim step -- you need to have |
| 354 | lock-upgrading guarded by another lock. |
| 355 | |
| 356 | If you have the misfortune of using System V, you are likely to |
| 357 | run into problems sooner or later having to do with statd/lockd. You |
| 358 | basically end up with one of three unsatisfactory choices: |
| 359 | 1) Grit your teeth and live with it. |
| 360 | 2) Try to make it work: |
| 361 | a) avoid NFS access so as not to stress statd/lockd. |
| 362 | b) try to understand the code in statd/lockd and hack it |
| 363 | to be more robust. |
| 364 | c) hunt out the system limit of locks, if there is one, |
| 365 | and increase it. Figure on at least two locks per |
| 366 | simultaneous imapd process and four locks per Pine |
| 367 | process. Better yet, make the limit be 10 times the |
| 368 | maximum number of processes. |
| 369 | d) increase the socket buffer (-S switch to lockd) if |
| 370 | it is offered. I don't know what this actually does, |
| 371 | but giving lockd more resources to do its work can't |
| 372 | hurt. Maybe. |
| 373 | 3) Decide that it can't possibly work, and turn off the |
| 374 | fcntl() calls in your program. |
| 375 | 4) If nuking statd/lockd can be done without breaking local |
| 376 | locking, then do so. This would make SVR4 have the same |
| 377 | limitations as BSD locking, with a couple of additional |
| 378 | bugs. |
| 379 | 5) Check for NFS, and don't do the fcntl() in the NFS case. |
| 380 | This is what c-client does. |
| 381 | |
| 382 | Note that if you are going to use NFS to access files on a server |
| 383 | which does not have statd/lockd running, your only choice is (3), (4), |
| 384 | or (5). Here again, IMAP can bail you out. |
| 385 | |
| 386 | These problems aren't unique to c-client applications; they have |
| 387 | also been reported with Elm, Mediamail, and other email tools. |
| 388 | |
| 389 | Of the other two SVR4 locking bugs: |
| 390 | |
| 391 | Programmer awareness is necessary to deal with the bug that you |
| 392 | can not get an exclusive lock unless the file is open for write. I |
| 393 | believe that c-client has fixed all of these cases. |
| 394 | |
| 395 | The problem about opening a second designator smashing any |
| 396 | current locks on the file has not been addressed satisfactorily yet. |
| 397 | This is not an easy problem to deal with, especially in c-client which |
| 398 | really doesn't know what other files/streams may be open by Pine. |
| 399 | |
| 400 | Aren't you so happy that you bought an System V system? |