Logging: pipelining log_selector
[exim.git] / doc / doc-misc / Ext-mbx-locking
CommitLineData
e05f33e0
PH
1 UNIX Advisory File Locking Implications on c-client
2 Mark Crispin, 28 November 1995
3
4
5 THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE
6 IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995. SOME STATEMENTS
7 IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE
8 IMAP TOOLKIT.
9
10INTRODUCTION
11
12 Advisory locking is a mechanism by which cooperating processes
13can signal to each other their usage of a resource and whether or not
14that usage is critical. It is not a mechanism to protect against
15processes which do not cooperate in the locking.
16
17 The most basic form of locking involves a counter. This counter
18is -1 when the resource is available. If a process wants the lock, it
19executes an atomic increment-and-test-if-zero. If the value is zero,
20the process has the lock and can execute the critical code that needs
21exclusive usage of a resource. When it is finished, it sets the lock
22back to -1. In C terms:
23
24 while (++lock) /* try to get lock */
25 invoke_other_threads (); /* failed, try again */
26 .
27 . /* critical code here */
28 .
29 lock = -1; /* release lock */
30
31 This particular form of locking appears most commonly in
32multi-threaded applications such as operating system kernels. It
33makes several presumptions:
34 (1) it is alright to keep testing the lock (no overflow)
35 (2) the critical resource is single-access only
36 (3) there is shared writeable memory between the two threads
37 (4) the threads can be trusted to release the lock when finished
38
39 In applications programming on multi-user systems, most commonly
40the other threads are in an entirely different process, which may even
41be logged in as a different user. Few operating systems offer shared
42writeable memory between such processes.
43
44 A means of communicating this is by use of a file with a mutually
45agreed upon name. A binary semaphore can be passed by means of the
4c04137d 46existence or non-existence of that file, provided that there is an
e05f33e0
PH
47atomic means to create a file if and only if that file does not exist.
48In C terms:
49
50 /* try to get lock */
51 while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0)
52 sleep (1); /* failed, try again */
53 close (fd); /* got the lock */
54 .
55 . /* critical code here */
56 .
57 unlink ("lockfile"); /* release lock */
58
59 This form of locking makes fewer presumptions, but it still is
60guilty of presumptions (2) and (4) above. Presumption (2) limits the
61ability to have processes sharing a resource in a non-conflicting
62fashion (e.g. reading from a file). Presumption (4) leads to
63deadlocks should the process crash while it has a resource locked.
64
65 Most modern operating systems provide a resource locking system
66call that has none of these presumptions. In particular, a mechanism
67is provided for identifying shared locks as opposed to exclusive
68locks. A shared lock permits other processes to obtain a shared lock,
69but denies exclusive locks. In other words:
70
71 current state want shared want exclusive
72 ------------- ----------- --------------
73 unlocked YES YES
74 locked shared YES NO
75 locked exclusive NO NO
76
77 Furthermore, the operating system automatically relinquishes all
78locks held by that process when it terminates.
79
80 A useful operation is the ability to upgrade a shared lock to
81exclusive (provided there are no other shared users of the lock) and
82to downgrade an exclusive lock to shared. It is important that at no
83time is the lock ever removed; a process upgrading to exclusive must
4c04137d 84not relinquish its shared lock.
e05f33e0
PH
85
86 Most commonly, the resources being locked are files. Shared
87locks are particularly important with files; multiple simultaneous
88processes can read from a file, but only one can safely write at a
89time. Some writes may be safer than others; an append to the end of
90the file is safer than changing existing file data. In turn, changing
91a file record in place is safer than rewriting the file with an
92entirely different structure.
93
94
95FILE LOCKING ON UNIX
96
97 In the oldest versions of UNIX, the use of a semaphore lockfile
98was the only available form of locking. Advisory locking system calls
99were not added to UNIX until after the BSD vs. System V split. Both
100of these system calls deal with file resources only.
101
102 Most systems only have one or the other form of locking. AIX
103emulates the BSD form of locking as a jacket into the System V form.
104Ultrix and OSF/1 implement both forms.
105\f
106BSD
107
108 BSD added the flock() system call. It offers capabilities to
109acquire shared lock, acquire exclusive lock, and unlock. Optionally,
110the process can request an immediate error return instead of blocking
111when the lock is unavailable.
112
113
114FLOCK() BUGS
115
116 flock() advertises that it permits upgrading of shared locks to
117exclusive and downgrading of exclusive locks to shared, but it does so
118by releasing the former lock and then trying to acquire the new lock.
119This creates a window of vulnerability in which another process can
120grab the exclusive lock. Therefore, this capability is not useful,
121although many programmers have been deluded by incautious reading of
122the flock() man page to believe otherwise. This problem can be
123programmed around, once the programmer is aware of it.
124
125 flock() always returns as if it succeeded on NFS files, when in
126fact it is a no-op. There is no way around this.
127
128 Leaving aside these two problems, flock() works remarkably well,
129and has shown itself to be robust and trustworthy.
130\f
131SYSTEM V/POSIX
132
133 System V added new functions to the fnctl() system call, and a
134simple interface through the lockf() subroutine. This was
135subsequently included in POSIX. Both offer the facility to apply the
136lock to a particular region of the file instead of to the entire file.
137lockf() only supports exclusive locks, and calls fcntl() internally;
138hence it won't be discussed further.
139
140 Functionally, fcntl() locking is a superset of flock(); it is
141possible to implement a flock() emulator using fcntl(), with one minor
142exception: it is not possible to acquire an exclusive lock if the file
143is not open for write.
144
145 The fcntl() locking functions are: query lock station of a file
146region, lock/unlock a region, and lock/unlock a region and block until
147have the lock. The locks may be shared or exclusive. By means of the
148statd and lockd daemons, fcntl() locking is available on NFS files.
149
150 When statd is started at system boot, it reads its /etc/state
151file (which contains the number of times it has been invoked) and
152/etc/sm directory (which contains a list of all remote sites which are
153client or server locking with this site), and notifies the statd on
154each of these systems that it has been restarted. Each statd then
155notifies the local lockd of the restart of that system.
156
157 lockd receives fcntl() requests for NFS files. It communicates
158with the lockd at the server and requests it to apply the lock, and
159with the statd to request it for notification when the server goes
160down. It blocks until all these requests are completed.
161
162 There is quite a mythos about fcntl() locking.
163
164 One religion holds that fcntl() locking is the best thing since
165sliced bread, and that programs which use flock() should be converted
166to fcntl() so that NFS locking will work. However, as noted above,
167very few systems support both calls, so such an exercise is pointless
168except on Ultrix and OSF/1.
169
170 Another religion, which I adhere to, has the opposite viewpoint.
171
172
173FCNTL() BUGS
174
175 For all of the hairy code to do individual section locking of a
176file, it's clear that the designers of fcntl() locking never
177considered some very basic locking operations. It's as if all they
178knew about locking they got out of some CS textbook with not
179investigation of real-world needs.
180
181 It is not possible to acquire an exclusive lock unless the file
182is open for write. You could have append with shared read, and thus
183you could have a case in which a read-only access may need to go
184exclusive. This problem can be programmed around once the programmer
185is aware of it.
186
187 If the file is opened on another file designator in the same
188process, the file is unlocked even if no attempt is made to do any
189form of locking on the second designator. This is a very bad bug. It
190means that an application must keep track of all the files that it has
191opened and locked.
192
193 If there is no statd/lockd on the NFS server, fcntl() will hang
194forever waiting for them to appear. This is a bad bug. It means that
195any attempt to lock on a server that doesn't run these daemons will
196hang. There is no way for an application to request flock() style
197``try to lock, but no-op if the mechanism ain't there''.
198
199 There is a rumor to the effect that fcntl() will hang forever on
200local files too if there is no local statd/lockd. These daemons are
201running on mailer.u, although they appear not to have much CPU time.
202A useful experiment would be to kill them and see if imapd is affected
203in any way, but I decline to do so without an OK from UCS! ;-) If
204killing statd/lockd can be done without breaking fcntl() on local
205files, this would become one of the primary means of dealing with this
206problem.
207
208 The statd and lockd daemons have quite a reputation for extreme
209fragility. There have been numerous reports about the locking
210mechanism being wedged on a systemwide or even clusterwide basis,
211requiring a reboot to clear. It is rumored that this wedge, once it
212happens, also blocks local locking. Presumably killing and restarting
213statd would suffice to clear the wedge, but I haven't verified this.
214
215 There appears to be a limit to how many locks may be in use at a
216time on the system, although the documentation only mentions it in
217passing. On some of their systems, UCS has increased lockd's ``size
218of the socket buffer'', whatever that means.
219\f
220C-CLIENT USAGE
221
222 c-client uses flock(). On System V systems, flock() is simulated
223by an emulator that calls fcntl(). This emulator is provided by some
224systems (e.g. AIX), or uses c-client's flock.c module.
225
226
227BEZERK AND MMDF
228
229 Locking in the traditional UNIX formats was largely dictated by
230the status quo in other applications; however, additional protection
231is added against inadvertantly running multiple instances of a
232c-client application on the same mail file.
233
234 (1) c-client attempts to create a .lock file (mail file name with
235``.lock'' appended) whenever it reads from, or writes to, the mail
236file. This is an exclusive lock, and is held only for short periods
237of time while c-client is actually doing the I/O. There is a 5-minute
238timeout for this lock, after which it is broken on the presumption
239that it is a stale lock. If it can not create the .lock file due to
240an EACCES (protection failure) error, it once silently proceeded
241without this lock; this was for systems which protect /usr/spool/mail
242from unprivileged processes creating files. Today, c-client reports
243an error unless it is built otherwise. The purpose of this lock is to
244prevent against unfavorable interactions with mail delivery.
245
246 (2) c-client applies a shared flock() to the mail file whenever
247it reads from the mail file, and an exclusive flock() whenever it
248writes to the mail file. This lock is freed as soon as it finishes
249reading. The purpose of this lock is to prevent against unfavorable
250interactions with mail delivery.
251
252 (3) c-client applies an exclusive flock() to a file on /tmp
253(whose name represents the device and inode number of the file) when
254it opens the mail file. This lock is maintained throughout the
255session, although c-client has a feature (called ``kiss of death'')
256which permits c-client to forcibly and irreversibly seize the lock
257from a cooperating c-client application that surrenders the lock on
258demand. The purpose of this lock is to prevent against unfavorable
259interactions with other instances of c-client (rewriting the mail
260file).
261
262 Mail delivery daemons use lock (1), (2), or both. Lock (1) works
263over NFS; lock (2) is the only one that works on sites that protect
264/usr/spool/mail against unprivileged file creation. Prudent mail
265delivery daemons use both forms of locking, and of course so does
266c-client.
267
268 If only lock (2) is used, then multiple processes can read from
269the mail file simultaneously, although in real life this doesn't
270really change things. The normal state of locks (1) and (2) is
271unlocked except for very brief periods.
272
273
274TENEX AND MTX
275
276 The design of the locking mechanism of these formats was
277motivated by a design to enable multiple simultaneous read/write
278access. It is almost the reverse of how locking works with
279bezerk/mmdf.
280
281 (1) c-client applies a shared flock() to the mail file when it
282opens the mail file. It upgrades this lock to exclusive whenever it
283tries to expunge the mail file. Because of the flock() bug that
284upgrading a lock actually releases it, it will not do so until it has
285acquired an exclusive lock (2) first. The purpose of this lock is to
286prevent against expunge taking place while some other c-client has the
287mail file open (and thus knows where all the messages are).
288
289 (2) c-client applies a shared flock() to a file on /tmp (whose
290name represents the device and inode number of the file) when it
291parses the mail file. It applies an exclusive flock() to this file
292when it appends new mail to the mail file, as well as before it
293attempts to upgrade lock (1) to exclusive. The purpose of this lock
294is to prevent against data being appended while some other c-client is
295parsing mail in the file (to prevent reading of incomplete messages).
296It also protects against the lock-releasing timing race on lock (1).
297\f
298OBSERVATIONS
299
300 In a perfect world, locking works. You are protected against
301unfavorable interactions with the mailer and against your own mistake
302by running more than one instance of your mail reader. In tenex/mtx
303formats, you have the additional benefit that multiple simultaneous
304read/write access works, with the sole restriction being that you
305can't expunge if there are any sharers of the mail file.
306
307 If the mail file is NFS-mounted, then flock() locking is a silent
308no-op. This is the way BSD implements flock(), and c-client's
309emulation of flock() through fcntl() tests for NFS files and
310duplicates this functionality. There is no locking protection for
311tenex/mtx mail files at all, and only protection against the mailer
312for bezerk/mmdf mail files. This has been the accepted state of
313affairs on UNIX for many sad years.
314
315 If you can not create .lock files, it should not affect locking,
316since the flock() locks suffice for all protection. This is, however,
317not true if the mailer does not check for flock() locking, or if the
318the mail file is NFS-mounted.
319
320 What this means is that there is *no* locking protection at all
321in the case of a client using an NFS-mounted /usr/spool/mail that does
322not permit file creation by unprivileged programs. It is impossible,
323under these circumstances, for an unprivileged program to do anything
324about it. Worse, if EACCES errors on .lock file creation are no-op'ed
325, the user won't even know about it. This is arguably a site
326configuration error.
327
328 The problem with not being able to create .lock files exists on
329System V as well, but the failure modes for flock() -- which is
330implemented via fcntl() -- are different.
331
332 On System V, if the mail file is NFS-mounted and either the
333client or the server lacks a functioning statd/lockd pair, then the
334lock attempt would have hung forever if it weren't for the fact that
335c-client tests for NFS and no-ops the flock() emulator in this case.
336Systemwide or clusterwide failures of statd/lockd have been known to
337occur which cause all locks in all processes to hang (including
338local?). Without the special NFS test made by c-client, there would
339be no way to request BSD-style no-op behavior, nor is there any way to
340determine that this is happening other than the system being hung.
341
342 The additional locking introduced by c-client was shown to cause
343much more stress on the System V locking mechanism than has
344traditionally been placed upon it. If it was stressed too far, all
345hell broke loose. Fortunately, this is now past history.
346\f
347TRADEOFFS
348
349 c-client based applications have a reasonable chance of winning
350as long as you don't use NFS for remote access to mail files. That's
351what IMAP is for, after all. It is, however, very important to
352realize that you can *not* use the lock-upgrade feature by itself
353because it releases the lock as an interim step -- you need to have
354lock-upgrading guarded by another lock.
355
356 If you have the misfortune of using System V, you are likely to
357run into problems sooner or later having to do with statd/lockd. You
358basically end up with one of three unsatisfactory choices:
359 1) Grit your teeth and live with it.
360 2) Try to make it work:
361 a) avoid NFS access so as not to stress statd/lockd.
362 b) try to understand the code in statd/lockd and hack it
363 to be more robust.
364 c) hunt out the system limit of locks, if there is one,
365 and increase it. Figure on at least two locks per
366 simultaneous imapd process and four locks per Pine
367 process. Better yet, make the limit be 10 times the
368 maximum number of processes.
369 d) increase the socket buffer (-S switch to lockd) if
370 it is offered. I don't know what this actually does,
371 but giving lockd more resources to do its work can't
372 hurt. Maybe.
373 3) Decide that it can't possibly work, and turn off the
374 fcntl() calls in your program.
375 4) If nuking statd/lockd can be done without breaking local
376 locking, then do so. This would make SVR4 have the same
377 limitations as BSD locking, with a couple of additional
378 bugs.
379 5) Check for NFS, and don't do the fcntl() in the NFS case.
380 This is what c-client does.
381
382 Note that if you are going to use NFS to access files on a server
383which does not have statd/lockd running, your only choice is (3), (4),
384or (5). Here again, IMAP can bail you out.
385
386 These problems aren't unique to c-client applications; they have
387also been reported with Elm, Mediamail, and other email tools.
388
389 Of the other two SVR4 locking bugs:
390
391 Programmer awareness is necessary to deal with the bug that you
392can not get an exclusive lock unless the file is open for write. I
393believe that c-client has fixed all of these cases.
394
395 The problem about opening a second designator smashing any
396current locks on the file has not been addressed satisfactorily yet.
397This is not an easy problem to deal with, especially in c-client which
398really doesn't know what other files/streams may be open by Pine.
399
400 Aren't you so happy that you bought an System V system?