Commit | Line | Data |
---|---|---|
e05f33e0 PH |
1 | UNIX Advisory File Locking Implications on c-client |
2 | Mark Crispin, 28 November 1995 | |
3 | ||
4 | ||
5 | THIS DOCUMENT HAS BEEN UPDATED TO REFLECT THE CODE IN THE | |
6 | IMAP-4 TOOLKIT AS OF NOVEMBER 28, 1995. SOME STATEMENTS | |
7 | IN THIS DOCUMENT DO NOT APPLY TO EARLIER VERSIONS OF THE | |
8 | IMAP TOOLKIT. | |
9 | ||
10 | INTRODUCTION | |
11 | ||
12 | Advisory locking is a mechanism by which cooperating processes | |
13 | can signal to each other their usage of a resource and whether or not | |
14 | that usage is critical. It is not a mechanism to protect against | |
15 | processes which do not cooperate in the locking. | |
16 | ||
17 | The most basic form of locking involves a counter. This counter | |
18 | is -1 when the resource is available. If a process wants the lock, it | |
19 | executes an atomic increment-and-test-if-zero. If the value is zero, | |
20 | the process has the lock and can execute the critical code that needs | |
21 | exclusive usage of a resource. When it is finished, it sets the lock | |
22 | back to -1. In C terms: | |
23 | ||
24 | while (++lock) /* try to get lock */ | |
25 | invoke_other_threads (); /* failed, try again */ | |
26 | . | |
27 | . /* critical code here */ | |
28 | . | |
29 | lock = -1; /* release lock */ | |
30 | ||
31 | This particular form of locking appears most commonly in | |
32 | multi-threaded applications such as operating system kernels. It | |
33 | makes several presumptions: | |
34 | (1) it is alright to keep testing the lock (no overflow) | |
35 | (2) the critical resource is single-access only | |
36 | (3) there is shared writeable memory between the two threads | |
37 | (4) the threads can be trusted to release the lock when finished | |
38 | ||
39 | In applications programming on multi-user systems, most commonly | |
40 | the other threads are in an entirely different process, which may even | |
41 | be logged in as a different user. Few operating systems offer shared | |
42 | writeable memory between such processes. | |
43 | ||
44 | A means of communicating this is by use of a file with a mutually | |
45 | agreed upon name. A binary semaphore can be passed by means of the | |
46 | existance or non-existance of that file, provided that there is an | |
47 | atomic means to create a file if and only if that file does not exist. | |
48 | In C terms: | |
49 | ||
50 | /* try to get lock */ | |
51 | while ((fd = open ("lockfile",O_WRONLY|O_CREAT|O_EXCL,0666)) < 0) | |
52 | sleep (1); /* failed, try again */ | |
53 | close (fd); /* got the lock */ | |
54 | . | |
55 | . /* critical code here */ | |
56 | . | |
57 | unlink ("lockfile"); /* release lock */ | |
58 | ||
59 | This form of locking makes fewer presumptions, but it still is | |
60 | guilty of presumptions (2) and (4) above. Presumption (2) limits the | |
61 | ability to have processes sharing a resource in a non-conflicting | |
62 | fashion (e.g. reading from a file). Presumption (4) leads to | |
63 | deadlocks should the process crash while it has a resource locked. | |
64 | ||
65 | Most modern operating systems provide a resource locking system | |
66 | call that has none of these presumptions. In particular, a mechanism | |
67 | is provided for identifying shared locks as opposed to exclusive | |
68 | locks. A shared lock permits other processes to obtain a shared lock, | |
69 | but denies exclusive locks. In other words: | |
70 | ||
71 | current state want shared want exclusive | |
72 | ------------- ----------- -------------- | |
73 | unlocked YES YES | |
74 | locked shared YES NO | |
75 | locked exclusive NO NO | |
76 | ||
77 | Furthermore, the operating system automatically relinquishes all | |
78 | locks held by that process when it terminates. | |
79 | ||
80 | A useful operation is the ability to upgrade a shared lock to | |
81 | exclusive (provided there are no other shared users of the lock) and | |
82 | to downgrade an exclusive lock to shared. It is important that at no | |
83 | time is the lock ever removed; a process upgrading to exclusive must | |
84 | not relenquish its shared lock. | |
85 | ||
86 | Most commonly, the resources being locked are files. Shared | |
87 | locks are particularly important with files; multiple simultaneous | |
88 | processes can read from a file, but only one can safely write at a | |
89 | time. Some writes may be safer than others; an append to the end of | |
90 | the file is safer than changing existing file data. In turn, changing | |
91 | a file record in place is safer than rewriting the file with an | |
92 | entirely different structure. | |
93 | ||
94 | ||
95 | FILE LOCKING ON UNIX | |
96 | ||
97 | In the oldest versions of UNIX, the use of a semaphore lockfile | |
98 | was the only available form of locking. Advisory locking system calls | |
99 | were not added to UNIX until after the BSD vs. System V split. Both | |
100 | of these system calls deal with file resources only. | |
101 | ||
102 | Most systems only have one or the other form of locking. AIX | |
103 | emulates the BSD form of locking as a jacket into the System V form. | |
104 | Ultrix and OSF/1 implement both forms. | |
105 | \f | |
106 | BSD | |
107 | ||
108 | BSD added the flock() system call. It offers capabilities to | |
109 | acquire shared lock, acquire exclusive lock, and unlock. Optionally, | |
110 | the process can request an immediate error return instead of blocking | |
111 | when the lock is unavailable. | |
112 | ||
113 | ||
114 | FLOCK() BUGS | |
115 | ||
116 | flock() advertises that it permits upgrading of shared locks to | |
117 | exclusive and downgrading of exclusive locks to shared, but it does so | |
118 | by releasing the former lock and then trying to acquire the new lock. | |
119 | This creates a window of vulnerability in which another process can | |
120 | grab the exclusive lock. Therefore, this capability is not useful, | |
121 | although many programmers have been deluded by incautious reading of | |
122 | the flock() man page to believe otherwise. This problem can be | |
123 | programmed around, once the programmer is aware of it. | |
124 | ||
125 | flock() always returns as if it succeeded on NFS files, when in | |
126 | fact it is a no-op. There is no way around this. | |
127 | ||
128 | Leaving aside these two problems, flock() works remarkably well, | |
129 | and has shown itself to be robust and trustworthy. | |
130 | \f | |
131 | SYSTEM V/POSIX | |
132 | ||
133 | System V added new functions to the fnctl() system call, and a | |
134 | simple interface through the lockf() subroutine. This was | |
135 | subsequently included in POSIX. Both offer the facility to apply the | |
136 | lock to a particular region of the file instead of to the entire file. | |
137 | lockf() only supports exclusive locks, and calls fcntl() internally; | |
138 | hence it won't be discussed further. | |
139 | ||
140 | Functionally, fcntl() locking is a superset of flock(); it is | |
141 | possible to implement a flock() emulator using fcntl(), with one minor | |
142 | exception: it is not possible to acquire an exclusive lock if the file | |
143 | is not open for write. | |
144 | ||
145 | The fcntl() locking functions are: query lock station of a file | |
146 | region, lock/unlock a region, and lock/unlock a region and block until | |
147 | have the lock. The locks may be shared or exclusive. By means of the | |
148 | statd and lockd daemons, fcntl() locking is available on NFS files. | |
149 | ||
150 | When statd is started at system boot, it reads its /etc/state | |
151 | file (which contains the number of times it has been invoked) and | |
152 | /etc/sm directory (which contains a list of all remote sites which are | |
153 | client or server locking with this site), and notifies the statd on | |
154 | each of these systems that it has been restarted. Each statd then | |
155 | notifies the local lockd of the restart of that system. | |
156 | ||
157 | lockd receives fcntl() requests for NFS files. It communicates | |
158 | with the lockd at the server and requests it to apply the lock, and | |
159 | with the statd to request it for notification when the server goes | |
160 | down. It blocks until all these requests are completed. | |
161 | ||
162 | There is quite a mythos about fcntl() locking. | |
163 | ||
164 | One religion holds that fcntl() locking is the best thing since | |
165 | sliced bread, and that programs which use flock() should be converted | |
166 | to fcntl() so that NFS locking will work. However, as noted above, | |
167 | very few systems support both calls, so such an exercise is pointless | |
168 | except on Ultrix and OSF/1. | |
169 | ||
170 | Another religion, which I adhere to, has the opposite viewpoint. | |
171 | ||
172 | ||
173 | FCNTL() BUGS | |
174 | ||
175 | For all of the hairy code to do individual section locking of a | |
176 | file, it's clear that the designers of fcntl() locking never | |
177 | considered some very basic locking operations. It's as if all they | |
178 | knew about locking they got out of some CS textbook with not | |
179 | investigation of real-world needs. | |
180 | ||
181 | It is not possible to acquire an exclusive lock unless the file | |
182 | is open for write. You could have append with shared read, and thus | |
183 | you could have a case in which a read-only access may need to go | |
184 | exclusive. This problem can be programmed around once the programmer | |
185 | is aware of it. | |
186 | ||
187 | If the file is opened on another file designator in the same | |
188 | process, the file is unlocked even if no attempt is made to do any | |
189 | form of locking on the second designator. This is a very bad bug. It | |
190 | means that an application must keep track of all the files that it has | |
191 | opened and locked. | |
192 | ||
193 | If there is no statd/lockd on the NFS server, fcntl() will hang | |
194 | forever waiting for them to appear. This is a bad bug. It means that | |
195 | any attempt to lock on a server that doesn't run these daemons will | |
196 | hang. There is no way for an application to request flock() style | |
197 | ``try to lock, but no-op if the mechanism ain't there''. | |
198 | ||
199 | There is a rumor to the effect that fcntl() will hang forever on | |
200 | local files too if there is no local statd/lockd. These daemons are | |
201 | running on mailer.u, although they appear not to have much CPU time. | |
202 | A useful experiment would be to kill them and see if imapd is affected | |
203 | in any way, but I decline to do so without an OK from UCS! ;-) If | |
204 | killing statd/lockd can be done without breaking fcntl() on local | |
205 | files, this would become one of the primary means of dealing with this | |
206 | problem. | |
207 | ||
208 | The statd and lockd daemons have quite a reputation for extreme | |
209 | fragility. There have been numerous reports about the locking | |
210 | mechanism being wedged on a systemwide or even clusterwide basis, | |
211 | requiring a reboot to clear. It is rumored that this wedge, once it | |
212 | happens, also blocks local locking. Presumably killing and restarting | |
213 | statd would suffice to clear the wedge, but I haven't verified this. | |
214 | ||
215 | There appears to be a limit to how many locks may be in use at a | |
216 | time on the system, although the documentation only mentions it in | |
217 | passing. On some of their systems, UCS has increased lockd's ``size | |
218 | of the socket buffer'', whatever that means. | |
219 | \f | |
220 | C-CLIENT USAGE | |
221 | ||
222 | c-client uses flock(). On System V systems, flock() is simulated | |
223 | by an emulator that calls fcntl(). This emulator is provided by some | |
224 | systems (e.g. AIX), or uses c-client's flock.c module. | |
225 | ||
226 | ||
227 | BEZERK AND MMDF | |
228 | ||
229 | Locking in the traditional UNIX formats was largely dictated by | |
230 | the status quo in other applications; however, additional protection | |
231 | is added against inadvertantly running multiple instances of a | |
232 | c-client application on the same mail file. | |
233 | ||
234 | (1) c-client attempts to create a .lock file (mail file name with | |
235 | ``.lock'' appended) whenever it reads from, or writes to, the mail | |
236 | file. This is an exclusive lock, and is held only for short periods | |
237 | of time while c-client is actually doing the I/O. There is a 5-minute | |
238 | timeout for this lock, after which it is broken on the presumption | |
239 | that it is a stale lock. If it can not create the .lock file due to | |
240 | an EACCES (protection failure) error, it once silently proceeded | |
241 | without this lock; this was for systems which protect /usr/spool/mail | |
242 | from unprivileged processes creating files. Today, c-client reports | |
243 | an error unless it is built otherwise. The purpose of this lock is to | |
244 | prevent against unfavorable interactions with mail delivery. | |
245 | ||
246 | (2) c-client applies a shared flock() to the mail file whenever | |
247 | it reads from the mail file, and an exclusive flock() whenever it | |
248 | writes to the mail file. This lock is freed as soon as it finishes | |
249 | reading. The purpose of this lock is to prevent against unfavorable | |
250 | interactions with mail delivery. | |
251 | ||
252 | (3) c-client applies an exclusive flock() to a file on /tmp | |
253 | (whose name represents the device and inode number of the file) when | |
254 | it opens the mail file. This lock is maintained throughout the | |
255 | session, although c-client has a feature (called ``kiss of death'') | |
256 | which permits c-client to forcibly and irreversibly seize the lock | |
257 | from a cooperating c-client application that surrenders the lock on | |
258 | demand. The purpose of this lock is to prevent against unfavorable | |
259 | interactions with other instances of c-client (rewriting the mail | |
260 | file). | |
261 | ||
262 | Mail delivery daemons use lock (1), (2), or both. Lock (1) works | |
263 | over NFS; lock (2) is the only one that works on sites that protect | |
264 | /usr/spool/mail against unprivileged file creation. Prudent mail | |
265 | delivery daemons use both forms of locking, and of course so does | |
266 | c-client. | |
267 | ||
268 | If only lock (2) is used, then multiple processes can read from | |
269 | the mail file simultaneously, although in real life this doesn't | |
270 | really change things. The normal state of locks (1) and (2) is | |
271 | unlocked except for very brief periods. | |
272 | ||
273 | ||
274 | TENEX AND MTX | |
275 | ||
276 | The design of the locking mechanism of these formats was | |
277 | motivated by a design to enable multiple simultaneous read/write | |
278 | access. It is almost the reverse of how locking works with | |
279 | bezerk/mmdf. | |
280 | ||
281 | (1) c-client applies a shared flock() to the mail file when it | |
282 | opens the mail file. It upgrades this lock to exclusive whenever it | |
283 | tries to expunge the mail file. Because of the flock() bug that | |
284 | upgrading a lock actually releases it, it will not do so until it has | |
285 | acquired an exclusive lock (2) first. The purpose of this lock is to | |
286 | prevent against expunge taking place while some other c-client has the | |
287 | mail file open (and thus knows where all the messages are). | |
288 | ||
289 | (2) c-client applies a shared flock() to a file on /tmp (whose | |
290 | name represents the device and inode number of the file) when it | |
291 | parses the mail file. It applies an exclusive flock() to this file | |
292 | when it appends new mail to the mail file, as well as before it | |
293 | attempts to upgrade lock (1) to exclusive. The purpose of this lock | |
294 | is to prevent against data being appended while some other c-client is | |
295 | parsing mail in the file (to prevent reading of incomplete messages). | |
296 | It also protects against the lock-releasing timing race on lock (1). | |
297 | \f | |
298 | OBSERVATIONS | |
299 | ||
300 | In a perfect world, locking works. You are protected against | |
301 | unfavorable interactions with the mailer and against your own mistake | |
302 | by running more than one instance of your mail reader. In tenex/mtx | |
303 | formats, you have the additional benefit that multiple simultaneous | |
304 | read/write access works, with the sole restriction being that you | |
305 | can't expunge if there are any sharers of the mail file. | |
306 | ||
307 | If the mail file is NFS-mounted, then flock() locking is a silent | |
308 | no-op. This is the way BSD implements flock(), and c-client's | |
309 | emulation of flock() through fcntl() tests for NFS files and | |
310 | duplicates this functionality. There is no locking protection for | |
311 | tenex/mtx mail files at all, and only protection against the mailer | |
312 | for bezerk/mmdf mail files. This has been the accepted state of | |
313 | affairs on UNIX for many sad years. | |
314 | ||
315 | If you can not create .lock files, it should not affect locking, | |
316 | since the flock() locks suffice for all protection. This is, however, | |
317 | not true if the mailer does not check for flock() locking, or if the | |
318 | the mail file is NFS-mounted. | |
319 | ||
320 | What this means is that there is *no* locking protection at all | |
321 | in the case of a client using an NFS-mounted /usr/spool/mail that does | |
322 | not permit file creation by unprivileged programs. It is impossible, | |
323 | under these circumstances, for an unprivileged program to do anything | |
324 | about it. Worse, if EACCES errors on .lock file creation are no-op'ed | |
325 | , the user won't even know about it. This is arguably a site | |
326 | configuration error. | |
327 | ||
328 | The problem with not being able to create .lock files exists on | |
329 | System V as well, but the failure modes for flock() -- which is | |
330 | implemented via fcntl() -- are different. | |
331 | ||
332 | On System V, if the mail file is NFS-mounted and either the | |
333 | client or the server lacks a functioning statd/lockd pair, then the | |
334 | lock attempt would have hung forever if it weren't for the fact that | |
335 | c-client tests for NFS and no-ops the flock() emulator in this case. | |
336 | Systemwide or clusterwide failures of statd/lockd have been known to | |
337 | occur which cause all locks in all processes to hang (including | |
338 | local?). Without the special NFS test made by c-client, there would | |
339 | be no way to request BSD-style no-op behavior, nor is there any way to | |
340 | determine that this is happening other than the system being hung. | |
341 | ||
342 | The additional locking introduced by c-client was shown to cause | |
343 | much more stress on the System V locking mechanism than has | |
344 | traditionally been placed upon it. If it was stressed too far, all | |
345 | hell broke loose. Fortunately, this is now past history. | |
346 | \f | |
347 | TRADEOFFS | |
348 | ||
349 | c-client based applications have a reasonable chance of winning | |
350 | as long as you don't use NFS for remote access to mail files. That's | |
351 | what IMAP is for, after all. It is, however, very important to | |
352 | realize that you can *not* use the lock-upgrade feature by itself | |
353 | because it releases the lock as an interim step -- you need to have | |
354 | lock-upgrading guarded by another lock. | |
355 | ||
356 | If you have the misfortune of using System V, you are likely to | |
357 | run into problems sooner or later having to do with statd/lockd. You | |
358 | basically end up with one of three unsatisfactory choices: | |
359 | 1) Grit your teeth and live with it. | |
360 | 2) Try to make it work: | |
361 | a) avoid NFS access so as not to stress statd/lockd. | |
362 | b) try to understand the code in statd/lockd and hack it | |
363 | to be more robust. | |
364 | c) hunt out the system limit of locks, if there is one, | |
365 | and increase it. Figure on at least two locks per | |
366 | simultaneous imapd process and four locks per Pine | |
367 | process. Better yet, make the limit be 10 times the | |
368 | maximum number of processes. | |
369 | d) increase the socket buffer (-S switch to lockd) if | |
370 | it is offered. I don't know what this actually does, | |
371 | but giving lockd more resources to do its work can't | |
372 | hurt. Maybe. | |
373 | 3) Decide that it can't possibly work, and turn off the | |
374 | fcntl() calls in your program. | |
375 | 4) If nuking statd/lockd can be done without breaking local | |
376 | locking, then do so. This would make SVR4 have the same | |
377 | limitations as BSD locking, with a couple of additional | |
378 | bugs. | |
379 | 5) Check for NFS, and don't do the fcntl() in the NFS case. | |
380 | This is what c-client does. | |
381 | ||
382 | Note that if you are going to use NFS to access files on a server | |
383 | which does not have statd/lockd running, your only choice is (3), (4), | |
384 | or (5). Here again, IMAP can bail you out. | |
385 | ||
386 | These problems aren't unique to c-client applications; they have | |
387 | also been reported with Elm, Mediamail, and other email tools. | |
388 | ||
389 | Of the other two SVR4 locking bugs: | |
390 | ||
391 | Programmer awareness is necessary to deal with the bug that you | |
392 | can not get an exclusive lock unless the file is open for write. I | |
393 | believe that c-client has fixed all of these cases. | |
394 | ||
395 | The problem about opening a second designator smashing any | |
396 | current locks on the file has not been addressed satisfactorily yet. | |
397 | This is not an easy problem to deal with, especially in c-client which | |
398 | really doesn't know what other files/streams may be open by Pine. | |
399 | ||
400 | Aren't you so happy that you bought an System V system? |