Install all the files that comprise the new DocBook way of making the
[exim.git] / doc / doc-docbook / HowItWorks.txt
CommitLineData
168e428f
PH
1$Cambridge: exim/doc/doc-docbook/HowItWorks.txt,v 1.1 2005/06/16 10:32:31 ph10 Exp $
2
3CREATING THE EXIM DOCUMENTATION
4
5"You are lost in a maze of twisty little scripts."
6
7
8This document describes how the various versions of the Exim documentation, in
9different output formats, are created from DocBook XML, and also how the
10DocBook XML is itself created.
11
12
13BACKGROUND: THE OLD WAY
14
15From the start of Exim, in 1995, the specification was written in a local text
16formatting system known as SGCAL. This is capable of producing PostScript and
17plain text output from the same source file. Later, when the "ps2pdf" command
18became available with GhostScript, that was used to create a PDF version from
19the PostScript. (A few earlier versions were created by a helpful user who had
20bought the Adobe distiller software.)
21
22A demand for a version in "info" format led me to write a Perl script that
23converted the SGCAL input into a Texinfo file. Because of the somewhat
24restrictive requirements of Texinfo, this script has always needed a lot of
25maintenance, and has never been 100% satisfactory.
26
27The HTML version of the documentation was originally produced from the Texinfo
28version, but later I wrote another Perl script that produced it directly from
29the SGCAL input, which made it possible to produce better HTML.
30
31There were a small number of diagrams in the documentation. For the PostScript
32and PDF versions, these were created using Aspic, a local text-driven drawing
33program that interfaces directly to SGCAL. For the text and texinfo versions,
34alternative ascii-art diagrams were used. For the HTML version, screen shots of
35the PostScript output were turned into gifs.
36
37
38A MORE STANDARD APPROACH
39
40Although in principle SGCAL and Aspic could be generally released, they would
41be unlikely to receive much (if any) maintenance, especially after I retire.
42Furthermore, the old production method was only semi-automatic; I still did a
43certain amount of hand tweaking of spec.txt, for example. As the maintenance of
44Exim itself was being opened up to a larger group of people, it seemed sensible
45to move to a more standard way of producing the documentation, preferable fully
46automated. However, we wanted to use only non-commercial software to do this.
47
48At the time I was thinking about converting (early 2005), the "obvious"
49standard format in which to keep the documentation was DocBook XML. The use of
50XML in general, in many different applications, was increasing rapidly, and it
51seemed likely to remain a standard for some time to come. DocBook offered a
52particular form of XML suited to documents that were effectively "books".
53
54Maintaining an XML document by hand editing is a tedious, verbose, and
55error-prone process. A number of specialized XML text editors were available,
56but all the free ones were at a very primitive stage. I therefore decided to
57keep the master source in AsciiDoc format (described below), from which a
58secondary XML master could be automatically generated.
59
60All the output formats are generated from the XML file. If, in the future, a
61better way of maintaining the XML source becomes available, this can be adopted
62without changing any of the processing that produces the output documents.
63Equally, if better ways of processing the XML become available, they can be
64adopted without affecting the source maintenance.
65
66A number of issues arose while setting this all up, which are best summed up by
67the statement that a lot of the technology is (in 2005) still very immature. It
68is probable that trying to do this conversion any earlier would not have been
69anywhere near as successful. The main problems that still bother me are
70described in the penultimate section of this document.
71
72The following sections describe the processes by which the AsciiDoc files are
73transformed into the final output documents. In practice, the details are coded
74into a makefile that specifies the chain of commands for each output format.
75
76
77REQUIRED SOFTWARE
78
79Installing software to process XML puts lots and lots of stuff on your box. I
80run Gentoo Linux, and a lot of things have been installed as dependencies that
81I am not fully aware of. This is what I know about (version numbers are current
82at the time of writing):
83
84. AsciiDoc 6.0.3
85
86 This converts the master source file into a DocBook XML file, using a
87 customized AsciiDoc configuration file.
88
89. xmlto 0.0.18
90
91 This is a shell script that drives various XML processors. It is used to
92 produce "formatted objects" for PostScript and PDF output, and to produce
93 HTML output. It uses xsltproc, libxml, libxslt, libexslt, and possibly other
94 things that I have not figured out, to apply the DocBook XSLT stylesheets.
95
96. libxml 1.8.17
97 libxml2 2.6.17
98 libxslt 1.1.12
99
100 These are all installed on my box; I do not know which of libxml or libxml2
101 the various scripts are actually using.
102
103. xsl-stylesheets-1.66.1
104
105 These are the standard DocBook XSL stylesheets.
106
107. fop 0.20.5
108
109 FOP is a processor for "formatted objects". It is written in Java. The fop
110 command is a shell script that drives it.
111
112. w3m 0.5.1
113
114 This is a text-oriented web brower. It is used to produce the Ascii form of
115 the Exim documentation from a specially-created HTML format. It seems to do a
116 better job than lynx.
117
118. docbook2texi (part of docbook2X 0.8.5)
119
120 This is a wrapper script for a two-stage conversion process from DocBook to a
121 Texinfo file. It uses db2x_xsltproc and db2x_texixml. Unfortunately, there
122 are two versions of this command; the old one is based on an earlier fork of
123 docbook2X and does not work.
124
125. db2x_xsltproc and db2x_texixml (part of docbook2X 0.8.5)
126
127 More wrapping scripts (see previous item).
128
129. makeinfo 4.8
130
131 This is used to make a set of "info" files from a Texinfo file.
132
133In addition, there are some locally written Perl scripts. These are described
134below.
135
136
137ASCIIDOC
138
139AsciiDoc (http://www.methods.co.nz/asciidoc/) is a Python script that converts
140an input document in a more-or-less human-readable format into DocBook XML.
141For a document as complex as the Exim specification, the markup is quite
142complex - probably no simpler than the original SGCAL markup - but it is
143definitely easier to work with than XML itself.
144
145AsciiDoc is highly configurable. It comes with a default configuration, but I
146have extended this with an additional configuration file that must be used when
147processing the Exim documents. There is a separate document called AdMarkup.txt
148that describes the markup that is used in these documents. This includes the
149default AsciiDoc markup and the local additions.
150
151The author of AsciiDoc uses the extension .txt for input documents. I find
152this confusing, especially as some of the output files have .txt extensions.
153Therefore, I have used the extension .ascd for the sources.
154
155
156THE MAKEFILE
157
158The makefile supports a number of targets of the form x.y, where x is one of
159"filter", "spec", or "test", and y is one of "xml", "fo", "ps", "pdf", "html",
160"txt", or "info". The intermediate targets "x.xml" and "x.fo" are provided for
161testing purposes. The other five targets are production targets. For example:
162
163 make spec.pdf
164
165This runs the necessary tools in order to create the file spec.pdf from the
166original source spec.ascd. A number of intermediate files are created during
167this process, including the master DocBook source, called spec.xml. Of course,
168the usual features of "make" ensure that if this already exists and is
169up-to-date, it is not needlessly rebuilt.
170
171The "test" series of targets were created so that small tests could easily be
172run fairly quickly, because processing even the shortish filter document takes
173a bit of time, and processing the main specification takes ages.
174
175Another target is "exim.8". This runs a locally written Perl script called
176x2man, which extracts the list of command line options from the spec.xml file,
177and creates a man page. There are some XML comments in the spec.xml file to
178enable the script to find the start and end of the options list.
179
180There is also a "clean" target that deletes all the generated files.
181
182
183CREATING DOCBOOK XML FROM ASCIIDOC
184
185There is a single local AsciiDoc configuration file called MyAsciidoc.conf.
186Using this, one run of the asciidoc command creates a .xml file from a .ascd
187file. When this succeeds, there is no output.
188
189
190DOCBOOK PROCESSING
191
192Processing a .xml file into the five different output formats is not entirely
193straightforward. For a start, the same XML is not suitable for all the
194different output styles. When the final output is in a text format (.txt,
195.texinfo) for instance, all non-Ascii characters in the input must be converted
196to Ascii transliterations because the current processing tools do not do this
197correctly automatically.
198
199In order to cope with these issues in a flexible way, a Perl script called
200Pre-xml was written. This is used to preprocess the .xml files before they are
201handed to the main processors. Adding one more tool onto the front of the
202processing chain does at least seem to be in the spirit of XML processing.
203
204The XML processors themselves make use of style files, which can be overridden
205by local versions. There is one that applies to all styles, called MyStyle.xsl,
206and others for the different output formats. I have included comments in these
207style files to explain what changes I have made. Some of the changes are quite
208significant.
209
210
211THE PRE-XML SCRIPT
212
213The Pre-xml script copies a .xml file, making certain changes according to the
214options it is given. The currently available options are as follows:
215
216-abstract
217
218 This option causes the <abstract> element to be removed from the XML. The
219 source abuses the <abstract> element by using it to contain the author's
220 address so that it appears on the title page verso in the printed renditions.
221 This just gets in the way for the non-PostScript/PDF renditions.
222
223-ascii
224
225 This option is used for Ascii output formats. It makes the following
226 character replacements:
227
228 &8230; => ... (sic, no #x)
229 &#x2019; => ' apostrophe
230 &#x201C; => " opening double quote
231 &#x201D; => " closing double quote
232 &#x2013; => - en dash
233 &#x2020; => * dagger
234 &#x2021; => ** double dagger
235 &#x00a0; => a space hard space
236 &#x00a9; => (c) copyright
237
238 In addition, this option causes quotes to be put round <literal> text items,
239 and <quote> and </quote> to be replaced by Ascii quote marks. You would think
240 the stylesheet would cope with the latter, but it seems to generate non-Ascii
241 characters that w3m then turns into question marks.
242
243-bookinfo
244
245 This option causes the <bookinfo> element to be removed from the XML. It is
246 used for the PostScript/PDF forms of the filter document, in order to avoid
247 the generation of a full title page.
248
249-fi
250
251 Replace any occurrence of "fi" by the ligature &#xFB01; except when it is
252 inside an XML element, or inside a <literal> part of the text.
253
254 The use of ligatures would be nice for the PostScript and PDF formats. Sadly,
255 it turns out that fop cannot at present handle the FB01 character correctly.
256 The only format that does so is the HTML format, but when I used this in the
257 test version, people complained that it made searching for words difficult.
258 So at the moment, this option is not used. :-(
259
260-noindex
261
262 Remove the XML to generate a Concept Index and an Options index.
263
264-oneindex
265
266 Remove the XML to generate a Concept and an Options Index, and add XML to
267 generate a single index.
268
269The source document has two types of index entry, for a concept and an options
270index. However, no index is required for the .txt and .texinfo outputs.
271Furthermore, the only output processor that supports multiple indexes is the
272processor that produces "formatted objects" for PostScript and PDF output. The
273HTML processor ignores the XML settings for multiple indexes and just makes one
274unified index. Specifying two indexes gets you two copies of the same index, so
275this has to be changed.
276
277
278CREATING POSTSCRIPT AND PDF
279
280These two output formats are created in three stages. First, the XML is
281pre-processed. For the filter document, the <bookinfo> element is removed so
282that no title page is generated, but for the main specification, no changes are
283currently made.
284
285Second, the xmlto command is used to produce a "formatted objects" (.fo) file.
286This process uses the following stylesheets:
287
288 (1) Either MyStyle-filter-fo.xsl or MyStyle-spec-fo.xsl
289 (2) MyStyle-fo.xsl
290 (3) MyStyle.xsl
291 (4) MyTitleStyle.xsl
292
293The last of these is not used for the filter document, which does not have a
294title page. The first three stylesheets were created manually, either by typing
295directly, or by coping from the standard style sheet and editing.
296
297The final stylesheet has to be created from a template document, which is
298called MyTitlepage.templates.xml. This was copied from the standard styles and
299modified. The template is processed with xsltproc to produce the stylesheet.
300All this apparatus is appallingly heavyweight. The processing is also very slow
301in the case of the specification document. However, there should be no errors.
302
303In the third and final part of the processing, the .fo file that is produced by
304the xmlto command is processed by the fop command to generate either PostScript
305or PDF. This is also very slow, and you get a whole slew of errors, of which
306these are a sample:
307
308 [ERROR] property - "background-position-horizontal" is not implemented yet.
309
310 [ERROR] property - "background-position-vertical" is not implemented yet.
311
312 [INFO] JAI support was not installed (read: not present at build time).
313 Trying to use Jimi instead
314 Error creating background image: Error creating FopImage object (Error
315 creating FopImage object
316 (http://docbook.sourceforge.net/release/images/draft.png) :
317 org.apache.fop.image.JimiImage
318
319 [WARNING] table-layout=auto is not supported, using fixed!
320
321 [ERROR] Unknown enumerated value for property 'span': inherit
322
323 [ERROR] Error in span property value 'inherit':
324 org.apache.fop.fo.expr.PropertyException: No conversion defined
325
326 [ERROR] Areas pending, text probably lost in lineinclude parts matched in the
327 response by response_pattern by means of numeric variables such as
328
329The last one is particularly meaningless gobbledegook. Some of the errors and
330warnings are repeated many times. Nevertheless, it does eventually produce
331usable output, though I have a number of issues with it (see a later section of
332this document). Maybe one day there will be a new release of fop that does
333better. Maybe there will be some other means of producing PostScript and PDF
334from DocBook XML. Maybe porcine aeronautics will really happen.
335
336
337CREATING HTML
338
339Only two stages are needed to produce HTML, but the main specification is
340subsequently postprocessed. The Pre-xml script is called with the -abstract and
341-oneindex options to preprocess the XML. Then the xmlto command creates the
342HTML output directly. For the specification document, a directory of files is
343created, whereas the filter document is output as a single HTML page. The
344following stylesheets are used:
345
346 (1) Either MyStyle-chunk-html.xsl or MyStyle-nochunk-html.xsl
347 (2) MyStyle-html.xsl
348 (3) MyStyle.xsl
349
350The first stylesheet references the chunking or non-chunking standard
351stylesheet, as appropriate.
352
353The original HTML that I produced from the SGCAL input had hyperlinks back from
354chapter and section titles to the table of contents. These links are not
355generated by xmlto. One of the testers pointed out that the lack of these
356links, or simple self-referencing links for titles, makes it harder to copy a
357link name into, for example, a mailing list response.
358
359I could not find where to fiddle with the stylesheets to make such a change, if
360indeed the stylesheets are capable of it. Instead, I wrote a Perl script called
361TidyHTML-spec to do the job for the specification document. It updates the
362index.html file (which contains the the table of contents) setting up anchors,
363and then updates all the chapter files to insert appropriate links.
364
365The index.html file as built by xmlto contains the whole table of contents in a
366single line, which makes is hard to debug by hand. Since I was postprocessing
367it anyway, I arranged to insert newlines after every '>' character.
368
369The TidyHTML-spec script also takes the opportunity to postprocess the
370spec.html/ix01.html file, which contains the document index. Again, the index
371is generated as one single line, so it splits it up. Then it creates a list of
372letters at the top of the index and hyperlinks them both ways from the
373different letter portions of the index.
374
375People wanted similar postprocessing for the filter.html file, so that is now
376done using a similar script called TidyHTML-filter. It was easier to use a
377separate script because filter.html is a single file rather than a directory,
378so the logic is somewhat different.
379
380
381CREATING TEXT FILES
382
383This happens in four stages. The Pre-xml script is called with the -abstract,
384-ascii and -noindex options to remove the <abstract> element, convert the input
385to Ascii characters, and to disable the production of an index. Then the xmlto
386command converts the XML to a single HTML document, using these stylesheets:
387
388 (1) MyStyle-txt-html.xsl
389 (2) MyStyle-html.xsl
390 (3) MyStyle.xsl
391
392The MyStyle-txt-html.xsl stylesheet is the same as MyStyle-nochunk-html.xsl,
393except that it contains an addition item to ensure that a generated "copyright"
394symbol is output as "(c)" rather than the Unicode character. This is necessary
395because the stylesheet itself generates a copyright symbol as part of the
396document title; the character is not in the original input.
397
398The w3m command is used with the -dump option to turn the HTML file into Ascii
399text, but this contains multiple sequences of blank lines that make it look
400awkward, so, finally, a local Perl script called Tidytxt is used to convert
401sequences of blank lines into a single blank line.
402
403
404CREATING INFO FILES
405
406This process starts with the same Pre-xml call as for text files. The
407<abstract> element is deleted, non-ascii characters in the source are
408transliterated, and the <index> elements are removed. The docbook2texi script
409is then called to convert the XML file into a Texinfo file. However, this is
410not quite enough. The converted file ends up with "conceptindex" and
411"optionindex" items, which are not recognized by the makeinfo command. An
412in-line call to Perl in the Makefile changes these to "cindex" and "findex"
413respectively in the final .texinfo file. Finally, a call of makeinfo creates a
414set of .info files.
415
416There is one apparently unconfigurable feature of docbook2texi: it does not
417seem possible to give it a file name for its output. It chooses a name based on
418the title of the document. Thus, the main specification ends up in a file
419called the_exim_mta.texi and the filter document in exim_filtering.texi. These
420files are removed after their contents have been copied and modified by the
421inline Perl call, which makes a .texinfo file.
422
423
424CREATING THE MAN PAGE
425
426I wrote a Perl script called x2man to create the exim.8 man page from the
427DocBook XML source. I deliberately did NOT start from the AsciiDoc source,
428because it is the DocBook source that is the "standard". This comment line in
429the DocBook source marks the start of the command line options:
430
431 <!-- === Start of command line options === -->
432
433A similar line marks the end. If at some time in the future another way other
434than AsciiDoc is used to maintain the DocBook source, it needs to be capable of
435maintaining these comments.
436
437
438UNRESOLVED PROBLEMS
439
440There are a number of unresolved problems with producing the Exim documentation
441in the manner described above. I will describe them here in the hope that in
442future some way round them can be found.
443
444(1) Errors in the toolchain
445
446 When a whole chain of tools is processing a file, an error somewhere in
447 the middle is often very hard to debug. For instance, an error in the
448 AsciiDoc might not show up until an XML processor throws a wobbly because
449 the generated XML is bad. You have to be able to read XML and figure out
450 what generated what. One of the reasons for creating the "test" series of
451 targets was to help in checking out these kinds of problem.
452
453(2) There is a mechanism in XML for marking parts of the document as
454 "revised", and I have arranged for AsciiDoc markup to use it. However, at
455 the moment, the only output format that pays attention to this is the HTML
456 output, which sets a green background. There are therefore no revision
457 marks (change bars) in the PostScript, PDF, or text output formats as
458 there used to be. (There never were for Texinfo.)
459
460(3) The index entries in the HTML format take you to the top of the section
461 that is referenced, instead of to the point in the section where the index
462 marker was set.
463
464(4) The HTML output supports only a single index, so the concept and options
465 index entries have to be merged.
466
467(5) The index for the PostScript/PDF output does not merge identical page
468 numbers, which makes some entries look ugly.
469
470(6) None of the indexes (PostScript/PDF and HTML) make use of textual
471 markup; the text is all roman, without any italic or boldface.
472
473(7) I turned off hyphenation in the PostScript/PDF output, because it was
474 being done so badly.
475
476 (a) It seems to force hyphenation if it is at all possible, without
477 regard to the "tightness" or "looseness" of the line. Decent
478 formatting software should attempt hyphenation only if the line is
479 over some "looseness" threshold; otherwise you get far too many
480 hyphenations, often for several lines in succession.
481
482 (b) It uses an algorithmic form of hyphenation that doesn't always produce
483 acceptable word breaks. (I prefer to use a hyphenation dictionary.)
484
485(8) The PostScript/PDF output is badly paginated:
486
487 (a) There seems to be no attempt to avoid "widow" and "orphan" lines on
488 pages. A "widow" is the last line of a paragraph at the top of a page,
489 and an "orphan" is the first line of a paragraph at the bottom of a
490 page.
491
492 (b) There seems to be no attempt to prevent section headings being placed
493 last on a page, with no following text on the page.
494
495(9) The fop processor does not support "fi" ligatures, not even if you put the
496 appropriate Unicode character into the source by hand.
497
498(10) There are no diagrams in the new documentation. This is something I could
499 work on. The previously-used Aspic command for creating line art from a
500 textual description can output Encapsulated PostScript or Scalar Vector
501 Graphics, which are two standard diagram representations. Aspic could be
502 formally released and used to generate output that could be included in at
503 least some of the output formats.
504
505The consequence of (7), (8), and (9) is that the PostScript/PDF output looks as
506if it comes from some of the very early attempts at text formatting of around
50720 years ago. We can only hope that 20 years' progress is not going to get
508lost, and that things will improve in this area.
509
510
511LIST OF FILES
512
513AdMarkup.txt Describes the AsciiDoc markup that is used
514HowItWorks.txt This document
515Makefile The makefile
516MyAsciidoc.conf Localized AsciiDoc configuration
517MyStyle-chunk-html.xsl Stylesheet for chunked HTML output
518MyStyle-filter-fo.xsl Stylesheet for filter fo output
519MyStyle-fo.xsl Stylesheet for any fo output
520MyStyle-html.xsl Stylesheet for any HTML output
521MyStyle-nochunk-html.xsl Stylesheet for non-chunked HTML output
522MyStyle-spec-fo.xsl Stylesheet for spec fo output
523MyStyle-txt-html.xsl Stylesheet for HTML=>text output
524MyStyle.xsl Stylesheet for all output
525MyTitleStyle.xsl Stylesheet for spec title page
526MyTitlepage.templates.xml Template for creating MyTitleStyle.xsl
527Myhtml.css Experimental css stylesheet for HTML output
528Pre-xml Script to preprocess XML
529TidyHTML-filter Script to tidy up the filter HTML output
530TidyHTML-spec Script to tidy up the spec HTML output
531Tidytxt Script to compact multiple blank lines
532filter.ascd AsciiDoc source of the filter document
533spec.ascd AsciiDoc source of the specification document
534x2man Script to make the Exim man page from the XML
535
536The file Myhtml.css was an experiment that was not followed through. It is
537mentioned in a comment in MyStyle-html.xsl, but is not at present in use.
538
539
540Philip Hazel
541Last updated: 10 June 2005