doc/doc-docbook/HowItWorks.txt

   1 CREATING THE EXIM DOCUMENTATION
   2
   3 "You are lost in a maze of twisty little scripts."
   4
   5
   6 This document describes how the various versions of the Exim documentation, in
   7 different output formats, are created from DocBook XML, and also how the
   8 DocBook XML is itself created.
   9
  10
  11 BACKGROUND: THE OLD WAY
  12
  13 From the start of Exim, in 1995, the specification was written in a local text
  14 formatting system known as SGCAL. This is capable of producing PostScript and
  15 plain text output from the same source file. Later, when the "ps2pdf" command
  16 became available with GhostScript, that was used to create a PDF version from
  17 the PostScript. (A few earlier versions were created by a helpful user who had
  18 bought the Adobe distiller software.)
  19
  20 A demand for a version in "info" format led me to write a Perl script that
  21 converted the SGCAL input into a Texinfo file. Because of the somewhat
  22 restrictive requirements of Texinfo, this script always needed a lot of
  23 maintenance, and was never totally satisfactory.
  24
  25 The HTML version of the documentation was originally produced from the Texinfo
  26 version, but later I wrote another Perl script that produced it directly from
  27 the SGCAL input, which made it possible to produce better HTML.
  28
  29 There were a small number of diagrams in the documentation. For the PostScript
  30 and PDF versions, these were created using Aspic, a local text-driven drawing
  31 program that interfaces directly to SGCAL. For the text and texinfo versions,
  32 alternative ascii-art diagrams were used. For the HTML version, screen shots of
  33 the PostScript output were turned into gifs.
  34
  35
  36 A MORE STANDARD APPROACH
  37
  38 Although in principle SGCAL and Aspic could be generally released, they would
  39 be unlikely to receive much (if any) maintenance, especially after I retire.
  40 Furthermore, the old production method was only semi-automatic; I still did a
  41 certain amount of hand tweaking of spec.txt, for example. As the maintenance of
  42 Exim itself was being opened up to a larger group of people, it seemed sensible
  43 to move to a more standard way of producing the documentation, preferable fully
  44 automated. However, we wanted to use only non-commercial software to do this.
  45
  46 At the time I was thinking about converting (early 2005), the "obvious"
  47 standard format in which to keep the documentation was DocBook XML. The use of
  48 XML in general, in many different applications, was increasing rapidly, and it
  49 seemed likely to remain a standard for some time to come. DocBook offered a
  50 particular form of XML suited to documents that were effectively "books".
  51
  52 Maintaining an XML document by hand editing is a tedious, verbose, and
  53 error-prone process. A number of specialized XML text editors were available,
  54 but all the free ones were at a very primitive stage. I therefore decided to
  55 keep the master source in AsciiDoc format, from which a secondary XML master
  56 could be automatically generated.
  57
  58 The first "new" versions of the documents, for the 4.60 release, were generated
  59 this way. However, there were a number of problems with using AsciiDoc for a
  60 document as large and as complex as the Exim manual. As a result, I wrote a new
  61 application called xfpt ("XML From Plain Text") which creates XML from a
  62 relatively simple and consistent markup language. This application has been
  63 released for general use, and the master sources for the Exim documentation are
  64 now in xfpt format.
  65
  66 All the output formats are generated from the XML file. If, in the future, a
  67 better way of maintaining the XML source becomes available, this can be adopted
  68 without changing any of the processing that produces the output documents.
  69 Equally, if better ways of processing the XML become available, they can be
  70 adopted without affecting the source maintenance.
  71
  72 A number of issues arose while setting this all up, which are best summed up by
  73 the statement that a lot of the technology was (in 2006) still very immature.
  74 Trying to do this conversion any earlier would probably not have been anywhere
  75 near as successful. The main issues that bother me in the XML-generated
  76 documentation are described in the penultimate section of this document.
  77
  78 Initially, the major problems were in producing PostScript and PDF outputs. The
  79 available free software for doing this was and still is (we are now in 2007)
  80 cumbersome and slow, and does not support certain output features that I would
  81 like. My response to this was, over a period of two years, to write an XML
  82 processor called SDoP (Simple DocBook Processor). This program reads DocBook
  83 XML and writes PostScript, without using any of the heavyweight apparatus that
  84 is required for xmlto and fop (the previously used software).
  85
  86 An experimental first version of SDoP was used for the Exim 4.67
  87 documentation. Subsequently SDoP was released for general use. SDoP's output
  88 includes features that are missing when xmlto/fop is used, and it also runs
  89 about 60 times faster. The main manual can be formatted in 2.5 seconds instead
  90 of 2.5 minutes, which makes checking and fixing mistakes much easier.
  91
  92 The Makefile that is used to build the various forms of output will, for the
  93 moment, support both ways of producing PostScript and PDF output, though the
  94 default is now to use SDoP.
  95
  96 The following sections describe the processes by which the xfpt files are
  97 transformed into the final output documents. In practice, the details are coded
  98 into a Makefile that specifies the chain of commands for each output format.
  99
 100
 101 REQUIRED SOFTWARE
 102
 103 Installing software to process XML puts lots and lots of stuff on your box. I
 104 run Gentoo Linux, and a lot of things have been installed as dependencies that
 105 I am not fully aware of. This is what I know about (version numbers are current
 106 at the time of writing):
 107
 108 . xfpt 0.03
 109
 110   This converts the master source file into a DocBook XML file.
 111
 112 . sdop 0.03
 113
 114   This is my new DocBook-to-PostScript processor.
 115
 116 . ps2pdf
 117
 118   This is a wrapper script that is part of the GhostScript distribution. It
 119   converts a PostScript file into a PDF file. It is used to process the output
 120   from SDoP. It is not required when xmlto/fop is being used to generate PDF
 121   output.
 122
 123 . xmlto 0.0.18
 124
 125   This is a shell script that drives various XML processors. It is used to
 126   produce "formatted objects" when PostScript and PDF output is being generated
 127   using fop (the old way) rather than SDoP. It is always used to produce HTML
 128   output. It uses xsltproc, libxml, libxslt, libexslt, and possibly other
 129   things that I have not figured out, to apply the DocBook XSLT stylesheets.
 130
 131 . libxml 1.8.17
 132   libxml2 2.6.28
 133   libxslt 1.1.20
 134
 135   These are all installed on my box; I do not know which of libxml or libxml2
 136   the various scripts are actually using.
 137
 138 . xsl-stylesheets-<version>
 139
 140   These are the standard DocBook XSL stylesheets.
 141
 142   The documents use http://docbook.sourceforge.net/release/xsl/current/ which
 143   should be mapped to an appropriate local path via the system catalogs.
 144
 145 . fop 0.93
 146
 147   FOP is a processor for "formatted objects". It is written in Java. The fop
 148   command is a shell script that drives it. It required only if you do not
 149   want to use SDoP and ps2pdf to generate PostScript and PDF output.
 150
 151 . w3m 0.5.2
 152
 153   This is a text-oriented web browser. It is used to produce the ASCII form of
 154   the Exim documentation (spec.txt) from a specially-created HTML format. It
 155   seems to do a better job than lynx.
 156
 157 . docbook2texi (part of docbook2X 0.8.5)
 158
 159   This is a wrapper script for a two-stage conversion process from DocBook to a
 160   Texinfo file. It uses db2x_xsltproc and db2x_texixml. Unfortunately, there
 161   are two versions of this command; the old one is based on an earlier fork of
 162   docbook2X and does not work.
 163
 164 . db2x_xsltproc and db2x_texixml (part of docbook2X 0.8.5)
 165
 166   More wrapping scripts (see previous item).
 167
 168 . makeinfo 4.8
 169
 170   This is used to make an "info" file from a Texinfo file.
 171
 172 In addition, there are a number of locally written Perl scripts. These are
 173 described below.
 174
 175
 176 THE MAKEFILE
 177
 178 The makefile supports a number of targets of the form x.y, where x is one of
 179 "filter", "spec", or "test", and y is one of "xml", "fo", "ps", "pdf", "html",
 180 "txt", or "info". The intermediate targets "x.xml" and "x.fo" are provided for
 181 testing purposes. The other five targets are production targets. For example:
 182
 183   make spec.pdf
 184
 185 This runs the necessary tools in order to create the file spec.pdf from the
 186 original source spec.xfpt. A number of intermediate files are created during
 187 this process, including the master DocBook source, called spec.xml. Of course,
 188 the usual features of "make" ensure that if this already exists and is
 189 up-to-date, it is not needlessly rebuilt.
 190
 191 Because there are now two ways of creating the PostScript and PDF outputs,
 192 there are two targets for each one. For example fop-spec.ps makes PostScript
 193 using fop, and sdop-spec.ps makes it using SDoP. The generic targets spec.ps
 194 and spec.pdf now point to the SDoP versions.
 195
 196 The "test" series of targets were created so that small tests could easily be
 197 run fairly quickly, because processing even the shortish XML document takes
 198 a bit of time, and processing the main specification takes ages -- except when
 199 using SDoP for PostScript and PDF.
 200
 201 Another target is "exim.8". This runs a locally written Perl script called
 202 x2man, which extracts the list of command line options from the spec.xml file,
 203 and creates a man page. There are some XML comments in the spec.xml file to
 204 enable the script to find the start and end of the options list.
 205
 206 There is also a "clean" target that deletes all the generated files.
 207
 208
 209 CREATING DOCBOOK XML FROM XFPT INPUT
 210
 211 The small amount of local configuration for xfpt is included at the start of
 212 the two .xfpt files; there are no separate local xfpt configuration files.
 213 Running the xfpt command creates a .xml file from a .xfpt file. When this
 214 succeeds, there is no output.
 215
 216
 217 DOCBOOK PROCESSING
 218
 219 Processing a .xml file into the five different output formats is not entirely
 220 straightforward. For a start, the same XML is not suitable for all the
 221 different output styles. When the final output is in a text format (.txt,
 222 .texinfo) for instance, all non-ASCII characters in the input must be converted
 223 to ASCII transliterations because the current processing tools do not do this
 224 correctly automatically.
 225
 226 In order to cope with these issues in a flexible way, a Perl script called
 227 Pre-xml was written. This is used to preprocess the .xml files before they are
 228 handed to the main processors. Adding one more tool onto the front of the
 229 processing chain does at least seem to be in the spirit of XML processing.
 230
 231 The XML processors other than SDoP make use of style files, which can be
 232 overridden by local versions. There is one that applies to all styles, called
 233 MyStyle.xsl, and others for the different output formats. I have included
 234 comments in these style files to explain what changes I have made. Some of the
 235 changes are quite significant.
 236
 237
 238 XSL INCLUDES
 239
 240 References to XSL paths should use the public URLs, such as:
 241   http://docbook.sourceforge.net/release/xsl/current/xhtml/docbook.xsl
 242 If this fails to work for you, then there is a problem with your system
 243 catalogs.  As a work-around, you can adjust the OS-Fixups script and then:
 244 $ make os-fixup
 245
 246 As an example of how this should normally work, on a FreeBSD system the
 247 resolution goes to /usr/local/share/xml/catalog which contains a directive:
 248   <nextCatalog catalog="/usr/local/share/xml/catalog.ports" />
 249 to pull in the file automatically maintained by the Ports system.  That file
 250 will contain:
 251   <delegateSystem
 252    systemIdStartString="http://docbook.sourceforge.net/release/xsl/"
 253    catalog="file:///usr/local/share/xsl/docbook/catalog" />
 254   <delegateURI
 255    uriStartString="http://docbook.sourceforge.net/release/xsl/"
 256    catalog="file:///usr/local/share/xsl/docbook/catalog" />
 257 and that catalog file contains:
 258   <rewriteSystem
 259    systemIdStartString="http://docbook.sourceforge.net/release/xsl/current"
 260    rewritePrefix="file:///usr/local/share/xsl/docbook" />
 261   <rewriteURI
 262    uriStartString="http://docbook.sourceforge.net/release/xsl/current"
 263     rewritePrefix="file:///usr/local/share/xsl/docbook" />
 264 and the full path is thus eventually arrived at.
 265
 266 See also the tools:
 267   xmlcatalog(1) from libxml2
 268   xmlcatmgr(1) for a lightweight tool written for the NetBSD Packages system.
 269
 270
 271 THE PRE-XML SCRIPT
 272
 273 The Pre-xml script copies a .xml file, making certain changes according to the
 274 options it is given. The currently available options are as follows:
 275
 276 -ascii
 277
 278   This option is used for ASCII output formats. It makes the following
 279   character replacements:
 280
 281     &#x2019;  =>  '         apostrophe
 282     &copy;    =>  (c)       copyright
 283     &dagger;  =>  *         dagger
 284     &Dagger;  =>  **        double dagger
 285     &nbsp;    =>  a space   hard space
 286     &ndash;   =>  -         en dash
 287
 288   The apostrophe is specified numerically because that is what xfpt generates
 289   from an ASCII single quote character. Non-ASCII characters that are not in
 290   this list should not be used without thinking about how they might be
 291   converted for the ASCII formats.
 292
 293   In addition to the character replacements, this option causes quotes to be
 294   put round <literal> text items, and <quote> and </quote> to be replaced by
 295   ASCII quote marks. You would think the stylesheet would cope with the latter,
 296   but it seems to generate non-ASCII characters that w3m then turns into
 297   question marks.
 298
 299 -bookinfo
 300
 301   This option causes the <bookinfo> element to be removed from the XML. It is
 302   used for the PostScript/PDF forms of the filter document, in order to avoid
 303   the generation of a full title page.
 304
 305 -fi
 306
 307   Replace any occurrence of "fi" by the ligature &#xFB01; except when it is
 308   inside an XML element, or inside a <literal> part of the text.
 309
 310   The use of ligatures would be nice for the PostScript and PDF formats. Sadly,
 311   it turns out that fop cannot at present handle the FB01 character correctly.
 312   Happily this problem is now avoided when SDoP is used to generate PostScript
 313   (and thence PDF) because SDoP automatically uses an "fi" ligature for
 314   non-fixed-width fonts.
 315
 316   The only xmlto format that handles FB01 is the HTML format, but when I used
 317   this in the test version, people complained that it made searching for words
 318   difficult. So this option is in practice not used at all.
 319
 320 -noindex
 321
 322   Remove the XML to generate a Concept Index and an Options index. The source
 323   document has three types of index entry, for variables, options, and concept
 324   indexes. However, no index is required for the .txt and .texinfo outputs.
 325
 326 -oneindex
 327
 328   Remove the XML to generate separate variables, options, and concept indexes,
 329   and add XML to generate a single index. The only output processors that
 330   support multiple indexes are SDoP and the processor that produces "formatted
 331   objects" for PostScript and PDF output for fop. The HTML processor ignores
 332   the XML settings for multiple indexes and just makes one unified index.
 333   Specifying three indexes gets you three copies of the same index, so this has
 334   to be changed.
 335
 336 -optbreak
 337
 338   Look for items of the form <option>...</option> and <varname>...</varname> in
 339   ordinary paragraphs, and insert &#x200B; after each underscore in the
 340   enclosed text. The same is done for any word containing four or more upper
 341   case letters (compile-time options in the Exim specification). The character
 342   &#x200B; is a zero-width space. This means that the line may be split after
 343   one of these underscores, but no hyphen is inserted.
 344
 345
 346 CREATING POSTSCRIPT AND PDF
 347
 348 These two output formats are created either by using my new SDoP program to
 349 produce PostScript which can then be run through ps2pdf to make a PDF, or by
 350 using xmlto and fop in the old way.
 351
 352
 353 USING SDOP TO CREATE POSTSCRIPT AND PDF
 354
 355 PostScript output is created in two stages. First, the XML is pre-processed by
 356 the Pre-xml script. For the filter document, the <bookinfo> element is removed
 357 so that no title page is generated. For the main specification, the only change
 358 is to insert line breakpoints via -optbreak.
 359
 360 The SDoP program is then used to create PostScript output directly from the XML
 361 input. Then the ps2pdf command is used to generated a PDF from the PostScript.
 362 There are no external stylesheets that are used by SDoP. Any variations to the
 363 default format are specified inline using "processing instructions".
 364
 365
 366 USING XMLTO AND FOP TO CREATE POSTSCRIPT AND PDF
 367
 368 This is the original way of creating PostScript and PDF output. The processing
 369 happens in three stages, with an additional fourth stage for PDF. First, the
 370 XML is pre-processed by the Pre-xml script. For the filter document, the
 371 <bookinfo> element is removed so that no title page is generated. For the main
 372 specification, the only change is to insert line breakpoints via -optbreak.
 373
 374 Second, the xmlto command is used to produce a "formatted objects" (.fo) file.
 375 This process uses the following stylesheets:
 376
 377   (1) Either MyStyle-filter-fo.xsl or MyStyle-spec-fo.xsl
 378   (2) MyStyle-fo.xsl
 379   (3) MyStyle.xsl
 380   (4) MyTitleStyle.xsl
 381
 382 The last of these is not used for the filter document, which does not have a
 383 title page. The first three stylesheets were created manually, either by typing
 384 directly, or by coping from the standard style sheet and editing.
 385
 386 The final stylesheet has to be created from a template document, which is
 387 called MyTitlepage.templates.xml. This was copied from the standard styles and
 388 modified. The template is processed with xsltproc to produce the stylesheet.
 389 All this apparatus is appallingly heavyweight. The processing is also very slow
 390 in the case of the specification document. However, there should be no errors.
 391
 392 The reference book that saved my life while I was trying to get all this to
 393 work is "DocBook XSL, The Complete Guide", third edition (2005), by Bob
 394 Stayton, published by Sagehill Enterprises.
 395
 396 In the third part of the processing, the .fo file that is produced by the xmlto
 397 command is processed by the fop command to generate either PostScript or PDF.
 398 This is also very slow, and you get a whole slew of errors, of which these are
 399 a sample:
 400
 401   [ERROR] property - "background-position-horizontal" is not implemented yet.
 402
 403   [ERROR] property - "background-position-vertical" is not implemented yet.
 404
 405   [INFO] JAI support was not installed (read: not present at build time).
 406     Trying to use Jimi instead
 407     Error creating background image: Error creating FopImage object (Error
 408     creating FopImage object
 409     (http://docbook.sourceforge.net/release/images/draft.png) :
 410     org.apache.fop.image.JimiImage
 411
 412   [WARNING] table-layout=auto is not supported, using fixed!
 413
 414   [ERROR] Unknown enumerated value for property 'span': inherit
 415
 416   [ERROR] Error in span property value 'inherit':
 417     org.apache.fop.fo.expr.PropertyException: No conversion defined
 418
 419   [ERROR] Areas pending, text probably lost in lineinclude parts matched in the
 420     response by response_pattern by means of numeric variables such as
 421
 422 The last one is particularly meaningless gobbledegook. Some of the errors and
 423 warnings are repeated many times. Nevertheless, it does eventually produce
 424 usable output, though I have a number of issues with it (see a later section of
 425 this document). Maybe one day there will be a new release of fop that does
 426 better. In the meantime, I have written my own program for making PostScript
 427 output -- see the previous section -- because the problems with xmlto/fop were
 428 sufficiently annoying.
 429
 430 The PDF file that is produced by this process has one problem: the pages, as
 431 shown by acroread in its thumbnail display, are numbered sequentially from one
 432 to the end. Those numbers do not correspond with the page numbers of the body
 433 of the document, which makes finding a page from the index awkward. There is a
 434 facility in the PDF format to give pages appropriate "labels", but I cannot
 435 find a way of persuading fop to generate these. Fortunately, it is possibly to
 436 fix up the PDF to add page labels. I wrote a script called PageLabelPDF which
 437 does this. They are shown correctly by acroread and xpdf, but not by
 438 GhostScript (gv).
 439
 440
 441 THE PAGELABELPDF SCRIPT
 442
 443 This script reads the standard input and writes the standard output. It is used
 444 to "tidy up" the PDF output that is produced by fop. It is not needed when
 445 PDF output is generated from SDoP's output using ps2pdf.
 446
 447 The PageLabelPDF script searches for the PDF object that sets data in its
 448 "Catalog", and adds appropriate information about page labels. The number of
 449 front-matter pages (those before chapter 1) is hard-wired into this script as
 450 12 because I could not find a way of determining it automatically. As the
 451 current table of contents finishes near the top of the 11th page, there is
 452 plenty of room for expansion, so it is unlikely to be a problem.
 453
 454 Having added data to the PDF file, the script then finds the xref table at the
 455 end of the file, and adjusts its entries to allow for the added text. This
 456 simple processing seems to be enough to generate a new, valid, PDF file.
 457
 458
 459 CREATING HTML
 460
 461 Only two stages are needed to produce HTML, but the main specification is
 462 subsequently postprocessed. The Pre-xml script is called with the -optbreak and
 463 -oneindex options to preprocess the XML. Then the xmlto command creates the
 464 HTML output directly. For the specification document, a directory of files is
 465 created, whereas the filter document is output as a single HTML page. The
 466 following stylesheets are used:
 467
 468   (1) Either MyStyle-chunk-html.xsl or MyStyle-nochunk-html.xsl
 469   (2) MyStyle-html.xsl
 470   (3) MyStyle.xsl
 471
 472 The first stylesheet references the chunking or non-chunking standard DocBook
 473 stylesheet, as appropriate.
 474
 475 You may see a number of these errors when creating HTML: "Revisionflag on
 476 unexpected element: literallayout (Assuming block)". They seem to be harmless;
 477 the output appears to be what is intended.
 478
 479 The original HTML that I produced from the SGCAL input had hyperlinks back from
 480 chapter and section titles to the table of contents. These links are not
 481 generated by xmlto. One of the testers pointed out that the lack of these
 482 links, or simple self-referencing links for titles, makes it harder to copy a
 483 link name into, for example, a mailing list response.
 484
 485 I could not find where to fiddle with the stylesheets to make such a change, if
 486 indeed the stylesheets are capable of it. Instead, I wrote a Perl script called
 487 TidyHTML-spec to do the job for the specification document. It updates the
 488 index.html file (which contains the the table of contents) setting up anchors,
 489 and then updates all the chapter files to insert appropriate links.
 490
 491 The index.html file as built by xmlto contains the whole table of contents in a
 492 single line, which makes is hard to debug by hand. Since I was postprocessing
 493 it anyway, I arranged to insert newlines after every '>' character.
 494
 495 The TidyHTML-spec script also processes every HTML file, to tidy up some of the
 496 untidy features therein. It turns <div class="literallayout"><p> into <div
 497 class="literallayout"> and a matching </p></div> into </div> to get rid of
 498 unwanted vertical white space in literallayout blocks. Before each occurrence
 499 of </td> it inserts &nbsp; so that the table's cell is a little bit wider than
 500 the text itself.
 501
 502 The TidyHTML-spec script also takes the opportunity to postprocess the
 503 spec_html/ix01.html file, which contains the document index. Again, the index
 504 is generated as one single line, so it splits it up. Then it creates a list of
 505 letters at the top of the index and hyperlinks them both ways from the
 506 different letter portions of the index.
 507
 508 People wanted similar postprocessing for the filter.html file, so that is now
 509 done using a similar script called TidyHTML-filter. It was easier to use a
 510 separate script because filter.html is a single file rather than a directory,
 511 so the logic is somewhat different.
 512
 513
 514 CREATING TEXT FILES
 515
 516 This happens in four stages. The Pre-xml script is called with the -ascii,
 517 -optbreak, and -noindex options to convert the input to ASCII characters,
 518 insert line break points, and disable the production of an index. Then the
 519 xmlto command converts the XML to a single HTML document, using these
 520 stylesheets:
 521
 522   (1) MyStyle-txt-html.xsl
 523   (2) MyStyle-html.xsl
 524   (3) MyStyle.xsl
 525
 526 The MyStyle-txt-html.xsl stylesheet is the same as MyStyle-nochunk-html.xsl,
 527 except that it contains an addition item to ensure that a generated "copyright"
 528 symbol is output as "(c)" rather than the Unicode character. This is necessary
 529 because the stylesheet itself generates a copyright symbol as part of the
 530 document title; the character is not in the original input.
 531
 532 The w3m command is used with the -dump option to turn the HTML file into ASCII
 533 text, but this contains multiple sequences of blank lines that make it look
 534 awkward. Furthermore, chapter and section titles do not stand out very well. A
 535 local Perl script called Tidytxt is used to post-process the output. First, it
 536 converts sequences of blank lines into a single blank lines. Then it searches
 537 for chapter and section headings. Each chapter heading is uppercased, and
 538 preceded by an extra two blank lines and a line of equals characters. An extra
 539 newline is inserted before each section heading, and they are underlined with
 540 hyphens.
 541
 542 The output of xmlto also contains non-ASCII Unicode characters that w3m passes
 543 through. Fortunately, they are few, and Tidytxt cleans them up as well. Some
 544 headings use "box drawing" characters in the range U+2500 to U+253F which are
 545 translated into -+| as appropriate, and U+00A0 (hard space) and U+25CF (bullet)
 546 are translated into plain spaces and asterisks. (It might be possible to do all
 547 this in the same way as I dealt with copyright - see above - but adding a few
 548 lines of Perl to an existing script was a lot easier.)
 549
 550
 551 CREATING INFO FILES
 552
 553 This process starts with the same Pre-xml call as for text files. Non-ascii
 554 characters in the source are transliterated, and the <index> elements are
 555 removed. The docbook2texi script is then called to convert the XML file into a
 556 Texinfo file. However, this is not quite enough. The converted file ends up
 557 with "conceptindex" and "optionindex" items, which are not recognized by the
 558 makeinfo command. These have to be changed to "cindex" and "findex"
 559 respectively in the final .texinfo file. Furthermore, the main menu lacks a
 560 pointer to the index, and indeed the index node itself is missing. These
 561 problems are fixed by running the file through a script called TidyInfo.
 562 Finally, a call of makeinfo creates a .info file.
 563
 564 There is one apparently unconfigurable feature of docbook2texi: it does not
 565 seem possible to give it a file name for its output. It chooses a name based on
 566 the title of the document. Thus, the main specification ends up in a file
 567 called the_exim_mta.texi and the filter document in exim_filtering.texi. These
 568 files are removed after their contents have been copied and modified by the
 569 TidyInfo script, which writes to a .texinfo file.
 570
 571
 572 CREATING THE MAN PAGE
 573
 574 I wrote a Perl script called x2man to create the exim.8 man page from the
 575 DocBook XML source. I deliberately did NOT start from the xfpt source,
 576 because it is the DocBook source that is the "standard". This comment line in
 577 the DocBook source marks the start of the command line options:
 578
 579   <!-- === Start of command line options === -->
 580
 581 A similar line marks the end. If at some time in the future another way other
 582 than xfpt is used to maintain the DocBook source, it needs to be capable of
 583 maintaining these comments.
 584
 585
 586 UNRESOLVED PROBLEMS
 587
 588 There are a number of unresolved problems with producing the Exim documentation
 589 in the manner described above. I will describe them here in the hope that in
 590 future some way round them can be found. Some of the problems are solved by
 591 using SDoP instead of xmlto/fop to produce PostScript and PDF output.
 592
 593 (1)  When a whole chain of tools is processing a file, an error somewhere
 594      in the middle is often very hard to debug. For instance, an error in the
 595      xfpt file might not show up until an XML processor throws a wobbly because
 596      the generated XML is bad. You have to be able to read XML and figure out
 597      what generated what. One of the reasons for creating the "test" series of
 598      targets was to help in checking out these kinds of problem.
 599
 600 (2)  There is a mechanism in XML for marking parts of the document as
 601      "revised", and I have arranged for xfpt markup to use it. However, the
 602      only xmlto output format that pays attention to this is the HTML output,
 603      which sets a green background. If xmlto/fop is used to generate PostScript
 604      and PDF, there are no revision marks (change bars). This problem
 605      is not present when SDoP is used. However, the text and Texinfo output
 606      format lack revision indications.
 607
 608 (3)  The index entries in the HTML format take you to the top of the section
 609      that is referenced, instead of to the point in the section where the index
 610      marker was set.
 611
 612 (4)  The HTML output supports only a single index, so the variable, options,
 613      and concept index entries have to be merged.
 614
 615 (5)  The index for the PostScript/PDF output created by xmlto/fop does not
 616      merge identical page numbers, which makes some entries look ugly. This is
 617      not a problem when SDoP is used.
 618
 619 (6)  The HTML index and the PostScript/PDF indexes, when made with xmlto/fop,
 620      make no use of textual markup; the text is all roman, without any italic
 621      or boldface. For PostScript/PDF, this is not a problem when SDoP is used.
 622
 623 (7)  I turned off hyphenation in the PostScript/PDF output produced by
 624      xmlto/fop, because it was being done so badly. Needless to say, I made
 625      SDoP do a better job. These comments apply to xmlto/fop:
 626
 627      (a) It seems to force hyphenation if it is at all possible, without
 628          regard to the "tightness" or "looseness" of the line. Decent
 629          formatting software should attempt hyphenation only if the line is
 630          over some "looseness" threshold; otherwise you get far too many
 631          hyphenations, often for several lines in succession.
 632
 633      (b) It uses an algorithmic form of hyphenation that doesn't always produce
 634          acceptable word breaks. (I prefer to use a hyphenation dictionary,
 635          which is what SDoP does.)
 636
 637 (8)  The PostScript/PDF output produced by xmlto/fop is badly paginated:
 638
 639      (a) There seems to be no attempt to avoid "widow" and "orphan" lines on
 640          pages. A "widow" is the last line of a paragraph at the top of a page,
 641          and an "orphan" is the first line of a paragraph at the bottom of a
 642          page.
 643
 644      (b) There seems to be no attempt to prevent section headings being placed
 645          last on a page, with no following text on the page.
 646
 647      Neither of these problems occurs when SDoP is used to produce the
 648      PostScript/PDF output.
 649
 650 (9)  The fop processor does not support "fi" ligatures, not even if you put the
 651      appropriate Unicode character into the source by hand. Again, this is not
 652      a problem if SDoP is used.
 653
 654 (10) There are no diagrams in the new documentation. This is something I hope
 655      to work on. The previously used Aspic command for creating line art from a
 656      textual description can output Encapsulated PostScript or Scalar Vector
 657      Graphics, which are two standard diagram representations. Aspic could be
 658      formally released and used to generate output that could be included in at
 659      least some of the output formats.
 660
 661 (11) The use of a "zero-width space" works well as a way of specifying that
 662      Exim option names can be split, without hyphens, over line breaks.
 663
 664      However, when xmlto/fop is being used and an option is not split, if the
 665      line is very "loose", the zero-width space is expanded, along with other
 666      spaces. This is a totally crazy thing to, but unfortunately it is
 667      suggested by the Unicode definition of the zero-width space, which says
 668      "its presence between two characters does not prevent increased letter
 669      spacing in justification". It seems that the implementors of fop have
 670      understood "letter spacing" also to include "word spacing". Sigh.
 671
 672      This problem does not arise when SDoP is used.
 673
 674 The consequence of (7), (8), and (9) is that the PostScript/PDF output as
 675 produced by xmlto/fop looks as if it comes from some of the very early attempts
 676 at text formatting of around 20 years ago. We can only hope that 20 years'
 677 progress is not going to get lost, and that things will improve in this area.
 678 My small contribution to this has been to write SDoP, which, though simple and
 679 "non-standard", does get some of these formatting issues right.
 680
 681
 682 LIST OF FILES
 683
 684 Markup.txt                     Describes the xfpt markup that is used
 685 HowItWorks.txt                 This document
 686 Makefile                       The makefile
 687 MyStyle-chunk-html.xsl         Stylesheet for chunked HTML output
 688 MyStyle-filter-fo.xsl          Stylesheet for filter fo output
 689 MyStyle-fo.xsl                 Stylesheet for any fo output
 690 MyStyle-html.xsl               Stylesheet for any HTML output
 691 MyStyle-nochunk-html.xsl       Stylesheet for non-chunked HTML output
 692 MyStyle-spec-fo.xsl            Stylesheet for spec fo output
 693 MyStyle-txt-html.xsl           Stylesheet for HTML=>text output
 694 MyStyle.xsl                    Stylesheet for all output
 695 MyTitleStyle.xsl               Stylesheet for spec title page
 696 MyTitlepage.templates.xml      Template for creating MyTitleStyle.xsl
 697 Myhtml.css                     Experimental css stylesheet for HTML output
 698 PageLabelPDF                   Script to postprocess xmlto/fop PDF output
 699 Pre-xml                        Script to preprocess XML
 700 TidyHTML-filter                Script to tidy up the filter HTML output
 701 TidyHTML-spec                  Script to tidy up the spec HTML output
 702 TidyInfo                       Script to sort index problems in Texinfo output
 703 Tidytxt                        Script to compact multiple blank lines
 704 filter.xfpt                    xfpt source of the filter document
 705 spec.xfpt                      xfpt source of the specification document
 706 x2man                          Script to make the Exim man page from the XML
 707
 708
 709 (Originally, and for the most part: Philip Hazel)
 710 The Exim Maintainers
 711 Last updated: 5 July 2010