doc/doc-docbook/HowItWorks.txt

   1 $Cambridge: exim/doc/doc-docbook/HowItWorks.txt,v 1.2 2005/11/10 12:30:13 ph10 Exp $
   2
   3 CREATING THE EXIM DOCUMENTATION
   4
   5 "You are lost in a maze of twisty little scripts."
   6
   7
   8 This document describes how the various versions of the Exim documentation, in
   9 different output formats, are created from DocBook XML, and also how the
  10 DocBook XML is itself created.
  11
  12
  13 BACKGROUND: THE OLD WAY
  14
  15 From the start of Exim, in 1995, the specification was written in a local text
  16 formatting system known as SGCAL. This is capable of producing PostScript and
  17 plain text output from the same source file. Later, when the "ps2pdf" command
  18 became available with GhostScript, that was used to create a PDF version from
  19 the PostScript. (A few earlier versions were created by a helpful user who had
  20 bought the Adobe distiller software.)
  21
  22 A demand for a version in "info" format led me to write a Perl script that
  23 converted the SGCAL input into a Texinfo file. Because of the somewhat
  24 restrictive requirements of Texinfo, this script has always needed a lot of
  25 maintenance, and has never been 100% satisfactory.
  26
  27 The HTML version of the documentation was originally produced from the Texinfo
  28 version, but later I wrote another Perl script that produced it directly from
  29 the SGCAL input, which made it possible to produce better HTML.
  30
  31 There were a small number of diagrams in the documentation. For the PostScript
  32 and PDF versions, these were created using Aspic, a local text-driven drawing
  33 program that interfaces directly to SGCAL. For the text and texinfo versions,
  34 alternative ascii-art diagrams were used. For the HTML version, screen shots of
  35 the PostScript output were turned into gifs.
  36
  37
  38 A MORE STANDARD APPROACH
  39
  40 Although in principle SGCAL and Aspic could be generally released, they would
  41 be unlikely to receive much (if any) maintenance, especially after I retire.
  42 Furthermore, the old production method was only semi-automatic; I still did a
  43 certain amount of hand tweaking of spec.txt, for example. As the maintenance of
  44 Exim itself was being opened up to a larger group of people, it seemed sensible
  45 to move to a more standard way of producing the documentation, preferable fully
  46 automated. However, we wanted to use only non-commercial software to do this.
  47
  48 At the time I was thinking about converting (early 2005), the "obvious"
  49 standard format in which to keep the documentation was DocBook XML. The use of
  50 XML in general, in many different applications, was increasing rapidly, and it
  51 seemed likely to remain a standard for some time to come. DocBook offered a
  52 particular form of XML suited to documents that were effectively "books".
  53
  54 Maintaining an XML document by hand editing is a tedious, verbose, and
  55 error-prone process. A number of specialized XML text editors were available,
  56 but all the free ones were at a very primitive stage. I therefore decided to
  57 keep the master source in AsciiDoc format (described below), from which a
  58 secondary XML master could be automatically generated.
  59
  60 All the output formats are generated from the XML file. If, in the future, a
  61 better way of maintaining the XML source becomes available, this can be adopted
  62 without changing any of the processing that produces the output documents.
  63 Equally, if better ways of processing the XML become available, they can be
  64 adopted without affecting the source maintenance.
  65
  66 A number of issues arose while setting this all up, which are best summed up by
  67 the statement that a lot of the technology is (in 2005) still very immature. It
  68 is probable that trying to do this conversion any earlier would not have been
  69 anywhere near as successful. The main problems that still bother me are
  70 described in the penultimate section of this document.
  71
  72 The following sections describe the processes by which the AsciiDoc files are
  73 transformed into the final output documents. In practice, the details are coded
  74 into a makefile that specifies the chain of commands for each output format.
  75
  76
  77 REQUIRED SOFTWARE
  78
  79 Installing software to process XML puts lots and lots of stuff on your box. I
  80 run Gentoo Linux, and a lot of things have been installed as dependencies that
  81 I am not fully aware of. This is what I know about (version numbers are current
  82 at the time of writing):
  83
  84 . AsciiDoc 6.0.3
  85
  86   This converts the master source file into a DocBook XML file, using a
  87   customized AsciiDoc configuration file.
  88
  89 . xmlto 0.0.18
  90
  91   This is a shell script that drives various XML processors. It is used to
  92   produce "formatted objects" for PostScript and PDF output, and to produce
  93   HTML output. It uses xsltproc, libxml, libxslt, libexslt, and possibly other
  94   things that I have not figured out, to apply the DocBook XSLT stylesheets.
  95
  96 . libxml 1.8.17
  97   libxml2 2.6.17
  98   libxslt 1.1.12
  99
 100   These are all installed on my box; I do not know which of libxml or libxml2
 101   the various scripts are actually using.
 102
 103 . xsl-stylesheets-1.66.1
 104
 105   These are the standard DocBook XSL stylesheets.
 106
 107 . fop 0.20.5
 108
 109   FOP is a processor for "formatted objects". It is written in Java. The fop
 110   command is a shell script that drives it.
 111
 112 . w3m 0.5.1
 113
 114   This is a text-oriented web brower. It is used to produce the Ascii form of
 115   the Exim documentation from a specially-created HTML format. It seems to do a
 116   better job than lynx.
 117
 118 . docbook2texi (part of docbook2X 0.8.5)
 119
 120   This is a wrapper script for a two-stage conversion process from DocBook to a
 121   Texinfo file. It uses db2x_xsltproc and db2x_texixml. Unfortunately, there
 122   are two versions of this command; the old one is based on an earlier fork of
 123   docbook2X and does not work.
 124
 125 . db2x_xsltproc and db2x_texixml (part of docbook2X 0.8.5)
 126
 127   More wrapping scripts (see previous item).
 128
 129 . makeinfo 4.8
 130
 131   This is used to make a set of "info" files from a Texinfo file.
 132
 133 In addition, there are some locally written Perl scripts. These are described
 134 below.
 135
 136
 137 ASCIIDOC
 138
 139 AsciiDoc (http://www.methods.co.nz/asciidoc/) is a Python script that converts
 140 an input document in a more-or-less human-readable format into DocBook XML.
 141 For a document as complex as the Exim specification, the markup is quite
 142 complex - probably no simpler than the original SGCAL markup - but it is
 143 definitely easier to work with than XML itself.
 144
 145 AsciiDoc is highly configurable. It comes with a default configuration, but I
 146 have extended this with an additional configuration file that must be used when
 147 processing the Exim documents. There is a separate document called AdMarkup.txt
 148 that describes the markup that is used in these documents. This includes the
 149 default AsciiDoc markup and the local additions.
 150
 151 The author of AsciiDoc uses the extension .txt for input documents. I find
 152 this confusing, especially as some of the output files have .txt extensions.
 153 Therefore, I have used the extension .ascd for the sources.
 154
 155
 156 THE MAKEFILE
 157
 158 The makefile supports a number of targets of the form x.y, where x is one of
 159 "filter", "spec", or "test", and y is one of "xml", "fo", "ps", "pdf", "html",
 160 "txt", or "info". The intermediate targets "x.xml" and "x.fo" are provided for
 161 testing purposes. The other five targets are production targets. For example:
 162
 163   make spec.pdf
 164
 165 This runs the necessary tools in order to create the file spec.pdf from the
 166 original source spec.ascd. A number of intermediate files are created during
 167 this process, including the master DocBook source, called spec.xml. Of course,
 168 the usual features of "make" ensure that if this already exists and is
 169 up-to-date, it is not needlessly rebuilt.
 170
 171 The "test" series of targets were created so that small tests could easily be
 172 run fairly quickly, because processing even the shortish filter document takes
 173 a bit of time, and processing the main specification takes ages.
 174
 175 Another target is "exim.8". This runs a locally written Perl script called
 176 x2man, which extracts the list of command line options from the spec.xml file,
 177 and creates a man page. There are some XML comments in the spec.xml file to
 178 enable the script to find the start and end of the options list.
 179
 180 There is also a "clean" target that deletes all the generated files.
 181
 182
 183 CREATING DOCBOOK XML FROM ASCIIDOC
 184
 185 There is a single local AsciiDoc configuration file called MyAsciidoc.conf.
 186 Using this, one run of the asciidoc command creates a .xml file from a .ascd
 187 file. When this succeeds, there is no output.
 188
 189
 190 DOCBOOK PROCESSING
 191
 192 Processing a .xml file into the five different output formats is not entirely
 193 straightforward. For a start, the same XML is not suitable for all the
 194 different output styles. When the final output is in a text format (.txt,
 195 .texinfo) for instance, all non-Ascii characters in the input must be converted
 196 to Ascii transliterations because the current processing tools do not do this
 197 correctly automatically.
 198
 199 In order to cope with these issues in a flexible way, a Perl script called
 200 Pre-xml was written. This is used to preprocess the .xml files before they are
 201 handed to the main processors. Adding one more tool onto the front of the
 202 processing chain does at least seem to be in the spirit of XML processing.
 203
 204 The XML processors themselves make use of style files, which can be overridden
 205 by local versions. There is one that applies to all styles, called MyStyle.xsl,
 206 and others for the different output formats. I have included comments in these
 207 style files to explain what changes I have made. Some of the changes are quite
 208 significant.
 209
 210
 211 THE PRE-XML SCRIPT
 212
 213 The Pre-xml script copies a .xml file, making certain changes according to the
 214 options it is given. The currently available options are as follows:
 215
 216 -abstract
 217
 218   This option causes the <abstract> element to be removed from the XML. The
 219   source abuses the <abstract> element by using it to contain the author's
 220   address so that it appears on the title page verso in the printed renditions.
 221   This just gets in the way for the non-PostScript/PDF renditions.
 222
 223 -ascii
 224
 225   This option is used for Ascii output formats. It makes the following
 226   character replacements:
 227
 228     &8230;    =>  ...       (sic, no #x)
 229     &#x2019;  =>  '         apostrophe
 230     &#x201C;  =>  "         opening double quote
 231     &#x201D;  =>  "         closing double quote
 232     &#x2013;  =>  -         en dash
 233     &#x2020;  =>  *         dagger
 234     &#x2021;  =>  **        double dagger
 235     &#x00a0;  =>  a space   hard space
 236     &#x00a9;  =>  (c)       copyright
 237
 238   In addition, this option causes quotes to be put round <literal> text items,
 239   and <quote> and </quote> to be replaced by Ascii quote marks. You would think
 240   the stylesheet would cope with the latter, but it seems to generate non-Ascii
 241   characters that w3m then turns into question marks.
 242
 243 -bookinfo
 244
 245   This option causes the <bookinfo> element to be removed from the XML. It is
 246   used for the PostScript/PDF forms of the filter document, in order to avoid
 247   the generation of a full title page.
 248
 249 -fi
 250
 251   Replace any occurrence of "fi" by the ligature &#xFB01; except when it is
 252   inside an XML element, or inside a <literal> part of the text.
 253
 254   The use of ligatures would be nice for the PostScript and PDF formats. Sadly,
 255   it turns out that fop cannot at present handle the FB01 character correctly.
 256   The only format that does so is the HTML format, but when I used this in the
 257   test version, people complained that it made searching for words difficult.
 258   So at the moment, this option is not used. :-(
 259
 260 -noindex
 261
 262   Remove the XML to generate a Concept Index and an Options index.
 263
 264 -oneindex
 265
 266   Remove the XML to generate a Concept and an Options Index, and add XML to
 267   generate a single index.
 268
 269 The source document has two types of index entry, for a concept and an options
 270 index. However, no index is required for the .txt and .texinfo outputs.
 271 Furthermore, the only output processor that supports multiple indexes is the
 272 processor that produces "formatted objects" for PostScript and PDF output. The
 273 HTML processor ignores the XML settings for multiple indexes and just makes one
 274 unified index. Specifying two indexes gets you two copies of the same index, so
 275 this has to be changed.
 276
 277
 278 CREATING POSTSCRIPT AND PDF
 279
 280 These two output formats are created in three stages. First, the XML is
 281 pre-processed. For the filter document, the <bookinfo> element is removed so
 282 that no title page is generated, but for the main specification, no changes are
 283 currently made.
 284
 285 Second, the xmlto command is used to produce a "formatted objects" (.fo) file.
 286 This process uses the following stylesheets:
 287
 288   (1) Either MyStyle-filter-fo.xsl or MyStyle-spec-fo.xsl
 289   (2) MyStyle-fo.xsl
 290   (3) MyStyle.xsl
 291   (4) MyTitleStyle.xsl
 292
 293 The last of these is not used for the filter document, which does not have a
 294 title page. The first three stylesheets were created manually, either by typing
 295 directly, or by coping from the standard style sheet and editing.
 296
 297 The final stylesheet has to be created from a template document, which is
 298 called MyTitlepage.templates.xml. This was copied from the standard styles and
 299 modified. The template is processed with xsltproc to produce the stylesheet.
 300 All this apparatus is appallingly heavyweight. The processing is also very slow
 301 in the case of the specification document. However, there should be no errors.
 302
 303 In the third and final part of the processing, the .fo file that is produced by
 304 the xmlto command is processed by the fop command to generate either PostScript
 305 or PDF. This is also very slow, and you get a whole slew of errors, of which
 306 these are a sample:
 307
 308   [ERROR] property - "background-position-horizontal" is not implemented yet.
 309
 310   [ERROR] property - "background-position-vertical" is not implemented yet.
 311
 312   [INFO] JAI support was not installed (read: not present at build time).
 313     Trying to use Jimi instead
 314     Error creating background image: Error creating FopImage object (Error
 315     creating FopImage object
 316     (http://docbook.sourceforge.net/release/images/draft.png) :
 317     org.apache.fop.image.JimiImage
 318
 319   [WARNING] table-layout=auto is not supported, using fixed!
 320
 321   [ERROR] Unknown enumerated value for property 'span': inherit
 322
 323   [ERROR] Error in span property value 'inherit':
 324     org.apache.fop.fo.expr.PropertyException: No conversion defined
 325
 326   [ERROR] Areas pending, text probably lost in lineinclude parts matched in the
 327     response by response_pattern by means of numeric variables such as
 328
 329 The last one is particularly meaningless gobbledegook. Some of the errors and
 330 warnings are repeated many times. Nevertheless, it does eventually produce
 331 usable output, though I have a number of issues with it (see a later section of
 332 this document). Maybe one day there will be a new release of fop that does
 333 better. Maybe there will be some other means of producing PostScript and PDF
 334 from DocBook XML. Maybe porcine aeronautics will really happen.
 335
 336
 337 CREATING HTML
 338
 339 Only two stages are needed to produce HTML, but the main specification is
 340 subsequently postprocessed. The Pre-xml script is called with the -abstract and
 341 -oneindex options to preprocess the XML. Then the xmlto command creates the
 342 HTML output directly. For the specification document, a directory of files is
 343 created, whereas the filter document is output as a single HTML page. The
 344 following stylesheets are used:
 345
 346   (1) Either MyStyle-chunk-html.xsl or MyStyle-nochunk-html.xsl
 347   (2) MyStyle-html.xsl
 348   (3) MyStyle.xsl
 349
 350 The first stylesheet references the chunking or non-chunking standard
 351 stylesheet, as appropriate.
 352
 353 The original HTML that I produced from the SGCAL input had hyperlinks back from
 354 chapter and section titles to the table of contents. These links are not
 355 generated by xmlto. One of the testers pointed out that the lack of these
 356 links, or simple self-referencing links for titles, makes it harder to copy a
 357 link name into, for example, a mailing list response.
 358
 359 I could not find where to fiddle with the stylesheets to make such a change, if
 360 indeed the stylesheets are capable of it. Instead, I wrote a Perl script called
 361 TidyHTML-spec to do the job for the specification document. It updates the
 362 index.html file (which contains the the table of contents) setting up anchors,
 363 and then updates all the chapter files to insert appropriate links.
 364
 365 The index.html file as built by xmlto contains the whole table of contents in a
 366 single line, which makes is hard to debug by hand. Since I was postprocessing
 367 it anyway, I arranged to insert newlines after every '>' character.
 368
 369 The TidyHTML-spec script also processes every HTML file, to tidy up some of the
 370 untidy features therein. It turns <div class="literallayout"><p> into <div
 371 class="literallayout"> and a matching </p></div> into </div> to get rid of
 372 unwanted vertical white space in literallayout blocks. Before each occurrence
 373 of </td> it inserts &nbsp; so that the table's cell is a little bit wider than
 374 the text itself.
 375
 376 The TidyHTML-spec script also takes the opportunity to postprocess the
 377 spec.html/ix01.html file, which contains the document index. Again, the index
 378 is generated as one single line, so it splits it up. Then it creates a list of
 379 letters at the top of the index and hyperlinks them both ways from the
 380 different letter portions of the index.
 381
 382 People wanted similar postprocessing for the filter.html file, so that is now
 383 done using a similar script called TidyHTML-filter. It was easier to use a
 384 separate script because filter.html is a single file rather than a directory,
 385 so the logic is somewhat different.
 386
 387
 388 CREATING TEXT FILES
 389
 390 This happens in four stages. The Pre-xml script is called with the -abstract,
 391 -ascii and -noindex options to remove the <abstract> element, convert the input
 392 to Ascii characters, and to disable the production of an index. Then the xmlto
 393 command converts the XML to a single HTML document, using these stylesheets:
 394
 395   (1) MyStyle-txt-html.xsl
 396   (2) MyStyle-html.xsl
 397   (3) MyStyle.xsl
 398
 399 The MyStyle-txt-html.xsl stylesheet is the same as MyStyle-nochunk-html.xsl,
 400 except that it contains an addition item to ensure that a generated "copyright"
 401 symbol is output as "(c)" rather than the Unicode character. This is necessary
 402 because the stylesheet itself generates a copyright symbol as part of the
 403 document title; the character is not in the original input.
 404
 405 The w3m command is used with the -dump option to turn the HTML file into Ascii
 406 text, but this contains multiple sequences of blank lines that make it look
 407 awkward, so, finally, a local Perl script called Tidytxt is used to convert
 408 sequences of blank lines into a single blank line.
 409
 410
 411 CREATING INFO FILES
 412
 413 This process starts with the same Pre-xml call as for text files. The
 414 <abstract> element is deleted, non-ascii characters in the source are
 415 transliterated, and the <index> elements are removed. The docbook2texi script
 416 is then called to convert the XML file into a Texinfo file. However, this is
 417 not quite enough. The converted file ends up with "conceptindex" and
 418 "optionindex" items, which are not recognized by the makeinfo command. An
 419 in-line call to Perl in the Makefile changes these to "cindex" and "findex"
 420 respectively in the final .texinfo file. Finally, a call of makeinfo creates a
 421 set of .info files.
 422
 423 There is one apparently unconfigurable feature of docbook2texi: it does not
 424 seem possible to give it a file name for its output. It chooses a name based on
 425 the title of the document. Thus, the main specification ends up in a file
 426 called the_exim_mta.texi and the filter document in exim_filtering.texi. These
 427 files are removed after their contents have been copied and modified by the
 428 inline Perl call, which makes a .texinfo file.
 429
 430
 431 CREATING THE MAN PAGE
 432
 433 I wrote a Perl script called x2man to create the exim.8 man page from the
 434 DocBook XML source. I deliberately did NOT start from the AsciiDoc source,
 435 because it is the DocBook source that is the "standard". This comment line in
 436 the DocBook source marks the start of the command line options:
 437
 438   <!-- === Start of command line options === -->
 439
 440 A similar line marks the end. If at some time in the future another way other
 441 than AsciiDoc is used to maintain the DocBook source, it needs to be capable of
 442 maintaining these comments.
 443
 444
 445 UNRESOLVED PROBLEMS
 446
 447 There are a number of unresolved problems with producing the Exim documentation
 448 in the manner described above. I will describe them here in the hope that in
 449 future some way round them can be found.
 450
 451 (1)  Errors in the toolchain
 452
 453      When a whole chain of tools is processing a file, an error somewhere in
 454      the middle is often very hard to debug. For instance, an error in the
 455      AsciiDoc might not show up until an XML processor throws a wobbly because
 456      the generated XML is bad. You have to be able to read XML and figure out
 457      what generated what. One of the reasons for creating the "test" series of
 458      targets was to help in checking out these kinds of problem.
 459
 460 (2)  There is a mechanism in XML for marking parts of the document as
 461      "revised", and I have arranged for AsciiDoc markup to use it. However, at
 462      the moment, the only output format that pays attention to this is the HTML
 463      output, which sets a green background. There are therefore no revision
 464      marks (change bars) in the PostScript, PDF, or text output formats as
 465      there used to be. (There never were for Texinfo.)
 466
 467 (3)  The index entries in the HTML format take you to the top of the section
 468      that is referenced, instead of to the point in the section where the index
 469      marker was set.
 470
 471 (4)  The HTML output supports only a single index, so the concept and options
 472      index entries have to be merged.
 473
 474 (5)  The index for the PostScript/PDF output does not merge identical page
 475      numbers, which makes some entries look ugly.
 476
 477 (6)  None of the indexes (PostScript/PDF and HTML) make use of textual
 478      markup; the text is all roman, without any italic or boldface.
 479
 480 (7)  I turned off hyphenation in the PostScript/PDF output, because it was
 481      being done so badly.
 482
 483      (a) It seems to force hyphenation if it is at all possible, without
 484          regard to the "tightness" or "looseness" of the line. Decent
 485          formatting software should attempt hyphenation only if the line is
 486          over some "looseness" threshold; otherwise you get far too many
 487          hyphenations, often for several lines in succession.
 488
 489      (b) It uses an algorithmic form of hyphenation that doesn't always produce
 490          acceptable word breaks. (I prefer to use a hyphenation dictionary.)
 491
 492 (8)  The PostScript/PDF output is badly paginated:
 493
 494      (a) There seems to be no attempt to avoid "widow" and "orphan" lines on
 495          pages. A "widow" is the last line of a paragraph at the top of a page,
 496          and an "orphan" is the first line of a paragraph at the bottom of a
 497          page.
 498
 499      (b) There seems to be no attempt to prevent section headings being placed
 500          last on a page, with no following text on the page.
 501
 502 (9)  The fop processor does not support "fi" ligatures, not even if you put the
 503      appropriate Unicode character into the source by hand.
 504
 505 (10) There are no diagrams in the new documentation. This is something I could
 506      work on. The previously-used Aspic command for creating line art from a
 507      textual description can output Encapsulated PostScript or Scalar Vector
 508      Graphics, which are two standard diagram representations. Aspic could be
 509      formally released and used to generate output that could be included in at
 510      least some of the output formats.
 511
 512 The consequence of (7), (8), and (9) is that the PostScript/PDF output looks as
 513 if it comes from some of the very early attempts at text formatting of around
 514 20 years ago. We can only hope that 20 years' progress is not going to get
 515 lost, and that things will improve in this area.
 516
 517
 518 LIST OF FILES
 519
 520 AdMarkup.txt                   Describes the AsciiDoc markup that is used
 521 HowItWorks.txt                 This document
 522 Makefile                       The makefile
 523 MyAsciidoc.conf                Localized AsciiDoc configuration
 524 MyStyle-chunk-html.xsl         Stylesheet for chunked HTML output
 525 MyStyle-filter-fo.xsl          Stylesheet for filter fo output
 526 MyStyle-fo.xsl                 Stylesheet for any fo output
 527 MyStyle-html.xsl               Stylesheet for any HTML output
 528 MyStyle-nochunk-html.xsl       Stylesheet for non-chunked HTML output
 529 MyStyle-spec-fo.xsl            Stylesheet for spec fo output
 530 MyStyle-txt-html.xsl           Stylesheet for HTML=>text output
 531 MyStyle.xsl                    Stylesheet for all output
 532 MyTitleStyle.xsl               Stylesheet for spec title page
 533 MyTitlepage.templates.xml      Template for creating MyTitleStyle.xsl
 534 Myhtml.css                     Experimental css stylesheet for HTML output
 535 Pre-xml                        Script to preprocess XML
 536 TidyHTML-filter                Script to tidy up the filter HTML output
 537 TidyHTML-spec                  Script to tidy up the spec HTML output
 538 Tidytxt                        Script to compact multiple blank lines
 539 filter.ascd                    AsciiDoc source of the filter document
 540 spec.ascd                      AsciiDoc source of the specification document
 541 x2man                          Script to make the Exim man page from the XML
 542
 543 The file Myhtml.css was an experiment that was not followed through. It is
 544 mentioned in a comment in MyStyle-html.xsl, but is not at present in use.
 545
 546
 547 Philip Hazel
 548 Last updated: 10 June 2005