[exim.git] / doc / doc-docbook / HowItWorks.txt

$Cambridge: exim/doc/doc-docbook/HowItWorks.txt,v 1.9 2008/02/04 17:28:44 fanf2 Exp $

CREATING THE EXIM DOCUMENTATION

"You are lost in a maze of twisty little scripts."


This document describes how the various versions of the Exim documentation, in
different output formats, are created from DocBook XML, and also how the
DocBook XML is itself created.


BACKGROUND: THE OLD WAY

From the start of Exim, in 1995, the specification was written in a local text
formatting system known as SGCAL. This is capable of producing PostScript and
plain text output from the same source file. Later, when the "ps2pdf" command
became available with GhostScript, that was used to create a PDF version from
the PostScript. (A few earlier versions were created by a helpful user who had
bought the Adobe distiller software.)

A demand for a version in "info" format led me to write a Perl script that
converted the SGCAL input into a Texinfo file. Because of the somewhat
restrictive requirements of Texinfo, this script always needed a lot of
maintenance, and was never totally satisfactory.

The HTML version of the documentation was originally produced from the Texinfo
version, but later I wrote another Perl script that produced it directly from
the SGCAL input, which made it possible to produce better HTML.

There were a small number of diagrams in the documentation. For the PostScript
and PDF versions, these were created using Aspic, a local text-driven drawing
program that interfaces directly to SGCAL. For the text and texinfo versions,
alternative ascii-art diagrams were used. For the HTML version, screen shots of
the PostScript output were turned into gifs.


A MORE STANDARD APPROACH

Although in principle SGCAL and Aspic could be generally released, they would
be unlikely to receive much (if any) maintenance, especially after I retire.
Furthermore, the old production method was only semi-automatic; I still did a
certain amount of hand tweaking of spec.txt, for example. As the maintenance of
Exim itself was being opened up to a larger group of people, it seemed sensible
to move to a more standard way of producing the documentation, preferable fully
automated. However, we wanted to use only non-commercial software to do this.

At the time I was thinking about converting (early 2005), the "obvious"
standard format in which to keep the documentation was DocBook XML. The use of
XML in general, in many different applications, was increasing rapidly, and it
seemed likely to remain a standard for some time to come. DocBook offered a
particular form of XML suited to documents that were effectively "books".

Maintaining an XML document by hand editing is a tedious, verbose, and
error-prone process. A number of specialized XML text editors were available,
but all the free ones were at a very primitive stage. I therefore decided to
keep the master source in AsciiDoc format, from which a secondary XML master
could be automatically generated.

The first "new" versions of the documents, for the 4.60 release, were generated
this way. However, there were a number of problems with using AsciiDoc for a
document as large and as complex as the Exim manual. As a result, I wrote a new
application called xfpt ("XML From Plain Text") which creates XML from a
relatively simple and consistent markup language. This application has been
released for general use, and the master sources for the Exim documentation are
now in xfpt format.

All the output formats are generated from the XML file. If, in the future, a
better way of maintaining the XML source becomes available, this can be adopted
without changing any of the processing that produces the output documents.
Equally, if better ways of processing the XML become available, they can be
adopted without affecting the source maintenance.

A number of issues arose while setting this all up, which are best summed up by
the statement that a lot of the technology was (in 2006) still very immature.
Trying to do this conversion any earlier would probably not have been anywhere
near as successful. The main issues that bother me in the XML-generated
documentation are described in the penultimate section of this document.

Initially, the major problems were in producing PostScript and PDF outputs. The
available free software for doing this was and still is (we are now in 2007)
cumbersome and slow, and does not support certain output features that I would
like. My response to this was, over a period of two years, to write an XML
processor called SDoP (Simple DocBook Processor). This program reads DocBook
XML and writes PostScript, without using any of the heavyweight apparatus that
is required for xmlto and fop (the previously used software).

An experimental first version of SDoP was used for the Exim 4.67
documentation. Subsequently SDoP was released for general use. SDoP's output
includes features that are missing when xmlto/fop is used, and it also runs
about 60 times faster. The main manual can be formatted in 2.5 seconds instead
of 2.5 minutes, which makes checking and fixing mistakes much easier.

The Makefile that is used to build the various forms of output will, for the
moment, support both ways of producing PostScript and PDF output, though the
default is now to use SDoP.

The following sections describe the processes by which the xfpt files are
transformed into the final output documents. In practice, the details are coded
into a Makefile that specifies the chain of commands for each output format.


REQUIRED SOFTWARE

Installing software to process XML puts lots and lots of stuff on your box. I
run Gentoo Linux, and a lot of things have been installed as dependencies that
I am not fully aware of. This is what I know about (version numbers are current
at the time of writing):

. xfpt 0.03

  This converts the master source file into a DocBook XML file.

. sdop 0.03

  This is my new DocBook-to-PostScript processor.

. ps2pdf

  This is a wrapper script that is part of the GhostScript distribution. It
  converts a PostScript file into a PDF file. It is used to process the output
  from SDoP. It is not required when xmlto/fop is being used to generate PDF
  output.

. xmlto 0.0.18

  This is a shell script that drives various XML processors. It is used to
  produce "formatted objects" when PostScript and PDF output is being generated
  using fop (the old way) rather than SDoP. It is always used to produce HTML
  output. It uses xsltproc, libxml, libxslt, libexslt, and possibly other
  things that I have not figured out, to apply the DocBook XSLT stylesheets.

. libxml 1.8.17
  libxml2 2.6.28
  libxslt 1.1.20

  These are all installed on my box; I do not know which of libxml or libxml2
  the various scripts are actually using.

. xsl-stylesheets-<version>

  These are the standard DocBook XSL stylesheets.

  The documents use http://docbook.sourceforge.net/release/xsl/current/ which
  should be mapped to an appropriate local path via the system catalogs.

. fop 0.93

  FOP is a processor for "formatted objects". It is written in Java. The fop
  command is a shell script that drives it. It required only if you do not
  want to use SDoP and ps2pdf to generate PostScript and PDF output.

. w3m 0.5.2

  This is a text-oriented web brower. It is used to produce the ASCII form of
  the Exim documentation (spec.txt) from a specially-created HTML format. It
  seems to do a better job than lynx.

. docbook2texi (part of docbook2X 0.8.5)

  This is a wrapper script for a two-stage conversion process from DocBook to a
  Texinfo file. It uses db2x_xsltproc and db2x_texixml. Unfortunately, there
  are two versions of this command; the old one is based on an earlier fork of
  docbook2X and does not work.

. db2x_xsltproc and db2x_texixml (part of docbook2X 0.8.5)

  More wrapping scripts (see previous item).

. makeinfo 4.8

  This is used to make an "info" file from a Texinfo file.

In addition, there are a number of locally written Perl scripts. These are
described below.


THE MAKEFILE

The makefile supports a number of targets of the form x.y, where x is one of
"filter", "spec", or "test", and y is one of "xml", "fo", "ps", "pdf", "html",
"txt", or "info". The intermediate targets "x.xml" and "x.fo" are provided for
testing purposes. The other five targets are production targets. For example:

  make spec.pdf

This runs the necessary tools in order to create the file spec.pdf from the
original source spec.xfpt. A number of intermediate files are created during
this process, including the master DocBook source, called spec.xml. Of course,
the usual features of "make" ensure that if this already exists and is
up-to-date, it is not needlessly rebuilt.

Because there are now two ways of creating the PostScript and PDF outputs,
there are two targets for each one. For example fop-spec.ps makes PostScript
using fop, and sdop-spec.ps makes it using SDoP. The generic targets spec.ps
and spec.pdf now point to the SDoP versions.

The "test" series of targets were created so that small tests could easily be
run fairly quickly, because processing even the shortish XML document takes
a bit of time, and processing the main specification takes ages -- except when
using SDoP for PostScript and PDF.

Another target is "exim.8". This runs a locally written Perl script called
x2man, which extracts the list of command line options from the spec.xml file,
and creates a man page. There are some XML comments in the spec.xml file to
enable the script to find the start and end of the options list.

There is also a "clean" target that deletes all the generated files.


CREATING DOCBOOK XML FROM XFPT INPUT

The small amount of local configuration for xfpt is included at the start of
the two .xfpt files; there are no separate local xfpt configuration files.
Running the xfpt command creates a .xml file from a .xfpt file. When this
succeeds, there is no output.


DOCBOOK PROCESSING

Processing a .xml file into the five different output formats is not entirely
straightforward. For a start, the same XML is not suitable for all the
different output styles. When the final output is in a text format (.txt,
.texinfo) for instance, all non-ASCII characters in the input must be converted
to ASCII transliterations because the current processing tools do not do this
correctly automatically.

In order to cope with these issues in a flexible way, a Perl script called
Pre-xml was written. This is used to preprocess the .xml files before they are
handed to the main processors. Adding one more tool onto the front of the
processing chain does at least seem to be in the spirit of XML processing.

The XML processors other than SDoP make use of style files, which can be
overridden by local versions. There is one that applies to all styles, called
MyStyle.xsl, and others for the different output formats. I have included
comments in these style files to explain what changes I have made. Some of the
changes are quite significant.


XSL INCLUDES

References to XSL paths should use the public URLs, such as:
  http://docbook.sourceforge.net/release/xsl/current/xhtml/docbook.xsl
If this fails to work for you, then there is a problem with your system
catalogs.  As a work-around, you can adjust the OS-Fixups script and then:
$ make os-fixup

As an example of how this should normally work, on a FreeBSD system the
resolution goes to /usr/local/share/xml/catalog which contains a directive:
  <nextCatalog catalog="/usr/local/share/xml/catalog.ports" />
to pull in the file automatically maintained by the Ports system.  That file
will contain:
  <delegateSystem
   systemIdStartString="http://docbook.sourceforge.net/release/xsl/"
   catalog="file:///usr/local/share/xsl/docbook/catalog" />
  <delegateURI
   uriStartString="http://docbook.sourceforge.net/release/xsl/"
   catalog="file:///usr/local/share/xsl/docbook/catalog" />
and that catalog file contains:
  <rewriteSystem
   systemIdStartString="http://docbook.sourceforge.net/release/xsl/current"
   rewritePrefix="file:///usr/local/share/xsl/docbook" />
  <rewriteURI
   uriStartString="http://docbook.sourceforge.net/release/xsl/current"
    rewritePrefix="file:///usr/local/share/xsl/docbook" />
and the full path is thus eventually arrived at.

See also the tools:
  xmlcatalog(1) from libxml2
  xmlcatmgr(1) for a lightweight tool written for the NetBSD Packages system.


THE PRE-XML SCRIPT

The Pre-xml script copies a .xml file, making certain changes according to the
options it is given. The currently available options are as follows:

-ascii

  This option is used for ASCII output formats. It makes the following
  character replacements:

    &#x2019;  =>  '         apostrophe
    &copy;    =>  (c)       copyright
    &dagger;  =>  *         dagger
    &Dagger;  =>  **        double dagger
    &nbsp;    =>  a space   hard space
    &ndash;   =>  -         en dash

  The apostrophe is specified numerically because that is what xfpt generates
  from an ASCII single quote character. Non-ASCII characters that are not in
  this list should not be used without thinking about how they might be
  converted for the ASCII formats.

  In addition to the character replacements, this option causes quotes to be
  put round <literal> text items, and <quote> and </quote> to be replaced by
  ASCII quote marks. You would think the stylesheet would cope with the latter,
  but it seems to generate non-ASCII characters that w3m then turns into
  question marks.

-bookinfo

  This option causes the <bookinfo> element to be removed from the XML. It is
  used for the PostScript/PDF forms of the filter document, in order to avoid
  the generation of a full title page.

-fi

  Replace any occurrence of "fi" by the ligature &#xFB01; except when it is
  inside an XML element, or inside a <literal> part of the text.

  The use of ligatures would be nice for the PostScript and PDF formats. Sadly,
  it turns out that fop cannot at present handle the FB01 character correctly.
  Happily this problem is now avoided when SDoP is used to generate PostScript
  (and thence PDF) because SDoP automatically uses an "fi" ligature for
  non-fixed-width fonts.

  The only xmlto format that handles FB01 is the HTML format, but when I used
  this in the test version, people complained that it made searching for words
  difficult. So this option is in practice not used at all.

-noindex

  Remove the XML to generate a Concept Index and an Options index. The source
  document has three types of index entry, for variables, options, and concept
  indexes. However, no index is required for the .txt and .texinfo outputs.

-oneindex

  Remove the XML to generate separate variables, options, and concept indexes,
  and add XML to generate a single index. The only output processors that
  support multiple indexes are SDoP and the processor that produces "formatted
  objects" for PostScript and PDF output for fop. The HTML processor ignores
  the XML settings for multiple indexes and just makes one unified index.
  Specifying three indexes gets you three copies of the same index, so this has
  to be changed.

-optbreak

  Look for items of the form <option>...</option> and <varname>...</varname> in
  ordinary paragraphs, and insert &#x200B; after each underscore in the
  enclosed text. The same is done for any word containing four or more upper
  case letters (compile-time options in the Exim specification). The character
  &#x200B; is a zero-width space. This means that the line may be split after
  one of these underscores, but no hyphen is inserted.


CREATING POSTSCRIPT AND PDF

These two output formats are created either by using my new SDoP program to
produce PostScript which can then be run through ps2pdf to make a PDF, or by
using xmlto and fop in the old way.


USING SDOP TO CREATE POSTSCRIPT AND PDF

PostScript output is created in two stages. First, the XML is pre-processed by
the Pre-xml script. For the filter document, the <bookinfo> element is removed
so that no title page is generated. For the main specification, the only change
is to insert line breakpoints via -optbreak.

The SDoP program is then used to create PostScript output directly from the XML
input. Then the ps2pdf command is used to generated a PDF from the PostScript.
There are no external stylesheets that are used by SDoP. Any variations to the
default format are specified inline using "processing instructions".


USING XMLTO AND FOP TO CREATE POSTSCRIPT AND PDF

This is the original way of creating PostScript and PDF output. The processing
happens in three stages, with an additional fourth stage for PDF. First, the
XML is pre-processed by the Pre-xml script. For the filter document, the
<bookinfo> element is removed so that no title page is generated. For the main
specification, the only change is to insert line breakpoints via -optbreak.

Second, the xmlto command is used to produce a "formatted objects" (.fo) file.
This process uses the following stylesheets:

  (1) Either MyStyle-filter-fo.xsl or MyStyle-spec-fo.xsl
  (2) MyStyle-fo.xsl
  (3) MyStyle.xsl
  (4) MyTitleStyle.xsl

The last of these is not used for the filter document, which does not have a
title page. The first three stylesheets were created manually, either by typing
directly, or by coping from the standard style sheet and editing.

The final stylesheet has to be created from a template document, which is
called MyTitlepage.templates.xml. This was copied from the standard styles and
modified. The template is processed with xsltproc to produce the stylesheet.
All this apparatus is appallingly heavyweight. The processing is also very slow
in the case of the specification document. However, there should be no errors.

The reference book that saved my life while I was trying to get all this to
work is "DocBook XSL, The Complete Guide", third edition (2005), by Bob
Stayton, published by Sagehill Enterprises.

In the third part of the processing, the .fo file that is produced by the xmlto
command is processed by the fop command to generate either PostScript or PDF.
This is also very slow, and you get a whole slew of errors, of which these are
a sample:

  [ERROR] property - "background-position-horizontal" is not implemented yet.

  [ERROR] property - "background-position-vertical" is not implemented yet.

  [INFO] JAI support was not installed (read: not present at build time).
    Trying to use Jimi instead
    Error creating background image: Error creating FopImage object (Error
    creating FopImage object
    (http://docbook.sourceforge.net/release/images/draft.png) :
    org.apache.fop.image.JimiImage

  [WARNING] table-layout=auto is not supported, using fixed!

  [ERROR] Unknown enumerated value for property 'span': inherit

  [ERROR] Error in span property value 'inherit':
    org.apache.fop.fo.expr.PropertyException: No conversion defined

  [ERROR] Areas pending, text probably lost in lineinclude parts matched in the
    response by response_pattern by means of numeric variables such as

The last one is particularly meaningless gobbledegook. Some of the errors and
warnings are repeated many times. Nevertheless, it does eventually produce
usable output, though I have a number of issues with it (see a later section of
this document). Maybe one day there will be a new release of fop that does
better. In the meantime, I have written my own program for making PostScript
output -- see the previous section -- because the problems with xmlto/fop were
sufficiently annoying.

The PDF file that is produced by this process has one problem: the pages, as
shown by acroread in its thumbnail display, are numbered sequentially from one
to the end. Those numbers do not correspond with the page numbers of the body
of the document, which makes finding a page from the index awkward. There is a
facility in the PDF format to give pages appropriate "labels", but I cannot
find a way of persuading fop to generate these. Fortunately, it is possibly to
fix up the PDF to add page labels. I wrote a script called PageLabelPDF which
does this. They are shown correctly by acroread and xpdf, but not by
GhostScript (gv).


THE PAGELABELPDF SCRIPT

This script reads the standard input and writes the standard output. It is used
to "tidy up" the PDF output that is produced by fop. It is not needed when
PDF output is generated from SDoP's output using ps2pdf.

The PageLabelPDF script searches for the PDF object that sets data in its
"Catalog", and adds appropriate information about page labels. The number of
front-matter pages (those before chapter 1) is hard-wired into this script as
12 because I could not find a way of determining it automatically. As the
current table of contents finishes near the top of the 11th page, there is
plenty of room for expansion, so it is unlikely to be a problem.

Having added data to the PDF file, the script then finds the xref table at the
end of the file, and adjusts its entries to allow for the added text. This
simple processing seems to be enough to generate a new, valid, PDF file.


CREATING HTML

Only two stages are needed to produce HTML, but the main specification is
subsequently postprocessed. The Pre-xml script is called with the -optbreak and
-oneindex options to preprocess the XML. Then the xmlto command creates the
HTML output directly. For the specification document, a directory of files is
created, whereas the filter document is output as a single HTML page. The
following stylesheets are used:

  (1) Either MyStyle-chunk-html.xsl or MyStyle-nochunk-html.xsl
  (2) MyStyle-html.xsl
  (3) MyStyle.xsl

The first stylesheet references the chunking or non-chunking standard DocBook
stylesheet, as appropriate.

You may see a number of these errors when creating HTML: "Revisionflag on
unexpected element: literallayout (Assuming block)". They seem to be harmless;
the output appears to be what is intended.

The original HTML that I produced from the SGCAL input had hyperlinks back from
chapter and section titles to the table of contents. These links are not
generated by xmlto. One of the testers pointed out that the lack of these
links, or simple self-referencing links for titles, makes it harder to copy a
link name into, for example, a mailing list response.

I could not find where to fiddle with the stylesheets to make such a change, if
indeed the stylesheets are capable of it. Instead, I wrote a Perl script called
TidyHTML-spec to do the job for the specification document. It updates the
index.html file (which contains the the table of contents) setting up anchors,
and then updates all the chapter files to insert appropriate links.

The index.html file as built by xmlto contains the whole table of contents in a
single line, which makes is hard to debug by hand. Since I was postprocessing
it anyway, I arranged to insert newlines after every '>' character.

The TidyHTML-spec script also processes every HTML file, to tidy up some of the
untidy features therein. It turns <div class="literallayout"><p> into <div
class="literallayout"> and a matching </p></div> into </div> to get rid of
unwanted vertical white space in literallayout blocks. Before each occurrence
of </td> it inserts &nbsp; so that the table's cell is a little bit wider than
the text itself.

The TidyHTML-spec script also takes the opportunity to postprocess the
spec_html/ix01.html file, which contains the document index. Again, the index
is generated as one single line, so it splits it up. Then it creates a list of
letters at the top of the index and hyperlinks them both ways from the
different letter portions of the index.

People wanted similar postprocessing for the filter.html file, so that is now
done using a similar script called TidyHTML-filter. It was easier to use a
separate script because filter.html is a single file rather than a directory,
so the logic is somewhat different.


CREATING TEXT FILES

This happens in four stages. The Pre-xml script is called with the -ascii,
-optbreak, and -noindex options to convert the input to ASCII characters,
insert line break points, and disable the production of an index. Then the
xmlto command converts the XML to a single HTML document, using these
stylesheets:

  (1) MyStyle-txt-html.xsl
  (2) MyStyle-html.xsl
  (3) MyStyle.xsl

The MyStyle-txt-html.xsl stylesheet is the same as MyStyle-nochunk-html.xsl,
except that it contains an addition item to ensure that a generated "copyright"
symbol is output as "(c)" rather than the Unicode character. This is necessary
because the stylesheet itself generates a copyright symbol as part of the
document title; the character is not in the original input.

The w3m command is used with the -dump option to turn the HTML file into ASCII
text, but this contains multiple sequences of blank lines that make it look
awkward. Furthermore, chapter and section titles do not stand out very well. A
local Perl script called Tidytxt is used to post-process the output. First, it
converts sequences of blank lines into a single blank lines. Then it searches
for chapter and section headings. Each chapter heading is uppercased, and
preceded by an extra two blank lines and a line of equals characters. An extra
newline is inserted before each section heading, and they are underlined with
hyphens.

The output of xmlto also contains non-ASCII Unicode characters that w3m passes
through. Fortunately, they are few, and Tidytxt cleans them up as well. Some
headings use "box drawing" characters in the range U+2500 to U+253F which are
translated into -+| as appropriate, and U+00A0 (hard space) and U+25CF (bullet)
are translated into plain spaces and asterisks. (It might be possible to do all
this in the same way as I dealt with copyright - see above - but adding a few
lines of Perl to an existing script was a lot easier.)


CREATING INFO FILES

This process starts with the same Pre-xml call as for text files. Non-ascii
characters in the source are transliterated, and the <index> elements are
removed. The docbook2texi script is then called to convert the XML file into a
Texinfo file. However, this is not quite enough. The converted file ends up
with "conceptindex" and "optionindex" items, which are not recognized by the
makeinfo command. These have to be changed to "cindex" and "findex"
respectively in the final .texinfo file. Furthermore, the main menu lacks a
pointer to the index, and indeed the index node itself is missing. These
problems are fixed by running the file through a script called TidyInfo.
Finally, a call of makeinfo creates a .info file.

There is one apparently unconfigurable feature of docbook2texi: it does not
seem possible to give it a file name for its output. It chooses a name based on
the title of the document. Thus, the main specification ends up in a file
called the_exim_mta.texi and the filter document in exim_filtering.texi. These
files are removed after their contents have been copied and modified by the
TidyInfo script, which writes to a .texinfo file.


CREATING THE MAN PAGE

I wrote a Perl script called x2man to create the exim.8 man page from the
DocBook XML source. I deliberately did NOT start from the xfpt source,
because it is the DocBook source that is the "standard". This comment line in
the DocBook source marks the start of the command line options:

  <!-- === Start of command line options === -->

A similar line marks the end. If at some time in the future another way other
than xfpt is used to maintain the DocBook source, it needs to be capable of
maintaining these comments.


UNRESOLVED PROBLEMS

There are a number of unresolved problems with producing the Exim documentation
in the manner described above. I will describe them here in the hope that in
future some way round them can be found. Some of the problems are solved by
using SDoP instead of xmlto/fop to produce PostScript and PDF output.

(1)  When a whole chain of tools is processing a file, an error somewhere
     in the middle is often very hard to debug. For instance, an error in the
     xfpt file might not show up until an XML processor throws a wobbly because
     the generated XML is bad. You have to be able to read XML and figure out
     what generated what. One of the reasons for creating the "test" series of
     targets was to help in checking out these kinds of problem.

(2)  There is a mechanism in XML for marking parts of the document as
     "revised", and I have arranged for xfpt markup to use it. However, the
     only xmlto output format that pays attention to this is the HTML output,
     which sets a green background. If xmlto/fop is used to generate PostScript
     and PDF, there are no revision marks (change bars). This problem
     is not present when SDoP is used. However, the text and Texinfo output
     format lack revision indications.

(3)  The index entries in the HTML format take you to the top of the section
     that is referenced, instead of to the point in the section where the index
     marker was set.

(4)  The HTML output supports only a single index, so the variable, options,
     and concept index entries have to be merged.

(5)  The index for the PostScript/PDF output created by xmlto/fop does not
     merge identical page numbers, which makes some entries look ugly. This is
     not a problem when SDoP is used.

(6)  The HTML index and the PostScript/PDF indexes, when made with xmlto/fop,
     make no use of textual markup; the text is all roman, without any italic
     or boldface. For PostScript/PDF, this is not a problem when SDoP is used.

(7)  I turned off hyphenation in the PostScript/PDF output produced by
     xmlto/fop, because it was being done so badly. Needless to say, I made
     SDoP do a better job. These comments apply to xmlto/fop:

     (a) It seems to force hyphenation if it is at all possible, without
         regard to the "tightness" or "looseness" of the line. Decent
         formatting software should attempt hyphenation only if the line is
         over some "looseness" threshold; otherwise you get far too many
         hyphenations, often for several lines in succession.

     (b) It uses an algorithmic form of hyphenation that doesn't always produce
         acceptable word breaks. (I prefer to use a hyphenation dictionary,
         which is what SDoP does.)

(8)  The PostScript/PDF output produced by xmlto/fop is badly paginated:

     (a) There seems to be no attempt to avoid "widow" and "orphan" lines on
         pages. A "widow" is the last line of a paragraph at the top of a page,
         and an "orphan" is the first line of a paragraph at the bottom of a
         page.

     (b) There seems to be no attempt to prevent section headings being placed
         last on a page, with no following text on the page.

     Neither of these problems occurs when SDoP is used to produce the
     PostScript/PDF output.

(9)  The fop processor does not support "fi" ligatures, not even if you put the
     appropriate Unicode character into the source by hand. Again, this is not
     a problem if SDoP is used.

(10) There are no diagrams in the new documentation. This is something I hope
     to work on. The previously used Aspic command for creating line art from a
     textual description can output Encapsulated PostScript or Scalar Vector
     Graphics, which are two standard diagram representations. Aspic could be
     formally released and used to generate output that could be included in at
     least some of the output formats.

(11) The use of a "zero-width space" works well as a way of specifying that
     Exim option names can be split, without hyphens, over line breaks.

     However, when xmlto/fop is being used and an option is not split, if the
     line is very "loose", the zero-width space is expanded, along with other
     spaces. This is a totally crazy thing to, but unfortunately it is
     suggested by the Unicode definition of the zero-width space, which says
     "its presence between two characters does not prevent increased letter
     spacing in justification". It seems that the implementors of fop have
     understood "letter spacing" also to include "word spacing". Sigh.

     This problem does not arise when SDoP is used.

The consequence of (7), (8), and (9) is that the PostScript/PDF output as
produced by xmlto/fop looks as if it comes from some of the very early attempts
at text formatting of around 20 years ago. We can only hope that 20 years'
progress is not going to get lost, and that things will improve in this area.
My small contribution to this has been to write SDoP, which, though simple and
"non-standard", does get some of these formatting issues right.


LIST OF FILES

Markup.txt                     Describes the xfpt markup that is used
HowItWorks.txt                 This document
Makefile                       The makefile
MyStyle-chunk-html.xsl         Stylesheet for chunked HTML output
MyStyle-filter-fo.xsl          Stylesheet for filter fo output
MyStyle-fo.xsl                 Stylesheet for any fo output
MyStyle-html.xsl               Stylesheet for any HTML output
MyStyle-nochunk-html.xsl       Stylesheet for non-chunked HTML output
MyStyle-spec-fo.xsl            Stylesheet for spec fo output
MyStyle-txt-html.xsl           Stylesheet for HTML=>text output
MyStyle.xsl                    Stylesheet for all output
MyTitleStyle.xsl               Stylesheet for spec title page
MyTitlepage.templates.xml      Template for creating MyTitleStyle.xsl
Myhtml.css                     Experimental css stylesheet for HTML output
PageLabelPDF                   Script to postprocess xmlto/fop PDF output
Pre-xml                        Script to preprocess XML
TidyHTML-filter                Script to tidy up the filter HTML output
TidyHTML-spec                  Script to tidy up the spec HTML output
TidyInfo                       Script to sort index problems in Texinfo output
Tidytxt                        Script to compact multiple blank lines
filter.xfpt                    xfpt source of the filter document
spec.xfpt                      xfpt source of the specification document
x2man                          Script to make the Exim man page from the XML


(Originally, and for the most part: Philip Hazel)
The Exim Maintainers
Last updated: 5 July 2010