Note: www.cdegroot.com is in rebuild. Please accept my apologies for broken links, missing stuff, etcetera - more
  Home

[ Article on SGMLtools 2.0 by Cees de Groot
  Scheduled for: October 1998
  Deadline: July 15, 1998
  Revision: $Revision: 1.2 $
]

More Flexible Formatting with SGMLtools

In LJ issue #18, Christian Schwarz presented a short overview of
Linuxdoc-SGML as it stood then: a complete, out-of-the-box package that
gave and still gives authors a chance to write once and present anywhere:
from flat ASCII to typeset PostScript and hypertext HTML, it all rolls out
from a single SGML source file. Since then, lots of smaller and bigger
changes resulted in a rename to SGML-Tools (and then in SGMLtools -
the hyphen caused confusion) to indicate that it wasn't just for Linux
anymore. But still, we - the SGMLtools project authors - weren't satisfied
with this so we set out to build an even better package that is presented
here: SGMLtools 2. This article will give a brief overview of what happened
to SGML-Tools 1 that we are calling it SGMLtools 2; more extensive information
can be found on the SGMLtools website.

From Linuxdoc to DocBook

A big issue that came up again and again was the fact that the Linux
document type definition was showing shortcomings.  Document type
definition (DTD) is the SGML term for the set of rules that fixes how an
SGML document that complies with that DTD must look like. It outlines
the structure of the document: from titles and subtitles to tables,
everything is in there.

Maintaining a document type definition, as we found out, is very hard. There
is constant discussion over what features should be allowed in, how to
make existing features better, whether to stick with pure procedural
markup or be a little bit pragmatic about things - endless discussions
came up and came back, and started to interfere with progress. The Linuxdoc
DTD was clearly too limited but we didn't wanted to redesign it without
looking whether alternatives already existed. 

We quickly came to the conclusion that the DocBook DTD, as developed by
the Davenport Group, would be a good successor to the Linuxdoc DTD. DocBook,
being developed by professionals for professionals with an emphasis towards
technical documentation, fits the target audience for SGMLtools very
well and solves a number of probles of Linuxdoc. Furthermore, almost every
SGML vendor support DocBook, so this would make users less dependent on
us and give them more ways to write/format/access SGML documentation.
Recently, responsibility for maintaining DocBook has been transferred to
the Organisation for the Advancement of Structured Information
Standards (www.oasis-open.org), ensuring that DocBook will continue
to be widely supported.


From mapping files to DSSSL

The acronym DSSSL may not say much to the average reader, but it stands
for another significant change in SGMLtools. DSSSL (which stands for Document
Style and Semantics Specification Language) is a language that you can
use to specify how SGML documents will look like. It helps in translating
procedural markup like "section" to a certain formatting style like "Helvetica
Bold, 18 points", building up tables of contents, etcetera. It is much more
powerful than the mapping files used previously, because it can act on
context, you can define functions, and so on. As DSSSL is based on Scheme
there is not much you cannot do in one way or the other. 

We chose to go with DSSSL not only because its power: it is also an
industry standard (contrary to the old method and to alternatives we 
evaluated) and it helped us jumpstart the project because there is a 
complete set of DSSSL styles for the DocBook DTD. 

So, how does SGMLtools work?

SGMLtools 2 is a collection of tools around three core elements:
[list]
[dot] The DocBook DTD;
[dot] The standard DocBook DSSSL files;
[dot] Jade, the SGML/DSSSL parser.
[endlist]

When you hand your SGML source to SGMLtools (with the command "sgmltools"), 
it basically does nothing but calling Jade with
a) the name of the SGML file, b) the name of the DSSSL file to
apply to it, and c) the requested output format. The following sections
go into some detail in order to make the process clear - it is not
very hard to understand, and it helps a great deal when you want to 
make some modifications here and there to have some basic knowledge
of what happens during a run of SGMLtools.

Jade first reads the SGML file and tries to find the document type
definition from the SGML file's declaration:
[cw]
<\<>!DOCTYPE article PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
[ecw]
is what should appear on top for DocBook-compliant documents, with 
the note that [cw]article[ecw] may refer to any part of the DocBook DTD:
you can even write [cw]para[ecw] to designate an single-paragraph
document. Via the [cw]PUBLIC[ecw] identifier Jade gets to the filename
of the DTD definition (see the box on public identifiers) and if this 
all succeeds, the SGML source is checked for compliance. 

After the document has been found OK ("validated"), Jade reads the DSSSL
file indicated and executes it against the parsed SGML file. The DSSSL
"program" reads the SGML document from objects in memory and outputs
another memory structure called a Flow Object Tree (FOT). The FOT will
look a lot in structure like the SGML document, but it has information
on fonts, sizes, etcetera. Finally, Jade hands the FOT to one of its
backends which converts the generic style information into the backend's
specific file format.

A short example to illustrate this process: you start with an SGML
document that has a part that says:
[cw]
Introduction
...
[ecw]
This is a top-level section with "Introduction" as the title. When Jade
determines that it is good DocBook, say the DSSSL file called [cw]ldp.dsl[ecw]
is read - this DSSSL file could give instructions for Linux Documentation 
Project style formatting. Somewhere in the DSSSL file, there could be
a piece that says:
[cw]
(element SECT1 TITLE ((make paragraph
         font-family-name: "Times New Roman"
	 font-weight: 'bold
	 font-size: 20pt))
[ecw]
This expression says "for TITLE elements within SECT1 elements, output
a paragraph with the font being 20pt bold Times". Taking some shortcuts
we can say that this expression results in a flow object with the
given properties and the text "Introduction" for content (the concept
of making a paragraph out of everything, even headings, will be familiar
to people who have worked with DTP software). When everything is done,
Jade hands all the flow objects to the backend, for example the TeX
backend. This backend, upon encountering the flow object for our
introductory section title, will output something like:
[cw]
{\setfontfam{Times-Roman-Bold}\setfontsize{20pt}Introduction}
[ecw]
which can then be processed by TeX and likely a special TeX package to
generate DVI, PostScript, etcetera.

Note that the beauty of DSSSL is that you only talk about style, not
about specific instructions for specific formats: whether TeX, RTF, or
Groff, you'll always get at least a close equivalent of a "20pt Times
New Roman Bold" section header. If you need to tune this, you can easily
override pieces of DSSSL specifications for specific backends (often,
you'll have at least different DSSSL files for hardcopy and HTML output).

Customization

One of the biggest advantages of the new version is that it is very easy
to customize it - once you get the hang of DSSSL. As the previous part
showed, you don't even need to know a lot about the backend: in DSSSL you
deal with fairly high-level stuff like fontnames without worrying about
how these fontnames are dealt with in Postscript or Groff documents.

The original DocBook DSSSL style sheets as supplied by SGMLtools are meant
to be customized. All you need to do is write your own stylesheet that
includes the original one and overrides what you want to customize; often
this can be just a couple of lines that tune parameters, and in SGMLtools
you'll find a couple of examples of these customizations. After you setup
your own DSSSL style sheet, you'll need to make sure that SGMLtools
uses it. You can do this by giving the "-d" or "--dsssl-spec" option that
points to your DSSSL stylesheet. 

Migrating from Linuxdoc

The first question with many Linuxdoc-users will be: and what about my
current documents? The short answer is: don't worry, we've thought about
that. The longer answer: you'll have to migrate from Linuxdoc to DocBook
within six months from the release date of SGMLtools 2 and the package
provides a tool to help you in the conversion process.

The first step in the migration process is to make sure that your documents
are compliant with the last SGML-Tools 1 version, which will be 1.0.7
or newer. Install this software and run your documents through it to make
sure that they're up-to-date.

The second step is to convert your documents with the command
[cw]sgmltools --backend=ld2db[ecw], which spits out DocBook documents. If this
run succeeds, you can finalize the migration by reading up on DocBook and
seeing whether you are satisfied with the result of the conversion. From
this point on, you can continue to write in DocBook.

In order to give you some space for planning your conversion, we'll continue
to support SGML-Tools 1 for 6 months after the release date of SGMLtools
2 (which is unknown at the time of this writing, but should lie fairly
close to the publication date of this article - check the website for
details). After 6 months, SGML-Tools 1 is removed from the websites and
as far as we are concerned, the Linuxdoc DTD will be history from
then on. We'll remind you in comp.os.linux.announce of this event well
in advance, and of course you're free to keep using SGML-Tools 1 for
as long as you wish, but we recommend you take the trouble to learn DocBook
and start using SGMLtools 2 - it'll give you even more flexible formatting
power.

[Public and System Identifiers box]

SGML was designed not to have system-dependencies and therefore 
even a way around using filenames was found. SGML talks about
"external entities" which can
be identified in two ways: by a public identifier or a system
identifier, where the first one is generally preferred because it is
system-independent. Public identifiers
are known to everyone that has edited HTML:
[cw]

[ecw]
says: ``this is an "HTML" document and you'll be able to find the specs
via the public identifier "-//W3C//DTD HTML 3.2 Draft//EN"''. The public
identifier can be resolved into SGML stuff in any number of ways: through
databases, filesystems, networks, whatever the SGML system at hand implements.

A standard way to map public identifiers to system identifiers is by means
of SGML Open catalogs. These are files that contain entries like:
[cw]
PUBLIC "-//W3C//DTD HTML 3.2 Draft//EN" "/usr/local/sgml/html3-2.dtd"
[ecw]
where the third field is the system identifier, in this case (and
indeed in most cases) a filename. SGML software knows how to find these
catalogs and uses them to translate public identifiers without the
user having to worry about file locations (often, a name is hardcoded
but may be overriden by a set of names in an environment variable
SGML_CATALOG_FILES).

SGMLtools builds and uses a shared catalog in a well-known location
([cw]/var/lib/sgml/catalog[ecw]) that contains all these mappings
so that hard-coded system identifiers are avoided as much as possible,
making documents more portable.

[Getting SGMLtools box]

The central homepage of the SGMLtools project is
http://pobox.com/~cg/sgmltools. From this page, links will get you to
the most recent version of the software and to mirror sites around the
globe. Note that SGMLtools follows "Linux kernel versioning": if the
middle version number is even (like in 2.0.1), you have version that is
considered fit for general consumption; if the middle version number is
uneven (like in 2.1.2), you are dealing with a developer-only version.

[Author blurb]

Cees de Groot has been an avid Linux user since the very early days, and he
has tried to pay back this great favour Linus did him by contributing small 
bits and pieces to various parts of the system. Since Fall 1996 he is 
maintaining SGMLtools. At daytime, he is a Java consultant working for
a small company specializing in intranets. You can reach him as cg@pobox.com.


 
Copyright (C)2000-2011 Cees de Groot -- All rights reserved.