DTDs (Document Type Definitions)

One DTD per Content Type.  According to the XML Specification, each XML application , or instance of a Markup Language, may optionally have a Document Type Definition (DTD). That is, it may have a set of rules ('Definition') that describe what may and may not appear in the particular 'Type' of 'Document' that the markup language application is concerned with. A document instance adhering to these rules is said to be "valid."

As this website build system uses three XML Markup Languages (BusinessML, StructureML, and StructureLayoutML), it might occur there would be three DTDs. But only one of these XML markup languages needs to be checked for "validity." Only one therefore has a DTD: the source data XML, BusinessML. (Please note that the system per se does not explicitly perform a validity check. However, at authoring time of the BusinessML, in an XML-aware text editor (e.g. XML Spy), the DTD is made available for immediate testing of validity, while editing.

[Tip]Tip

One other validation tool of interest is found at the Java command-line: java Validate filename.xml . (See also: java Validate -help). 'Validate' uses the Xerces XML parser, which of course must be in your CLASSPATH.

The other two XML markup languages (Structure, StructureLayout) are temporary, interim, derived files that only exist upon processing the source BusinessML, in a series of transformations, enroute to the final SiteHTML.

[Tip]Tip

The fourth "Markup Language" in the system is the final rendering into HTML. This is a "Markup Language" that gets the special name "SiteHTML" not because it differs from what you can do withW3C HTML, but because it effectively means "the particular subset of W3C HTML that actually gets used on this site."

(Note that the DTD for SiteHTML therefore is none other than the W3C HTML DTD, though the processing system does not concern itself with a "validity" check. Again, the important data file to check for being "valid" is only the source XML: BusinessML.)

Interestingly, even the single concept of "BusinessML" does not provide quite the precise term to define the XML application (Markup Language) that is used throughout this website build system as the 'Type' of 'Document' for which a 'Definition' needs to be recorded. Instead, what we find is that across the website, there are indeed several "Document Types," and in fact, each of these is expressed in the concept we have called the Content Type. And so, we conclude that we need (and have) a DTD for each Content Type in the system.

What is further interesting to note is that the set of Content Types do at bottom share some (limited) commonality, and the DTD design in fact is a modular one of "hub and spoke" inheritance, whereby a central, or "core" DTD (/mlnmdev/dtd/ct00_core.dtd), is used to record in one place all aspects of the Document Type Definition that can be identified as shared across all Content Types. The "spokes" then are the set of individual DTDs (each of which "imports" the core ct00_core.dtd), with a naming convention matching their Content Types: ct01_general.dtd, ct10_press_release.dtd, ct70_highlight.dtd, etc.

Now, looked at this way, it must be admitted that you could arguably make the technical case that there really is but one ML for all these Content Types: ct00_core.dtd (the root element for which is <content>). It's then a question of your point of view, as to whether it is useful or helpful to think of what is found in the "core" (a very limited set of tags actually) as being a substantive "type" for what is called a "business markup language."

[Note]Note

What About HTML? While it is true that the core (ct00_core.dtd) in turn imports the whole of the W3C HTML DTD, the inclusion of that critical shared DTD resource could also be easily achieved without a "hub & spoke" arrangement. Each individual Content Type DTD, with a line or two of code, could import it. Therefore it cannot be said that one of the "substantive" reasons for having a ct00_core.dtd is to bring the HTML DTD to each spoke.

This is why it is suggested here to think of the examples of the far more "business-like" elements that are situated out in each "spoke" DTD as being their own instance of an ML, instead of technically regarding the "core" file as the instantiation of "BusinessML," which just happens to have all its interesting parts out in a series of extended modules. Instead, turn that on its head, and look at each "interesting part" (Content Type) as "MLs," each of which just happens to include a few commonly shared items from a small core, that holds them together as a set of cooperating DTDs, that recognizes they do have some relation to one another.

As stated above, it is a problem that permits more than one way to look at it, but this documentation will take the viewpoint that there is one DTD for each variant of BusinessML, and that these are designed to capture the semantic business information in each identified Content Type.

Adding another entire DTD for a new Content Type is a comparatively straightforward endeavor, and as with many artifacts in this entire website build system, the best "documentation" on how to do it is to find an existing example and "Save As..." the skeleton for the new piece, on which you then perform the necessary "Search & Replace..." operations.

Adding elements or attributes to the existing set of DTDs is an advanced concept, some details for which are provided elsewhere in both this documentation and in extensive comments in the DTD files themselves, but here we can note that the system is designed to support this customization.

The key concept to all this "document type defining" is that at bottom the system is based upon the W3C XHTML-1.0-Transitional DTD, and that to it are added useful proprietary "business" elements. But note that these new tags are added not only "on top," as it were, of the set of HTML tags, but are entered right into the fabric of the HTML DTD, such that the proprietary elements can be used—where this is desired—right "inline" with all other HTML tags.

[Tip]Tip

The "pattern" in XML design represented by this is referred to as the "Extensible Content Model," as documented at http://www.xmlpatterns.com/ExtensibleContentModelMain.shtml

The rationale for choosing the W3C's XHTML as the basis for the XML markup to be applied to a company's content is admittedly grounded in a bias (albeit a practical one) toward web page production. But as other channels or formats inevitably find that their particular needs (print; wireless; RSS; PDF; JavaDoc; DocBook (!); etc.) can't be entirely met in HTML, they also always make provision (where practicable) to transform their results into HTML, or make some kind of allowance to accept (X)HTML as an input, etc. To be in HTML (or more precisely, XHTML), is not a terribly bad place to be, to ensure continued processability of your data.

The other aspect favoring the HTML tagset—at least for the intended website workstaff audience—include "ease of learning," and "ease of use," concepts extensively covered in David Megginson's book, "Structuring XML Documents." The sufficiently comprehensive (but not too large) size of the DTD vocabulary; the familiarity with it that can be presumed; and the successful trade-off of brief element names (e.g. 'p', instead of 'para') for quicker data entry; all contributed to the selection of HTML.

Still, rather than HTML, other ideas were considered. The limitations encountered in an early-on initiative of creating a small set of proprietary-only tags led to trying to find examples elsewhere that could be instead adopted. The DocBook XML DTD, and the other Norman Walsh DTD "Website," were briefly investigated. But it was not long before the advantages became clear of simply using HTML (XHTML) and a mechanism ("parameter entities") to safely extend and customize it.

Finally, it must be noted that in practice, oftentimes the content selected by many corporations as destined for publication to their corporate website does not demand a great deal of semantic markup, beyond a limited number of key descriptor fields (e.g., headers, metadata, perhaps a list of particular key informative items). Therefore the simple capture ("markup") of the inherent structure of the document (its paragraphs, lists, links, tabular data, etc.) can be well attended to by the tagset available in (X)HTML.