Sign In Sign-Up

Chapter 44 Should You Upgrade to SGML?

by Steven J. DeRose

CONTENTS

How HTML and SGML Relate
What Data Is Already in SGML?
Why Is This Data in SGML, Not HTML?
Five Questions to Ask About Your Data
Challenges of Upgrading
Benefits of Upgrading
Summary

The World Wide Web has brought more attention to SGML than anything else. Most WWW documents (other than bitmapped graphics) are SGML documents that use the HTML DTD. If you're using HTML, you're using SGML, although there's much more to SGML. On the other hand, most Web browsers don't support any other DTDs besides HTML. This means that all the other SGML data in the world can't be browsed easily on the Web. (But take heart! Several solutions are presented in this chapter.)

This chapter begins by telling you how SGML relates to HTML and what's happening with SGML on the Web already. Then you learn about the practical issues: how to decide whether to go with HTML or SGML for your Web data, and how you can take advantage of each one's strengths and avoid their weaknesses.

How HTML and SGML Relate

People often say that HTML is a subset of SGML. This is nearly right, but it's a bit more complicated. Technically, HTML is an application of SGML. This means that it's really a DTD, a set of tags and rules for where the tags can go. SGML is a language for composing DTDs that fit various kinds of documents. There are many applications, and therefore many DTDs. (HTML, the DTD for the World Wide Web, is probably the best-known one.)

You already know that a DTD is always designed for some particular type of document: business letters, aircraft manuals, poetry, and so on. An important question to ask when deciding whether to put some data in HTML or another SGML DTD is, "What kind of documents is the HTML DTD meant for?"

Here is a sample of the kinds of tags that exist in HTML. First, HTML has a lot of tags for marking up common kinds of structures. Here's a partial list:

Headings:<H1>, <H2>…
Divisions (the actual big containers like chapters and sections, that contain headings and other data): <DIV>
Basic document blocks (paragraphs, block quotations, footnotes, various kinds of lists): <P>,<BQ>, <FN>,<OL>, <UL>, <DL>
Tables and equations (only in newer browsers): involve many different element types
Text emphasis: <EMPH>, <STRONG>
Hypermedia links:<A>, <IMG>
Interactive forms: <INPUT>,<TEXTAREA>

HTML also includes several element types that express formatting rather than structure. These pose some portability problems, but they can be useful in cases where you simply must have a certain layout:

Font changing, such as for getting bold and italic type: <B>, <I>
Various extensions that work only with certain browsers: <BLINK>,<FONT>, and so on
Forced line breaks (most used in code samples, "preformatted text," and similar examples): <BR>, <PRE>
Drawing rules, boxes, and so on: <HR>

From the selection of element types, you can easily see the kinds of documents HTML is best for: fairly simple documents with sections, paragraphs, lists, and the like. In fact, most of the HTML element types are pretty generic; nearly every DTD has paragraphs and lists in it. One place where HTML excels, however, is in linking. Although it only has a couple of element types for links, those element types can use URLs to point to any data anywhere in the world. For more details on HTML, you may want to read Special Edition: Using HTML, from Que Publishing.

So, why use other SGML DTDs? The main reason is that not all documents consist of only these basic kinds of elements. Whenever you run across some other kind of element, you have to "cheat" to express it in HTML. A very common example is the Level 6 heading element in HTML (H6). Because the first browsers formatted H6 headings in small caps and there was no text emphasis tag that would give the same effect, people got in the habit of using H6 to mean "small caps." Of course, some people also use H6 as a heading, and many people use it both ways.

This works fine-until something changes. Suppose that a browser comes along that enables users to adjust the text styles for different tags, for example. Someone changes H6 to look like something besides small caps, and everyone who was counting on small caps is surprised. Sometimes this won't matter, but it might; what if the user wants all the headings big and all the text emphasis small? Or what if the user is blind? When his browser runs across an H6 element, it wouldn't do any good for his browser to put it in large type, so instead maybe its computer-generated voice says "section" and reads the heading loudly; in the same way, maybe such a browser is not supposed to do anything special for small caps.

The most important problem, though, is that you might want to use the tags for something completely different than formatting later. What if a browser is really friendly and makes automatic outlines by grabbing all the headings? Or what if you want to do a search, but only for text in headings? (You might want to do that because if a word occurs in a heading, it's probably more important than if it just occurs in the main text.)

Using a tag because it gets the right formatting effect is always a problem, usually a delayed one; it works fine when you do it, but the "gotcha" comes later. People working with the distant ancestors of SGML made up a name for this: "tag abuse syndrome."

The only thing to do about tag abuse syndrome is to make sure you have the right types of tags available. Few people would use <H6> for small caps if there were a more appropriate emphasis element available. That is exactly why SGML is important for the Web; a lot of documents contain elements that don't fit into the HTML set. Here are some kinds of elements for which tags aren't available in HTML:

Poetry and drama: STANZA, VERSE, SPEEch, ROLES.
Computer manual-speak: COMMAND, RESPONSE, MENUNAME.
Bibliographies, card catalogs, and the like: AUTHOR, TITLE, PUBLISHER, EDITION, SUBJECT-CODE, DATE.
Back-of-book indexes: ENTRY, SUBENTRY, PAGEREF.
Dictionaries: ENTRY (of many levels), PRONUncIATION, ETYMOLOGY, DEFINITION, SAMPLE-QUOTATION.

This problem will continue to exist even though later versions of HTML will add many useful new tags-no one can predict all the kinds of documents that people will invent. SGML provides the solution, because when you need a new kind of element, you can create it. You can avoid problems by trying not to force every kind of document into a single mold (just as you don't try to make a single vehicle do the work of a bicycle, car, and Mack truck).

From time to time, as you tag a document, you might feel as if the right tag just isn't available. How often this happens is a good way to tell how well the DTD you're using fits the document you're working with. If the fit is too poor, the time may come to extend the DTD or switch to an entirely different one-though this shouldn't happen very often. It's better to use the right DTD for each job than to force-fit; to be able to do this, users must have software that handles SGML generically rather than forcing data into any one mold.

Tip

Moving data from one DTD to another can sometimes be easy. It helps to have at least a little skill with some programming tool like Perl, as well as SGML. Even so, the job is not always easy. If the two DTDs use similar structures and differ primarily in tag names, it may be as easy as running some global changes to rename tags. If you aren't using much SGML minimization, non-SGML tools like Perl or even a word processor's Search and Replace command may be enough, because all the tags are right there: you can search for a string like <P> and change it to <PARA>-but remember to allow for tags with attributes! On the other hand, if you're using a log of omitted tags or changing to a very different DTD where you have to add or subtract containers, re-order things, and so on, it can be a lot more work.

There are also special tools available to help transform SGML documents in this way. Among them are OmniMark from Software Exoterica, the SGML Hammer from SoftQuad, and Balise from AIS.

What Data Is Already in SGML?

A lot of data is already available in SGML, and a lot of that has already gone onto the Web. Because SGML was adopted first by large organizations (after all, they had the biggest document problems to solve), those organizations have been able to make a lot of data available.

From Commercial Publishers

Many publishers are moving to SGML for all their documents. Some want to preserve their investment so they can reproduce books even after the latest wiz-bang word processor is history. Some want to simplify the data-conversion they do when authors send in their drafts. Some want to support new forms of multimedia delivery, information retrieval, and so on.

One of the earliest success stories for SGML in publishing is the many-volume Oxford English Dictionary (OED). For many decades, the entire OED used rooms full of 3¥5 cards. But in the early 1980s, the publishers decided to go electronic. They worked with Waterloo University and developed sophisticated conversion programs to get the whole dictionary into SGML. One of the hardest tasks was teasing apart 25 or so different uses for italics in the scanned text: book titles, foreign words, emphasis, word origins, and so on. This is just a severe case of tag-abuse syndrome (one they couldn't avoid, since they had to work from scanned text, and scanners can't tell you much about distinctions other than font choice). Success in this conversion made it much easier to keep the dictionary up-to-date; it's also resulted in a great electronic edition that can be searched in very sophisticated ways. Because of the up-front tagging work, if you ask for all the words with Latin origins, you don't also get all the places where "Latin" happens to show up as an emphatic word or in a book title.

Another major SGML publishing project is the Chadwyck-Healey English Poetry Database. This project is collecting all English poetry from the earliest stages of English up to 1900 and publishing it on a series of CD-ROMs with sophisticated search software. Some of it everyone has read, some of it only an English professor could love-but it's all going to be there, in SGML.

Journal publishers have recently started using SGML to speed up the review and publishing cycle (see Figure 44.1). Platform and format independence make it easier to ship files to the many people involved. The fact that all kinds of software-from authoring to online and paper delivery systems-can now deal with SGML also makes it a good common format for them.

Figure 44.1 : SGML is being used for a variety of sophisticated documents, including technical and scientific journals. Screen shot courtesy of Lightbinders, San Francisco (http://ibin.com).

From Computer Vendors

When computer companies started using SGML, SGML won the battle. Now that the publications and documentation departments right inside computer companies are demanding good SGML tools, the need is obvious to those companies. When software companies notice a problem, there's a nice side benefit: They not only notice, but can do something about it, and so new tools are beginning to appear.

Silicon Graphics, Inc. was one of the first companies to move its documentation to SGML, calling its system "IRIS InSight" (see Figure 44.2). SGI makes the high-end graphics workstations that bring us a lot of special effects. Novell moved too, and reportedly saves millions of dollars (and trees) per year by shipping NetWare documentation on CD-ROM rather than paper. Novell used SGML to its advantage in moving to the Web; in only a few days, a single person set up over 110,000 pages of NetWare documentation for Web delivery, using a Web server that can convert SGML portions to HTML on demand. The data is still stored and maintained in generic SGML using its original DTDs, and so is always up-to-date without a complicated conversion and update process.

Figure 44.2 : SGI customers access documentation using the IRIS InSight system.

Sun Microsystems, AutoDesk, Phoenix (of BIOS fame), and many others also use SGML heavily, and there are reports that Microsoft does the same in-house. As one SGML Web publisher put it, a lot of the information you have to have is going onto the Web in SGML. IBM started using a predecessor of SGML, called GML, long ago, and may have more data in SGML-like forms than anyone.

From Libraries and Universities

SGML is being used for finding aids, which are the equivalent of catalogs for unique items like special collections of archives, personal papers, and manuscripts. The University of California at Berkeley's library is spearheading this work, quietly converting huge numbers of finding aids into SGML and working with many other libraries to refine a DTD (see Figure 44.3). They can (and do) deliver this information easily on any medium, from CD-ROMs to the Web.

Figure 44.3 : Berkeley and many other libraries have cooperated to develop the "Encoded Archival Description" DTD to help give easy access to a wide variety of manuscripts and other collections.

Scholars and teachers also have put a lot of information into SGML and are starting to move it to the Web. The Brown University Women Writers' Project is collecting and coding as many English documents as possible from female authors prior to 1950. Several theological tools, such as CDWord, provide access to sacred texts, commentaries, and the like. And the complete works of philosophers as varied as Nietzsche, Wittgenstein, Pierce, and Augustine are in various stages of conversion to SGML.

The Oxford Text Archive and the Rutgers/Princeton Center for Electronic Texts in the Humanities are developing large literary collections in SGML; some parts are already available on the Web. Many individuals also encode and contribute their favorite literature, as part of research or teaching.

From Industry

High-tech industries moved to SGML very early because of its power for managing large documents. Aircraft and similar industries use many subcontractors; assembling complete manuals using parts from a variety of sources is hard unless you set up some standards. So the aircraft manufacturers and the airlines got together and set up a DTD. The companies that make central-office telephone equipment have done the same.

Not long after these industries went to SGML, the automobile and truck industries did also; companies like Ryder and FreightLiner have improved their speed of repairs and overall reliability using SGML. Other success stories abound in power companies, copier and other office machine companies, and many others.

From Government and the International Community

They say the U.S. Government is the world's biggest publisher, and it's probably true. The Patent Office puts out about 109 megabytes of new patent text per week (not counting figures); the Congressional Record adds a lot, too. Both of these are moving to SGML, though it's a challenge because they must be very careful not to disrupt current practices or delay delivery during the transition.

Internationally, there is much interest in SGML in Europe, and increasing interest in Asia. The International Organization for Standardization (ISO, despite the English word-order), which put SGML together in the first place, uses it for publishing some of its standards.

Why Is This Data in SGML, Not HTML?

Because of all these users, there is a lot of SGML data out there. Why did all these companies choose SGML instead of HTML? Mostly because it's a generic solution; it lets them use tags appropriate to the kinds of documents each one cares about. This means describing the document parts themselves rather than how they should appear on today's output device. This generic approach is why SGML data outlasts the programs that process it, and that can mean huge long-term savings. HTML can do this for a limited number of cases, but not in general. There are other reasons for using SGML:

Scalability. SGML has features, such as entity management, that make it easier to work with large documents. A printed airplane manual often outweighs the plane itself, and the documentation system better not choke.
Validation. SGML's ability to check whether documents really conform to the publisher's rules is important in industry, especially in the current world of liability lawsuits. However, validating a document doesn't ensure it makes sense, any more than spelling correctly ensures it makes sense.
Information retrieval. Big documents are hard to work with, and SGML tagging puts in the "hooks" you need to make search and retrieval software work much better. True containers for big organizing units are especially helpful here, like chAPTER and SECTION instead of just H1 and H2.
Version management. High-tech manuals and ancient literature share a common problem because they come in many versions; it can make a big difference which one you get. Although not a true version-management system by itself, SGML has features that form a good foundation for one (such as marked sections, attributes, modularity, boilerplating, and so on).
Customizable presentations. This relates to version management, too. Because SGML doesn't predefine formatting and layout, delivery tools can customize the display for each user as needed-show extra hints for novices, hide secret information, and so on. This is what Ted Nelson (he invented the term hypertext) calls stretchtext: The document should smoothly expand and contract to match the user's interests.
Access for print disabled. Again because SGML gets away from formatting details, it is easy to convert SGML documents for delivery in Braille, via text-to-speech converters, and so on. Several books have been converted this way in record time.

All these advantages apply to paper production, online delivery, and information retrieval. But once you lay out pages for print, most of these advantages disappear; once all the lines and page breaks are set, the page representation takes over, and getting back to the structure is very difficult.

Five Questions to Ask About Your Data

Given all the advantages of generic SGML for big projects, yet all the simplicity of HTML for simple ones, how do you decide which way to go? There are five questions you can ask that will help you choose.

What Functionality Do I Need?

If your documents fit the HTML model and consist mostly of the kinds of elements HTML provides, HTML is probably a good choice. This is especially true if the documents are also small (tens of pages, not thousands). But if you have big documents or documents with special structures or elements, SGML will take you a lot farther.

If you need to do information retrieval, SGML is also better. You can search HTML, but you can't easily pin down just where hits are. This is because the HTML tags don't divide data up as finely as you can with full SGML, and HTML doesn't typically tag large units such as sections. (The tags have only been added in the latest revision, and they're still optional.)

Finally, if you need to deliver in more forms than just the Web, you should consider SGML. Tools are available to turn SGML not only into Web pages, but into paper pages, most kinds of word processor files, CD-ROM publications, Braille, and many other forms. This can all be done with HTML in theory, but it's harder in practice.

Do I Need Flexible Data Interchange?

SGML eases data interchange in several ways. Because it helps you avoid using tags for things they don't quite fit, your data is easier to move to other systems, especially if the tags can take advantage of finer distinctions. For example, if you tag book titles, emphasized words, and foreign words as <I> in HTML, you have a problem when you a move to something that can distinguish book titles and emphasis, such as a program to extract and index bibliographies. If you make the finer distinctions, you have a choice later whether to treat the items the same or differently.

Computers are pretty bad at sorting things into meaningful categories when they look the same. You almost need artificial intelligence to decide which italic text is a book title and which is something else. The good news is that computers are really good at the opposite task; if you've already marked up book titles and emphasized words as different things (say, <TI> and <EMPH>), it's no problem at all for a computer to show them both as italic.

Because of this, interchange is much easier down the road if you break things up early and make as many distinctions as practical. On the other hand, each distinction may be a little extra work, so you need to balance long-term flexibility versus how much time and effort you can put in up front. To figure out this balance, be sure to consider just how long you think your data will last (you're safest to at least double your first guess) and how important your data is.

Importance and lifespan don't always go together. Stock quotes are pretty important when they're current, but after a year only a few specialists ever look at them. At the other extreme, some literature that started out on stone tablets thousands of years ago is still important. Where does your data fit?

How Complex Are Your References and Links?

HTML has great strength as a linking system. This is mostly because URLs can point to any data in any format, and browsers provide a very convenient way to get any of that data. URLs (the most commonly available way of identifying information on the Net, though more advanced ways are coming) can get data via all these protocols (Web-speak for "methods") and others.

Protocol	Description
ftp	The data is copied down to your local machine.
Http	The data is formatted and shown in the browser itself (or by a helper application for graphics, sound, video, and so on).
e-mail	Communication works like electronic mail.
News	Postings from network newsgroups are retrieved and presented.

HTML does all of this with only a few tags, mainly <A> and <IMG>. This means that the linking itself is not very complex or sophisticated, even though the data that the links point to is. For example, both <A> and <IMG> are one-way links; they live somewhere in document A and point to document B, as shown in Figure 44.4. But if you're in document B, you don't know that document A exists or that it points to you.

Figure 44.4 : The HTML <A> tag makes one-way links.

If you click a link and travel from document A to document B, most browsers will remember where you were and provide you with a Back button to return to the same document (though perhaps not to the same place in that document). That's an important feature, but not at all the same as also being able to get from document B to document A in the first place-with true two-way links you know while in document B that there's a link from document A.

Note

It's also hard with HTML links to go from document A to a specific place inside document B because URLs normally point to whole files. HTML does give rudimentary support for getting a whole file and then scrolling it to some element with a given "name" (like an SGML ID). This is useful, but doesn't help much with larger SGML documents. With large documents, the problem of having to wait for the whole thing to download (even though you only need a small portion of it) becomes very important.

Link precision will probably improve in the future with conventions for a URL to give not only a file, but an ID or other location within a file, and to use this information to optimize downloading, not just scrolling. In fact, some servers already let you add a suffix to a URL to pick out a certain portion. For example, a server could let you put an SGML ID on as if it were a query, and then just serve up the element with that ID (including all its subelements, of course):

<a href="http://xyz.org/docs/book.sgm?id=chap4">

Though you can simulate a bi-directional (or two-way) link in HTML, you have to do it by creating two links (one in document A and one in document B). This poses a couple of problems; the most important one is that you have to actually go in and change both document A and document B, so you can't just do this between any two documents you choose. Even if you can get at both documents to insert the links in the first place, it's easy to forget to update one "half" of the link when you update the other. Such links gradually tend to break.

What do other hypermedia systems do about this? The best ones, SGML-based or not, provide a way to create links that live completely outside of documents, in a special area called a web. (That name may change now that it's popular as a shorthand for the World Wide Web.) In that case, the picture looks more like Figure 44.5. Many systems provide both methods, not just one or the other.

Figure 44.5 : An external web lets you create two-way links.

This is a much more powerful system, and you can do it with a number of SGML linking methods, such as HyTime and the TEI guidelines, and some recent systems like Hyper-G. It seems to have originated with the Brown University InterMedia system. Doing links this way has these benefits:

Because links live outside the documents, anyone can create them without needing permission to change the documents themselves. You can even link in and out of documents on CD-ROM or other unchangeable media. This is especially important for big data like video, because it's still much more effective to keep local copies on CD-ROM or similar media than to download huge files every time they need to be viewed.
Because documents aren't touched every time a link is attached, they can't be accidentally trashed. Most HTML links have this advantage at one end since the destination document needn't be touched. But the only way for HTML to point to a particular place inside a destination document is via an ID; so to do that you may have to add one, and in that case HTML loses even this one-ended advantage.
Because a set of links is a separate thing, you can collect links into useful groups and ship them, turn them off or on, and so on. Siskel's and Ebert's links to movie-makers' home pages can be in two separate webs, so you can choose to see either or both.

If you don't need this more sophisticated linking, HTML's links may be just fine. Otherwise, you need to go beyond HTML and beyond what current HTML browsers can do. The good news is that such a web can still use URLs and related methods to do the actual references, so you can keep the power HTML gets from them. You can add URL support (or even the <A> and <IMG> tags themselves) to another DTD that packages them up to provide greater capabilities.

Note

TEI and HyTime links provide a very good way to express this kind of linking.

What Kind of Maintenance Is Needed?

There are two areas where HTML files run into maintenance problems that SGML can help with:

Links tend to break over time.
HTML itself changes through improvements such as new tags.

While the URLs and other identifiers that HTML uses for links are very powerful, the most common kind right now, the URL itself, is also fragile. A URL names a specific machine on the Internet, and a specific directory and filename on that machine (technically, this doesn't have to be true, but in practice it almost always is). This method has an obvious maintenance problem: What if the file moves? A URL-based link can break in all these ways:

The owner moves or renames the file, or any of its containing directories (say, to install a bigger disk with a different name).
The owner creates a new version of the file in the same place and moves the old one elsewhere. (There's an interesting question about which version old links should take you to, but you needn't get into that here.)
The owner's machine gets a new domain name on the Internet (for example, if someone else trademarks the name the owner had).
The owner moves to a new company or school and takes all of his data with him.

The Internet Engineering Task Force (IETF) is working hard on Uniform Resource Names or URNs, which let links specify names instead of specific locations. This is like specifying a paper book by author and title, as opposed to "the fifth book on the third shelf in the living room at 153 Main Street." URNs will make links a lot safer against simple changes like the ones just mentioned.

SGML provides a similar solution for part of the problem already, through names called Formal Public Identifiers or FPIs for entire documents or other data objects. SGML IDs for particular places within documents can be used both in general SGML and in HTML. By using FPIs or URNs to identify documents, you can ignore where documents live. When a document is really needed (such as when the reader clicks a link to it), the name is sent off to a "name server" that looks it up and tells where the nearest copy is. This works a lot like library catalogs and like the Internet routing system used for e-mail and other communications.

Note

You can make HTML links a little safer against change by using the BASE feature. Very often, a document will have many links that go to nearly the same place as the document itself, such as to several different files living in the same directory on the same network server, or in neighboring directories. When this happens, the beginning of the URLs on those links are all the same, such as

http://www.abc.com/u/xyz/docs/aug95/review.htm http://www.abc.com/u/xyz/docs/aug95/recipe.htm

Instead of putting the full URL on every link, you can "factor out" the common part and put it on the BASE element in the header. The links all get much shorter, but the bigger plus is that you can update them all in one step if the server or a directory moves.

<BASE ID=b1 HREF="http://www.abc.com/u/xyz/docs/aug95/"> .. <A BASE=b1 HREF="review.htm"> ... <A BASE=b1 HREF="recipe.htm">

HTML is constantly being improved. While this is a good thing, it also poses compatibility problems. In HTML 1.0, <P> was not so much the start of an SGML element as a substitute for the Return key. It was an EMPTY element, so the content of the paragraph was never actually part of the P element, and there was normally no <P> tag before the first paragraph in any section. This has been fixed in HTML 2.0, but funny things can happen if you view an old document in a new browser or vice-versa; for example, you might not get a new line for the first paragraph after a heading.

A newer issue is tables. HTML 2.1 adds a way to mark up tables and get good formatting for them; they can even adjust automatically when the reader changes the window-width. But what about tables in earlier documents? Authors often deleted their tables entirely, but when they couldn't, they had to type tables up e-mail style, using HTML's preformatted-text tag (<PRE>) and putting in lots of spaces:


<PRE>

....China....1400.million

....India.....800.million

....USA.......250.million

....France.....50.million

....Canada.....25.million

</PRE>

These will still work in a new browser (because the <PRE> tag is still around), but they don't get the advantages or capabilities that the new tables support. They won't rewrap to different window widths, you can't wrap text within a single cell, and so on. So you can end up with awful effects like this:


    China    1400 

million

    India     800 

million

    USA       250 

million

    France     50 

million

    Canada     25 

million

To get the new capabilities, you have to go in and actually change the documents. This is one reason it's considered bad form in SGML to use spaces for formatting. SGML helps you avoid this painful updating because you can represent your documents in whatever form makes sense for the documents themselves. That form is much less likely to change than the way you have to express it in one fixed DTD or system.

With SGML, if you need to accommodate software that doesn't handle your markup structures, you can use a "down-translation"-that is, a process that throws away anything that a certain HTML version can't handle. For tables, you can mark them up in any table DTD you want (CALS is the most popular) and use a program as needed to translate them to a simpler form- even the HTML 1.0-formatted kind. Then when table support is common in browsers, you just throw the down-translation program away and deliver the same data without conversion.

This works where "up-translation" won't, because computers are so much better at throwing information away than creating it. Tables are a lot like the earlier example with italics. If your DTD distinguishes book titles and a few (or a thousand!) other kinds of italics, it's easy to write a program to turn all of them into just <I> for HTML-only browsers. The reverse is much harder.

Can I Make Do with HTML?

Given all these trade-offs, here are the main things to think about when making the HTML versus full SGML decision for Web delivery:

The form the data is already in. If your data is already in SGML (or in something conceptually similar, like LaTeX), it's much easier to stick with full SGML and have tags that fit your data naturally. This way you don't have to design a complicated set of correspondences, and whatever data conversion you do will be simpler.

The document size and number of authors. If your documents are small, don't have a lot of internal structure, and don't need to be shared among multiple authors or editors, HTML may be all you need. But a little Web-browsing easily shows the bad things that can happen when people try to break big documents into little pieces-the forest can be lost by dividing it into separate trees.

The structures needed for searching. If you need to do searches that target specific data in your documents, you'll probably need SGML to label that data. Doing without it is like doing a personnel database without having names for the fields; if you searched for people with salaries less than $30,000, you'd get not only that, but all the people who are less than 30,000 years old!

The frequency of changes. If your data is going to change frequently, you're better off in SGML, where you can modularize your documents using marked sections, entities, and other features.

All these things relate to each other, so you often can't answer one question without thinking about the others. One example is that frequent changes to a document matter a lot less if the document is really small and you have complete unshared control over it. But if a document is big and several authors have to cooperate to maintain it, frequency of changes matters a lot.

How to Use HTML Safely

If you choose to put your data in HTML rather than another SGML DTD, there are several things you can do to make a later transition easier. These things are also helpful in the short term because they make your HTML more consistent, portable, and reliable.

Make sure your HTML is really valid. Run it through an SGML parser-such as sgmls, yasp, or sp-or use one of the HTML "lint" programs. (They're called that because they go looking around for unwanted dirt that accumulates in dark pockets of HTML documents.) Weblint is one such program; you can find it at

http://www.unipress.com/weblint.

Be very careful about quoting attributes. Any attribute value that contains any characters other than letters, digits, periods, and hyphens needs to be quoted (either single or double quotes are fine, but not distinct open/close curly quotes).

Tip

There are a couple of very common HTML errors that you can get away with in some browsers, but that will break others, and will prevent you from using generic SGML tools. The biggest one is failing to quote attributes, as just described. Probably the next biggest is getting comments wrong. These are right:

<!- some text of a comment -> <!- another comment, with two text parts - - of which this is the second ->

But these are wrong (that is, they're not comments):

<!- this comment never ends -!> <! This is an SGML syntax error !> <- This is just data to SGML -> <!- This one -- really - is not a comment ->

Avoid any part of HTML that is labeled "deprecated" in the HTML DTD or its documentation. Deprecated is a polite term standards use to say, "Don't use this; it's dangerous, not recommended for the future, and not even universally supported at present."

Be sure to use the HTML "DIV" containers, not just free-standing headings-especially in larger documents. This makes the structure of your document easier for programs to find and process, and it can also help you find tagging errors.

Avoid colliding with SGML constructs, even if some HTML parsers ignore them. For example, don't depend on an HTML parser failing to know that the string <![ starts a marked section, that <? starts a processing instruction, or that <!- starts a comment; always escape such strings, for example, by changing the < to <.

Challenges of Upgrading

If you decide to put your data in an SGML DTD other than HTML, there are a few "gotchas" to watch out for. None are fatal, but you'll want to start out knowing the rules of the game. The issues are briefly summarized here.

Fewer Browsers to Choose From

At this time, only a few networked information browsers can receive and format SGML regardless of the DTD. Most Web browsers have the HTML tag names built right into the program and require a new release to add new ones. This is true even if the new ones don't require any new formatting capabilities; adding a BOOK-TITLE element type won't work, even though you may only want it to mean "show in italics."

The main exception that is already released is a viewer called Panorama, developed by Synex and marketed by SoftQuad. Panorama is an add-on "helper" to existing browsers, like various graphics viewers. This means it does not talk to the network by itself; instead, when a Web browser follows a link and notices that the data coming back claims to be "SGML," it can forward the data to Panorama for display.

If there are Internet-based links in the SGML, Panorama calls the browser back to retrieve them. If the destination is HTML or gif, it shows up directly in the Web browser. If it's SGML, the browser calls Panorama again.

Another SGML-capable Web browser is a new version of the DynaText SGML delivery system that can view SGML or HTML off a hard disk or CD-ROM, across the Net, out of a database, or from a compiled/indexed form used for big documents. It provides a unified environment for viewing all these data types, as well as graphic and multimedia formats.

Although there aren't many SGML-capable Web browsers, these two are very flexible and give you a lot of control over formatting, style, and other capabilities. Hopefully, more browsers will start to support generic SGML over time.

In the meantime, there are several server-end options available, too. You can always create and maintain documents using full SGML, and then run a conversion program to create HTML from it and put that on the Web. This is especially useful if you have an SGML-based authoring system in use for general publishing or other applications.

There are also Web servers available that can store SGML directly and then translate it to HTML on demand (for example, DynaWeb from Electronic Book Technologies-you can try it out at http://www.ebt.com). This method has the advantage that you can adjust the translation rules any time without rerunning a big conversion process over all your data. It also means the translation can be customized as needed, for example, to adjust to whichever browser is calling in, or even to modify the document by inserting real-time information during translation.

A DTD to Choose or Design

Even if you have all the software you need, with full SGML you'll need to answer a question that never arises with HTML: What DTD should I use? Very good DTDs are already available for a wide range of document types, and you can probably put off DTD-building for as long as you want by using them. This makes the task a lot easier. But even so, you have to think about your documents and then learn at least enough about a few DTDs to make a choice. You may also want to tweak an existing DTD-this is easier than starting from scratch, but still takes skills beyond those needed for tagging.

More Syntax to Learn

If you want to make up your own DTDs, you need to deal with all kinds of declarations, parameter entities, content models, and so on; there's a lot of syntax to learn (tools like Near & Far help a lot). If you use an existing DTD, there is less syntax to worry about, but there's still a little more than with HTML.

SGML provides many ways of saving keystrokes in markup, and many special-purpose constructs you never see used in HTML. Using these constructs in an HTML document will result in errors of one kind or another. For example, if you try to "comment out" a block of HTML with a marked section, its content is still there because typical HTML parsers don't recognize marked sections. In fact, for those parsers, the characters <![ IGNORE [ and ]]> all count as text content!


<P>

<![ IGNORE [ This text is not part of the document, really.

   In fact, it's <EMPH>really </EMPH> not there. ]]>

   And the paragraph goes on right here.

</P>

In an HTML application that isn't quite following the rules, this might be taken as just a paragraph that starts with some funny punctuation marks (a really bad HTML implementation might instead complain that you used a tag named ![). If you got used to this, you might be surprised when you go to a more generic SGML system and discover that the <![ in your document causes some very different effects-this is something you just have to memorize and know. In this case, the first two lines within the paragraph are not part of the content at all, and a browser shouldn't show them to you.

Using a WYSIWYG SGML editor helps a lot, for the same reasons that using MS Word is a lot easier than typing Microsoft's RTF interchange format directly. But even with the best tools, you can be surprised if you're not aware of such restrictions-for example, you might get a "beep" whenever you try to type <![ in a paragraph, and not know why.

Benefits of Upgrading

If there's less delivery software to choose from and more to learn, why bother? The reasons are mostly the same ones that influenced big publishers to go with SGML, although which reasons are most important varies from project to project.

Platform Independence

Other SGML DTDs are even better at abstracting formatting than HTML. SGML can be retargeted to anything from a top-line photocomposition system down to text-only browsers like Lynx, Braille composers, and anything in between. SGML itself greatly benefits flexibility. HTML accomplishes this to some extent, but less so because a small and fixed tag set can force authors to think more about display effects and less about describing structure.

Browser Independence

Because generic SGML software (by definition) handles many DTDs, using a new or modified DTD won't faze it. If it works for CALS and TEI, it'll almost certainly work for whatever DTD you choose.

SGML vendors spend a lot of time testing interoperability. A standard demo at trade shows used to be to pass a tape or disk of SGML files from booth to booth throughout the show. Each product had to read the data, do whatever it did with the data (like let you edit or format it), and then write it out to pass on-without trashing it.

The "SGML Open" vendor group gets together regularly online, at shows, and at special meetings to work out agreements on details and make sure SGML documents can move around easily. For example, a popular DTD for tables has a "rotate" attribute to let you lay out tables in either portrait or landscape mode, but doesn't say whether rotation is clockwise or counter-clockwise. The vendors sat down and decided, so now they all do it the same way. Simple agreements like this can save a lot of pain for end users.

Note

The central point for finding out about SGML Open activites is http://www.sgmlopen.org. Most companies that support SGML are involved in SGML Open, and you can find links to their home pages from the SGML Open Web site, along with links to other useful SGML information.

If you use an SGML-aware server, you can benefit from greater browser independence-even on the Web. Each Web browser has its own strengths and weaknesses. If you can ship slightly different HTML to each one, you can capitalize on the strengths and avoid the weaknesses. This is easier if your data uses a more precise DTD; clients tell servers who they are, so a server that has enough information can down-translate appropriately for each one.

HTML Revision Independence

Keeping your data in SGML also lets you avoid recoding it each time a new HTML feature arrives. You learned earlier about tables-how you'd have to completely rework them if you started by assuming the browser can't support table markup, and then had to change your data when browsers caught up. The same problem came up when Netscape introduced its FRAME element and a lot of reauthoring had to happen. The same problems can happen with any kind of markup. By keeping your documents in DTDs designed to fit, you can leave them untouched and merely adjust a conversion filter.

Appropriate Tag Usage

The biggest fundamental benefit of going to SGML is that your markup can tell the truth about what components are in your document, even if the document doesn't fit into any pre-existing scheme. If the tags you need are there (or, at worst, you can add them yourself), you avoid having to "pun" and use a single tag for a bunch of purposes it may not have been meant for.

Note

The question of having the right tag available for the job is very important, so here are a few examples. We've already talked about how sixth-level HTML headings (<H6>) get used to mean small caps, and how italics (<I>) get used to mark many things like emphasis, foreign words, book titles, and so on.

Sometimes preformatted text (<PRE>) gets used for quick-and-dirty tables. Line-break (<BR>) gets used heavily for forcing particular browsers to lay things out a certain way (and usually that way only works well for certain browsers, certain window widths, and so on).

Another big example is equations; since there are not yet HTML elements for doing math, journal publishers and others are stuck turning equations into graphics for Web delivery. This sort of works, but the fine print tends to disappear, and zooming in doesn't help. This is a case where there's dire need for more a more adequate set of tags. And there are already some very good equation DTDs in wide use outside the Web.

Large Document Management

SGML helps you manage the conflict between big documents and slow modems. You can't very well ship a whole manual or a lengthy paper of any kind every time a user wants to see the nth paragraph (even if browsers could handle documents that big, which many can't)-no user would wait for the download to finish. Novell certainly couldn't ship tens of thousand of pages of NetWare manuals every time a user wanted a summary of some installation detail.

The only viable option with documents bigger than several tens to hundreds of pages is to break them up; you can make many smaller documents, say one for each subsection, and a bunch of overview documents that give you access similar to the table of contents in a paper book. This is usually done manually for HTML because HTML documents don't usually contain explicit markup for their larger components. (Some do now that HTML has added the DIV element.) This method works except for these problems:

If you are also publishing a paper document, you have to maintain two quite different forms.
The document ends up in many pieces that aren't visibly related; only a person can tell whether some link between HTML files A and B means they're part of the same document, or two somehow-related documents. This makes it hard to maintain consistency between all the parts of your original document.
If users want to download the whole document for some reason, it's very hard to do. First, they have to find all the pieces, distinguishing "is-part-of" links from "is-related-to" links; then they have to assemble all the parts in the right order and put the larger containing structures in. It's not enough to just pack them end-to-end because some of the connections between lower sections appear only in "header" or "table of contents" documents.
Users can't scroll smoothly through the complete document; at best, you can carefully provide Next Portion and Previous Portion buttons on every piece.

Internationalization

A final benefit of other SGML DTDs over HTML is that they have more provisions for international and multilingual documents. HTML prescribes the "Latin 1" character set. Latin 1 includes the characters for most Western European languages, but not Eastern European, Asian, or many other languages. Future revisions will probably support "Unicode," a new standard that includes characters for nearly all modern languages. SGML itself lets each document specify a character set and doesn't particularly care whether characters are one, two, or more bytes wide.

Many DTDs also provide a way to mark that individual elements are in different languages. This can have a big effect on display and searching. For example, it helps a lot if you're searching for the English word "die," to not get the German word "die," which means roughly "the," and is very common.

DTDs that specifically mark language are also very helpful when you want to create multilingual documents or documents that can customize to the reader's language. You can create documents where every paragraph has a subelement for each language, and then set up your software to show only the type the user wants; this automatically customizes the document for the reader's own language:


<P>

   <ENGLISH>...</>

   <FREncH>...</>

   <ITALIAN>...</>

   <GERMAN>...</>

   <SPANISH>...</>

   ...

</P>

Summary

SGML is especially strong for large or structured documents, documents for which several authors share writing and editing, and documents that have components HTML doesn't provide. A single DTD such as HTML may not provide the types of elements your documents need, in which case you end up using some other type because it gets the desired appearance in the authoring software. This leads to problems down the line. HTML also has only limited support for expressing larger units such as sections, and that makes document management a bit harder.