Version 1.9 Build 000

HTML Language and Resource Guide

Alan H. Bridle
National Radio Astronomy Observatory
520 Edgemont Road
Charlottesville, VA 22903-2475

HTML 3.2 last updated 21 June 1996, 14:44 EDT
Master URL: http://aips2.nrao.edu/aips++/docs/html/htm4aips.html

Purpose

This document summarizes information about, and provides links to, HTML language standards, manuals, style guides, and other tools available via the World-Wide-Web. It is oriented to the needs of scientific documentation at the NRAO, but may also interest anyone exploring the use of HTML for other purposes.

Why hypertext?
What is HTML?
- HTML Levels and Standards
- HTML Manuals, Tutorials and HTML-oriented Web Sites
Generating HTML
Converting HTML to other formats
- To ASCII text
- To PostScript

1. Why hypertext?

Hypertext is attractive for documentation systems that have a wide variety of users, whether at multiple institutes (via the Internet) or within one organization (via an Intranet). Hypertext lets users explore documentation along individualized paths, navigating it in ways that match their interests or their level of understanding. Using HTML and the World-Wide-Web, hypertext also provides rapid publication and updating of information to users around the world.

Hypertext browsers can also serve information on-screen in ways that can be coupled to software, allowing interaction with user inputs. Tutorial documents can be integrated with on-line "applets", or with form-based interfaces to larger software packages. This allows text-based tutorials to be integrated with short "multimedia" demonstrations, access to databases, even fully-developed user interfaces to complex software packages.

2. What is HTML?

HTML is the Hypertext Markup Language, a method originated at CERN for formatting and linking documents, images and other information using tags enclosed in <angle brackets>. When allied with graphical browsers for displaying it on a wide range of computers, HTML became the basis for the rapid growth of the World-Wide-Web. This growth also saw the proliferation of browser-specific dialects of HTML; many of these provided "extensions" that are attractive to commercial Web sites, but are not standardized. The existence of these HTML dialects has encouraged some documenters to develop for particular WWW browsers, reducing inter-operability (but sometimes gaining commercial advantage).

While the Balkanization of HTML has been an attractive strategy for some commercial ventures, inter-operability based on global standards is usually more important in the long run for technical documentation. Developing for the browser du jour is rarely a good strategy relative to maximizing the portability of technical documentation at a multi-user facility like an observatory. What, then, are the long-term HTML Standards?

HTML Levels and Standards

HTML Level 1

http://www.w3.org/pub/WWW/MarkUp/HTML.html documents the Level 1 specification. This initial level did not support forms (data entry), scalable tables, formulae or scientific symbols. It was written by Tim Berners-Lee while at CERN and Dan Connolly while at Convex Computer Corp.

HTML Level 2

This level included forms for user input, but does not support scalable tables, formulae, or scientific symbols.

http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_toc.html is the Level 2 draft specification. It is now under final review by the HTML Working Group of the Internet Engineering Task Force (IETF).

Many browsers, including NCSA Mosaic and Netscape Navigator, use features of HTML Level 2. Some also use proprietary features that are not part of this standard. Netscape Navigator supports ad hoc extensions to HTML 2.0 which are known informally as NHTML. Although their authors "expect" some of these to be part of future standards, other browsers may ignore them. To maximize portability, avoid using such browser-specific HTML features that are not part of the Level 2 draft standard.

Unfortunately, HTML Level 2 cannot directly display characters outside the ISO 8879-1986 Latin character set, nor can it format resizable Tables. Its character set contains a few non-Latin entities that are used by scientists and engineers, but not all browsers interpret them all correctly. Until its math standard is agreed on and implemented (at some future stage of development of Level 3), HTML is therefore ill-adapted for some scientific writing.

HTML Level 3

The Level 3.0 draft included such norms of technical documentation as resizable tables, captioned figures and mathematical equations. It also allowed more flexible layout control (e.g. text flowing around figures), and supported links to common multimedia formats such as sound sequences and MPEG movies. Level 3.0, sometimes called HTML+, was never widely deployed in browsers, however.

Some parts of the proposed standard were supported by Netscape Navigator (1.1 and higher) and by NCSA Mosaic, but Netscape in particular adopted its own extensions to the Level 3.0 specification.

http://webreference.com/html3andns/ has a particularly clear discussion of the differences between HTML 3.0 and NHTML: "HTML 3.0 and Netscape 3.0: How to tame the wild Mozilla".

In May 1996, the World Wide Web Consortium (W3C) at MIT, in consultation with vendors including IBM, Microsoft, Netscape Communications, Novell, SoftQuad, Spyglass and Sun Microsystems, announced a new HTML 3.2 specification. This specification, code named Wilbur, adds such already-deployed features as tables, Java applets and text flow around images while providing backward compatibility with Level 2. It will also provide extensions for multimedia objects, scripting, style sheets, improved layout, higher quality printing and math. There is some hope that the Level 3.2 specification, having been developed with more input from the commercial browser-writers, has a better chance of deployment than the ill-fated Level 3.0.

Microsoft's Internet Explorer Version 3.0 uses a variant of the HTML 3.2 specification.

W3C's demonstration browser for Level 3 is called Arena. It is available for Linux, Solaris, SunOS, Dec and SGI systems. Although still somewhat buggy, even using its own demonstration files(!), Arena illustrates future possibilities for documentation using HTML 3. A demonstration of Arena's capabilities (as screen dumps viewable on other browsers) is available at http://www.csd.uwo.ca/~tzoq/HTML3/.

There is an archive of discussion on the IETF HTML Working Group's E-mailing list at http://www.acl.lanl.gov/HTML_WG/archives.html.

HTML Manuals, Tutorials and HTML-oriented Web Sites

The WWW has many useful resources relevant to HTML. Good language manuals and tutorials are available at:

"Introduction to HTML" by Ian Graham, University of Toronto (http://www.utirc.utoronto.ca/HTMLdocs/NewHTML/htmlindex.html)
"A Beginner's Guide to HTML" from NCSA (http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html)
"The Art of HTML" a comprehensive guide to HTML and Web development resources by Krish Menon at http://www.taoh.com/index.htm.
"The WWW and HTML Developer's Jump Station" by Barry Raveendran Greene at Johns Hopkins University (http://oneworld.wa.com/htmldev/devpage/dev-page.html).
"HTML Guide" from Web Communications at http://www.webcom.com/html/tutor/
"HTML Reference Manual" from Sandia Laboratories at http://www.sandia.gov/sci_compute/html_ref.html.
HTML Writers Guild, a useful source of tutorials and articles about HTML, with guides to specifications, tools, HTML journals and publications, conferences, etc. at http://www.hwg.org/
HTML Station, a compilation of HTML reference information by John December.

For HTML Style Guides, you might consult:

"Index of Guides to Writing HTML Documents" maintained at the University of Illinois (http://union.ncsa.uiuc.edu:80/HyperNews/get/www/html/guides.html)
"Elements of HTML Style" (brief, but good advice) at http://www.book.uci.edu/Staff/StyleGuide.html
"Composing Good HTML" (longer) by Eric Tilton, Carnegie-Mellon U. (http://www.cs.cmu.edu/~tilt/cgh/)
"Hints for Web Authors" (long, but good advice) by Warren Steel at the University of Mississippi (http://www.mcsr.olemiss.edu/~mudws/webhints.html)
"WWW Style Manual" (comprehensive) by Patrick J. Lynch at the Yale Center for Advanced Instructional Media (http://info.med.yale.edu/caim/StyleManual_Top.HTML)
"Guide to Web Style" (comprehensive) cookbook at Sun Microsystems for creating Web pages (http://www.sun.com/styleguide/)
The AIPS++ HTML Style Guide (specifically written by me for the AIPS++ Project, and therefore still under development!)

3. Generating HTML

HTML from TeX or LaTeX

Most existing documents on concepts, algorithms, instruments and mathematical methods that are relevant to astronomers are written in TeX or LaTeX. Because LaTeX-based packages are used to prepare and submit scientific articles to journals, Most scientists who will contribute to astronomical documentation systems are also more familiar with LaTeX than with HTML. Finally, TeX and LaTeX are well suited to the mastering and printing of large documents such as manuals, lecture notes and conference proceedings.

We can therefore expect to ingest much TeX/LaTeX material when constructing astronomical documentation in HTML. This process must be automated at least to the point where only minor hand-work is needed to bring scientific text and graphics into the documentation system.

We will probably need several approaches for this for most of the 1990's.

One is to convert whole TeX and LaTeX documents directly to PostScript files that can be displayed by WWW-compatible browsers. They can be referenced (jumped to) from HTML documents in a documentation system but cannot directly refer back (re-link) to the HTML-based system. This route is trivial to implement. It is suitable for "one-way" access to entire symbol-rich documents. The original document's format is preserved, whatever its merits.
Another is to convert as much as possible of the incoming TeX or LaTeX document directly to HTML. Links between the new document and the rest of the documentation system can then be two-way. The new document can then participate fully in a hypertext structure. New linkages can also be created within a long incoming document, making it easier to browse on-line.

The first approach is straightforward but does not integrate incoming documents fully into a hypertext system. The second lacks a robust implementation, and may continue to do so until the Level 3.0 standard has been around for a while. Its most plausible vehicle at the moment is LaTeX2HTML (see http://cbl.leeds.ac.uk/nikos/tex2html/doc/latex2html/latex2html.html) by Nikos Drakos of Leeds University. Tests of LaTeX2HTML on LaTeX files from the NRAO astronomy documentation have revealed detailed problems with this converter, however. It is unable to parse some perfectly correct TeX constructs and can produce incorrect results, including symbol substitutions and garbled equations. Even if the translator was bug-free, some problems of principle would remain from this method's use of "transparent" images (GIF 89 format) to represent symbols and equations in the original. This approach has several disadvantages:

Some Web browsers open a separate http session for every image that is embedded in the document. This can make a complicated document slow to load.
Using many GIF images in a document source file also makes it clumsy to edit as an HTML document. It is more attractive for the LaTeX version to remain as the master. The conversion, which can be time-consuming for a large document, must be repeated as the document evolves.
The images cannot be resized at the browser. The output document looks reasonable only with a limited range of font sizes and styles selected in the browser - which may not include the reader's usual default. Aligning characters between images and text is an ongoing problem and solutions tend to be browser-specific (non-portable).

Eventually, the HTML standard will allow mathematical and Greek symbols to be incorporated directly as HTML elements. Our technical documentation should use the standard as soon as it is settled and is available in a competent, low-cost browser. Until this happy state of affairs is realized, LaTeX-to-HTML conversion of scientific documents should probably be limited to those that demand two-way links (to and from) the rest of a documentation system.

From scratch

HTML files can be generated using any editor that emits ASCII files. There is no reason in principle not to write them in emacs, Word Perfect or Microsoft Word, once the author either:

is familiar enough with the HTML standard to type directly in conformity with it, or
has a macro package or output filter that translates into standard HTML from the editing session. Examples are the www-mode or HTML-helper mode for emacs, various wp2html filters for Word Perfect, and the Internet Assistant for Microsoft Word.

A list of filters for converting word-processor formats to HTML is kept at http://www.w3.org/pub/WWW/Tools/Word_proc_filters.html.

Several specialized tools that can simplify writing HTML from scratch are worth attention, however:

HoTMetaL by SoftQuad Inc. is an SGML-compliant editor for HTML with many tools for authoring valid HTML structures. Both the freeware and Pro (registered) versions run under MS-Windows or X-Windows. Its advantages are:
- rigorous rule-checking: it is almost impossible to generate invalid HTML so it is a good, if stern tutorial on the language as well an editor
- it has both a WYSIWYG mode and one that displays HTML tags in easily-recognized boxes, showing the structure of the file clearly
- structure-based editing tools speed many operations
- it can be used to standardize non-compliant HTML
Its disadvantages are:
- it can be hard to read highly non-compliant HTML into the freeware version. (Some users complain that "HTML files that display o.k. under NCSA Mosaic or Netscape Navigator cannot be read in". Such files contain invalid elements that the browsers ignore or replace "on the fly", so they are subject to the whims of browser-writers.) The Pro version has several filters for standardizing non-compliant files and is easier to use than the freeware version.
- needs a lot of memory (4Mb barely supports the freeware, the Pro version must have 6 Mb)
- can be slow on simple cut-and-paste operations
Writing standardized HTML is a good way to optimize the output across the space of all competent browsers. This is a better strategy than optimizing for a favored browser that may be unsupported, or merely unpopular, a few months ahead. There are other rules-checkers for HTML (see below), but HoTMetaL may contain the best that is integrated with an editor. The Pro version of HoTMetaL includes a spelling checker, thesaurus and macro creator and has special filters for use with Netscape Navigator's HTML extensions.
HTML Assistant by Howard Harawitz is a free HTML editor for MS-Windows. It provides a convenient button bar for inserting HTML elements into a document, and a URL cataloger based on browser bookmark files. It does not check rules for incoming or outgoing documents, so it is superficially more benign than HoTMetaL. It may have a role for people who want to speed up generating HTML without worrying about the niceties of standardization. The freeware version is limited to files about 62k in size, a Pro version removes this limitation.
HTML Writer by Kris Nosack at Brigham Young University is a free MS-Windows HTML editor. It makes it particularly easy to write the more complicated tags and is unusually user-friendly. It does no rules-checking, but has a simple "test" interface for viewing the current file with the browser(s) of your choice. It is a good, fast tool for small machines once you know the HTML rules well.
Sausage Software offers a plethora of Web-related tools for MS-Windows, including an HTML Editor called Hot Dog, which I have seen favorably reviewed but have not used myself.
asWedit by AdvaSoft Ltd. is an HTML 2 and HTML 3 editor for the X Window System and Motif, available free of charge to students and staff in education and non-profit organizations. Version 2.5 was released in April 1996. Binaries are available for IBM AIX, DEC Alpha, HP, SGI, Linux, solaris 2.4 and 2.5, SunOS 4.1.3, and Ultrix systems.
Phoenix is a freeware stand-alone HTML editor for Sun OS 4.1.3 and Solaris 2.3 by Lee Newberg at the University of Chicago. I believe it does some rules-checking but I have not tried it myself. The current version is an alpha release.
tkHTML is a freeware HTML editor from Liem Bahneman, based on the Tcl script language and the Tk toolkit for X11. It supports WYSIWYG previewing and short-cut keys but does not do rule-checking.

An extensive List of HTML Editors for all platforms can be found at http://union.ncsa.uiuc.edu/HyperNews/get/www/html/editors.html.

Checking HTML

As noted above, HTML files are plain text files that can be generated by any editor. HTML checkers make sure that all tags in HTML files are placed and nested meaningfully, and that the files contain all the required information. Simply reading an HTML file into someone`s favorite browser is not a good way to check it for validity, and certainly does not imply portability! The fluid nature of the standards beyond Level 2, and the enthusiasm of browser-writers for extending them or ignoring them, means that what looks valid or beautiful to one browser need not be conforming HTML that will display sensibly on another.

The HTML standard also contains features that are not required and are disregarded by many of the currently popular browsers but which will be turned on by more advanced browsers and indexing systems as the worldwide use of HTML matures. For example, the <HEAD> and <BODY> tags are not required by most browsers but are worth including as they can be used to speed up document-indexing systems. The use of an <HTML> container tag for the entire document is optional but may be used by future browsers to positively identify the file as one that is to be interpreted using the HTML Document Type Definition (DTD).

HoTMetal Pro, an X/MS-Windows editor, interprets the HTML standard more strictly than most browsers, and can be set to ignore speculative browser-specific extensions. A severe, but ultimately satisfactory, way to check HTML for cross-platform validity is to read it into HoTMetaL and correct anything that HoTMetaL complains about. HoTMetaL also has the attractive feature that it will fully standardize almost-compliant files that are read into it, for example adding the currently optional </P> tags where appropriate to close paragraph-container elements. Documents written out by HoTMetaL will also contain the SGML <!DOCTYPE ...> prologue that is recommended in the HTML standard to name the parsing DTD explicitly. Consult http://www.sq.com/hmpro.html for details.
HTML Validation, a service provided by Mark Gaither at http://http://www.webtechs.com/html-val-svc/, screens an HTML document for validity against different levels of the HTML standards. This site also explains how to install their validation tools locally, and their paper "Why Validate Your HTML?" explains why validation is a good idea.
weblint is a Perl script to check HTML, available at http://www.w3.org/pub/WWW/Tools/weblint.html from the Khoros group at the University of New Mexico.
htmlchek is an awk script to check HTML, from Henry Churchyard at the University of Texas (http://www.w3.org/pub/WWW/Tools/htmlchek.html).
Arena, W3C's demonstration HTML 3.0 browser, (http://www.w3.org/pub/WWW/Arena/), flags and comments bad HTML.

4. Converting HTML to other formats

The advantages of hypertext and browser-based documentation do not eliminate the need for printed documents. Not all readers prefer navigating the multi-linear structures of hypertext to the logical sequencing implicit in a printed manual. Not all browsing of documentation is done at a computer workstation. Documenters must still consider how to produce traditional printed manuals from any hypertext systems. Originating the documents as TeX or LaTeX or in a word-processor such as Microsoft Word or Word Perfect alleviates this problem, but there are also some options for documents that originate as HTML, or whose hypertext versions may have evolved away from a printable original.

To ASCII text

A simple but somewhat crude way to convert HTML to ASCII occasionally under Unix is to use the lynx text-only HTML browser from the University of Kansas, e.g.:
htmlcon, a small MS-DOS utility to convert HTML files to ASCII text with some choice of the conversion tactics is available from ftp://ftp.crl.com/ftp/users/ro/mikekell/ftp.
NCSA Mosaic and Netscape Navigator also offer options to save HTML files as plain or slightly-formatted ASCII text

To Postscript

There are several choices for dumping HTML files to PostScript for printing as pages or chapters of manuals:

Netscape Navigator's print option produces reasonable-looking PostScript from HTML sources (including in-line images) but has no facility to paginate it (i.e. to number the pages or to force page breaks at sensible places). The most recent versions of NCSA Mosaic (2.7b4) offer pagination, timestamping, and the ability to record URL's from the document in the printout as footnotes.
```
 
```
Jan Karrman at Uppsala University has written a stand-alone HTML-to-PostScript converter called html2ps. This is a Perl script that offers considerable control over the final output, including margins, font selection and sizes. It lets you insert page numbers, an important option for long documents. It has rudimentary features for concatenating multiple HTML files into one Postscript file. It does not yet support in-line images, the <ISINDEX> tag, or forms. It defaults to A4 page size but has switches that can be set appropriately for U.S. paper sizes. Reasonable Postscript can be produced on U.S. letter paper directly from HTML files by the command
```
html2ps -n -u -PS 27.9 -l 21.5 filename.htm > filename.ps 
```