6.0018 Correction on SGML (1/63)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 18 May 1992 19:52:33 EDT

Humanist Discussion Group, Vol. 6, No. 0018. Monday, 18 May 1992.

Date: Mon, 18 May 92 11:25:05 CDT
From: "C. M. Sperberg-McQueen" <U35395@UICVM>
Subject: Re: 6.0009 Text Retrieval

On Wed, 13 May 1992 22:02:40 EDT Brian Whittaker said:
>I have read interesting things about SGML, a powerful data base programmme
>which actually does operate by providing an extemely flexible search
>tool that operates on data stored in text files. I believe SGML is
>native to the IBM mainframe world. I do not know if a MAC version is
>available. I suspect this may be the ideal programme... probably
>because I don't have it :>)

Actually, SGML is not a program but a language (the Standard Generalized
Markup Language), by means of which one can define markup languages.
But that doesn't mean it's irrelevant to Brian Whittaker's discussion of
information retrieval.

In general, SGML-based markup languages are designed to exhibit the
logical structure of texts (rather than the details of their
presentation on an 8 1/2 x 11 sheet, or an A4 sheet, or a 5 1/2 x 8 1/2
sheet, or ...), which means a search engine working on well marked up
SGML documents can understand the structure of the text and in general
beat the pants off search software limited to imagining the text as an
unstructured stream of characters.

Equally important is that SGML is not a proprietary language, but a
language with a publicly accessible definition, which means SGML
documents are much less likely than documents in other forms to be
stranded in electronic islands: everyone is writing to the same file
format. Brian Whittaker points out how useful it is to avoid
proprietary formats, and the point should be taken very seriously
indeed.

SGML parsers exist for almost all the computing platforms I know of:
for Macs, and IBM PCs running DOS, and IBM PCs running Windows, and Unix
workstations with slick graphic interfaces, and Unix systems with glass
teletype interfaces, and, yes, IBM mainframes. There is public domain
code for an SGML parser available from the SGML Users Group, from which
several application programs have been written, which are available on a
number of servers around the world, including these:

ftp.ifi.uio.no (128.240.88.1) in directory /SIGhyper/SGMLUG/distrib

mailer.cc.fsu.edu (128.186.6.103) in directory /pub/sgml

ftp.uu.net (137.39.1.9) in directory pub/text-processing/sgml

(For information on using ftp, consult a local guru or the archives of
this list.)

Fully developed SGML browsing tools and search engines are somewhat less
widely available and the ones I know of are all commercial products and
not free. As time goes on, I believe more such programs will be
released, and some at least will become cheap, and some may be released
as public domain programs.

And of course SGML forms the syntactic basis for the Text Encoding
Initiative's Guidelines for Text Encoding and Interchange (second draft
in progress even as I write, or as soon as I finish this note). The
main reason for this is its ability to express all sorts of information
about a text which other markup languages simply do not provide
mechanisms to express. For really serious work with texts, SGML is the
only serious contender for one's markup language.

-C. M. Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago