7.0267 R: Tag Terms Definitions (1/106)

Sat, 23 Oct 1993 07:43:09 EDT

Humanist Discussion Group, Vol. 7, No. 0267. Saturday, 23 Oct 1993.

Date: Fri, 22 Oct 93 10:25:42 CDT
From: "C. M. Sperberg-McQueen" <U35395@UICVM>
Subject: Re: 7.0260 Qs: Tag Terms

On Thu, 21 Oct 1993 14:17:29 EDT, in Humanist Discussion Group, Vol. 7,
No. 0260. Thursday, 21 Oct 1993, Arjan Loeffen asked:

>Does anyone have any strong ideas about terminology for:
>Mark Markup 'Standard generalized markup language'
>Tag Tagging 'Starter set of tags'
>Code Encoding 'Text encoding initiative'
>What's the difference? What's the same?

Of course, any descriptive linguist will do well to be skeptical of
native-speaker introspection, but for what it's worth, here is my
mix of observation and introspection.

In most usage I am familiar with, these terms refer to much the same
thing, but there are, I think, differences of scope.

An 'encoding scheme' is (at least as I understand and endeavor to use
the term) any method of representing information of any kind in a
specific given medium; the most common use is to refer to methods of
representing text in electronic form. Encoding schemes for electronic
text must determine at least:
- how to represent each character of the text
- how to represent (if at all) characteristic features of the source
text such as quotation, language shifts, font shifts, font size, ...
- whether and how to represent structural divisions of the text
(e.g. book, chapter, verse, ...)
- how to reduce the source to a single linear stream of bytes. This
is fairly easy for conventional running text; it is harder for
running titles (transcribe where?), catch-words, footnotes,
end-notes, parallel texts printed in parallel columns, text-critical
apparatus, and other common disturbances of linearity.
- whether and how to represent analytic or interpretive information
not present, or not explicitly present, in the source (e.g.
morphological or syntactic analysis)
- whether and how to represent ancillary information relevant to
the use of the encoding (identity of the transcriber(s), source
edition used, nature of the encoding, ...)

The TEI, for example, answers these questions more or less as follows:
- represent characters with their analogues in the local character
set, or with SGML entity references
- represent quotation, etc., with SGML tags; represent language
shifts by changes in the LANG attribute.
- represent structural divisions of the text with <div> or with
<div1>, <div2>, ... <div7>, or with special-purpose elements
for the appropriate divisions (<entry>, <termEntry>, <list>, ...)
- transcribe running titles, catchwords, etc., in <skel> elements,
or omit them; transcribe footnotes and endnotes either at the
point of attachment or at their point of appearance, and link
them to their target appropriately.

And so on.

Word Perfect also must answer all these questions, though the answers
are more often 'omit the information' or 'make it look like it looks on
the source page'. So must every piece of software that works with
electronic text.

Most encoding schemes choose to represent at least some information by
means of 'markup', which is usually held to apply to everything not part
of the character stream of the running text (a definition not without
its problems, given the difficulty of defining the 'running text'!).
Among other things, markup is what you see when you press the Reveal
Codes key in Word Perfect.

Often, the units of markup in an encoding scheme are referred to as
'tags', and the process of inserting them as 'tagging'. SGML elements
are marked with 'tags' at their beginning and end; the Brown and LOB
corpora use 'tags' to encode part-of-speech and morphological
information, COCOA, OCP, and TACT use 'COCOA tags' to indicate where
certain textual features change value, or begin or end.

Thus, the extensions of the terms nest rather neatly, at least in some
usages. A 'tag' is a unit of 'markup', and 'markup' is a tool of

N.B. tags are not necessarily the only kind of markup: no one refers to
Word Perfect's proprietary markup as 'tags', and SGML defines other
types of markup beyond tags: notably entity references and markup
declarations. Similarly, not all encoding schemes use markup explicitly
identifiable as markup: Project Gutenberg, for example, prides itself
on having no markup in its texts. Since some authorities give 'markup'
a broad sense which includes punctuation and the like, Project
Gutenberg's claim to have markup-free texts is at best problematic. But
no one disputes that their texts have a peculiarly impoverished markup,
if any.

So the three terms in the middle column are not mutually exclusive, but
nested in their extensions. The left-hand terms, however, simply don't
match: 'code' is not a general term for the units of encoding; although
Word Perfect does use the term, any similarity between Word Perfect's
'codes' and the work of the Text Encoding Initiative is purely
superficial. Nor is 'mark' used at all to denote the units of a markup
language: for this denotation, 'tag' and 'code' are typically used
instead, 'tag' almost universally, and 'code' somewhat less frequently.

All this to be taken, as usual in cases of usage discussion, with
a grain or two of salt.

-C. M. Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago
u35395@uicvm.uic.edu / u35395@uicvm