9.429 text encoding and the TEI

Humanist (mccarty@phoenix.Princeton.EDU)
Thu, 4 Jan 1996 19:26:01 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Humanist: "9.434 Call for NEH Summer Seminar applications"
Previous message: Humanist: "9.430 inappropriateness"

Humanist Discussion Group, Vol. 9, No. 429.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
http://www.princeton.edu/~mccarty/humanist/

[1] From: Michael Sperberg-McQueen
<U35395%UICVM.bitnet@interbit.cren.net> (275)
Subject: text encoding

The occupations of the holidays have prevented me, til now, from
following up on the discussion of text encoding which began late last
year, but I want to comment on some outstanding issues, lest those not
familiar with the TEI be left with false impressions about the TEI and
SGML. I'll try to focus on the main points, though the discussion has
elicited a depressing number of slips, errors, and misapprehensions
which the pedant in me burns to correct line by line. By line. By
line.

I believe the discussion revolves around several important questions, to
which the answers cannot be in serious doubt.

Is SGML usable for scholarship? Yes, of course.
Does the TEI allow the use of presentational markup? Yes, it does.
Does the TEI provide tags for analytical bibliography? Some, to mark
the most obvious things (catch words, running titles, etc.).
P3 does not have tags to mark everything imaginable.
Are there phenomena of interest to analytical bibliography which
cannot be tagged in the TEI without extensions? I don't know;
Patrick Durusau asked over a month ago for some examples, but I
haven't seen any yet.

Is SGML usable for scholarship?

The main thread of the discussion, as I understand it, is Prof.
Lancashire's insistent suggestion, on a variety of grounds, that the TEI
-- or more generally, SGML -- is not suitable for scholarly work.
Actually, in his earliest postings to this thread, he phrased it
somewhat more strongly. In Humanist 9.343, on 2 Dec 95, he said:

> ... scholarly editors often cannot encode a text in the way SGML
> requires.

> Series such as the Malone Society editions -- any conservative or
> diplomatic editorial convention -- cannot use SGML as it is defined
> by its authors ...

N.B. the claim here is not that SGML is inconvenient or that it must be
used carefully, but that it *cannot* be used. This is a stronger claim
than is made in Prof. Lancashire's Calgary paper, where he limits
himself to claiming that SGML "encodes medieval and Renaissance
manuscripts and printed books with difficulty." I believe both claims
are false, but more to the point, they are incompatible with each other:
if SGML encodes something with difficulty, then it can be used (with
difficulty) to encode that thing, and it is false to claim that it
cannot.

But of course the main issue is not whether SGML can be used, since
Prof. Lancashire's own work with the Renaissance English Text project
reflects a conviction that it can, nor whether SGML is interpretive,
since in the ways that count it is neither more nor less interpretive
than ASCII-only text encoding or COCOA tagging. The main issue, I
think, is whether the TEI encoding scheme embodies an understanding of
the nature of text which is incompatible with sound scholarly practice
or not. Prof. Lancashire seems to believe that it does; I believe that
it does not. He has the right to try to persuade the readers of
Humanist of his view, and I would like to claim a similar right for
myself.

Descriptive and Presentational Markup in the TEI

Prof. Lancashire's early postings objected to the TEI (and to SGML in
general) on the grounds that they were inherently interpretive. He
meant by this, I think, not merely that the act of encoding a text
unavoidably reflects what one thinks that text *is* -- this is true for
ASCII-only encoding and COCOA, too -- but also that the TEI specifically
recommends descriptive tagging, where appropriate, in preference to
presentational tagging. If I understand him rightly, Prof. Lancashire
believes that descriptive tagging is interpretive in a way that
presentational tagging is not. (At least, so I read his remark in
Humanist 9.358 about the difference between reading letters of the text,
and tagging its formal structure.)

He objects to descriptive tagging, I believe, because he would rather
not be forced into that particular type of interpretation. He objects
to the TEI because he believes it forces him into it. And, he points
out, anyone creating a conservative edition may well feel the same way.

Prof. Lancashire and I disagree, I think, on whether presentational
markup is less interpretive (requires less intelligence or
understanding) than descriptive markup. I think anyone who has read the
literature on the manuscript of *Beowulf* will find it hard to agree
that identifying the letters of a source text is not interpretive, or
not controversial, or requires less expert knowledge than (say)
recognizing the boundaries between verse lines in that poem. I don't
believe the Beowulf manuscript is unusual in that regard, and while
printed works pose this problem in less chronic form, individual acute
instances can still occur. (Peter Schillingsburg tells a wonderful
story of discovering that a suspect semicolon in a Thackeray edition was
actually a colon with a flyspeck.) Lou Burnard has already cited
Randall McLeod on the importance of inference and intelligence in
analytical bibliography; I don't think there is any need to belabor the
point.

But though I think Prof. Lancashire is misguided to believe that
presentational markup is inherently uninterpretive, nevertheless I think
we agree on the critical point -- at least, I believe it's the critical
point -- namely, that it may be necessary or desirable to tag a text in
purely presentational terms, rather than in descriptive or analytical
terms. Concretely, it may be necessary to tag italics as italics, and
not necessarily as emphasis, foreign text, technical term, etc.

This is not a concession on my part. As various postings have already
made explicit, the TEI in no way prevents the user from marking italics
as italics. That decision rests not with the encoding scheme, but with
the encoder, and the TEI Guidelines work hard to ensure that encoders of
all persuasions have the tools and freedom they need to tag what they
are interested in. When the research community encounters ways in which
the current version of the TEI does not succeed in providing tools
adequate to the job, we intend to fix the Guidelines if we can. That is
why the TEI needs ongoing support and maintenance.

If the critical point is the right of the encoder to choose whether to
use presentational markup or 'descriptive' markup, then, we are all in
agreement on the critical point. The encoder should have that right,
and in the TEI scheme the encoder does have that right.

There can be a point of controversy here only if it turns out that Prof.
Lancashire, like some other critics of the TEI, wants not merely the
right to prefer presentational markup, but the right to outlaw
descriptive markup even in cases where the encoder believes it right and
appropriate to use descriptive markup.

Analytical Bibliography and the TEI

Prof. Lancashire has quoted, several times, a few sentences in chapter
18 (Primary Sources) of the Guidelines which point out that the chapter
does not provide tags for the detailed physical description of textual
witnesses. Prof. Lancashire (Humanist 9.358) infers from these
sentences that the Guidelines *cannot* tag "these things".

Quite apart from the logical leap involved (if chapter 18 does not
address a problem, can one really infer that the TEI cannot handle
that problem without extension?), Prof. Lancashire seems to overlook
the specifics of the list of topics not covered:

* the materials of the carrier
* the medium of the inscribing implement
* the layout of the inscription upon the material
* the organisation of the carrier materials themselves (as quiring,
collation, etc.),
* authorial instructions or scribal markup

That is, there are no tags for paper, parchment, papyrus, stone, nor for
quill or brush or other types of pen, or for non-pens -- but as Peter
Robinson and Dug Duggan have each demonstrated, this does not mean you
cannot say a given manuscript is parchment or paper, only that there is
not a prescribed tag for the phenomenon.

The description of layout for arbitrary pre-existing materials is a
general problem, and if anyone knows a general solution to it, they
haven't shared it with me. It's not clear exactly what is required for
various kinds of scholarly work. Must we be able to describe the layout
in words ('block of text is shaped like wings of angels', 'block of text
is shaped like an apple with a bite taken out of it on right', ...), so
we can search on keywords ('wings', 'apple', ...)? Must we be able to
reproduce the page? Is an image of the page sufficient, or must we be
able to construct it anew? The problem is complicated by the fact that
analytical bibliographers are not the only people who care about these
problems, and it's not obvious whether everyone who does care will be
able to use the same solution.

This is an area requiring further work, and I share with Patrick Durusau
the hope that the recently completed DSSSL standard will provide us with
some useful tools for understanding and solving it. In the meantime,
the TEI's global REND attribute is the most serious and general
mechanism I have seen on ANY markup scheme for dealing with layout
specification. Unlike other methods, it does not restrict the encoder
to a particular level of detail, or to a particular set of predefined
layout patterns. It allows the presentation of the text to be described
in as much or as little detail as is desired. Prof. Lancashire, it is
clear, does not like the REND attribute, but it's not clear why.
Certainly I have seen nothing in the Renaissance English Texts tag sets
which cannot be clearly and simply expressed with REND.

The organization of quires in a codex ought, in straightforward cases,
to be taggable with milestone tags. I don't know about confusing cases;
presumably it will vary with the type of confusion. Certainly I would
not hesitate to use milestone tags to mark all the kinds of things I
have seen in the examples in Gaskell, or in the RET documentation.

(Exercise for the interested student of encoding: given the difficulty
of representing a single forme as a single SGML element, how many
methods does the TEI offer for describing an individual forme and
connecting its description to the pages printed from that forme, and
what are those methods? Hint: there's more than one.)

All of these topics are interesting, and I would like to see work
continue on all of them. Prof. Lancashire's claim that the TEI regards
them as unimportant is quite simply false: the TEI regards them not as
unimportant, but as hard. Solving them will not be the work of an
afternoon.

Does the absence of solutions to these problems, or the absence of
explicit discussions of how to use general TEI mechanisms to deal with
them, really impair the work of analytical bibliographers today? I
don't know. I believe not, since the tag set Prof. Lancashire has
defined for the Renaissance English Texts also lacks tags, as far as I
can tell, for carrier materials, inscribing implement, layout, and the
treatment of authorial instructions like "insert this on page 54."

Between them, Prof. Lancashire and Lou Burnard have given a rather full
picture of the history of the TEI's attempts to deal with analytical
bibliography. Their picture differs from my recollections in a number
of ways, but there is no need to rehash who did what when. Let it
suffice to say that we did charter a work group for analytical
bibliography and appoint members, but it never met and produced no
draft, and that the treatment of the topic in P3 is not intended as a
full substitute for the work we had hoped that work group would perform.

We still hope to organize a work group to address this topic, though
the lack of prior art will make it difficult to solve the problems of
analytical bibliography to everyone's satisfaction. Analytical
bibliography, codicology, paleography, textual criticism, description of
archival materials, documentary editing, and description of page layout
all overlap in complex ways, while being quite distinct in other ways.
Addressing their common concerns appears likely to require an intimate
familiarity with some very technical issues related to document styles,
processing specifications, and concurrent document structures.

I don't know how it's all going to fit together. But I do hope we can
make some progress. Doing so will require clarity both on what is
required for analytical bibliography (not an obvious question) and on
what is actually in the TEI Guidelines now.

Minor Point: the TEI and Metrical Feet

I've tried to stick to the main points at issue and avoid trivia, but
one passing remark of Prof. Lancashire's (in Humanist 9.358) is so
grossly misleading that it must be corrected publicly. He says:

> Yet there is a sizable difference between asserting that a given
> ink-blot is a b instead of a p, and including -- in TEI's core
> tagset -- tags defining the structure of a poem without
> producing tags for a rhythmical unit or for a metrical foot! It
> is certainly NOT obvious that the fundamental unit of verse is
> the line. ...

The suggestion that the TEI does not have tags for rhythmical units or
metrical feet is simply erroneous. The TEI chapter on the base tag set
for verse shows explicitly (in section 9.3) how feet and other units
within the verse line may be tagged.

Prof. Lancashire may, of course, object that there is no tag called
'foot'. This is true. One reason not to have one is the large number
of verse traditions in which metrical feet do not occur, or are not
useful units of analysis. Rather than have distinct tags for foot,
metron, hemistich, measure, stave, and so on, the TEI applied Occam's
Razor and defines a generic element (seg) for the analysis of verse
lines into their feet, metra, halflines, or whatever units apply.

As to the line as the fundamental unit of verse, I must say I am
surprised to find it a controversial claim. 'Fundamental' should not
be taken to mean too much, of course. In the TEI, lines are fundamental
units of verse in the sense that if you want to tag something as verse,
without modifying the DTD, you must use the L element and need not use
any other element. (Of course, if you wish, you can modify the DTD
and tag verse however you wish. The resulting text can still be TEI
conformant in the strict sense of the term.) This banal observation
does not mean, and should not be misread as claiming, that lines are
logically prior to stanzas, feet, or other phenomena of verse, nor that
the latter are unimportant.

It does remain true that some notion of line appears to occur in every
tradition I have heard described as verse, while the same cannot be said
for feet, metra, halflines, measures, stanzas, or verse paragraphs.
Prof. Lancashire may know of other traditions, but the closest he comes
is the observation (in his Calgary paper) that in Anglo-Saxon
manuscripts the manuscript lineation does not agree with the boundaries
of the metrical lines. This is not a surprise; it is one reason the TEI
distinguishes systematically between typographic lines (marked lb) and
metrical lines (marked l). Even more important, it does not establish
that Anglo-Saxon poetry lacks metrical lines. (If Prof. Lancashire is
in any doubt on the matter, I suggest he read Robert Creed's work on the
lineation of Beowulf. Out of 3000 lines of poetry, I believe he found
fewer than one in a hundred which admitted of any debate as to the
proper line boundaries. It's been ten or fifteen years since I read the
work, so I may be remembering it wrong.)

I hope it is not necessary to expound in detail on some of the topics
raised by Prof. Lancashire in passing. Yes, the TEI did study the
history of text encoding in humanities computing, and ask ourselves why
it had not evolved into SGML. (Answer: in some important ways, it did
evolve into SGML.) Yes, a number of people in the relevant work groups
were indeed familiar with the work of W. W. Greg, Fredson Bowers, and
Thomas Tanselle. The TEI is indeed the product of humanists, not
computer scientists, though the latter did participate, and it
represents the effort of the humanities computing community to formulate
its own consensus, not an attempt by the technical writing community to
annex humanities computing. And yes, it is possible to produce
conservative documentary and critical editions in TEI, as is evidenced
by the work of SEENET, the Cambridge University Press Canterbury Tales,
and the Model Editions Partnership.

-----

I apologize to Humanist readers for this assault on their patience; if
you have read this far, though, I assume you were interested after all.

I hope to have persuaded you that the TEI is indeed suitable for serious
scholarly use, even for serious work with medieval and early modern
books. Of course, others have the right to persuade you of the
opposite, but before you believe them when they say the TEI cannot
encode early printed books, I hope you will ask to see an example of
something it cannot encode. Patrick Durusau asked for this at the very
outset of this discussion, and so far not a single example has been seen
on Humanist. It is one thing to question the assumptions and identify
the limits of the TEI tag set. It is quite another to call into question
assumptions which the TEI does not make, and to deplore limits which
the TEI does not in fact impose.

-C. M. Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago

All opinions expressed in this note (except those I have quoted with a
view to refuting them) are mine. They are not necessarily those of the
Text Encoding Initiative, its executive committee or other participants,
its sponsors, or its funders.

Next message: Humanist: "9.434 Call for NEH Summer Seminar applications"
Previous message: Humanist: "9.430 inappropriateness"