9.435 encoding and the TEI

Humanist (mccarty@phoenix.Princeton.EDU)
Fri, 5 Jan 1996 18:42:25 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 435.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
http://www.princeton.edu/~mccarty/humanist/

[1] From: Arjan Loeffen <Arjan.Loeffen@let.ruu.nl> (145)
Subject: Re: 9.402 encoding & the TEI

[2] From: chhenry@vaxsar.vassar.edu (20)
Subject: Re: 9.430 inappropriateness

--[1]------------------------------------------------------------------
Date: Fri, 5 Jan 1996 13:49:43 GMT+0100
From: Arjan Loeffen <Arjan.Loeffen@let.ruu.nl>
Subject: Re: 9.402 encoding & the TEI

In my view, Patrick is right in his respons to Ian Lancashire. The TEI gives
guidance in encoding, and can not avoid being somewhat restrictive in its
approach. This has been explained well and defended at several places in the
CHum issue of the TEI, as reprinted in N. Ide, J. Veronis: 'Text Encoding
Initiative' ('95), that I have eaten for breakfast, lunch and dinner, and
will be placed with pride on the bookshelf. I recommend it to all TEI
adepts. It nicely shows what problems were encountered in using SGML as the
meta-language of choice. In a way, it doesn't treat deficiencies of SGML for
humanities research in a serious way, though it is contested that if the TEI
were made a more integral part of ISO standardization, or were started at an
earlier moment for that matter, things such as CONCUR would have been high
on the priority list (e.g. pages 59 and 125 of IDE95).

Similarly, the role of TEI processing (a long list of processing
requirements can be compiled for any SGML system to adapt to TEI
requirements; I am currently working on that list) as related to a text
model is not well explained either -- the contributions focus on encoding
structures and underlying decisions, not on processing issues. If a text
model, to be expressed in terms of SGML contructs (cf. HyTime and DSSSL for
hyperlinking and publishing respectively), were focussed upon the class
system may have been different.

Anyway, the 'guiding' nature of the TEI is clear from the way content models
are specified (many alternative constructs are allowed for the same
element's content). This means that different approaches ('preferred uses')
of the guidelines may emerge, and a dicussion in this respect will parallel
the discussion on the guidelines themselves (with some cross-overs). The
'guiding' nature of the TEI is explained well in 'the design of the TEI
encoding scheme', page 17 ff.

Ian:
>> Instead of first discussing what a humanities encoding
>> scheme should provide -- for example, by looking at 20 years of
>> real encoding practice (by groups like TLG, TLF, CETEDOC, the OCP community,
>> etc., and I include WordCruncher and TACT users in the "etc.") -- TEI
>> seized on SGML as the final solution. TEI P3 does not discuss
>> the previous history of textual encoding either in the humanities or
>> in computational linguistics. Why not? TEI P3 does not discuss why
>> SGML was chosen as the format.

Ian is right here. There is a lack of reference to older practices, as far
as I can see. Though the 'encoding' strategy of OCP seems too odd to compare
with SGML, other encoding conventions could have been treated in more depth.
This would not only be interesting reading material, but also could help
people to decide to conform to TEI in preference to any other (proprietary)
scheme. If you look at SGML, you can imagine a better language to come up in
the future (SGML may still be subject to revision, it is not a stone wall),
but you cannot imagine an existing language to beat it. Not in the last
place because SGML, if you disgard the unneccessary garments, is
surprisingly simple. And, of course, because there's a lot of software
available out there that covers many aspects of use.

Patrick:
>I think a history of textual encoding in the humanities and computational
>linguistics would be an excellent resource for scholars. From the
>standpoint of writing suggested ways of encoding texts, it would increase
>the size of the Guidelines greatly to discuss and genuflect to every
>prior encoding effort and not really be relevant to the task at hand. I
>am sure the Guidelines reflect in the suggestions the experience of the
>numerous humanists who participated in most if not all of the significant
>prior encoding efforts.

Call for paper? Who takes up the glove? If you take it up, put older scheme
in the fingers, and the history of SGML in the palm.

Ian:
>> What troubles me most about SGML the syntax is that it is
>> interpretative itself. Its rigid assumption that only one structure
>> can be recognized (by SGML browsers, editors, etc.) at a time, and
>> that no more than two structures can be encoded in any document, .......
....
Patrick:
>The claims that SGML can encode only two structures in any document is
>simply false. In the original SGML standard, the CONCUR feature was
>unfortunately left as optional, but that does not limited ones ability to
>encode multiple structures for a single text.

Right. 'No more than two structures' is also ambiguous. SGML allows you to
encode an infinity of structures. It allows you to do this in parallel (same
data viewed upon in n ways, and you do not need CONCUR for this) or serially
(pieces of data structured differently). The number '2' doesn't occur in
SGML in this sense. There is no structure SGML can't handle, simply because
SGML encodes data; it doesn't state in what way the data is related (not
even for data that seems to be encoded in a hierarchical sense, cf. the tree
transformation process (STTF) of DSSSL, that modifies the document's
'structure' to be acceptable input for formatting the document).

For example, you can create the document type definition:

<!doctype anystruct [
<!element anystruct - - (entry*)>
<!element entry - - (#pcdata)>
<!attlist entry TYPE CDATA #REQUIRED
ID ID #IMPLIED
IDREFS IDREFS #IMPLIED>
]>

which allows any structure to be encoded, and hardly shows any structure
itself (SGML systems can only check if the symbolic references are
possible). Nobody would like to start creating such a document instance, as
nobody would be able to understand the structure. However, when you process
the document by a _dedicated_ SGML system it could be acceptable. So what's
done in the SGML definition is to state that the most prominent structure in
documents is hierarchical, and that it is a good foundation to work on in
defining any other structure imposed on it (as the example above, and TEI's
'Linking, segmentation and alignment' shows, I hope).
A nice illustration may be the 'grove' definition that is essential to the
HyTime and DSSSL revisions, which interprets an SGML document as a bag of
fragments, each fragment having properties that may be of the kind 'parent',
'child', 'attribute', and on (admittingly, 'grove' is viewed upon as a
'tree'-like structure anyway).

The CONCUR part of the reply is not undisputed. The nice thing about minimal
SGML (not including FEATURES) is that --disregarding some very specialistic
debates-- it is well defined. Once FEATURES are introduced discussions
really start (minimization, SHORTREF, CONCUR, LINK), though these
discussions play at several levels. CONCUR is in fact not well defined. It
is _practically_ limited by the simple fact that you cannot encode a
document of type 'letter' in two versions of the same type (i.e. two
'interpretations'). It is unclear why CONCUR has been emulated by the TEI
guidelines, i.e. if this is the result of the deficient design of the
feature (e.g. NET enabling start-tags and end-tag minimization introduce
ambiguities, cross references between concurrent views are not possible), or
just because most available parsers do not support it. In the first case, an
amendment could be compiled to this end; in the second case the TEI
community should provide for such a parser, SGML system, or TEI system.

Patrick:
>In my first post I called for examples of text structures that could not
>be encoded using the TEI Guidelines. Several weeks later, I am still
>waiting for an example of such a text structure. Rather than continue a
>debate in the abstract, Lancashire should post a reference to say a
>portion of some manuscript of his choosing and supply photocopies to
>anyone who wished to encode said material using the TEI Guidelines. One
>or two manuscript pages should be long enough for a fair test without
>being an undue burden on those wishing to participate.

Patrick, is this a slip of the tongue? It is of no use whatsoever to send
you a photocopy only: I can give you the DTD for *ANY* photocopy sent over
right now:

<!doctype this [
<!element this o o (#pcdata)>
]>

A similar DTD can be compiled within TEI P3 constraints.
If you think 'this' is not sufficient, you are asking for an interpretation.
And if an interpretion should be sent along, your request misses its goal.
What you try to explain here I guess is that any interpretation or set of
interpretations of a text can be encoded in SGML. This I think is true. Not
that the solutions always look elegant (cf. 'Hierarchical encoding of text:
technical problems and SGML solutions', the (bad) title of the paper in
IDE95, pg. 211, which is a revision of MLW18 which was originally published
in '88 -- the paper has a long history of revisions as I understand, and
rightfully so), but they are within the framework of SGML, and only need an
'intelligent' system to make sense of it.

Sorry for the long reply.
Arjan.
Arjan Loeffen (loeffen@let.ruu.nl), Humanities computing, Faculty of Arts,
Utrecht University, The Netherlands. +302536417 (work), +206656463 (home)
http://www.let.ruu.nl/departments/C+L/loeffen/home.htm

--[2]------------------------------------------------------------------
Date: Fri, 05 Jan 1996 09:11:01 -0400 (EDT)
From: chhenry@vaxsar.vassar.edu
Subject: Re: 9.430 inappropriateness

Willard's request for a reply to his posting concerning Tzvi
Freeman's article "The Case for Inappropriateness" is a convenient plank to
frame a response to the recent, and dismaying, exchange on some issues
relating to the Text Encoding Initiative.

Inferring from some of the points Prof. Lancashire has made, the TEI, like
any encoding system, is an act of interpretation, whether categorized as
presentational, analytical, or descriptive. To appropriate Barbara
Hernnstein-Smith's term, an encoding system always has the tendency to
privilege an epistemological bias. As I understand it, this is one aspect
of the TEI that Prof. Lancashire finds troublesome.

This and some other points Prof. Lancashire makes are difficult and not
susceptible to quick resolution. They require a dialogue of not only the
developers of the TEI but a broad range of humanists who need to ponder not
so much the technical virtuosity of the TEI but the realization that any
project so large, expensive to undertake, and interpretive requires a
social contract to succeed. To read characterizations of Prof.
Lancashire's inquiries--for that is what they are--as vile smears or the
impetus that raises pendantic hackles is evidence that the communal
construction of the TEI as an integral technological system appropriate to
the humanities is a distant prospect.

Chuck Henry