9.375 encoding: weights & measures

Humanist (mccarty@phoenix.Princeton.EDU)
Mon, 11 Dec 1995 20:42:46 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 375.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)

[1] From: FLANNAGAN@ouvaxa.cats.ohiou.edu (39)
Subject: RE: 9.373 encoding: weights & measures

[2] From: Robin Cover <robin@utafll.uta.edu> (116)
Subject: weightings (certainty tags)

Date: Sun, 10 Dec 1995 11:02:50 EST
From: FLANNAGAN@ouvaxa.cats.ohiou.edu
Subject: RE: 9.373 encoding: weights & measures

On weighing evidence, and tagging the process:

It's a good idea, but it is also something else that complicates
life and makes us work too hard.

I could label a simile in {Paradise Lost} as an epic simile (at least,
say, six lines, with a clear tenor and a clear vehicle, if that's what
I said a Miltonic epic simile consisted of). But if my lower limit for
number of lines is six, what am I to do with a five-line simile,
especially if it does begin with "like" or "as"? Should I say "almost
an epic simile," or "a half-simile," or "a half-baked simile"? Should
I, pace Willard, say "60% simile"?

The problem of interpretation enters again, and I would have to make
the delimiters for "epic simile" absolutely clear, perhaps deriving
them only from Milton's practice in {Paradise Lost}. Then there is the
problem of comparing Milton's practice to c17 practice in general--or
Homer's practice, for that matter.

The problem discovering or of coming clean about what is evidence or
meaningful evidence is also difficult. I was just reading a paper
submitted to {Milton Quarterly} in which the author said something like
"sixty percent of the words associated with the sense of taste are
given to Eve, whereas forty percent are given to Adam" (I am fudging
the figures). The author did not tell the reader what the words were,
although he did say what concordance he got them from. His reader
really needs to know what the words *he* thinks are associated with
taste are, and the reader needs some hard statistics about who uses
them when.

So, Willard, we are back to the problem of creating endless encoding,
and entering infinitely-varied SGML codes, all of which have to be
located and punched in scrupulously and meticulously, or we will have
failed as scholars.

Roy Flannagan

Date: Sun, 10 Dec 95 18:10:00 CST
From: Robin Cover <robin@utafll.uta.edu>
Subject: weightings (certainty tags)


> Willard McCarty <mccarty@epas.utoronto.ca> (36)
> Subject: encoding
> Here I would like to raise a matter that is quite independent of the
> encoding language (or meta-language) one uses -- at least I think it is. The
> question concerns the use of "weights" or degrees of certainty in a tag. I
> would like to understand more fully the rationale for using weights.

I'm not certain what all lies behind this question, and I can't speak for
anyone else, but I'm willing to defend the usefulness of "degrees of
certainty in a tag" insofar as I am familiar with the convention in my own
field, and against the backdrop of disussions which yielded chapter 17
(pages 521-528, 891-895) in the TEI P3 Guidelines. The TEI Guidelines, as
some may know, provide for optional use of a <certainty> tag, as well as a
"cert=" SGML attribute for the eight or so text-transcription tags. See
further below.

> Certainly the idea that phenomena occur in data to varying degrees of
> certainty is hardly new and, I'd suppose, not really debatable. It seems
> readily apparent to me that without the use of weights encoding for such
> phenomena radically falsifies the data, at least in imaginative textual and
> visual material, and I'd guess also in music. One is always in the position
> of making the binary choice that phenomenon X either occurs in location Y or
> it does not. At first it would seem that weights would provide a solution,
> but however fine the scale one adopts, attaching a degree of certainty,
> presence, or identity rationalizes, therefore falsifies the data. So the
> problem remains. I would argue that in any case unless the weight can be
> rigorously computed, always and forever to the satisfaction of anyone who
> would treat the data, this problem remains and must be faced.
> One answer to this might be that weighting has value for someone who wishes
> to record his or her confidence in a judgment -- e.g. "I am 70% certain that
> a metaphor of death occurs here". No argument from me, except I wonder what
> value doing this might have. I grant that the person who encodes in this way
> might well gain from an exhaustive accounting of his or her judgments, but
> will anyone else? We get something like this, for example in a conventional
> edition of a literary text, where the editor will decide this and that, then
> sometimes record the reasons why, indicate the strength of other readings,
> simply their existence, and so forth. But would all the trouble of encoding
> a text majesterially be worth the effort?
> As I think I've commented before on Humanist, I find enormous value in
> facing the question at the simplest level, of the first-order binary choice
> -- "is phenomenon X here or not?" -- then making my choices and seeing in
> the mass of them what reasonable approximations can be made to the textual
> realities as I see them. What, then, is the essential difference between
> what I do and the use of weights?
> Willard McCarty

What if your "phenomenon X" is (a) "the occurrence of the character FOO
in the manuscript," or (b) "the primitive reading, vis-a-vis a couple other
probable scribal hypercorrections as alternative readings"?

Transcribers and text critics in some fields feel that the "first-order
binary choice" (as you call it) is simply not sufficient for the reading
audience. When encodings are more fine-grained, and as the corpus becomes
very large, and as the corpus comes online, more detailed "weights" become
highly useful because you can specify they in a query.

A couple examples for paper editions, and then further explanation of the
usefulness of "certainty" encodings in electronic text:

(1) Volumes in the series Discoveries in the Judean Desert (Oxford, 1956-;
the official publication for the [Qumran] Dead Sea Scrolls) currently
use a four-value system of encoding to signify "certainty" of a reading,
as understood by the editorial team publishing the editio princeps
for a given text:

a) certain character
b) probable character
c) possible character
d) conjectured character

While these levels are not assigned mathematical scores in the DJD volumes
(e.g., "95%, 75%, 40%, 20%" certainty), they provide some labeled measure
of confidence in the transcribed reading. For certain purposes, a
researcher may wish to promote as examples (of a certain textual feature)
"only readings that have 'a)' or 'b)' certainty levels, and no passages
which contain 'c)' or 'd)' certainty levels."

(2) The UBS (United Bible Societies) has ranked text-critical decisions in
their critical edition of the Greek New Testament in terms of a four-way

a) "virtually certain" - committee vote virtually unanimous
b) "some degree of doubt" - committee vote not unanimous
c) "considerable degree of doubt" - committee very divided
d) "high degree of doubt" - (e.g., one of several equally
[im]probable readings)

The critical apparatus labels each lemma with an "A - D" coding, as does
the more detailed textual commentary volume. A "D" reading might mean,
for example, that there are 3 or more equally good readings, and the
committee could find no principled argument from stemmatics or internal
evidence for strongly preferring the reading placed in the eclectic

Now, in terms of your specific question: "but will anyone else [gain from
such an exhaustive accounting of text-critical judgments?]: the answer is
clearly "yes" -- at least in the experience of the Oxford Press and the
UBS. A wise teacher who possesses mimimal skills in the Greek language
(but who lacks the ability to do independent text-critical research) would
avoid basing a critical argument upon a single passage with a "d)" reading.
Or: a Greek scholar (who *can* work through thorny text-critical problems)
hastily preparing a talk for a popular audience would avoid placing weight
upon a passage marked with a "c)" or "d)" reading, simply because of
personal time constraints in the setting: s/he'd know the text is a thin

The above benefits that are enjoyed by users of the "paper" text are
multiplied many times when the encodings for "certainty" are registered
within an electronic text. It e-text, the certainty factor (on whatever
feature: paleographical, morphological, syntactic, thematic) can be
included in the researcher's query, so that matches in a query having
greater or lesser certainty than a specified value may be excluded or
included, depending upon the resarcher's goals. I daresay that even
today's generation of SGML software (e.g., EBT's DynaText, using TEI-
encoded text with <certainty> tags or "cert=" attributes), can address
these weightings for "certainty" encoded in transcription and analysis.


Robin Cover Email: robin@utafll.uta.edu ("uta-ef-el-el")
6634 Sarah Drive
Dallas, TX 75236 USA In case of link failure, use:
Tel: (1 214) 296-1783 (h) robin@acadcomp.sil.org
Tel: (1 214) 709-3346 (w)
FAX: (1 214) 709-3380 SGML Page: http://www.sil.org/sgml/sgml.html