4.0112 Errors in print and e-texts (1/68)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Wed, 23 May 90 19:47:52 EDT

Humanist Discussion Group, Vol. 4, No. 0112. Wednesday, 23 May 1990.

Date: Tue, 22 May 90 23:17:53 -0400
From: Mark Rooks <rooks@cs.unc.edu>
Subject: P-text errors; Clean e-text algorithm

Ken Steele writes that "even conventional printed publications" sometimes
have a large number of errors. We looked at a 250K work by J. S. Mill,
published in the late 1980's by one of the two leading British university
presses; we found over 50 word errors in the work. We looked at a
different 280K work by Mill published in 1978 by a U.S. commercial press,
and found over 60 word and 1200 punctuation errors. Both editions
purported to be the last edition published in Mill's lifetime; hence the
errors were deviations from the last edition (which we had in hand).
(None of the errors were corrections in Mill; these numbers ignore
spelling discrepancies.) (We handed the U.S. press their errors; they
promised corrections in a later edition.)

We are now working with 5 different editions posing as "reprints" of
Bentham's 1823 edition of Intro. to the Princ. of Morals and Legislation.
Unfortunately all five occasionally disagree with one another, although
3-2 disagreements are more common. (Interlibrary loan has yet to
gainsay us a copy of the actual 1823. Interlibrary loan: "But we have
reprints of the 1823.")

Not all print is bad (though all is suspect): we found 5 word
discrepancies between a ~1900 Oxford reprint of the 1651 Leviathan, and
the Mcpherson Penguin 1651 reprint, none terribly important (e.g.
'these' vs. 'those'). The Mcpherson differed by 1 word from the actual
1651, the Oxford by 4. By contrast the 1843 Molesworth Leviathan
differed by over 350 words and 2000 punctuation marks from the 1651,
ignoring orthography and spelling modernization.

Assuming the availability of multiple editions (not based on one
another) with multiple typefaces, scanning should produce nearly
flawless text. Double or triple scan and file compare. On different
typefaces the scanner will rarely make the same errors. We usu. follow
this procedure: scan 1 edition, proofread it onscreen, electronically
proofread it; scan a 2nd edition; print out a file comparison and
arbitrate with a 3rd edition. For the 3rd edition we usu. use the last
edition published in the author's lifetime, unless it itself is
scannable. (We thereby generate the last edition in the author's
lifetime (though there are complications of course).) Cleanup of the
first scan is the bottleneck, assuming decent scanning equipment. We
are more likely to introduce an error by an inadvertent keystroke during
our markup phase, than miss one with the above procedure. Too many
discrepancies warrant a third scan (of a 3rd edition). The second scan
sometimes requires cleanup, but no guessing is permitted (since the same
wrong guess might be made in the cleanup of the first scan).

At 1 error per megabyte (Ken Steele's figure), the notion of error itself
becomes problematic. Though I'm sure these issues have been discussed
in this forum: does one duplicate missing periods in the author's last
edition? Misspelled (?) words? (Critical editions sometimes address
these concerns.) How does one discover (assuming realistic economic
constraints) that a database contains only 1 error per megabyte? Or
fewer than 1 per megabyte? (If I observe the error, it no longer is (a
variant on Heisenberg?) (the Textual Uncertainty Principle?).)

Mark Rooks