3.1235 scanning and e-texts, cont. (251)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Thu, 29 Mar 90 22:11:37 EST

Humanist Discussion Group, Vol. 3, No. 1235. Thursday, 29 Mar 1990.

(1) Date: Thu, 29 Mar 90 12:45:42 EST (42 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1231 scanning and e-texts (86)

(2) Date: Wed, 28 Mar 90 13:00:30 CST (78 lines)
From: "Robin C. Cover" <ZRCC1001@SMUVM1>

(3) Date: Wed, 28 Mar 90 15:32:45 EST (15 lines)
From: Grace Logan - ACO <grace@watserv1.uwaterloo.ca>
Subject: noed

(4) Date: Wed, 28 Mar 90 16:41 PST (43 lines)
Subject: Alpha Centauri

(5) Date: Wed, 28 Mar 90 22:14:00 EST (14 lines)
Subject: RE: 3.1231 scanning and e-texts (86)

(6) Date: Wed, 28 Mar 90 22:32 EDT (19 lines)
From: The Man with the Plan <KEHANDLEY@AMHERST>
Subject: converting typesetting data into usable stuff

(1) --------------------------------------------------------------------
Date: Thu, 29 Mar 90 12:45:42 EST
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1231 scanning and e-texts (86)

The comments about lost satellite data are inappropriate as follows:

1. They were stored on mainframes, which were few in number and
remarkably arbitrary in both operating systems and hardware,
not being made to be portable to other machines. However it
was easy(?) to update tape formats simply by sending tape to
another machine which then stored it in the appropriate tape
or disk format. This trick is also extremely useful for the
modern day user, who can upload an e-text from one format to
a system and then download it to the new system, tape & disk
formats are thus taken care of automatically.

The satellite records were stored in arcane methodologies no
modern micro operator would consider, plus the problem of an
entire type of computer and its media becoming obsolete in a
span of a few years is no longer a reality, short of WWIV as
the proliferation of DOS alone is enough to assure reading a
text prepared on DOS disks can be read years from now. This
is also one of the reason Project Gutenberg insists ALL text
be made available as straight ASCII files with no encoding -
in addition to encouraging all methods of encoding and their
various (in)compatibilities with available search strategies
in use by various people and institutions.

2. Modern etexts are in the hands of the public, not in right -
left hand of the government, which knoweth not what the hand
on the other arm doeth. However a private individual copies
a file, that person knows how to get it back-did I say that.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

(2) --------------------------------------------------------------83----
Date: Wed, 28 Mar 90 13:00:30 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1>


I do not think the cost estimate given by Bob Amsler for scanning (or
otherwise digitizing) a book is high, given the very UN-intelligent
scanning software available today. Based upon our costs of data
preparation for several volumes used in a CDROM hypertext project
(using Kurzweil 4000 scanner and student proofreading teams), I would
put the average closer to $16,000 per volume. These were reference
books, which may be more demanding than novels. The actual cost of
data preparation -- at least in our project -- depends upon many
factors, but the most important are (a) the amount of markup you want
and (b) the level of accuracy required. Of course, the feasibility of
scanning itself, as opposed to keypunching, varies greatly as a
function of the document features (e.g., the languages, graphics,
print quality.) A few further details are given below for anyone who
is interested.

Digitizing (scanning, proof-reading) and tagging a 900-page
Greek-English lexicon cost us about $30,000. (The specific book is
the English 2nd edition of Bauer-Arndt-Gingrich-Danker'
<cit>Greek-English Lexicon</>). Part of this high cost was due to
strict quality-assurance demanded by the University of Chicago Press:
they wanted to see "no errors" in the text, so we had to proof-read it
an extra time. Most of the generic data-correction facilities,
including spell-checkers, were useless for this multi-lingual text
(contains Hebrew and Aramaic in addition to Greek). But markup was
also a large part of the cost: if we want to do indexing and
retrieval, having a text marked with typographic codes (necessary for
display, in traditional applications environments) is of little
benefit. So we had to do structural markup, lemma reconciliation with
two other Greek lexicons and two Greek morphology databases, etc.
This is why I say that markup is potentially the most significant
variable in the price equation when we talk about producing e-texts.
Just as one should not casually accept an estimate for an SGML DTD,
one should not accept an estimate for "digitizing" a book without
specifying all details of markup.

Data preparation for other books such as bible commentaries and bible
dictionaries cost somewhat less ($12,000 - $18,000), depending upon
various factors of typography and linking. In the few cases where we
were lucky enough to get typesetting tapes, the data preparation was
significantly less -- but still not trivial. Typographic codes are
typically underspecified for text-retrieval purposes.

The moral of this story, as already noted by a previous contributor to
this discussion, is easily grasped: so long as we submit our written
intellectual creations to publishing processes which corrupt the data
for information retrieval, we will pay a high price to "get the
information back" in an intelligent format. SGML editors and
text-processing systems are now becoming more widely available. But
for regaining the intellectual creations of our written/printed past,
scanning software (like operating systems, structured-document
editors, and authoring systems) needs to develop a concept of what
"language" is to a text, and the fundamental (heuristic) notion that
texts are not composed of senseless strings of characters.
Intelligent scanning needs to be more than "character-recognition"
software, even if characters are important atomic units in many
languages: it should be possible to tell the data preparation software
details about the document structure, languages, literary genre (etc.)
so that these hints are used in the creation of an e-text. This may
never happen so long as "Stealth Bombers" (apud D. Durand) are more
important to us than knowledge in the books of the Library of

Robin Cover

3909 Swiss Avenue
Dallas, TX 75204
AT&T: (214) 296-1783/841-3657
FAX: 214-841-3540
BITNET: zrcc1001@smuvm1
INTERNET: robin@txsil.lonestar.org
INTERNET: robin@utafll.lonestar.org
UUCP: texbell!txsil!robin
(3) --------------------------------------------------------------31----
Date: Wed, 28 Mar 90 15:32:45 EST
From: Grace Logan - ACO <grace@watserv1.uwaterloo.ca>
Subject: noed

Did I misunderstand M. Hart's posting or did he say that the NOED
was put in electronic form starting in 1975. It really did start
much later (only began to be talked about in 1984). If I didn't
get it right, I'm sorry. But it should be clear that it's a
more recent project (and hence the technology used wouldn't have
been all that outdated).. Scanning was considered by the way, and
found not suitable for the purpose for a variety of reasons.


(4) --------------------------------------------------------------49----
Date: Wed, 28 Mar 90 16:41 PST
Subject: Alpha Centauri

HUMANISTS drooling at the prospect of having the Library of
Congress online should know of the work of the Commission on
Preservation and Access.

The Commission estimates that there are some 305 million volumes
in the nation's research libraries, of which some 25%, or 78
million, are brittle or turning to dust, thanks to acid paper.
Of these, some 68 million are thought to be duplicates. The
Commission's charge is to save in some form or other as many as
possible of these books before they are gone in the next 20 years
or so. The Commission believes that in the best case only some 3
million of 10 million non-duplicates can be saved, and to achieve
even this, 20 institutions will need to microfilm (there are
reasons for this choice) 7500 volumes a year for 1000. Costs are
estimated at $60 per book or $180M for the whole project; there's
reason to think the federal government will come up with that
much. The Commission has set up task forces in various
disciplines to decide what gets saved and what goes.

The point is that if we're not going to come up with the half
billion or so it would take to save what we've got, we're not
going to come up with the billions and billions it would take to
put it all online. It's triage, folks, not Alpha Centauri.

Best wishes,

Charles M. Young

Department of Philosophy
Claremont Graduate School

Member, American Philosophical Association Committee on Computer
Use in Philosophy and its Subcommitee on Electronic Texts
Member, Philosophy Task Force, Commission on Preservation and

(5) --------------------------------------------------------------25----
Date: Wed, 28 Mar 90 22:14:00 EST
Subject: RE: 3.1231 scanning and e-texts (86)

I think the idea that e-text has to preserve everything of the original
text a bit of an exageration. No one (I think?) is suggesting that the
originals should be destroyed (we are not ofter all in a Farenheit 451
world), just that an alternative media be available. Access to the
alternative media say in cdrom, would allow researchers in far out corners
of the world (ie. Brazil, where I am writing from) access to arcane bibli-
ography. If one needs to look up the originals, obtain a grant and do
it in the great repositories of such things (LOC and Co.).
Hope my two bits worth from the third world proves usefull.
(6) --------------------------------------------------------------25----
Date: Wed, 28 Mar 90 22:32 EDT
From: The Man with the Plan <KEHANDLEY@AMHERST>
Subject: converting typesetting data into usable stuff

In response to someone who had typesetter's magnetic data, but no
inkling of how to make that data useful, I suggest checking out the companies
listed in BYTE Magazine (I'm looking at April, 1990) in the Buyer's Mart
section (pp.620-21). Under the heading of "Data Conversion" there are two
firms listed (although judging by the ads, I think only one might fit the
purpose), and under "Data/Disk Conversion" eight firms listed. I have no
data conversion experience with any of these companies, all my knowledge
coming from these ads, you see. Good luck, and whatever you end up doing,
tell us about it.

Keith Handley
User Services Associate
Amherst College Academic Computer Center