3.1208 super-scanning and its costs (137)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Fri, 23 Mar 90 23:04:45 EST

Humanist Discussion Group, Vol. 3, No. 1208. Friday, 23 Mar 1990.

(1) Date: 22 March 1990, 19:28:47 EDT (25 lines)
Subject: scanners and cooperative reading and revising

(2) Date: Thu, 22 Mar 90 12:33:37 EST (46 lines)
From: amsler@flash.bellcore.com (Robert A Amsler)
Subject: Digitizing the Library of Congress

(3) Date: Thu, 22 Mar 90 11:19:47 CST (43 lines)
From: Mark Olsen <mark@gide.uchicago.edu>
Subject: Costs and Scanning

(1) --------------------------------------------------------------------
Date: 22 March 1990, 19:28:47 EDT
Subject: scanners and cooperative reading and revising

Even given the error rate of current mid-priced OCR hardware and
software, and the need for accurate transcriptions of primary texts, it
would be comparatively easy to arrive at accurate texts if we humanist
scholars, or even the Feds in our respective countries, could arrive at
a procedure for reporting and correcting errors in texts. Companies
selling or brokering texts, such companies as Electronic Text
Corporation or Shakespeare on Disk, might offer small royalties for
lists of errors on texts they have already marketed--a kind of bounty on
errors reported. Then bounty hunters and those people who like to read
for misprints (and most editors get into the habit of doing that whether
they like to or not) could report errors for a pittance. Any person
controlling a large-scale scanning effort, as with the Milton Database
and the Renaissance Textbase projects at Oxford, Toronto, the University
College of North Wales and Ohio University, would be happy to receive
corrections and to incorporate them into the frequent revisions that
electronic texts can support. Now we need suggestions about how
for-profit and not-for-profit organizations can best organize and
supervise the error correction. I think it would be good to have
on-line corrections even with lower-level, secondary publications, as
with scholarly journals, for the sake of accuracy and to avoid
misrepresenting the author or journal. Roy Flannagan
(2) --------------------------------------------------------------60----
Date: Thu, 22 Mar 90 12:33:37 EST
From: amsler@flash.bellcore.com (Robert A Amsler)
Subject: Digitizing the Library of Congress

I think the cost is far far more than we can afford. To understand what
is involved one should note that this task would be more difficult
than either making a photocopy or a microform copy of everything in
the Library of Congress.

Consider a single work, such as the Oxford English Dictionary. The
effort to create a usable version of the OED took years and hundreds
of thousands if not millions of dollars. It required the work of
dozens of people.

Now we can create digitized bitmaps of printed pages--in black and
white, but these would be as useless as microform copies are today
and much more expensive because of the need to store the gigabytes of
data on magnetic media which doesn't meet archival standards. The
digitizations are inadequate for photographs and are entirely
oriented toward high-contrast (black/white) storage. To record
gray-scale or colors would require much experimentation. Storing
data on optical media `might' solve the archival issue, but nobody
knows the resolution needed right now nor even whether the
digitization would be any more useful in the future than the
microforms are today. The lifespan of a electronic recording medium
is less than the time it would take to finish the project.
Optical Character Recognition of any but contemporary texts
in any but a few fonts is highly experimental.

If we accept an estimate of 20 Terabytes as the size of the Library
of Congress, then at .5-1.0 megabyte per volume we'd get
something like 20-40 million books. At $10,000 per book, that
gives me something like $200 billion to represent the Library of
Congress in a format as useful as the Oxford English Dictionary.

The sum is SO large that I think it will never happen. The real
goal ought to be to start collecting machine-readable copies of published
works rather than letting those disappear and only saving the printed
versions. Then, in 10-20 years probably half the works will be
available in machine-readable form and due to the reprinting of
older works in new editions, almost all the information will be
machine-readable. However, there is virtually NO digital imagery in
use today in publishing. The photographs will remain outside of
digital access a while longer. Probably we can start collecting
digital imagery from the publishing of works in a decade.

(3) --------------------------------------------------------------60----
Date: Thu, 22 Mar 90 11:19:47 CST
From: Mark Olsen <mark@gide.uchicago.edu>
Subject: Costs and Scanning

Steve DeRose is correct in claiming that the cost of rendering the entire
LOC is a measurable number -- tho' his guess is probably low by a couple
of orders of magnitude. The real question is cost/benefit. There are
many people willing to pay substantial amounts of money -- recovery of
intial costs plus profits -- to have online access to legal records and
other information. This is *NOT* the case for much of the material in
the LOC. The back issues of _Language_ are of interest to a relatively
small and, by comparison to the legal or medical professions, very poor
community. And when we begin to consider the vast holdings of the LOC
in 19th century novels or other ephemeral literature that will interest
only tiny portions of the population, one has to wonder whether there
will ever be enough people willing to pay to access that material in order
to recover even a tiny fraction of the costs of input.

Steve's other point of not requiring very high accuracy is incorrect on
several grounds. The first is OCR technology is VERY poor at dealing with
texts published in the 19th century, for example. We have tried scanning
considerable amounts of this material, and it has proven close to impossible.
Raw scanner output from this is unreadable a very high percentage of time.
I can't imagine applying for large amounts of money from any funding
agency promising huge amounts of innaccurate, frequently unusable, data
in return. Finally, my experience with users of large amounts of
e-text is that they EXPECT printed edition quality, particularly in terms
of simple accuracy.

The rescue of material disintegrating in libraries will probably be aided
by storage of digitalized images of documents, much like microfilm. The
advantages of image storage are huge, in terms of preserving much of the
non-textual information in a document and images can, if OCR ever gets
to the point of being accurate enough to fucntion without considerable
human intervention, the basis for Steve's electronic LOC. This technology
is already in place and can be used for many of the applications that
Steve really wants, easy access to archived documents, at much lower cost.