scanning, e-texts, the NeXT, and more (108_

Tue, 18 Apr 89 20:21:55 EDT

Humanist Mailing List, Vol. 2, No. 856. Tuesday, 18 Apr 1989.

(1) Date: Tue, 18 Apr 89 13:42 EST (29 lines)
From: Terrence Erdt <ERDT@VUVAXCOM>
Subject: Scanning (25 lines)

(2) Date: Tuesday, 18 April 1989 0934-EST (59 lines)
Subject: KDEM Scanning; e-texts, etc.

(1) --------------------------------------------------------------------
Date: Tue, 18 Apr 89 13:42 EST
From: Terrence Erdt <ERDT@VUVAXCOM>
Subject: Scanning (25 lines)



Sources close to Kurzweil report that within 15 to
30 days, the company will announce a new scanner, the
Model 5100. The 5100 reportedly will be trainable and
four times faster than the Model 5000 with "clean" text
and twice as fast with less than ideal type. It will
resemble the 4000 in "trainability" and be particularly
powerful in managing dot matrix output. The sources
indicated, too, that the new model will make the work of
transcription easier.

The Model 5100 apparently will assign special
characters, such as Greek letters, to inscription that it
cannot recognize but which it encounters with some
frequency. Users will then be able to assign ASCII code
for each of the special characters.

from the desk of

Terry Erdt

(2) --------------------------------------------------------------62----
Date: Tuesday, 18 April 1989 0934-EST
Subject: KDEM Scanning; e-texts, etc.

As you know, CCAT has lots of experience reading Greek
and other "strange" character sets on its KDEM 3 (as do
other places like Oxford, Cornell, the old Duke center,
etc.). There are some tricks that would probably work on
the KDEM 4000 too, and I'm happy to share them if there is
an interested subgroup of HUMANISTs who are interested.
The recent communications from Jonathan Altman and Bob
Hollander regarding the Dante Commentary project experience
deserve further attention as well -- e.g. in what circumstances
is unattended scanning more efficient than the edit-while-you-scan
approach? What tools exist for semi-automatic verification
(e.g. lists of impossible or unlikely letter combinations in
the target language that can be safely changed automatically
in most instances -- e.g. in English, "cd" is usually a KDEM
misreading of "ed" [but could also be "od" or even "nd"], etc.)?
Indeed, such tools would help in any editing, I would think,
but the KDEM makes many typical errors that can be caught
this way (e.g. "j" at the end of a word is usually "]").

Well, Willard, originally I meant to be dashing off a note to
you alone, but maybe it's really for the HUMANIST gang. And
really, what is all this about a different style in BITNET
communications? Seems to me that its the same mixture of
styles that are typical of the "memo" hardcopy circuit
that we all experience daily! At least, speaking for myself,
I'm not conscious of writing any differently here, or of seeing
others write characteristically differently, from the world of
ordinary mail and memos. But then, I've been using the computer
for my writing pad for nearly a decade.

Apropos the useful contribution from Geoff Rockwell of the NeXT
staff, if they really hope to get massive amounts of significant
books ready to use on such machines, we really need to pay
serious attention to the issues of encoding and archiving.
I've tried my hand at this on a very small scale, with the
CCAT portion of the PHI/CCAT CD-ROM, with its variety of
languages and formats represented. It's a lot of work, and takes
more coordination and forethought than most individuals working
in only one or two languages or formats would realize. One of the
reasons for issuing those materials on the PHI/CCAT CD-ROM was to
provide programmers and users with texts that would help expose
the problems. I hope that NeXT and others are taking advantage of
the opportunity offered there! As I have said before, and will
doubtless say again: encoding and formatting data is in many
ways an extremely dull task, but it is also extremely crucial and
basic to all the other exciting things that students of texts
(in the widest definition of "text") may want to do with the
computer. The examples of such pioneering projects as TLG,
the French Literature material (ARTFL and its French counterpart),
the Global Jewish Database/bank, etc., stand tall, but we need
to be vigilant to support such efforts, and to pay attention
to the needs of coordination and compatibility (at least in
a general sense). The Librarians are a key element here!

(And it was going to be a short note!) Bob