optical scanning, cont. (150)

Mon, 24 Apr 89 21:21:07 EDT

Humanist Mailing List, Vol. 2, No. 882. Monday, 24 Apr 1989.

(1) Date: Mon, 24 APR 89 18:02:01 BST (86 lines)
Subject: Kurzweil scanning

(2) Date: Mon, 24 Apr 89 15:35:14 -0800 (44 lines)
From: Malcolm Brown <mbb@jessica.Stanford.EDU>
Subject: panel discussion on scanning at the Toronto conference

(1) --------------------------------------------------------------------
Date: Mon, 24 APR 89 18:02:01 BST
Subject: Kurzweil scanning

I have been forwarding the recent discussion of optical scanners to
our Kurzweil staff, two of whom have been operating the machines for
almost 9 years and can thus claim to have had longer experience of
the machine for academic use than anybody else - Kurzweil assured us
that we were their first academic customers.

Here are their comments.

Susan Hockey
Oxford University

We, the KDEM Service in Oxford, have been following with great interest
the Humanist discussion on OCR (Optical Character Readers), primarily
Kurzweil 4000. We have been running a Kurzweil service for 9 years now
and have just recently switched from MAX to 4000. We were rather
disappointed to discover that the 4000 is almost twice as slow as its
MAX predecessor.

The most important aspect of the work, we believe, is well trained
and reliable staff as well as provision of comprehensive service for
the users. We are very strict about what we accept for scanning.
As long as the quality of the print is reasonably good, regardless
of whether it is English, Russian, Hebrew, Greek or any other language
with all its idiosyncracies, we will take it on. In order to produce best
results we devote a great deal of time to coding, accents and diacritics,
as well as testing and experimenting with black and white threshold levels,
point sizes, font changes etc. which often involves photocopying, enlarging
or reducing the print.

Once the scanning is complete, we do a general tidy up for the user, using
wherever possible global edits. When this is done the user proofreads
the text, which comes back to us for a final edit which is free of charge.

Here are some of our general comments:

1. The machine should not be run unattended. Correction of interventions,
at the very least, during scanning time saves a great deal of subsequent
editing. We find that with the 4000 we can keep up with the scanner
quite comfortably. In any case the machine can read ahead if necessary.
2. We found the 4000 better on typescript than printed books.
It is not as good on ligatures and quotes and has problems with m, rn, i,
l, f, S etc. The list is endless.
3. We recommend making lists during scanning of typical misreads, and then
running global edits on the mainframe.
4. We frequently photocopy material and guillotine (though very reluctantly)
5. The tolerance level of different point sizes within the same text is quite
6. We have been quite successful with scanning unusual scripts like Sanskrit,
Old Church Slavonic, Hebrew and Greek. If Greek is printed in italic, then
it's almost impossible to pick up diacritics. Some of our users were quite
happy without them though.
7. It is not essential for the operator to be familiar with the language
that is being scanned, as long as he or she is familiar with the alphabet.
We have succesfully scanned a great number of texts in foreign languages.
Only Grazyna Cooper of the KDEM staff here in Oxford has knowledge of
other languages.
8. We disagree on the whole with the statement that minimum intervention
is better. Only by giving a great deal of attention at scanning stage
is the most efficient and economic way of doing the job properly.
9. We haven't experimented with allowing the machine to make up its own
mind as to the coding (for example of Russian characters). It is an
interesting idea and we will try it when the opportunity arises.

We are always amused by the lack of understanding from a great number of users
of what the machine is capable of reading. It will not perform miracles.
It will only read as well as the quality of the print it is presented with,
combined with the skill and expertise of the operator.


We did use 4000 to do a test on the Pali text. We were pleasantly surprised
by your comments about it. We do not edit test samples on the mainframe.
The user needs to be familiar with all the problems before any decision
is made on whether the text is worth scanning. Your text mainly requires
good photocopies!!!

Grazyna Cooper, Andreana Holl, Anita Sabin.

(2) --------------------------------------------------------------46----
Date: Mon, 24 Apr 89 15:35:14 -0800
From: Malcolm Brown <mbb@jessica.Stanford.EDU>
Subject: panel discussion on scanning at the Toronto conference

Willard suggested I offer a few words concerning the
discussion panel I've organized for the Toronto conference.

I'll start with a negative definition: the panel discussion
is not "on OCR", in the sense of being merely a report
on the current state of OCR technology. Instead,
I've asked the panelists to debate the overall
value of OCR as a means of text entry (as opposed to
contracting a typist or a keyboarding firm).

Obviously the panelists will need to address the capabilities
of current OCR technology, but the goal of the discussion
is to evaluate the technology over against the alternatives.

I organized the discussion entirely for selfish reasons. Having
encountered the limitations of the Kurzweil4000 all too often, I'd
like to know how the alternatives look.

Indeed, Ted Brunner of the TLG project reports that their keyboarding
firm guarantees an error rate of not greater than one error
per 25000 characters. That would save a tremendous amount of
post-scan cleanup work, allowing us to devote more resources
to the research at hand.

The discussion panel I've organized brings together
those among us who have been dealing with the
"challenges" of text entry for some time: Lou Burnard (Oxford),
Mark Olsen (ARTFL), Terry Erdt (CHum/Villanova),
Mel Smith (BYU), Bill Holmes (US National Archives).

In addition, several OCR vendors will be participating in
the software fair. These include Kurzweil, Calera Recognition
Systems (TrueScan), CTA (textpert), and Makrolog (Optopus).
This last system is the one mentioned by Dr. Ott in his recent
submission. Fair attendees should bring samples of your most
difficult texts to challenge these systems!

Malcolm Brown