optical scanners, cont. (94)

Thu, 6 Apr 89 19:30:25 EDT

Humanist Mailing List, Vol. 2, No. 810. Thursday, 6 Apr 1989.

Date: Wed, 5 Apr 89 22:36:32 -0400
From: jonathan@eleazar.Dartmouth.EDU (Jonathan Altman)
Subject: Observations on OCR

This discussion on OCR, I probably should note, was the one which
finally caused me to decide to join this mailing list.

My experiences with OCR lead me to believe that I might have something
to add about the proposal set forth by Malcolm Brown for benchmarks of
OCR packages, as well as the general discussion of OCR. I should
probably note now that the Dante Project has currently scanned or
had scanned for us over 150 MB of text to this point, and that I
have worked on this scanning from being an operator to being the
machine supervisor.

I had the responsibility in the summer of 1987 of finding a new
text-scanning alternative to Dartmouth College's old, ailing Kurzweil
model 2, and the main alternative was the Kurzweil 4000. As an aside,
I should note that these alternatives did NOT include pc-based
scanners/software. The heart of this, however, is that in coming up
with an evaluative measure of how much better the Model 4000 was than
the Model 2, I did exactly what Malcolm Brown suggested: came up with a
set of benchmarks.

I chose packets of text that our Project had or planned to scan, and
chose them to test individual criteria. I got lucky on my hunches
about criteria. My samples were:

1. I chose a sample of our highest quality material. Highest
quality for us was a xeroxed copy of a brand new book published in
1982 which we had sawed the binding off of to eliminate text warping
from the binding. This sample was designed to test the absolute
ceiling speed of the Kurzweil's scanning-how fast would it go with
the best set of conditions available? I hate to say, the results
were not overwhelming. The 4000 offered almost no speedup over the
model 2 given best quality text with me scanning, nor much
improvement in accuracy. I include this info because my second test

2. I chose a sample of material we had started scanning on our model
2 and which we pulled off because the 2 was too slow at scanning it.
This sample on the 4000 scanned at a significantly faster rate. Had
we had the 4000 earlier, the material in question would have been
quite acceptable to scan.

These first two benchmarks have led me to the following opinion
about the merits of optical text scanning. The area in which
advances in ICR (Intelligent Character Recognition) appear is in the
ability to move the margin of readability of material farther
towards unintelligible characters, not primarily in the speed with
which clean text can be read.

3. I brought an example of text with illustrations. I wanted to
test the 4000's ability to handle complex formatting (for example,
columns) using its tablet. I blew it on this one. The tablet
seemed much more useful than it's turned out to be.

That's my benchmark. Not very complex, but perhaps a good starting
point for discussion of one. What would be more useful would be to
include more of a continuum on quality. I actually had a wider range
of quality available to me, but saw no need as soon as my worst text
was read more than adequately to try middle-quality material.

Last, in this very long message, is the experiences I have seen with
trainability and handling of non-English character sets. I know that
Bob Kraft at the Center for the Computational Analysis of Text has
successfully gotten his Kurzweil model III to read Arabic and Russian.
He is on this list, and so might better explain how he's done his
work. His office and my project have both stumbled along ways to
handle various non-English character sets. For accents, the Dante
Project has developed a simple set of escape codes which we can teach
the Kurzweil to put in correctly. These include, for example,
representing an "a" with accent grave as "@a." A good scanner when it
sees the accented "a" should be able to substitute the "@a." I believe
Bob Kraft read Arabic by letting the Kurzweil decide what characters it
saw, and then made sure that the scanner was consistent in its choices,
but again, check with him.

Last, I should probably note, is that I do not like the concept of
unattending scanning. This might be acceptable for scanning a very
recently printed book original, but not very useful otherwise. A
good scanner-operator can correct a scanner at the rate at which the
machine reads, i.e. the computer cannot scan ahead of the operator.

Jonathan Altman