6.0125 Rs: OCRs (3/72)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Fri, 10 Jul 1992 16:26:58 EDT

Humanist Discussion Group, Vol. 6, No. 0125. Friday, 10 Jul 1992.


(1) Date: Thu, 9 Jul 92 18:13:51 CST (28 lines)
From: (James Marchand) <marchand@ux1.cso.uiuc.edu>
Subject: ocr

(2) Date: Fri, 10 Jul 1992 08:55:06 -0400 (EDT) (22 lines)
From: Steve Taylor <ussjt@emoryu1.cc.emory.edu>
Subject: Re: 6.0121 Qs: Indexes; OCRs

(3) Date: Fri, 10 Jul 1992 08:55:06 -0400 (EDT) (22 lines)
From: Steve Taylor <ussjt@emoryu1.cc.emory.edu>
Subject: Re: 6.0121 Qs: Indexes; OCRs

(1) --------------------------------------------------------------------
Date: Thu, 9 Jul 92 18:13:51 CST
From: (James Marchand) <marchand@ux1.cso.uiuc.edu>
Subject: ocr

Jean Anderson's question on OCR probably has no easy answer. I think I have
tested every OCR product on the market. For software, you cannot beat
OmniPage Pro (or OmniPro); it is expensive, but it will do the trick. It
runs only under Windows. I have used it to scan in Old English and Old
Icelandic with great results. It is trainable, and you have to train it
to read thorn and edh and the like. It is probably best to train it to
read a particular font, e.g. for Old Icelandic Gudhny Jonnson, SUGL,
Islenzk Fornrit. Happily, many of our texts are available in series. NB:
NO OCR will read manuscripts. The only cyrillic I have scanned in was on
an old Kurzweil 4000, and it did a marvelous job. It might be hard to
train OmniPage to read cyrillic. You need a good, but not outstanding
scanner. The question here is whether to use a page feeder (expensive) or
a flatbed. If you are on a tight budget, the Canon IX-12 and its various
other forms will do well (look in Computer Shopper), but you will have to
xerox (another source of error) your book to feed it in. The best of the
scanners is the Hewlett Packard ScanJet +. There is even a new one out with
greater speed. It is unfortunate that Kurzweil does not have a trainable
on the market any longer. The only Kurzweil-like machine I know of is the
German-made Optopus; it operates like the old Kurzweil 4000. Back to the
OmniPage Pro (make sure to get the Pro version; it's the only one that is
trainable; I have trained it to read fraktur, so cyrillic might not be so
hard), the manual is atrocious, so make sure that you can figure things
out, or you will be lost.
Jim Marchand
(2) --------------------------------------------------------------34----
Date: Fri, 10 Jul 1992 08:55:06 -0400 (EDT)
From: Steve Taylor <ussjt@emoryu1.cc.emory.edu>
Subject: Re: 6.0121 Qs: Indexes; OCRs

Jean Anderson asks about optical character recognition of Cyrillic,
Icelandic, etc. I've been successful at training Olduvai's Read-it!
program to recognize a pristine sample of Cyrillic text, but failed
miserably when trying to scan more typical samples, such as a manuscript
typed on an old typewriter and an article in a poorly printed journal.

The problem with trainable programs is that they learn to associate
characters with bit-map images of characters. If there's a lot of
inconsistency among examples of a character, the results will suffer. And
devoting a couple of hours to training the program on one font contributes
nothing to its ability to recognize another font in that alphabet-- even
another size or weight of the same typeface.

So I'd say that Read-it! can do the job if the manuscript is clean &
consistent and if it's long enough to warrant devoting 1-2 hours to training.

Steve Taylor
Emory University
(3) --------------------------------------------------------------36----
Date: Fri, 10 Jul 1992 08:55:06 -0400 (EDT)
From: Steve Taylor <ussjt@emoryu1.cc.emory.edu>
Subject: Re: 6.0121 Qs: Indexes; OCRs

Jean Anderson asks about optical character recognition of Cyrillic,
Icelandic, etc. I've been successful at training Olduvai's Read-it!
program to recognize a pristine sample of Cyrillic text, but failed
miserably when trying to scan more typical samples, such as a manuscript
typed on an old typewriter and an article in a poorly printed journal.

The problem with trainable programs is that they learn to associate
characters with bit-map images of characters. If there's a lot of
inconsistency among examples of a character, the results will suffer. And
devoting a couple of hours to training the program on one font contributes
nothing to its ability to recognize another font in that alphabet-- even
another size or weight of the same typeface.

So I'd say that Read-it! can do the job if the manuscript is clean &
consistent and if it's long enough to warrant devoting 1-2 hours to training.

Steve Taylor
Emory University