16.016 OCRing handwriting

From: Humanist Discussion Group (by way of Willard McCarty (w.mccarty@btinternet.com)
Date: Fri May 10 2002 - 01:38:10 EDT

  • Next message: Humanist Discussion Group (by way of Willard McCarty : "16.019 new on WWW: digital content guide; US preservation site; funding info"

                    Humanist Discussion Group, Vol. 16, No. 16.
           Centre for Computing in the Humanities, King's College London
                   <http://www.princeton.edu/~mccarty/humanist/>
                  <http://www.kcl.ac.uk/humanities/cch/humanist/>

             Date: Fri, 10 May 2002 06:20:16 +0100
             From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
             Subject: OCRing Handwriting

    Very few problems are really insurmountable, but OCR of handwriting
    comes close. It is not likely that we will see it any time soon,
    particularly for medieval manuscripts. You need to look upon each
    hand as a font-type, containing several fonts. For example, I
    trained my OCR software to read the font of SUGNL, which contains
    the largest collection of Old Norse literature, because I realized
    that I need Old Norse, but its collection is larger than that of
    any medieval scribe. If you remember the problems we had, still
    not all solved, with the various fonts in which a modern book is
    printed, you can see the problem more clearly. Although
    schoolmasters have tried hard in the past to get all their students
    to write the same way, they have only rarely been even close to
    getting it done (one might cite the Carolingian minuscule).
    Remember all those people who claim not to be able to read their
    own notes (J. W. Marchand, for example). Of course, we have to
    make a difference between `print' and `cursive' (we learn to print
    up to about mid-fourth grade, then cursive). This points out the
    difficulty of the major move in OCR, pattern recognition. We can
    all remember (and still suffer from) the advent of transitional
    probabilities and guesses into OCR, and how much it helped out.
    Who has not had to remember to turn off `recognition' (of English)
    when scanning German? Transitional probabilities and lexicon check
    are mainly there for English, though other languages use them, too.
    For a Carolingian manuscript, to look at Bill Schipper's problem,
    pattern recognition is difficult if not impossible; think how many
    scholarly arguments we have over the reading of a letter or two.
    Transitional probabilities are not available for Latin, although
    God only knows why not. We have only very few Latin lexica
    available in electronic form.

    We might be able to train an OCR program like the old Kurzweil to
    read the hand of a single scribe (though, as Wilhelm Braun pointed
    out, "wer schreibt an allen Tagen gleich?"), but a quoi bon? Some
    hands are very uniform; Ihre thought the Codex Argenteus's Gothic
    to be so uniform that he thought Wulfila had invented (4th C. AD)
    movable type, but even there it is easy to see places where there
    is little uniformity, and modern authorities have seen two `hands'.

    Of course, there is always the possibility of teaching us to write
    more uniformly and with recognizable distinctive features, as in
    the case of a hand-held, but that does not help those of us who
    crave an OCR program for those medieval (ancient, foreign, etc.)
    manuscripts. Unfortunately, it does not seem likely.



    This archive was generated by hypermail 2b30 : Fri May 10 2002 - 01:52:14 EDT