4.0192 Text: Lemmatizing; Browsing; TACT (3/68)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Fri, 15 Jun 90 17:25:26 EDT

Humanist Discussion Group, Vol. 4, No. 0192. Friday, 15 Jun 1990.


(1) Date: Fri, 15 Jun 90 12:51:57 PDT (28 lines)
From: cbf@faulhaber.Berkeley.Edu (Charles Faulhaber)
Subject: Lemmatization

(2) Date: Thursday, 14 June 1990 2319-EST (26 lines)
From: KRAFT@PENNDRLS
Subject: Search/Browse Program for HP 110 Laptop

(3) Date: Friday, 15 June 1990 7:48am CST (9 lines)
From: EIEB360@UTXVM.BITNET
Subject: 4.0187 Text Analysis (1/30)

(1) --------------------------------------------------------------------
Date: Fri, 15 Jun 90 12:51:57 PDT
From: cbf@faulhaber.Berkeley.Edu (Charles Faulhaber)
Subject: Lemmatization

Subject: Re: 4.0171 Lemmatization (2/99)

My thanks to Robin Cover and Timothy Reuter. Both of you may be right.
The specific problem I'm trying to solve is doing normalized searches on
unnormalized text.

In most medieval languages spelling variations are so great, especially
if more than one dialect is concerned, that trying to recover all tokens
of a given type through soundex procedures, unix-type wild-cards, or
algorithms is unlikely to be successful (e.g., in Spanish 'to talk' can
be fablar, hablar, ablar, favlar, avlar, havlar; the past participle of
'to do' can be echo, fecho, hecho, feito, feyto, fejto). In the latter
instance, the echo, hecho forms can also belong to the verb echar 'to
throw' (and these are trivial examples off the top of my head).

Thus the need to lemmatize the corpus. Reuter's approach sounds useful,
but there would need to be a good deal of manual intervention in any
case; and, again, it would require a morphological analyzer for each
language used (I think; I don't know enough about computational
linguistics to have an informed opinion).

Charles Faulhaber
UC Berkeley

(2) --------------------------------------------------------------38----
Date: Thursday, 14 June 1990 2319-EST
From: KRAFT@PENNDRLS
Subject: Search/Browse Program for HP 110 Laptop

I have resuscitated a few of the old sturdy HP 110 Laptops that were
given us to play with some years back (getting and installing new
battery packs is cheap and easy), and want to use them occasionally to
show off flat file data (e.g. genealogical). There is no search/browse
feature in the built-in software, so I tried to load a recent version of
LIST, but was told there was insufficient memory. Programmers here have
suggested that they could write a quick solution, but before I agree to
that, is there any light from the HUMANIST world?

CHKDSK tells me that the machine has 71K bytes free of internal memory
(and 176K bytes of available disk space). I want to be able to search
for at least simple matches (multiples, etc., would be nice but not
essential) and have them display in context on the 16 line 80 column
screen. The LIST model is adequate for my purposes -- LIST allows a
simple search, and puts the hit line about 5 lines below the top of the
screen which is filled with the complete context of the hit. Note that
the limited screen depth (16 lines) precludes direct use of any program
that assumes a 24 line screen and needs to write to that entire space.
Possibly an early version of LIST would work, but are there any other
suggestions?

Thanks for any help offered! Bob Kraft (KRAFT@PENNDRLS.bitnet)
(3) --------------------------------------------------------------15----
Date: Friday, 15 June 1990 7:48am CST
From: EIEB360@UTXVM.BITNET
Subject: 4.0187 Text Analysis (1/30)

Thanks to David Miall for his useful account of TACT and the uses of the
Z-score. I know it's come up here a number of times, but would someone
be kind enough to repeat information on how to obtain a copy of the
program? Thanks. John Slatin, University of Texas at Austin,
EIEB360@UTXVM.