3.432 morphological analysis for Hebrew (112)

Tue, 5 Sep 89 21:17:35 EDT

Humanist Discussion Group, Vol. 3, No. 432. Tuesday, 5 Sep 1989.

Date: Thu, 31 Aug 89 10:03:15 -0400
From: choueka@thunder.bellcore.com (Yaacov Choueka)
Subject: Morphological analysis for Hebrew

Following is an abstract of a talk to be given soon at a meeting
on Computational Linguistics in Haifa, Israel.
I would be grateful for any information on the questions
raised at the end of this abstract.
Yaacov Choueka, Bar-Ilan University, Ramat-Gan, Israel.
Now visiting Bellcore, NJ, till 09/14, choueka@thunder.bellcore.com

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


A complete and accurate morphological
analyzer for modern Hebrew for a PC environment

Yaacov Choueka (1,2) Yoni Neeman (2)

1) Bar-Ilan University, Ramat-Gan, Israel.
2) Center for Educational Technology, Ramat Aviv, Tel-Aviv.

As a typical semitic language, Hebrew has a rather complex
morphology. A verb can be conjugated in several modes, tenses,
persons and genders; causative pronouns can be suffixed and
combinations of prepositions can be prefixed to the conjugated form,
bringing the total number of morphological variants of one verb to a
few thousand variants. Similar considerations apply also to nominal
forms. No adequate natural language processing systems (such as
spelling checkers, full-text retrieval systems, mechanical
translation software, etc.) can be therefore developed for
Hebrew without a morphological analyzer operating in the background.

"MILIM" is a portable morphological analyzer for modern
Hebrew developed for the PC environment. It accepts as input any
string of characters and produces as output a complete and
linguistically accurate analysis of that string, giving the
lemma (=basic form, standard dictionary entry), the root and all
relevant morphological attributes, such as (for verbs): mode, tense,
person and gender, attached pronouns and prepositions, etc. If the
given word has several possible analyses, it will list them all.
Based on a carefully coded dictionary and a computerized version of
the Hebrew morphology, MILIM will correctly recognize and
analyze any linguistically legitimate entity, including "exceptions"
and "irregular" cases.

MILIM processes non-pointed Hebrew, and can recognize both
grammatical spelling ("ktiv hasser") as well as "plene" one
("ktiv male"). It also recognizes common non-linguistic textual
entities such as abbreviations, acronyms, proper names of places
and people, etc. Its response time is immediate, and it requires
less than 2 MB of internal and disk memory.

A VAX/VMS version is also available.

Is there such a package for English, that can be attached
to any natural language processing system running on a
PC or a VAX?
I am not interested in suffix-stripping routines, stemming
algorithms, approximate solutions, and the like. I am asking
about the availability of a package that can be called
from some specified operating system environment (much as "spell" is
used in Unix), and given a string, will output its
linguistically correct analyses, and specially a pointer to
its dictionary entry (so that all of the information attached
to this entry in any computerized dictionary - including
word senses, quotations, collocations, etc.,- can then be made
available), as in the following examples:
saw--- 1. past of (to) see, transitive verb,...
2. noun, singular, ...
3. tr. verb ...
4. in. verb...
saws-- 1. plural of saw, noun,...

Obviously such a tool will be closely tied to a given
dictionary and will be no more comprehensive or "correct"
than its dictionary base, but that's OK.

A good extra bonus can be some marking of the dictionary
entries that will enable their grouping together into
morphologically and semantically related "families" or
"roots". The following different dictionary entries will
be labeled for example as belonging to the same "family":
computer, computation, computational, (to) compute,
(to) computerize, etc... Note that this notion is
not related in any way to synonymity: "calculation" will not
be in the family just mentioned.

I am also not interested in such products if they
are proprietary or not available for purchase at
a "reasonable" fee, or if they are strongly attached to
one specific application.

Is such a tool available now (or will be very soon)?