7.0030 Lemmatization (X-Posted from Linguist) (1/343)

Wed, 26 May 1993 15:01:23 EDT

Humanist Discussion Group, Vol. 7, No. 0030. Wednesday, 26 May 1993.

Date: Wed, 26 May 1993 09:43:29
From: koontz@alpha.bldr.nist.gov (John E. Koontz)
Subject: Lemmatization

The following, from the Linguist list, seems to me to be on a subject that
has recurred on Humanist in the past. Note that I have nothing to do with
assembling this summary. JEK

LINGUIST List: Vol-4-401. Mon 24 May 1993. ISSN: 1068-4875. Lines: 339

Subject: 4.401 Sum: Language Stemmers

Anthony Rodrigues Aristar: Texas A&M U. <aristar@tamuts.tamu.edu>
Helen Dry: Eastern Michigan U. <hdry@emunix.emich.edu>
Asst. Editor: Ron Reck <rreck@emunix.emich.edu>

Date: Fri, 21 May 93 12:21:52 -0400
From: Tom Donaldson <tomd@pls.com>
Subject: German language "stemmers"

Back in April I submitted requests for information on language
stemmers (sub-area of "morphological analysis" involving generation of
roots and inflected forms) to two mailing lists:


Thanks to Patty Schmidt, a linguist at Logos USA, for directing
me to the LINGUIST mailing list, and to Lifen Chen and Arlene
Puryear (more Georgetown linguists) for directing me to Patty.

I believe that Patty's employer, Logos USA, develops machine
translation software.


Mailing list primarily for those concerned with
internationalization of software.

Thanks to all who responded!

Below is a summary of responses and other results.

>From Thomas Everth of Circle Noetic Services

Internet address: EVERTH@AppleLink.Apple.COM

I am directing the marketing of CNS (Circle Noetic Services). We are a
linguistic software company in New Hampshire (USA).

CNS has developed a product called "WordFan" that is exactly aimed at the
market that you describe. WordFan will produce all conjugated forms of an
word, or derive the base form from any of its conjugated forms.
The first version of WordFan will be released in mid of '93 for English.
languages to follow. German is high up on the list indeed. Japanese is
currently not under development, but: Russian, Arabic and most Western

CNS was founded in 1987 and has since then provided hyphenation algorithms
in 29 languages) and spelling checking (now in 13 languages, incl. Arabic)
the computing industry. Our products have been licensed by major national
international vendors for typesetting, word processing and DTP applications.

CNS also offers or has under development:

IOW (In Other Words), a large lexical database with over a million concepts
relations based on over 100,000 English words.

Linguistic tools for OCR and hand writing recognition.

Wordlists in many languages: conjugated, with rule markings and conjugation
rules, morphological breakdown, morphological cross references,.... and many

The WordFan product will in the future also include many other relations
besides "conjugation" like: synonym, antonym, homolog ... etc.
Hammer -> (is a) tool -> [other tools] for example.
WordFan will also come (optionally) with algorithms to split Germanic
words like: "Bundestagsverwaltungshauptapparat" or other such tongue
monsters. (I am German by the way.) Splitting Germanic compound words should
a must for text retrieval software in these languages.

Our technology is being developed by former linguists and programmers from

Please call us at (603) 672-6151 or fax: (603) 672-8025 for further
information. Via internet use: D1634@AppleLink.Apple.COM or my personal id:

>From Krister Linden of Lingsoft Inc.

Internet address: klinden@ling.Helsinki.FI

Lingsoft is a small software company in Finland. We specialize in
morphology and morphosyntactic analysis. Our methods are based on the
Kimmo Koskenniemi two-level model. We also sell products based on the
FiniteState-syntactic model presented last week at the EACL. That
paper received the Don Walker Award for best paper. We have:

0. spell-checking and hyphenation
1. morphological analysis and generation
2. stemming for information retrieval
3. part-of-speech tagging
( >99% correct, <5% ambiguity)
4. NP extraction for text indexing and retrieval
( >98% recall, >95% precision)
5. surface syntactic analysis
6. grammar checker

English 1,2,3,4,5
German 1 (end of May), 0,2 (end of summer)
Swedish 0,1,2,3
Russian 0,1,2
Finnish 0,1,2,3,6
Danish 1, 0,2,3 (end of year)
Swahili 1,2

All the lexicons have between 40.000 and 80.000 roots. The programs
are programmed in C and have been ported to various platforms. The
speed of all the tools are btw 600-1000 w/s on a Sparcstation 2.

In a near future we will have tools for French, Estonian, Italian and
Norwegian as well.

Krister Linden
Lingsoft Inc.

tomd: From sales literature send via hardcopy mail, I learned that
Prof Kimmo Koskenniemi is one of the "founders and principal owners of
Lingsoft."  He developed a "two-level model" of morphological analysis
that seems to be popular as the basis of software for morphological
>From Richard Sproat of AT&T's Linguistics Research Department
Internet address: rws@research.att.com
Probably the best and most general available commercial software for
doing this kind of thing is PC-KIMMO, which you can actually get for
free by anonymous FTP. I enclose some info (dated January 92 -- I
assume it still holds) on that below. There is also a book to go with
that by Evan Antworth, which you can get from the Summer Institute of
Linguistics (address below).
For more general discussion of various methods for doing computational
morphology, you can also consult two recent MIT Press Books:
1. Computational Morphology: Practical mechanisms for the English
lexicon. By Graeme D. Ritchie, Graham J. Russell, Alan W. Black and
Stephen G. Pulman. ACL-MIT Press Series in Natural Language
Processing. Cambridge, Massachusetts: MIT Press, 1992
2. And my own 1992 book in the same series, Morphology and
Mine covers a wider variety of stuff than does the Ritchie et al. book.
Richard Sproat
Linguistics Research Department
AT&T Bell Laboratories                  | tel (908) 582-5296
600 Mountain Avenue, Room 2d-451        | fax (908) 582-7308
Murray Hill, NJ 07974, USA              | rws@research.att.com
TomD: Richard also enclosed a lengthy "news" item on PC-KIMMO from
Evan Antworth.  It seemed a bit too long to include here, but see the
next item *from* Evan Antworth.
>From Evan Antworth of Academic Computing Department, (institution???)
Internet address: evan.antworth@sil.org
Here is some information on PC-KIMMO, a program for morphological parsing.
It has been reviewed in _Computational Linguistics_ 17:2, June 1991 and
also in _Computers and the Humanities_ 26:2, April 1992. We provide the
C source code with the intention that it be used in programs developed
by the user. Of course, I cannot say whether or not it could successfully
be used in your application. Let me know if I can help you further.
Evan Antworth
PC-KIMMO: A Two-level Processor for Morphological Analysis
PC-KIMMO is a new implementation for microcomputers of a program
dubbed KIMMO after its inventor Kimmo Koskenniemi (see
Koskenniemi 1983). It is of interest to computational linguists,
descriptive linguists, and those developing natural language
processing systems. The program is designed to generate (produce)
and/or recognize (parse) words using a two-level model of word
structure in which a word is represented as a correspondence
between its lexical level form and its surface level form.
Work on PC-KIMMO began in 1985, following the specifications of
the LISP implementation of Koskenniemi's model described in
Karttunen 1983. The coding has been done in Microsoft C by David
Smith and Stephen McConnel under the direction of Gary Simons and
under the auspices of the Summer Institute of Linguistics. The
aim was to develop a version of the two-level processor that
would run on an IBM PC compatible computer and that would include
an environment for testing and debugging a linguistic
description. The PC-KIMMO program is actually a shell program
that serves as an interactive user interface to the primitive
PC-KIMMO functions. These functions are available as a C-language
source code library that can be included in a program written by
the user.
[tomd: much text deleted]
PC-KIMMO is a research project in progress, not a finished
commercial product. In this spirit, we invite your response to
the software and the book. Please direct your comments to:
    Academic Computing Department
    PC-KIMMO project
    7500 W. Camp Wisdom Road
    Dallas, TX 75236
    phone: 214/709-3346, -2418
    email: evan.antworth@sil.org (Evan Antworth)
Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for
    morphological analysis. Occasional Publications in Academic
    Computing No. 16. Dallas, TX: Summer Institute of Linguistics.
    ISBN 0-88312-639-7, 273 pages, paperbound.
Karttunen, Lauri. 1983. KIMMO: a general morphological processor.
    Texas Linguistic Forum 22:163-186.
Koskenniemi, Kimmo. 1983. Two-level morphology: a general
    computational model for word-form recognition and production.
    Publication No. 11. University of Helsinki: Department of
    General Linguistics.
>From Ian Hersey of IBM
Internet Address: hersey@vnet.IBM.COM
We do have a system that both lemmatizes ("stems") and generates all
inflected forms, and it is available for about 19 European languages.  We
also do lemmatization for Japanese.  The code is language-independent:
you just plug in the dictionary you need and go from there.  This same
service also performs hyphenation (not for Japanese -- it isn't ever
hyphenated) and spell-checking.
This system is available for Windows, OS/2, AIX, VM and MVS.
I should mention that our morphological processing only handles
inflectional morphology:  "compute" can generate "computes", "computed"
and "computing" (all forms of the verb "to compute"), but it will not
generate "computer".  The "-er" and other affixes that change the part
of speech are known as derivational morphology, and our service doesn't
handle that area (yet).
I'm not the one to give pricing information.  Please contact Brian Gessel
at 301-803-2943 for that; he's our business person.  He can also provide
you with an OEM fact sheet that lists all of the languages and sizes.
>From Daniel Stieger of Institut fuer Informationssysteme
Internet Address: stieger@inf.ethz.ch
[tomd: Dani Stieger is responding to a query regarding a German
language stemmer based on the "Porter algorithm."]
As I mentioned to your colleague there is no serious report about our
experiments. I am in possession of a "Semester Work" (a short report
performed by a student) about this subject. It is NOT available in
machine readable form
[tomd: text deleted]
AND ... it is written in GERMAN. The Report contains also a listing of
the german Porter algorithm (written in MODULA-2 !!). Furthermore,
you need the decomposition of german words so that you are really
stemming the right (ending) part of the word (as you know, german
words may be composed of several words). For the decomposition I used
an automatically generated dictionnary (215'000 german words).
[tomd: text deleted]
>You mention "Porter (1983)."  Can you send me the full citation?  Is
>there some way we can get the source of your experiments with the
M.F. Porter: An Algorithm for Suffix Stripping. Program, Vol. 14, No. 3,
1980, pp. 130-137.
[tomd: text deleted]
Daniel Stieger                                       stieger@inf.ethz.ch
Institut fuer Informationssysteme
ETH Zentrum, IFW E43.2                               Tel: +41-1-254-7226
CH - 8092 Zuerich                                    Fax: +41-1-262-3973
Thanks again for all your help,
  # Tom Donaldson                2400 Research Blvd., Suite 350      #
  # Senior Software Developer    Rockville, MD 20850                 #
  # Personal Library Software    (301) 990-1155, FAX: (301) 963-9738 #
  #                 e-mail: tomd@pls.com                             #