6.0467 Lexical Text Analysis (1/80)

Mon, 1 Feb 1993 12:49:00 EST

Humanist Discussion Group, Vol. 6, No. 0467. Monday, 1 Feb 1993.

Date: Sat, 30 Jan 93 01:03:07 -0800
From: edwards@cogsci.Berkeley.EDU (Jane Edwards)
Subject: Lexical Texts Analysis

Date: Sat, 23 Jan 93 00:40:54 -0500
From: irl@glas.apc.org
Subject: Lexical Texts Analysis

The following is being posted upon the request of Mikhael Maron, a
linguist associated with the Institute of Russian Language in Moscow.
He is interested in contacting others interested in his work, or doing
similar work. He has some limits on email access, and shares his
account with many others. Please address any queries on this research
to him at:


with attention to his name in the subject line: "ATTN: M. Maron - IRL"

Forwarded message:
Subject: Lexical Texts Analysis

Large text (ASCII) files such as whole books (fiction,
humanities,science) in electronic form are considered.

- *Total list* of all word occurences in the text is generated.

- Having some general idea of what key words we are intersted in, we
select them from this list, making *partial word lists* (PWL): names,
words from specific problem areas etc; from several up to about 200-250
words can be included in one PWL.

- For given PWL we are to build: (1) a set of all contexts
(paragraphs/lines) where the PWL member words occured;

* words index, telling on which book pages these occurences took place.

The problem is to perform these activities effectively: concordance
word crunchers I know need up to analysed_text_volume *10..30 for some
service indexes, which makes the search process not practical for real

- The solution is to introduce markup into the text searched: the words
of interest are supplied with special markers, which is done OK with the
help of some context-replacement routines I have developed.

- Having markupped files, we may extract only markers- containg lines
which form the needed set of contexts. This extraction may be done with
the help of GREP routine, for instance - and with turbo efficiency also.

- Index is generated from the markupped files/set of contexts with the
help of the routine I produced.

This technique was used to analyse the text of "The Possessed" novel by
Dostoyevsky with respect to possession lexicology: to have, to possess,
to acquire etc. (imet', vladet', priobretat'...)

The idea was inspired by Fromm's "To Have or To Be?" concept of
possession. According to Fromm's semantics, to posssess means (a bit
roughly) to possess property/goods, NOT abstract properties as in
logic/computer science for instance.

The "possession" PWL for the novel was built, as well as a set of
contexts and an index for about 200 occurences of the words in this

For each word occurence its usage model was built. For example, "to
have" in the context "I had a terrible headache after discussion with
Ivanov" is modelled as "to have headache".

The complete set of such usage models for the words in the given PWL in
the given text provides us the understanding of the semantics of these
words with respect to the text.

As for the novel, the semantics of "possession" in it appeared to be
very interesting and seems to give considerable insight into
Dostoyevsky's trail of thought. It seems to be compatible with
logic/computer approach - and quite incompatible with Fromm's!