11.0293 T-score and MI

Humanist Discussion Group (humanist@kcl.ac.uk)
Thu, 25 Sep 1997 23:34:53 +0100 (BST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Humanist Discussion Group: "11.0301 Exemplaria WWW Preprints"
Previous message: Humanist Discussion Group: "11.0295 hardware and software"

Humanist Discussion Group, Vol. 11, No. 293.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

Date: Wed, 24 Sep 1997 16:10:54 -0400 (EDT)
From: "David L. Gants" <dgants@parallel.park.uga.edu>
Subject: Re: Corpora: T-Score and MI: query

your question is not quite specific enough to give you a real answer.
it sounds like you are talking about the approximate t-score that ken
church used in his papers on finding collocates.

if that is what you mean, then the meanings for t-score and mutual
information are rather idiosyncratic to ken's work. the measure that
he called mutual information is perhaps better called a
log-association ratio. it attempts to measure the degree to which two
words occur together more than would be expected based on observation
of their overall frequency of occurrence.

unfortunately, these association ratios cannot be directly compared
for words which have very different underlying frequencies. thus the
t-score. in general, a t-score is a statistic which assumes that the
values being analyzed are distributed in the standard bell-shaped
curve (what is called the normal distribution). the t-score allows
measurements which conform to some normal distribution to be reduced
to the standard unit normal distribution which has a mean of zero and
a known average squared deviation. once this is done, then the
significance of a measurement can be assessed by referring to standard
tables.

in fact, though, the assumption of normal distribution is not very
good for association ratios. i argued in my 1993 CL paper that it was
better to use statistics which were not based on this assumption. i
proposed that a statistic called G^2 or log-likelihood ratio was more
appropriate for this sort of work. log-likelihood ratios can be used
in many other areas of statistical natural language process with good
results. G^2 is closely related to the statistic which is normally
called mutual information (which is *not* what ken church used).

in later work, ted pederson and rebecca bruce have extended this work
and found that for many applications with a very large number of
degrees of freedom that G^2 becomes less useful and another statistic
based on fisher's exact test becomes more useful. in unpublished
work, i have been able to compensate the G^2 test so that, at least in
some situations, it becomes very useful again.

this quick explanation is necessarily very inadequate if you intend to
do serious work with any of these statistics. each of these tests has
its own virtues and defects. if you plan to do more than read the
work of others, then it would be a very good idea to find a
sympathetic statistician who is familiar with the issues which arise
in analysing data involving small counts.

>>>>> "BLdlC" == Belen Labrador de la Cruz <dfmblc@unileon.es> writes:

BLdlC> Could anybody tell me the difference between T-score and
BLdlC> Mutual Information, please? (in easy words, I am new to
BLdlC> this)

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================

Next message: Humanist Discussion Group: "11.0301 Exemplaria WWW Preprints"
Previous message: Humanist Discussion Group: "11.0295 hardware and software"