21.066 how to use Chi-square correctly

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Fri, 1 Jun 2007 06:58:39 +0100

                Humanist Discussion Group, Vol. 21, No. 66.
       Centre for Computing in the Humanities, King's College London
                     Submit to: humanist_at_princeton.edu

   [1] From: Carl Vogel <vogel_at_cs.tcd.ie> (48)
         Subject: Re: 21.065 how to use Chi-square correctly

   [2] From: Michael Hart <hart_at_pglaf.org> (4)
         Subject: Re: 21.065 how to use Chi-square correctly

   [3] From: Ryan Deschamps <Ryan.Deschamps_at_Dal.Ca> (36)
         Subject: Re: 21.065 how to use Chi-square correctly

   [4] From: "Juliana Tambovtseva" <jultamb_at_yandex.ru> (8)
         Subject: Classification Accuracy Within Author Discrimination

         Date: Fri, 01 Jun 2007 06:53:31 +0100
         From: Carl Vogel <vogel_at_cs.tcd.ie>
         Subject: Re: 21.065 how to use Chi-square correctly

> At 09.35 30/05/2007, Norman Gray (Humanist Discussion Group) wrote:
> >I don't believe anything has changed in the way that chi-squared is
> >defined or used. Nor has much changed (unfortunately) in the way and
> >extent that it is abused. [...]
> Am I the only one interested to know more
> (possibly with examples) about how the chi-square test is abused?
> If you Norman could take us by hand and write
> down a small description about how the test is
> used correctly and how to avoid the abuse...
> With thanks in advance
> maurizio
> Maurizio Lana - ricercatore
> Dipartimento di Studi Umanistici - Universita del Piemonte Orientale a
> Vercelli
> via Manzoni 8, I-13100 Vercelli
> +39 347 7370925


Of course, without knowing the full details of the article, one can
only speculate on what the reviewer had in mind. The digested version
of what was done in the paper in question was checking the relative
distribution of one linguistic feature between two sources. It is
possible that the reviewer was objecting to the articulation of the
null hypothesis (that any differences in the distributions are random)
or the alternative hypothesis (that the two texts could not have been
drawn from the same population). One or other of those might have
been given a different interpretation. Further, the fact that the
null hypotheses was rejected might have been used to argue support
for a hypothesis far more general than the test was actually focused

There's a very interesting article about the use of Chi-square testing
in natural language research by Adam Kilgarriff:

    author = {Kilgarriff, Adam},
    title = {Language is never, ever, ever random},
    journal = {Corpus Linguistics and Linguistic Theory},
    year = {2005},
    OPTkey = {},
    volume = {1-2},
    OPTnumber = {},
    pages = {263-275},
    OPTmonth = {},
    OPTnote = {},
    OPTannote = {}

All the best,


         Date: Fri, 01 Jun 2007 06:54:06 +0100
         From: Michael Hart <hart_at_pglaf.org>
         Subject: Re: 21.065 how to use Chi-square correctly

I am interested as well.

Michael S. Hart
Founder of Project Gutenberg
Who Minored In Statistics
Some 33 Years Ago. . . .

         Date: Fri, 01 Jun 2007 06:54:53 +0100
         From: Ryan Deschamps <Ryan.Deschamps_at_Dal.Ca>
         Subject: Re: 21.065 how to use Chi-square correctly

I can think of only three cases where Chi-Squared could be said to be abused:

1. Using Chi-squared in cases where it isn't appropriate (non-exclusive
categories, too few observations, time-influenced data etc.)
2. Using Chi-squared after failing to follow appropriate observational &
descriptive analysis.
3. Using Chi-squared when a parametric test (t-test, ANOVA) would do just as

The first two just involve proper scientific procedure and analysis, and is not
unique to Chi-squared. Of course, just popping data into SPSS and running a
test is not a good way to go about any statistical analysis.

The last one is bit of statistical snobbery, but it is valid enough
to warrant a
mention. If the parametric test can be run, you should run it instead of the
Chi-squared (despite any preference for Greek letters). (see this tutorial
for an explanation:
http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html ).

The whole thing about the "new" way to run a Chi-squared confounds me, although
I'd need a context for the quote. Mathematical distributions are not very
useful if they change over time. If a more proficient or flexible
distribution is discovered, I am sure they'd give it a new name to make note of
that fact. Frankly, I can only see this happening for specific kinds of
Chi-tests, and the Chi test would still be relevant because it is so easy to

Sometimes economists will give the use a series of procedures a name
and call it
a "test" -- for instance, the Granger test uses Least-squares regression and
ANOVA on time-series data -- but now I'm really grasping.

Sometimes another test is required to make sure the data is appropriate for a
particular statistical test. But Chi-squared is so basic that I cannot really
think of case where that is necessary. (Still grasping).

Hope this is helpful.

For transparency sake, I am a diletant on statistical matters (ie. geeky enough
to write about it on a wiki). Judging on my not-so-many years of statistical
memory, I encourage further comments to straighten anything I've said out.

Ryan. . .

Ryan Deschamps
MLIS/MPA Expected 2005

         Date: Fri, 01 Jun 2007 06:55:30 +0100
         From: "Juliana Tambovtseva" <jultamb_at_yandex.ru>
         Subject: Classification Accuracy Within Author Discrimination

Dear Humanist colleagues, what are your impression on the
article published in: Literary and Linguistic Computing for June 2007;
Vol. 22, No. 2
Employing Thematic Variables for Enhancing Classification Accuracy
Within Author Discrimination Experiments by George Tambouratzis and
Marina Vassiliou, pages 207-224. The idea to compile a simple and
understandable handbook on how to use Chi-square is appreciated.
Especially with lots of exact examples. Looking forward to hearing
from you to <mailto:yutamb_at_mail.ru>yutamb_at_mail.ru
Received on Fri Jun 01 2007 - 02:06:42 EDT

This archive was generated by hypermail 2.2.0 : Fri Jun 01 2007 - 02:06:43 EDT