17.415 gender-testing fame: LLC in the Times

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty@kcl.ac.uk)
Date: Mon Dec 01 2003 - 03:25:08 EST

  • Next message: Humanist Discussion Group (by way of Willard McCarty

                   Humanist Discussion Group, Vol. 17, No. 415.
           Centre for Computing in the Humanities, King's College London
                       www.kcl.ac.uk/humanities/cch/humanist/
                            www.princeton.edu/humanist/
                         Submit to: humanist@princeton.edu

             Date: Mon, 01 Dec 2003 08:06:03 +0000
             From: Willard McCarty <willard.mccarty@kcl.ac.uk>
             Subject: gender-testing fame: LLC in the Times

    Results from work reported in a recent issue of LLC have attracted the
    attention of the Times (London) for 22 November, in a review entitled, "A
    question of gender: Murder she wrote, or was it he?" The LLC article in
    question is, Moshe Koppel, Shlomo Argamon and Anat Rachel Shimoni,
    "Automatically Categorizing Written Texts by Author Gender", Literary and
    Linguistic Computing 17.4 (2002): 401-12
    (http://www3.oup.co.uk/litlin/hdb/Volume_17/Issue_04/170401.sgm.abs.html).
    Unfortunately the article is no longer online. Some fairly-used extracts
    follow.

    The article begins by quoting from the film "As good as it gets", in which
    the main character Melvin Udall explains his ability to write in a woman's
    voice convincingly: "I think of a man and take away reason and
    accountability." That surely gets our attention.... The author then points
    to "...the single most mysterious, enduring and vexed question of dramatic
    writing: how do men write women convincingly? In this case a truly awful
    man writing a well-adjusted woman. The same question declares itself no
    less energetically the other way round: how do women write men? However,
    once we’ve started down the muddy path into this particular valley of
    inquiry, we soon discover ourselves mired in deeper and still more menacing
    questions . . . Are we, in fact, kidding ourselves about the whole gender
    thing? Are writers really capable of genuine sex-change in their fiction,
    or has the history of English literature merely been one long exercise in
    furtive cross-dressing?"

    Are we all now wondering how we can manage to get an equivalent
    introduction to the research we do?

    Koppel, Argamon and Fine 2002 is summarized as follows: "It turns out that
    the truth ­ the scientific truth ­ is that men are capable only of writing
    like men; and women only like women." Along with his colleagues, the Times
    author goes on to explain, Professor Koppel "has designed a computer
    program that is capable of reading any text of more than a thousand words
    written in English and telling you the author’s gender. His results, which
    have just been published by the Oxford University Press in the academic
    journal Literary and Linguistic Computing, are going profoundly to affect
    the study of literature around the world. In short, he has used a computer
    to prove once and for all that there is a fundamental and recognisable
    gender difference in the way we write.

    "Such nuances may not be visible to the reader’s eye but his program sees
    them sure enough (the accuracy rate is about 83 per cent). Literary
    scholars have spent hundreds of years manually sifting texts without
    finding the elusive formula that guarantees such consistent success...."
    Using texts from the British National Corpus, the author explains, "Bit by
    bit, Koppel’s team stripped out all the subject-specific words until the
    remaining copy could be fed back into the computer. Then, as they told the
    program which text was male and which female, so the team was able to
    construct a mathematical set of rules to ascribe to either gender. This
    ascribing slowly became describing.

    "Koppel again: 'We would look for unique elements for the women’s stuff and
    the same for the guys. To give you a simple example, if the computer found
    that women used ‘you’ a lot more than the men, then we’d give it a female
    weighting. After a while we got down to about 50 reliable distinguishing
    features. We did some more programming. We refined the model. Then we tried
    it out on anonymous texts. By the end we were hitting 83 per cent accuracy.

    “'And it’s the function words that give the game away. Not the clever or
    the topic-specific stuff but the ‘ands’ and the ‘ifs’ and the ‘buts’, the
    least significant parts of sentences. Mainly, we used individual words but
    also pairs and triples of consecutive parts of speech. But this program is
    not about grammar. Actually, the single biggest difference is that women
    are far more likely than men to use personal pronouns ­ ‘I ’, ‘you’, ‘she’,
    ‘myself’, or ‘yourself’. Men, on the other hand, are more likely to use
    determiners ­ ‘a’, ‘the’, ‘that’, and ‘these’ ­ as well as numbers and
    quantifiers like ‘more’ or ‘some’.

    “'And though it might feel a little eerie that people give away their
    gender like this, it’s actually kind of obvious ­ when we write we pay
    attention to the big words, not the little ones.'”

    "But who, I wondered, was among the mistaken 17 per cent? Koppel had kept a
    list. P. D. James was the first name that caught my eye. After checking her
    novel Devices and Desires, the computer concluded that Baroness James was a
    man. Likewise wrongly sexed was David Lodge’s early novel, The Picture
    Goers. But the one that really caught my eye was Dick Francis."

    The reporter interviews Baroness James, then David Lodge, who is quoted as
    saying, "Novels are very problematic texts because they are written in a
    medley of styles. And more often than not the author is trying to imitate
    some kind of imagined consciousness ­ male or female. Indeed, writers have
    always tried to imitate the distinctive characteristics of male and female
    discourse and we are in the habit of thinking that they have often
    succeeded. But perhaps these scientists believe they can prove this is an
    illusion. Still, I’m very surprised that this program is able to discern
    the gender of the real author. If you were to take ordinary first-person
    texts ­ letters or diaries ­ then you might, of course, expect a fairly
    high degree of accuracy. But that it can be done on literary novels
    intrigues me. This will have fascinating literary, critical and general
    sociological implications. That said, I’d like to see them apply it to a
    novelist’s attempt to imitate the opposite sex in a particular passage.”

    The reporter goes back to Koppel -- to discover "the program’s Achilles’
    heel. When writers move into direct speech they can imitate the voice of
    the opposite gender much more successfully, it seemed. Not always, but
    often. This was something: a sort of 50 per cent rescue for novelists. But
    what of the long passages of prose where the novelist is pretending to be
    inside the mind of a character? What about, for example, the last chapter
    of Ulysses wherein James Joyce spends 20,000 words pretending to be the
    untidy consciousness of Molly Bloom ­ changing direction, interrupting,
    digressing.

    "On this, Koppel demurred. He hadn’t, he pointed out, tested the program in
    such a literary-specific way. In fact, he went on to explain, the whole
    text-recognition business is ultimately about the internet ­ that’s the
    impetus behind the work and the most obvious commercial application:
    refining search-engine accuracy, recognising disguised chatroom entrants
    and so on. But, he agreed, it remained rather important for English
    literature that such specifics were tested."

    Dr Willard McCarty | Senior Lecturer | Centre for Computing in the
    Humanities | King's College London | Strand | London WC2R 2LS || +44 (0)20
    7848-2784 fax: -2980 || willard.mccarty@kcl.ac.uk
    www.kcl.ac.uk/humanities/cch/wlm/



    This archive was generated by hypermail 2b30 : Mon Dec 01 2003 - 03:33:07 EST