20.196 processing Spanish texts

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Thu, 14 Sep 2006 06:40:11 +0100

               Humanist Discussion Group, Vol. 20, No. 196.
       Centre for Computing in the Humanities, King's College London
  www.kcl.ac.uk/schools/humanities/cch/research/publications/humanist.html
                        www.princeton.edu/humanist/
                     Submit to: humanist_at_princeton.edu

   [1] From: Erik Hatcher <esh6h_at_virginia.edu> (38)
         Subject: Re: 20.182 queries: ALLC/ACH paper? processing Spanish
                 texts?

   [2] From: Duane Gran <dmg2n_at_virginia.edu> (107)
         Subject: Re: 20.188 ALLC/ACH paper

--[1]------------------------------------------------------------------
         Date: Thu, 14 Sep 2006 06:30:45 +0100
         From: Erik Hatcher <esh6h_at_virginia.edu>
         Subject: Re: 20.182 queries: ALLC/ACH paper? processing Spanish texts?

On Sep 12, 2006, at 2:07 AM, Humanist Discussion Group (by way of
Willard McCarty <willard.mccarty_at_kcl.ac.uk>) wrote:
>--
>[2]------------------------------------------------------------------
> Date: Tue, 12 Sep 2006 07:00:56 +0100
> From: Maria Esteva <mesteva_at_mail.utexas.edu>
> >
>Hi,
>
>For my dissertation I will mine a corpus of corporate electronic
>texts. The corpus contains some texts in English and Portuguese and I
>need to focus on the Spanish section.
>
>I am wondering if anybody knows where can I find some or all of the
>next tools to process the texts:

Maria,

The Lucene family of projects has all the pieces you're after. Here
are the pointers....

>Language identification software (to sort texts based on language)

Nutch has a language identifier plugin. It is not distilled from
Nutch yet, but there is an effort to do so. You can find its source
code here:

          <http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/=20
languageidentifier/>

>Spanish stemmer
>Spanish tokenizer
>Spanish stop words list

All of these are rolled into one here:

          <http://mail-archives.apache.org/mod_mbox/lucene-java-user/=20
200408.mbox/%3C010a01c48950$1d8745d0$1db8e818_at_ernesto%3E>

The SnowballAnalyzer stems, and the hardcoded list of stop words are
removed.

Tokenizing Spanish is fairly trivial and any of the basic Lucene
analyzers would do a reasonable job with that, such at the
StandardAnalyzer found in the core Lucene API. There is an accent
removal filter available in Lucene's core (called
ISOLatin1AccentFilter), which will change characters like =F1 into n.

          Erik

--[2]------------------------------------------------------------------
         Date: Thu, 14 Sep 2006 06:31:53 +0100
         From: Duane Gran <dmg2n_at_virginia.edu>
         Subject: Re: 20.188 ALLC/ACH paper

Maria,

I see your original inquiry below about stemming analysis of
spanish. You may be interested in the following:

    http://snowball.tartarus.org/algorithms/spanish/stemmer.html

This algorithm is usable by the Lucene search engine (which includes
tokenization):

    http://lucene.apache.org/java/docs/lucene-sandbox/

Duane Gran

On Sep 13, 2006, at 2:10 AM, Humanist Discussion Group (by way of
Willard McCarty <willard.mccarty_at_kcl.ac.uk>) wrote:

> Humanist Discussion Group, Vol. 20, No. 188.
> Centre for Computing in the Humanities, King's College London
> www.kcl.ac.uk/schools/humanities/cch/research/publications/ humanist.html
> www.princeton.edu/humanist/
> Submit to: humanist_at_princeton.edu
>
>
>
> Date: Wed, 13 Sep 2006 06:57:05 +0100
> From: "nyhan, julianne" <julianne.nyhan_at_ucc.ie>
> >Spanish texts?
>
>Dear maurizio,
>
> >I am searching for the Proceedings of ALLC-ACH
> >conference of Goteborg 2004, and specifically for these papers:
> >1) Juola, P. Ad-hoc Authorship Attribution
> >2) Koppel, M. and Schler, J. (2004). Ad-hoc
>Authorship Attribution Competition Approach
>Outline.
>
>I do not know if the proceedings are available
>online (they were published by Goteborg
>University), but I can send you on a photocopy of
>the papers you are looking for - if that helps?
>
>Regards,
>Julianne Nyhan
>
>-----------------------------------
>Dr Julianne Nyhan
>Research associate,
>CELT project,
>History Department,
>University College Cork
>e-mail:julianne.nyhan_at_ucc.ie
>phone: 00353 214903142
>Web: http://www.ucc.ie/celt/digineen.html
>Web: http://epu.ucc.ie/lexicon/complex_example
>Mailing list: https://www.ucc.ie/digiquest/
>---------------------------------------------
>
>
>Dear humanists,
>I am searching for the Proceedings of ALLC-ACH
>conference of Goteborg 2004, and specifically for these papers:
>1) Juola, P. Ad-hoc Authorship Attribution
>Competition. In: Proceedings 2004 Joint
>International Conference of the Association for Literary and
>Linguistic
>Computing and the Association for Computers and
>the Humanities (ALLC/ACH 2004), Go=A8teborg, Sweden.
>2) Koppel, M. and Schler, J. (2004). Ad-hoc
>Authorship Attribution Competition Approach
>Outline. In Juola, P. (ed.), Ad-hoc Authorship Attribution Contest,
>ACH/ALLC.
>
>Does anyone know if they are available through an
>online service like Ingenta, or the site of the
>publisher (which I am not able to recognise)?
>
>with many thanks for your help
>maurizio
>
>Maurizio Lana - ricercatore
>Dipartimento di Studi Umanistici - Universit=E0 del Piemonte
>Orientale a
>Vercelli
>via Manzoni 8, I-13100 Vercelli
>+39 347 7370925
>
>
>--
>[2]------------------------------------------------------------------
> Date: Tue, 12 Sep 2006 07:00:56 +0100
> From: Maria Esteva <mesteva_at_mail.utexas.edu>
>
>Hi,
>
>For my dissertation I will mine a corpus of corporate electronic
>texts. The corpus contains some texts in English and Portuguese and I
>need to focus on the Spanish section.
>
>I am wondering if anybody knows where can I find some or all of the
>next tools to process the texts:
>
>Language identification software (to sort texts based on language)
>Spanish stemmer
>Spanish tokenizer
>Spanish stop words list
>
>Thanks,
>
>Maria Esteva
>Doctoral Candidate
>School of Information
>University of Texas at Austin
Received on Thu Sep 14 2006 - 03:00:43 EDT

This archive was generated by hypermail 2.2.0 : Thu Sep 14 2006 - 03:00:44 EDT