11.389 ELRA News

Humanist Discussion Group (humanist@kcl.ac.uk)
Sat, 8 Nov 1997 17:32:53 +0000 (GMT)

Humanist Discussion Group, Vol. 11, No. 389.
Centre for Computing in the Humanities, King's College London

Date: Sat, 8 Nov 1997 11:41:15 -0500 (EST)
From: "David L. Gants" <dgants@parallel.park.uga.edu>
Subject: ELRA new resources



ELRA is happy to announce the update of its catalogue
of Language resources for Language Engineering and Research.
It currently consists of:

1) Spoken resources: 64 databases in several languages (recordings
from microphone, telephone, continuous speech, isolated words, phon etic
dictionaries, etc.).

2) Written resources:
* 15 monolingual and multilingual corpora
* 40 monolingual lexica
* Around 60 multilingual lexica
* A linguistic software platform and grammars development platform

3) Terminological resources: over 360 databases with a wide range of
domains and several languages (Catalan, Danish, English, French, German,
Italian, Latin, Polish, Portuguese, Spanish, Turkish).


The POLYCOST speech database was recorded during January-March 1996
as a common initiative entitled "Speaker Recognition in Telephony"
within the COST 250 action. The main purpose of the database is to
compare and validate speaker recognition algorithms. The data was
collected via international telephone lines, with more than five
sessions per speaker, and with English spoken by foreigners.

The database contains around 10 sessions recorded by 134 subjects from
14 countries. Each session contains 14 items. All items, except the last
two, are expressed in English. The speakers come from the European countries
taking part in the action. Approximately 10 speakers per country were
provided by each partner.

Each session comprises 15 prompts, including one prompt for DTMF detection,
10 prompts with connected digits uttered in English, 2 prompts with sentences
uttered in English and 2 prompts in the speaker's mother tongue. One of the
prompts in the speaker's mother tongue consists of free speech.

* English:
- 4 prompts distributed throughout the session in which the speaker
pronounces his or her 7-digit client code;
- 5 prompts distributed throughout the session in which the speaker
pronounces a sequence of 10 digits (the same from session to session
and from speaker to speaker);
- 2 prompts in which the speaker pronounces the sentences: ''Joe took
father's green shoe bench out'' and ''He eats several light tacos'',
as fixed password phrases which are common to all speakers;
- 1 prompt in which the speaker is supposed to give his or her
international phone number.

* Mother tongue
- 1 prompt in which the speaker gives his or her first name, family name,
gender (female/male), town and country;
- 1 prompt with free speech.

The database was collected through the European telephone network and was recorded
through an ISDN card on XTL SUN platform with an 8 kHz sampling rate. Most of the
calls were automatically classified by DTMF detection. Manual classification has
been used in the case of no DTMF or wrong DTMF PIN code (circa 10% of the database).

The English prompts are segmented and labelled at the word level (orthographic
transcription and word stretches). The prompts in mother tongue are simply labelled
(an orthographic transcription will be given). The conventions used for the
annotation are those defined within the SpeechDat project.

Character set: ISO-8859-1
Medium: CD-ROMs. The first CD contains speech data from speakers M001-M069,
and the second CD contains data from speakers F001-F060 plus M070-M074.
Total size CD1: 636 MB=09
Total size CD2: 610 MB=09
File format: A-law, 8 kHz sampling rate, 8 bits/sample, with no file header.

Price for ELRA members:
o price for research use: 500 ECU
o price for commercial use: 1000 ECU
Price for non members:
o price for research use: 600 ECU
o price for commercial use: 1200 ECU
Price for COST 250 partners: 100 ECU


The ONOMASTICA project was a European-wide research initiative within the scope
of the Linguistic Research and Engineering Programme, the aim of which was the
construction of a multi-language pronunciation lexicon of proper names. That
project covered eleven European languages: Danish, Dutch, English, French,
German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish.

Although the ONOMASTICA project ended in June 1995, the work continued with the
introduction of new partners, addressing names in Eastern and Central European
languages: Czech, Estonian, Latvian, Polish, Romanian, Slovakian, Slovenian and
Ukrainian, in a new project funded by the European Commission's Copernicus Programme.

Though the result of the Onomastica project related to Western languages is not
available (except for the German), the result of this new project is available.
It consists of a collection of 1,783,390 transcriptions of 1,705,653 names, broken
down as follows:

=B7 Czech: 257,700 entries consisting of 244,025 names prepared by Dr. Pavel Kolar of
the Language Institute, Silesian University, Opava, Czech Republic.
=B7 Estonian: 209,515 entries consisting of 208,380 names prepared by Dr. Peeter P=E4ll
of the Institute for the Estonian Language, Estonian Academy of Sciences, Tallinn, Estonia.
=B7 Latvian: 258,214 entries consisting of 245,331 names prepared by Dr. Andrejs Spektors
of the Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia.
=B7 Polish: 285,412 entries consisting of 244,632 names prepared by Prof. Wiktor Jassem
of the Institute of Fundamental Technological Research, Polish Academy of Sciences,
Posnan, Poland.
=B7 Slovak: 228,257 entries consisting of 228,257 names prepared by Dr. Peter Durco of
the Department of Foreign Languages, Police Academy of the Slovak Republic, Bratislava,
Slovak Republic.
=B7 Slovenian: 285,862 entries consisting of 283,449 names prepared by Dr. Zdravko Kacic
of the Faculty of Technical Sciences, University of Maribor, Maribor, Slovenia.
=B7 Ukrainian: 258,430 entries consisting of 251,579 names prepared by Dr. Yevgeniy Ludovik
of the Institute of Cybernetics, Ukraine Academy of Sciences, Kiev, Ukraine.

The databases are presented in Microsoft Access format and in ASCII text format,
together with database browser software prepared by Keith Edwards of the Centre for
Communication Interface Research, The University of Edinburgh.

Price for ELRA members:
o price for research use: 400 ECU
o price for commercial use: 3000 ECU
Price for non members:
o price for research use: 800 ECU
o price for commercial use: 6000 ECU

For more information, please contact:
87, Avenue d'Italie
75013 PARIS
Tel: +33 1 45 86 53 00
Fax: +33 1 45 86 44 88
E-mail: info-elra@calva.net

Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>