11.0243 Linguistic Data Consortium release

Humanist Discussion Group (humanist@kcl.ac.uk)
Tue, 26 Aug 1997 10:34:59 +0100 (BST)

Humanist Discussion Group, Vol. 11, No. 243.
Centre for Computing in the Humanities, King's College London

Date: Wed, 20 Aug 1997 20:50:21 -0400 (EDT)
From: "David L. Gants" <dgants@parallel.park.uga.edu>
Subject: New Collection from the Linguistic Data Consortium

>> From: LDC Office <ldc@unagi.cis.upenn.edu>

Announcing a NEW RELEASE from the

CALLHOME Collection in Six Languages

The objective of the CALLHOME project is the creation of a
multi-lingual speech corpus that will support the development of Large
Vocabulary Conversational Speech Recognition (LVCSR) technology. The
collection covers six languages, American English, Egyptian
Arabic, German, Japanese, Mandarin Chinese, and Spanish.

Each CALLHOME language includes telephone speech, transcripts and
tables, and a lexicon. Each language can be distributed as a complete
set of speech, transcripts, and lexicon (lexicons to be released in
the near future) or the components can be ordered separately.

The telephone speech consists of either 100 or 120 unscripted
telephone conversations between native speakers of the specific
language. All calls, which lasted up to 30 minutes, originated in
North America. Participants typically called family members or close
friends. Most calls were placed to various locations overseas, but
some participants placed calls within North America.

The transcripts cover a contiguous 5 or 10 minute segment taken from a
recorded conversation. The transcripts are timestamped by speaker
turn for alignment with the speech signal, and are provided in
standard orthography.

The lexicons, which are not yet available, contain tab-separated
information fields with orthographic, morphological, phonological,
stress, source, and frequency information for each word. The lexicons
will be covered by a special license agreement.

Institutions that have membership in the LDC during the 1997
Membership Year will be able to receive the CALLHOME materials at no
additional charge, in the same manner as all other text and speech
corpora published by the LDC. Due to a delayed release, 1996 members
are entitled to CALLHOME Japanese, Mandarin Chinese, and Spanish.

Nonmembers can purchase CALLHOME materials for research purposes only.
The cost of the CALLHOME collection is $3000 per language. The
various components of this collection can be purchased separately;
Speech databases are $1000, transcripts are $500, and lexicons are
$1500 each. If you would like to order a copy of this corpus, please
email your request to ldc@unagi.cis.upenn.edu. If you need additional
information before placing your order, or would like to inquire about
membership in the LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when
asked for password.

Language Speech Transcripts Lexicon Membership
$1000 $500 $1500 year
American LDC97S42 LDC97T14 LDC97L20 97
English (PRONLEX)
Egyptian LDC97S45 LDC97T19 LDC97L19 97
German LDC97S43 LDC97T15 LDC97L18 97
Japanese LDC96S37 LDC96T18 LDC96L17 96/97
Mandarin LDC96S34 LDC96T16 LDC96L15 96/97
Spanish LDC96S35 LDC96T17 LDC96L16 96/97

Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>