7.0274 CELEX lexical data on CD-ROM (1/200)

Mon, 1 Nov 1993 08:16:18 EST

Humanist Discussion Group, Vol. 7, No. 0274. Monday, 1 Nov 1993.

Date: Mon, 25 Oct 1993 17:28 +0100 (MET)
From: The Centre for Lexical Information <CELEX@MPI.NL>
Subject: CELEX lexical data on CD-ROM

Dear Madam or Sir,

This message is posted to announce the release of a CD-ROM with
lexical data by the Dutch Centre for Lexical Information in
collaboration with the Linguistic Data Consortium in the USA.

This CD-ROM, which contains the CELEX lexical databases of English
(version 2.5), Dutch (version 3.1) and German (version 2.0), is now
available for research purposes from the Linguistic Data Consortium
for $150. For each language, the CD-ROM contains detailed information
on the orthography (variations in spelling, hyphenation), the
phonology (phonetic transcriptions, variations in pronunciation,
syllable structure, primary stress), the morphology (derivational and
compositional structure, inflectional paradigms), the syntax (word
class, word-class specific subcategorisations, argument structures),
and word frequency (summed word and lemma counts, based on recent and
representative text corpora) of both wordforms and lemmas (English:
52446 lemmas, 160594 wordforms; German: 50708 lemmas, 359611
wordforms; Dutch: 124136 lemmas, 381292 wordforms). Postscript files
describe the available lexical information in detail.

The original Celex databases can be consulted interactively either by
using the SQL*PLUS query language within an ORACLE RDBMS environment,
or by means of the specially designed user interface FLEX. The
databases on this CD-ROM have not been tailored to fit any particular
database management program. Instead, the information is presented in
a series of plain ASCII files in a UNIX directory tree that can be
queried with tools such as AWK or ICON. Unique identity numbers allow
the linking of information from different files. As in the original
databases, some kinds of information have to be computed on-line.
Wherever necessary, AWK functions have been provided to recover this
information. README files specify the details of their use.

The CD-ROM is mastered using the ISO 9660 data format, with the Rock
Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh (*)
and UNIX environments.

Anyone who would like to purchase the CD-ROM should send a check or
purchase order made payable to the "Trustees of the University of
Pennsylvania" to

Judith Storniolo
Administrative Assistant, LDC
Linguistic Data Consortium
441 Williams Hall
University of Pennsylvania
Philadelphia, PA 19104-6305
Tel: +1/215/898-0464 Fax: +1/215/573-2175

(*) If someone has a Mac with a cdrom drive that was obtained before
12/92, and has not installed any system upgrades since that date, then
that system will not be able to read the CELEX CD-ROM. In such a case,
all that is needed is to obtain the upgraded driver software (a very
small amount of code), and copy it onto the system in place of the
existing driver. The upgrade can be obtained as follows:

Connect to ftp server: ftp.apple.com
Go to directory: dts/mac/sys.soft/cdrom
Get file: cd-rom-setup

A brief overview of the English data on this CD is given below:


When starting to use the English database, the user first has to
choose between two so-called `lexicon types':

- a lemma lexicon
- a wordform lexicon

Each lexicon type uses a specific kind of entry. The CELEX lemma lexicon
is the one most similar to an ordinary dictionary since every entry in
this lexicon represents a set of related inflected words. In a lexicon, a
lemma can be represented by using a headword (cf. traditional dictionary
entries) such as, for example, `call' or `cat'. The wordform lexicon
yields all possible inflected words: every entry in the lexicon is an
inflectional variant of the related headword or stem. So, a wordform
lexicon contains words like `call', `calls', `calling', `called', `cat',
`cats' and so on.

For both types of lexicons, the user may subsequently select any number
of columns -- from approximately 150 database columns -- combining
information on the orthography, phonology, morphology, syntax and
frequency of the entries. The information sheet `Lexical Data, English'
summarizes the types of information available. An exhaustive overview of
the columns available is given in the CELEX User Guide.


The lexical data that can be selected for each entry in the different
English lexicon types can be divided into five categories: orthography,
phonology, morphology, syntax and frequency. In a separate section,
example data are given for each of these categories.

Orthography - with or without diacritics
(spelling) - with or without word division positions
- alternative spellings
- number of letters/syllables

Phonology - phonetic transcriptions (using SAMPA notation or
(pronunciation) Computer Phonetic Alphabet (CPA) notation) with:
- syllable boundaries
- primary and secondary stress markers
- consonant-vowel patterns
- number of phonemes/syllables
- alternative pronunciations

Morphology - Derivational/compositional:
(word structure) - division into stems and affixes
- flat or hierarchical representations
- Inflectional:
- stems and their inflections

Syntax - word class
(grammar) - subcategorisations per word class

Frequency - COBUILD frequency(*)
(*)These frequency data are based on the COBUILD corpus (sized 18
million words) built up by the University of Birmingham, Great


An arbitrary query using a small English lemma lexicon (that is, one with
very few columns) might yield the following result:

Headword Pronunciation Morphology: M: Cl Freq
Structure Cl
----------- ---------------- ------------------- -- -- ----
celebrant "sE-lI-br@nt ((celebrate),(ant)) Vx N 6
celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201
cell "sEl (cell) N N 1210
cellar "sE-l@r* (cellar) N N 228
cellarage "sE-l@-rIdZ ((cellar),(age)) Nx N 0
cellist "tSE-lIst ((cello),(ist)) Nx N 5
cello "tSE-l@U (cello) N N 25
cellular "sEl-jU-l@r* ((cell),(ular)) Nx A 21
celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29

An example selection from a small English wordform lexicon, showing the
inflectional variants of the headwords given in the previous example, is
presented in the next table:

Word Word division Pronunciation Cl Type Freq
------------ --------------- ----------------- -- ---- ----
celebrant cel-e-brant "sE-lI-br@nt N sing 2
celebrants cel-e-brants "sE-lI-br@nts N plu 4
celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144
celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57
cell cell "sEl N sing 655
cells cells "sElz N plu 555
cellar cel-lar "sE-l@r* N sing 187
cellars cel-lars "sE-l@z N plu 41
cellarage cel-lar-age "sE-l@-rIdZ N sing 0
cellarages cel-lar-ag-es "sE-l@-rI-dZIz N plu 0
cellist cel-list "tSE-lIst N sing 5
cellists cel-lists "tSE-lIsts N plu 0
cello cel-lo "tSE-l@U N sing 24
cellos cel-los "tSE-l@Uz N plu 1
cellular cel-lu-lar "sEl-jU-l@r* A pos 21
celluloid cel-lu-loid "sEl-jU-lOId N sing 29

With best regards,

Richard Piepenbrock
CELEX Project Manager

-- C E L E X -- C C C
Max-Planck-Institut fuer Psycholinguistik C CCCCCC
The Netherlands CCCCCCCCCC CC
Tel: (+31) (0)80 - 615797 CCCCCCCC
Fax: (+31) (0)80 - 521213 CCCCCCCC
EARN/BITNET: celex@hnympi51 CCCCCCCC
Internet: celex@mpi.nl CCCCCCCC
SURFNET: celex::celexmail CCCCCCCC
JANET: celex%hnympi51@uk.ac.earn-relay CCCCCCCCC