4.1113 Linguistic Diacritics; Unicode v. 10646 (2/94)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Sat, 2 Mar 91 22:04:25 EST

Humanist Discussion Group, Vol. 4, No. 1113. Saturday, 2 Mar 1991.

(1) Date: Sat, 02 Mar 91 11:09:22 EST (60 lines)
From: Henry Rogers <ROGERS@vm.epas.utoronto.ca>
Subject: [Symbols and Diacritics for Linguists]

(2) Date: Fri, 1 Mar 91 19:03:50 PST (34 lines)
From: tut@Eng.Sun.COM (Bill "Bill" Tuthill)
Subject: Unicode and DIS 10646

(1) --------------------------------------------------------------------
Date: Sat, 02 Mar 91 11:09:22 EST
From: Henry Rogers <ROGERS@vm.epas.utoronto.ca>
Subject: [Symbols and Diacritics for Linguists]

Much of the recent discussion on symbols and diacritics as a single
unit versus two units has focussed on languages with written
traditions. Users of computers in those languages naturally want the
coding to correspond as closely as possible to the way people
conceptualise writing in those languages.

In linguistics and phonetics, we also have a written tradition
(Pullum Ladusaw, 1986), and we also would like computers to
allow us to write symbols easily in the way we conceptualise them.
Our specific problem is that we have more symbols and more
diacritics and more combinations than most other users. Linguists
invariably think of phonetic symbols as consisting of the basic
symbol plus any added diacritics.

Non-floating diacritics present no problem. These symbols, such as
a colon for length, or a small superscript w for lip rounding, are
written to the right of a symbol; they occupy space and are treated
just as any basic symbol.

Floating diacritics, however, are more troublesome. The human
vocal apparatus has several aspects which can be independently
controlled: the place of articulation, the manner of articulation, the
activity of the vocal cords, and the passage of air through the nasal
passage. The value of any of these is at times transcribed with a
diacritic. As a result, linguists find it completely reasonable to have
several floating diacritics 'stacked' on the same basic symbol.

For example, a voiceless dental nasal trill is quite easily produced
and would be transcribed as an r with three diacritics: a superscript
tilde, a subscript bottomless box, and a subscript ring below the box.
Or, many linguists (writing in the North American orthographic
dialect) would transcribe a high front rounded nasal vowel with
falling tone as a u with three superscript diacritics -- in ascending
order, an umlaut, a tilde, and a circumflex. The IPA dialect
substitutes y for the u umlaut.

The ordering of diacritics cannot be uniquely fixed. A retroflex
sound which allophonically has creaky voice would place a subscript
tilde (for creaky) below a subscript dot (for retroflex). On the other
hand, a sound with creaky voice which is allophonically retroflexed
would put the dot below the tilde. I have never seen the principle
underlying this convention set out in print, but it is clearly followed
by most linguists.

The total number of symbols and diacritics vary from source to
source, but the 1989 IPA chart gives just over 100 basic symbols
and about 30 floating diacritics. The number of combinations ever
needed could probably be reduced by eliminating theoretically and
anatomically impossible combinations. The result is still a very,
very large number of combinations that would seem meaningful to a
linguist. Surely the sensible answer for phonetic symbols is to treat
them the way a linguist does -- as a structured combination of
meaningful units.

Having just finished a textbook on phonetics, I am very much
aware of the difficulty of moving codes for phonetic symbols from
my computer to the publisher's computer and from there to paper. A
standard coding would probably have saved me 3-5 weeks in making

As a footnote, I was surprised that Unicode did not include the
symbols introduced in the 1989 IPA revision, nor the PRDS symbols
for disordered speech, nor Bliss symbols.

ref. -- Geoffrey K. Pullum and William A. Ladusaw, Phonetic Symbol
Guide, University of Chicago Press, 1986. [An enormously useful

Henry Rogers
Department of Linguistics
University of Toronto

(2) --------------------------------------------------------------------
Date: Fri, 1 Mar 91 19:03:50 PST
From: tut@Eng.Sun.COM (Bill "Bill" Tuthill)
Subject: Unicode and DIS 10646

> Michael Sperberg-McQueen <U35395@UICVM> writes:
> ... the issue raised by Pierre Mackay: whether Unicode and 10646
> provide guaranteed fixed-width-character data streams.

Upon closer inspection, one is forced to admit that 10646 already contains,
and indeed requires, floating diacritics for Arabic and Greek. Furthermore,
some precomposed accented characters have been omitted (four are missing
for Vietnamese).

The real issue is whether all bytes are 16 bits long (as in Unicode), or
whether bytes can be 8 bits, 16 bits, 24 bits, or 32 bits (as in DIS 10646).
Users don't care about this, but programmers and hardware vendors do.

> [Tuthill's] comments on implementation difficulties are to the point,
> though opaque.

The designers of 10646 seem to imagine a world where each country has its
own 8-bit or 16-bit national system, and interchanges data using 10646's
32-bit canonical form. This means your text (unless ASCII-only) probably
won't display or print in other countries. It also means that computer
systems must contain complex rules for figuring out what size bytes and
what compaction forms they're receiving.

Unicode, by contrast, was designed to permit the global interchange of
unambiguous data. It's much easier to implement all of Unicode than all
of 10646, since 10646 is variable-bit and contains duplicate characters
and incompatible subsets all over the place.

Hope that's clear.