4.1075 Greek and Unicode (1/104)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Sun, 24 Feb 91 21:04:32 EST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Elaine Brennan & Allen Renear: "4.1076 Rs: Outlining; Amstrad; ... (4/58)"
Previous message: Elaine Brennan & Allen Renear: "4.1074 Rs: List of Anonymous FTP Sites (2/107)"

Humanist Discussion Group, Vol. 4, No. 1075. Sunday, 24 Feb 1991.

Date: Sun, 24 Feb 91 16:20:47 -0800
From: mackay@cs.washington.edu (Pierre MacKay)
Subject: UNICODE, THE TEI, and Classically accented GREEk

One important point in favor of Unicode seems not to be quite clear
in the communication from galiard@let.rug.nl.

In this communication Unicode is grouped with ISO 10646 as a multibyte
system in the same sense as ISO 10646. It really is not. Unicode is a
16-bit system, plain and simple. It provides some compatibility with
8-bit coding, but assumes the ultimate adoption of 16-bit characters.
The "byte" in Unicode just happens to be 16 bits long, so Unicode is
not, properly speaking, a multibyte coding system at all. (It is
entirely legitimate to speak of a 16-bit byte.)

ISO 10646 is a descendant of ISO 2022. It is based on "octets" which
come in groups of variable length. That is no great problem for a
communications standard, but is a real pain for a computer coding
standard. As M. A. Padlipsky has pointed out in criticism of other
aspects of ISO negotiation, the national PTT organizations (which
understandably think in terms primarily of sequential-by-character
streams of information) seem to have an overwhelming voice in drowning
out the protests of computer types who would like to be able to use
arbitrary array addresses to access any part of a text file without
having to read it through from the beginning every time to make sure
that the array index doesn't just happen to land in the middle of a
multiple octet sequence.

What composites do or do not get included in the code table is of
diminished importance beside the basic question of whether you can
safely use simple array addressing or must read every file from the
beginning every time you wish to move around in it. Nevertheless it is
worth pointing out the following problem in the Greek set.

########################################

Note on table Greek U+0370-03FF

The only thing that is wrong with a sequence such as --- alpha + IOTA
SUBSCRIPT + ROUGH BREATHING MARK + GRAVE ACCENT is that it is in the
wrong order.

If Unicode is going to recognize the historic forms of Greek (The
Greek government frowns upon anything but the simple stress accent for
the modern language) then it should be done with a recognition of the
fact that ROUGH BREATHING (and likewise SMOOTH BREATHING) is not a
diacritical in the same sense as the accents, but is the vestige of a
distinct letter H (a shape which was later borrowed for other
purposes). The proper sequence is

ROUGH BREATHING MARK + alpha + IOTA SUBSCRIPT + GRAVE ACCENT

because the sequence

HAI (with *maybe* a GRAVE ACCENT over the I) is a possible
alternative, and could probably be found on some inscription or other
(actually, it would be extraordinary to find the accent, but in a
Hadrianic archaizing inscription it might just be possible).

Since the Greek government PTT is basically uninterested in
historically accented Greek, we might as well consult the interest of
the scholarly community. Here the most significant concern ought to
be the existence of a single database (the Thesaurus Linguae Graecae)
which already contains 2/3 of the entire body of Greek writing from
800 B. C. to 600 A. D, and is planned to contain the lot. The actual
code of the TLG is not at issue, it is a historical artifact from the
age of the punched card, and the less said about it the better, but it
is unambiguous, and can easily be translated into something like
Unicode. THe translation will be a good deal more complicated, however
if the ROUGH and SMOOTH breathing marks are not properly treated.
The TLG codes breathings before letters, as it ought to.

THe historical evolution is

H H H H
H H H H
H H H H
HHHHH ---> HHHH ---> HHHH (ROUGH BREATHING MARK)
H H H
H H H
H H H

H H
H H
H H
HHHH ---> HHHH (SMOOTH BREATHING MARK)
H
H
H

All the forms of rough breathing are well attested in inscriptions.
The third form is regularly in use in texts of inscriptions, and
always precedes the affected letter, never rides above it. The smooth
breathing is an unnecessary but characteristic appeal to symmetry, but
it also shows a reader (and only a reader) who no longer pronounces
the H sound that "HO" is different from "O".

Email concerned with UnixTeX distribution software should be sent primarily
to: elisabet@max.u.washington.edu Elizabeth Tachikawa
otherwise to: mackay@cs.washington.edu Pierre A. MacKay
Smail: Northwest Computing Support Center TUG Site Coordinator for
Thomson Hall, Mail Stop DR-10 Unix-flavored TeX
University of Washington
Seattle, WA 98195
(206) 543-6259

Next message: Elaine Brennan & Allen Renear: "4.1076 Rs: Outlining; Amstrad; ... (4/58)"
Previous message: Elaine Brennan & Allen Renear: "4.1074 Rs: List of Anonymous FTP Sites (2/107)"