4.1069 Some Humanists on Unicode (4/193)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Thu, 21 Feb 91 17:15:01 EST

Humanist Discussion Group, Vol. 4, No. 1069. Thursday, 21 Feb 1991.

(1) Date: Thu, 21 Feb 91 16:14:26 EST (10 lines)
From: Allen Renear <EDITORS@BROWNVM>
Subject: This Digest

(2) Date: Tue, 19 Feb 91 11:22:51 +0100 (72 lines)
From: Timothy.Reuter@MGH.BADW-MUENCHEN.DBP.DE
Subject: Unicode

(3) Date: Tue, 19 Feb 91 08:38:29 EST (38 lines)
From: "Robert A. Amsler" <amsler@STARBASE.MITRE.ORG>
Subject: Presentation vs. Descriptive CHARACTER Markup

(4) Date: Tue, 19 Feb 91 16:15:56 CST (73 lines)
From: "Robin C. Cover" <ZRCC1001@SMUVM1.BITNET>
Subject: CHAR ENCODING AND TEXT PROCESSING

(1) --------------------------------------------------------------13----
Date: Thu, 21 Feb 91 16:14:26 EST
From: Allen Renear <EDITORS@BROWNVM>
Subject: This Digest

The following postings are from a discussion of Unicode on TEI-L (UICVM),
a list for the discussion of the Text Encoding Initiative. The three
authors are long-time Humanist members.

I believe the issues being raised in these character set discussions
are very important for humanities computing. A response to these criticisms
of Unicode will follow in the next digest.

-- Allen

(2) --------------------------------------------------------------------
Date: Tue, 19 Feb 91 11:22:51 +0100
From: Timothy.Reuter@MGH.BADW-MUENCHEN.DBP.DE
Subject: Unicode

I think it's important to think about the general aspects as well as about
whether Unicode does or does not have the variant form for this or that
letter in this or that writing system. Some general points occur to me:

a. Pace Harry Gaylord, Unicode seems to me to be biased towards display
rather than other forms of data processing. Some semantic distinctions are
observed, but roughly speaking, if things look substantially different,
even though semantically substantially the same, they get different codes
(e.g. medial and final sigma in Greek - or alphabets of Roman numerals or
letters in circles - or the various forms of cedilla at U+0300 up). If on
the other hand they look substantially the same, even though semantically
different, they may well get the same code (e.g. hacek is considered to be
identical with superscript v, and the overlaps are very acute in the
mathematical symbol area). Digraphs only get in if they are in existing
standards (German ss, Dutch ij, Slav Dz), i.e. since you can display, say,
Spanish "ch" as "c" followed by "h" there is no provision for a code to
mean "ch", though this might well be helpful in non-display contexts.

b. "Unicode makes no pretense to correlate character encoding with
collation or case" and indeed it doesn't. The basic setup (for those who
haven't seen the draft) is that the high byte is used to indicate a kind of
code page, which may contain one or more alphabets/syllabaries/symbol sets,
etc. There's no attempt to use bit fields of non-byte width within the 16
bits, except in so far as sequences within existing eight-bit standards
have done this. The difference between lc and uc can be 1, 32 or 48
(possibly others as well), while runs of letters can be interrupted by
numerals and non-letters. Previous standards play a role here, but there
seems to me to be no compelling reason if you're drawing up a 16-bit code
to say that you will take over all existing standards on the basis of
eight-bit code + fixed offset! It's an opportunity to eliminate rather than
perpetuate things which in any case only originated because of restrictions
which no longer apply.

c. Diacritics are trailing "non-spacing" separate characters (actually
they're backspacing). Diacritics modifying two letters follow the second
one. The point has already been made that you can't really do it any other
way (though in a 32-bit code you could probably do it with bit-fields).
However, trailing diacritics seem to me undesirable, because you have to
"maintain state" (something the Unicode people claim to eliminate) in any
programming you do. If you're reading a file or a string sequentially you
can't even send a character to the printer or the screen until you have
checked the one after it to make sure it's not a trailing diacritic! For
the user, the order of storage is irrelevant; for the programmer, preceding
diacritics are much easier to handle in most contexts. The slavish take-
over of existing eight-bit standards means that many diacritics are also
codable as "static" single characters - as has been pointed out, this leads
to potential ambiguities.

Diacritics apart, there seem to be conflicts of interest between different
applications, which *necessarily* lead to ambiguities or
difficulties for someone. Take the "s" problem. Harry Gaylord says he needs
long s as a code of its own; Unicode itself distinguishes between Greek
medial and final sigma, and between "s" + "s" and German "szet", on the
basis of existing standards. Any text containing these coding distinctions
can be displayed more easily and more faithfully to its original than it
can without them (though I would have thought there was no serious problem
about identifying final sigma and acting accordingly). But other kinds of
analysis become *more* difficult if such coding is used:
regular expressions involving "s" are much more difficult to construct, as
are collating and comparison sequences. This is an area where SGML-style
entities are positively advantageous, simply because they announce their
presence: if long s is always coded as &slong; in a base text, different
applications can be fed with different translations. Precisely because
Unicode puts so much emphasis on how things look rather than what they
mean, it won't eliminate the need for such "kludges", as someone on
HUMANIST thought it would.

Timothy Reuter, Monumenta Germaniae Historica, Munich

(3) --------------------------------------------------------------40----
Date: Tue, 19 Feb 91 08:38:29 EST
From: "Robert A. Amsler" <amsler@STARBASE.MITRE.ORG>
Subject: Presentation vs. Descriptive CHARACTER Markup

Timothy Reuter's note on UNICODE suggests that we ought to be careful
that the same guidelines that have led the TEI to select descriptive
markup for text not be abandoned when we get to characters.

The TEI's concern should first and foremost be whether a character
representation represents the meaning of the characters to the authors,
and not their presentation format. Likewise, this also means that
how the representation is achieved is rather irrelevant to whether
or not the markup captures the meaning of the character.

I think it worth noting that to me there seems to be a need for
two standards for characters. One to represent their meaning, the
other to represent their print images. The print image representation
has a LOT of things to take into account, and may in fact only
be possible in some form such as the famous "Hersey fonts" released
long ago by the US National Bureau of Standards. That is, the
print image on characters and symbols may have to be accompanied by
representations as bit maps or equations as to how to draw the
characters within a specified rectangular block of space.

Within the descriptive markup, there clearly are enough problems
to solve without adding the burden of achieving consistent print
representations on all display devices. For example, one descriptive
issue is that of whether the representation is adequate for
spoken or only written forms of the language.

While the TEI has addressed the concerns of researchers in linguistics
dealing with speech, there exists a need to address the concerns of
ordinary text users concerned with the representation of information
about indicating spoken language information in printed form.
Some of this is a bit arcane, such as how to represent text dialogues
to be spoken with a foreign accent, but representing EMPHASIS is a continual
issue and emphasis can descend to the characteristics of individual
letters.

(4) --------------------------------------------------------------75----
Date: Tue, 19 Feb 91 16:15:56 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1.BITNET>
Subject: CHAR ENCODING AND TEXT PROCESSING

A propos of recent comments by Timothy Reuter and Robert A. Amsler on the
relationship between character encodings and (optimized) text processing,
two notes:

(1) Timothy writes that "Unicode seems to me to be biased towards display
rather than other forms of data processing." We note that UNICODE indeed
does contain algorithms for formatting right-to-left text and bi-directional
text, but (as far as I know) it has no general support for indicating the
language in which a text occurs.

(2) On the matter of separating "form and function" (various two-level
distinctions germane to character encoding and writing systems: character
and graph; graph and image; language and script; writing system and script),
the following article by Gary Simons may be of interest. (I do not know if
it represents his current thinking in every detail.)

Gary F. Simons, "The Computational Complexity of Writing Systems."
Pp. 538-553 in _The Fifteenth LACUS Forum 1988_ (edited by Ruth M. Brend and
David G. Lockwood). Lake Bluff, IL: Linguistic Association of Canada and the
United States, 1989.

<abstract>In this article the author argues that computer systems, like
their users, need to be multilingual. ``We need computers, operating
systems, and programs that can potentially work in any language and can
simultaneously work with many languages at the same time." The article
proposes a conceptual framework for achieving this goal.

Section 1, ``Establishing the baseline," focuses on the problem of graphic
rendering and illustrates the range of phenomena which an adequate solution
to computational rendering of writing systems must account for. These
include phenomena like nonsequential rendering, movable diacritics,
positional variants, ligatures, conjuncts, and kerning.

Section 2, ``A general solution to the complexities of character rendering,"
proposes a general solution to the rendering of scripts that can be printed
typographically. (The computational rendering of calligraphic scripts adds
further complexities which are not addressed.) The author first argues that
the proper modeling of writing systems requires a two-level system in which
a functional level is distinguished from a formal level. The functional
level is the domain of characters (which represent the underlying
information units of the writing system). The formal level is the domain of
graphs (which represent the distinct graphic signs which appear on the
surface). The claim is then made that all the phenomena described in section
1 can be handled by mapping from characters to graphs via finite-state
transducers -- simple machines guaranteed to produce results in linear time.
A brief example using the Greek writing system is given.

Section 3, ``Toward a conceptual model for multilingual computing," goes
beyond graphic rendering to consider the requirements of a system that would
adequately deal with other language-specific issues like keyboarding,
sorting, transliteration, hyphenation, and the like. The author observes
that every piece of textual data stored in a computer is expressed in a
particular language, and it is the identity of that language which
determines how the data should be rendered, keyboarded, sorted, and so on.
He thus argues that a rendering-centered approach which simply develops a
universal character set for all languages will not solve the problem of
multilingual computing. Using examples from the world's languages, he goes
on to define language, script, and writing system as distinct concepts and
argues that a complete system for multilingual computing must model all
three.</abstract>

<note>Availability: Offprints of this article are available from the author
at the following Internet address: gary@txsil.lonestar.org. The volume is
available from LACUS, P.O. Box 101, Lake Bluff, IL 60044.</note>

Robin Cover
BITNET: zrcc1001@smuvm1
INTERNET: robin@ling.uta.edu
INTERNET: robin@txsil.lonestar.org