4.1050 Unicode for Hebrew (1/204)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 18 Feb 91 17:22:30 EST

Humanist Discussion Group, Vol. 4, No. 1050. Monday, 18 Feb 1991.

Date: Sun, 10 Feb 91 22:44:59 EST
From: Robin Cover <robin@TXSIL.LONESTAR.ORG>
Subject: UNICODE for Hebrew

[This is a much longer version of Robin's Humanist posting. --ahr]

A recent bit of expert testimony from "Jony Rosenne" of Tel Aviv
<rosenne@telvm1.iinus1.ibm.com> to the UNICODE Technical Committee on matters
of Hebrew encoding prompts me to revise a posting I submitted to the HUMANIST
forum, and to ask for your consideration. I should not over-dramatize what is
or may be at stake, but I feel the issue of central concern (e.g., unique
coding for Hebrew/Aramaic SIN and SHIN) is of direct theoretical relevance to
biblical studies computing. I apologize for the length of this note, but the
matter is not easily condensed.

UNICODE is one of two new competing standards for the multi-byte encoding of
multilingual texts. UNICODE is finalizing the draft specification for its
16-bit fixed-width character encoding scheme, and will close the comment
period on February 15th. This emerging standard is backed by IBM, Microsoft,
Apple, Metaphor, NeXT, Sun Microsystems, Xerox, The Research Libraries Group,
Claris and other powerful commercial groups -- so the consequences cannot be
taken lightly. ECMA (European Computer Manufacturers Association) has thrown
it weight solidly behind the ISO 10646 group (the competing multi-byte
standard [variable-width multi-octet encoding] in opposing UNICODE. I do not
know what the Unicode consortium will do about the ECMA/ISO opposition, but it
seems prudent that humanities scholars address both groups with their
concerns.

The UNICODE draft currently contains (84) 16-bit characters for Hebrew (a
"code point" being one 16-bit fixed-width character):

* 31 code points for "cantillation marks and accents"
* 20 code points for "points and punctuation"
* 27 code points for consonants (based upon ISO 8859/8) - similar to
standard Hebrew keyboards = 22 chars + 5 final forms
* 3 code points for Yiddish digraphs (double-vav; vav-yod; double-yod)
* 2 code points for "additional punctuation" (geresh; gershayim)
* 1 code point for Ladino/Judezmo (point VARIKA)

The draft thus does encompass possibilities for standard encoding of fully
pointed texts at a low level. As I have reviewed the proposed encoding, I see
five areas of interest and hence as many possible candidates for comment by
biblical scholars:

(1) the absence of unique codes for Hebrew/Aramaic SIN and SHIN
(2) lack of a distinct code for furtive patah. Jony R. says: "The Standards
Institute of Israel had decided it is not a character... Most often Patah
and Patah Furtive are indistinguishable. Some printers shift the Patah
Furtive slightly to the right. Since the rules to distinguish the Furtive
are simple and straightforward, i.e. this is a straightforward case of
rendering, it was decided that a special character is not needed."
(3) less problematic in my view but not entirely felicitous is that the
"dot" for daghesh (05BC) is also used for mappiq and (the "dot") in
shureq
(4) the asymmetrical encodings for certain "accentus communes" and "accentus
poetici" (distinct sinnorit is lacking); Jony Rosenne now also recommends
omitting the UNICODES currently assigned for tarha, azla, galgal,
and yored on the grounds that for the five pairs (Tarha - Tippeha;
Zinorit - Zarqa; Azla - Qadma; Galgal - Yerah ben yomo; Yored - Merkha)
only the latter are necessary
(5) the absence of a full set of UNICODEs for transliteration of semitic
(viz, the original 29 consonants); I think UNICODE currently has provision
for 25 of 29 characters, using the most common transliteration schemes)

Items (2) - (5) are of relatively lesser importance in my judgment, so I will
not comment further on these latter four issues. Others may wish to consult
the UNICODE manual on these points.

DISTINCT SIN/SHIN (absence of unique UNICODEs)

The UNICODE draft standard follows the ISO 8859/8 standard in assigning 27
code points for the Hebrew "consonants" (let's ignore the problem of matres
and mixed orthographic systems). The character "SHIN" (code 05E9) thus serves
to represent both SIN and SHIN. Two code points are assigned to the "dots"
for SHIN (05C1) and SIN (05C2), so that one could compose distinct SIN and
SHIN as double-width characters, viz, <05E9>+<05C2> and <05E9>+<05C1>, on the
interpretation that the "dots" are diacritics.

We could argue about the precise orthographic stratum to which "dotted" SHIN
and SIN belong (relative to vowel points, accents, cantillation and
punctuation marks), but the results would probably not be determinative for
this discussion. On the surface, the ISO 8859 position appears to have strong
credibility, at least in handbook tradition available to the UNICODE
consortium and its standards-makers: (a) "the earliest Hebrew writing did not
distinguish SIN and SHIN;" (b) "modern and medieval Hebrew do not distinguish
SIN and SHIN" (c) "the dots are clearly 'diacritics' like European accented
(acute, grave, umlaut) characters;" (d) "only a few obsolete writing
traditions used such diacritics."

This is substantially what Jony Rosenne recommended to the UNICODE committee
(on her apparent authority as a "standards committee" member):

> 3. Sin/Shin Issue - what is this?
>
> Quote from the Israel Comments to 2nd DP 10646:
>
> 1. Shin and Sin. These two letters are identical in
> common use, i.e. in unpointed texts ( ). The common
> letter is called Shin. When pointed, they are nearly
> always distinguished by a diacritical mark, a dot, on
> the upper right for Shin and on the upper left for
> Sin.
>
> In the past lexicographers had considered them
> separate letters, but modern dictionaries treat them
> as one. In the existing computer standards, which
> relate to unpointed texts, they are one letter.
> Conclusions:
>
> In DP 10646, Shin and Sin should be considered a
> single graphic character ( ).
>
> The dots should be included in the set of points.
>
> Shin with Shin Dot and Sin with Sin Dot, i.e. the
> combinations of the letter Shin ( ) and the two dots
> -- Shin Dot and Sin Dot, are ligatures, as are the
> combinations of Shin with Dagesh, Shin with Dagesh and
> Shin Dot and Sin with Dagesh and Sin Dot.


Jony's point of view is adequate for modern Israeli Hebrew, presumably, and
for text corpora (ancient epigraphic, unpointed rabbinic) in which SIN/SHIN
are not (to be) distinguished. Her viewpoint is also adequate for screen and
paper-print rendering -- both are indifferent to the linguistic issues of
concern to humanities scholars.

My evaluation is that awarding code points to the *dots* for SIN and SHIN is a
waste of code space (why not promote these two "dots" to full characters,
since the dots are not useful elsewhere in the repertoire, as defined?). What
humanities scholars want for text-processing is distinct SIN and SHIN as
single-character UNICODEs. Arguments: (1) SIN and SHIN *are* historically
distinct consonants, in Hebrew and in proto-Semitic (2) the Phoenician writing
system IS underspecified (degenerate) for Hebrew, because there had (probably)
been phonemic collapse of SIN/SHIN in Phoenician before the time the Hebrews
adopted this alphabet (3) a significant body of Hebrew/Aramaic literature does
distinguish SIN and SHIN, as do transliteration schemes; (4) an unnecessary
and unfortunate performance penalty will be paid for the artificial
requirement of having two "odd" characters in consonant class double-width
characters.

The issue is one of efficiency and optimization for writing programs and
running them on microcomputers -- though we could argue that historical
linguistics (not the degenerate writing system) favors the "masoretic"
situation. One major UNICODE proponent (Joe Becker) has assured us that the
goal of the UNICODE consortium is to have good generalized solutions for
dealing with multi-code glyphs -- efficient software such that implementors
and users don't care how glyphs are defined internally. But there is
reasonable doubt about how well these goals will be met, especially for
hardware affordable to humanities scholars. I am not a programmer, so perhaps
I overestimate the negative consequences of current UNICODE for implementors
in having to work around this problem, and the performance penalty. Opinions
from experts? Programmers already have to deal with five cases of
overspecification at applications level (the five final forms), so in
principle having three characters for SHIN/SIN, SHIN and SIN would be
consistent and economical.

Beyond this - do you trust IBM, Apple and Microsoft? They have an abysmally
poor track record when it comes to supporting humanities computing: the
linguistic basis of writing (and computing) has never had strong
representation in formulating computing standards, because academia does not
represent a significant commercial interest. Typically, neither operating
systems nor commercial applications have supported any conception of the fact
that "text" is "in" a given language, with all this implies for computing
(direction of writing, hyphenation rules, keyboarding conventions, screen
rendering, sort/collation sequences, bibliographic case-conversion rules,
embedded-quotation (-mark) rules, spell checking, online thesaurus, fonts,
scripts, line-wrap rules, etc).

Nothing seems lost or compromised in the UNICODE scheme if undifferentiated
(viz, "undotted) "HEBREW LETTER SHIN" is left at code point 05E9, for modern
and epigraphic Hebrew, and two of the unassigned slots (05EB - 05EF) are used
for linguistically and/or orthographically distinct SHIN and SIN. I assume
that the SIN/SHIN "dots" could then be surrendered as code positions and made
available for something else. This would seem a small concession to
humanities scholars considering that five slots are given to final Hebrew
forms -- an overspecification that could be handled in applications programs.

It is not clear to me that biblical scholarship would have the sympathy of
Yaacov Choueka (and others) who already have large corpora of encoded Hebrew
texts. For example, I have not seen the encoding used in the GJD (Global
Jewish Database - 350 megabyte Responsa corpus), but I assume that almost all
texts are unpointed, and that the parsing engines disambiguate SIN/SHIN as one
of the "simplest" parsing tasks. We should clarify that the code point for
ambiguous/undifferentiated SIN/SHIN (05E9) taken over from ISO 8859 would be
retained in UNICODE for any who wish to use it - and parse it in applications.

Time is short, but those interested in the UNICODE draft should contact Asmus
Freytag at Microsoft. Ask for the UNICODE "Draft Standard - Final Review
Document." The comment period is to officially close Feb. 15th, and a
technical committee meeting for review of comments is scheduled for February
28th (as I recall). If biblical scholars say nothing, we will have only
ourselves to blame for the results.

Tel: (1 206) 882-8080
FAX: (1 206) 883-8101
Email: microsoft!asmusf@uunet.uu.net
Postal: Unicode Final Review; c/o Asmus Freytag; Bldg 2/Floor 2; Misrosoft
Corporation; One Microsoft Way; Redmond, WA 98052-6399 USA.

Matters pertaining to both UNICODE and ISO 10646 are discussed on the forum
ISO10646@jhuvm.BITNET, and there is a dedicated UNICODE forum on
unicode@Sun.COM.

Robin Cover
BITNET: zrcc1001@smuvm1
Internet: robin@ling.uta.edu
Internet: robin@txsil.lonestar.org