4.1078 Unicode v. 10646 (2/176)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Sun, 24 Feb 91 21:22:02 EST

Humanist Discussion Group, Vol. 4, No. 1078. Sunday, 24 Feb 1991.


(1) Date: Mon, 11 Feb 91 18:08:56 PST (111 lines)
From: "Masami Hasegawa,
Subject: Interlocking Unicode and 10646
Forwarded from: Multi-byte Code Issues <ISO10646@JHUVM.BITNET>

(2) Date: Thu, 14 Feb 91 19:05:11 PST (65 lines)
From: "Masami Hasegawa,
Subject: Interworking with Unicode
Forwarded from: Multi-byte Code Issues <ISO10646@JHUVM.BITNET>

(1) --------------------------------------------------------------------
Date: Mon, 11 Feb 91 18:08:56 PST
From: "Masami Hasegawa,
Forwarded from: Multi-byte Code Issues <ISO10646@JHUVM.BITNET>
Subject: Interlocking Unicode and 10646

A proposal to bring Unicode and ISO 10646 closer
------------------------------------------------
Masami Hasegawa (Digital)

1. Goal

This proposal is designed to bring Unicode and ISO 10646 closer.

2. Basic ideas

The basic idea of this proposal is to make Unicode an extension of ISO 10646
Basic Multilingual Plane. Thus non-ideographic characters in zones A-00, A-01,
A-10 and A-11 will be identical in Unicode and 2-octet compaction form of ISO
10646. Zone I-00 will be the "compatibility zone" for Unicode. Zones I-01, I-10
and I-11 can be used to code Unicode Han (UniHan) characters. Any additional
characters required by Unicode (such as floating accents) can be coded in C0/C1
control code areas.

+-----------------------------+-----------------------------+
| Unicode extensions |
| |
| +-----------------------+ +-----------------------+
| | A-00 alphabetic | | A-01 alphabetic |
| +-----------------------+ +-----------------------+
| | | | |
| | I-00 | | I-01 |
| | | | |
| | compatibility | | UniHan |
| | | | |
| | | | |
| | | | |
| | | | |
+ +-----------------------+ +-----------------------+
| Unicode extensions |
| |
| +-----------------------+ +-----------------------+
| | A-10 alphabetic | | A-11 alphabetic |
| +-----------------------+ +-----------------------+
| | | | |
| | I-10 | | I-11 |
| | | | |
| | UniHan | | UniHan |
| | | | |
| | | | |
| | | | |
| | | | |
+-----+-----------------------+-----+-----------------------+
Fig 1. New Unicode layout


3. Advantages

a. Common "Core" subset - This approach creates a common "core" subset between
Unicode and ISO 10646. This is beneficial for many users.

b. Code conversion - Code conversion between Unicode and ISO 10646 becomes
trivial and cheap for most of the non-ideographic characters.

c. Superset of BMP - By doing this, Unicode can claim being a superset of ISO
standard 10646 BMP.

d. Conformance requirement - It becomes possible to have a conformance level in
Unicode, which can meet ISO 10646 conformance requirements. This will promote
standards conscious people implement Unicode.

4. Disadvantage

a. Non-contiguous coding of characters in Unicode. This is a problem if
software assumes contiguous coding of some characters.

5. Justifications

ANSI X3L2 tried to influence ISO-IEC/JTC1/SC2 with Unicode based ideas (major
ones are C0/C1, non-spacing accent and Han unification), but the effort has not
been successful at the ISO level. One of the major reason for the negative
reaction is due to the fact that major structure/architecture changes are
likely to cause problems with existing users of ISO character sets (including
other standards such as programming languages and OSI). Thus it is unlikely
that ISO will accept major technical structural/architectural change proposal
for ISO 10646.

On the other hand, having two totally different multilingual character code
is costly, and many end users will suffer.

This proposal requires re-arrangements of many characters in Unicode, but it
does not require any structural/architectural change to Unicode. For example,
since C0 and C1 control code areas will NOT be used by ISO 10646, these areas
can be used for any Unicode extensions. Thus this proposal will not prevent
meeting any special requirements needed for Unicode (including those
controversial characters such as floating accents).

By creating a common "core" subset of Unicode and ISO 10646, most of the
non-ideographic script users will benefit by the close relationship between the
basic characters. (Just consider how much benefits we are getting by having
Macintosh and PC code pages a superset of ASCII today.)

5. Open issue

ISO 8859-1 characters are now in Row 032 in DIS 10646. These characters could
be moved to Row 000 to meet programming language C's built-in assumptions for
wchar_t processing code.

(2) --------------------------------------------------------------66----
Date: Thu, 14 Feb 91 19:05:11 PST
From: "Masami Hasegawa,
Forwarded from: Multi-byte Code Issues <ISO10646@JHUVM.BITNET>
Subject: Interworking with Unicode

Forwarded From Multi-byte Code Issues <ISO10646@JHUVM.BITNET>

Through the discussion of "unbounded repertoire", we have learned that the
concept of character identity is different in Unicode. The difference might
first appear only academic and philosophical, but practical implications are
very serious. My conclusion is that code conversion between Unicode and
existing character sets may be impossible without good knowledge of the scripts
and the writing systems in general. Here's why:

- There are underlying agreements on what "code elements" are in existing
popular character sets (including ISO, ASCII, JIS, EBCDIC, and PC codes). Thus
code conversion between different character sets can be achieved by simple
algorithms (such as table lookup) since the concept of "code elements" is
common even though their "coded representations" may be quite different.

- Unicode adopts a different concept of "code elements" to those existing
character sets. Thus code conversion from Unicode to ISO 10646 requires 3
levels of transformation:

Unicode code element(s)
| (1)
v
abstraction as Unicode text element(s)
| (2)
v
abstraction as 10646 text element(s)
| (3)
v
10646 code element(s)

These transformation requires knowledge of how code elements and text elements
are related in two different models (1 and 3 above) as well as relating Unicode
and 10646 text elements (2)!

This is not only for rare scripts. As a real life example, ISO character sets
(including 1-byte ISO 8859 Latin/Hebrew set) have LEFT and RIGHT PARENTHESIS.
Unicode has OPEN and CLOSE PARENTHESIS. They are identical when text
presentation direction is left-to-right. This means that you really need to
have the directionality information to convert ASCII characters!

My proposal
-----------

This situation is very bad when considering interworking of Unicode with
existing character sets. Thus, I propose the following points:

1) Minimum requirement: Unicode should clearly indicate where the concept of
characters are different from 10646 which is a collection of characters from
existing character sets. If the character concepts are the same, we have easy
code conversion.

2) Suggestion 1: Minimize the number of "new" characters with different
concept. (For example, ISO 8859 Latin/Hebrew and Latin/Arabic sets have LEFT
and RIGHT PARENTHESIS for dealing with bi-directional text. Unicode really do
not need to be different.)

3) Suggestion 2: Group "traditional"-concept characters in some logical areas.
This makes checking much easier. My "interlocking Unicode and 10646" approach
provides a possible solution. [ie. characters in A-00, A-01, A-10 and A-11
zones are common "traditional" characters with very easy code conversion.
Characters in C0/C1 are "new" characters.]