4.0663 Sanskrit Character Sets -- Standards (1/219)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Wed, 31 Oct 90 22:42:35 EST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Elaine Brennan & Allen Renear: "4.0664 Information on CETEDOC (from CETEDOC) (1/181)"
Previous message: Elaine Brennan & Allen Renear: "4.0662 R: Lists Related to Education (1/10)"

Humanist Discussion Group, Vol. 4, No. 0663. Wednesday, 31 Oct 1990.

Date: Tue, 30 Oct 90 11:55:38 +0000
From: Dominik Wujastyk <ucgadkw@ucl.ac.uk>
Subject: Sanskrit character sets

At the 8th World Sanskrit Conference, in Vienna in August, there was
a great deal of discussion of computer matters of various kinds.

One subject -- dear to the hearts of several HUMANISTS -- was to
decide on an 8-bit encoding scheme for the roman transliteration
of Sanskrit. In other words, a "code page", or ISO 8859 type
character set, planned according to ISO 2022. Other issues too
were discussed, notably the problem of document transfer. The
TEI was presented as the best general solution to this problem.

Two code pages were decided by a committee, and generally approved by
many of those present at the conference. Nobody objected, or proposed
alternative schemes, although many alternative schemes are in use. One
for Classical Sanskrit (CS) and the other for Classical Sanskrit
Extended (CSX). The first would do for normal stuff; the second
included accented long vowels for Vedic, and special characters for
MIA, Tamil, and some other bits and pieces. Neither CS nor CSX are
anybody's particular set, i.e., they were originated afresh at the
conference.

I append a statement on all this, with full details of the character
codes as decided. (This is to be published in the Newsletter of the
International Association of Sanskrit Studies.) This document is tagged
for processing with LaTeX. It includes a call to the multicol.sty
style of Mittelbach, which you can omit without materially affecting
the document content. If you don't have LaTeX (surely everyone has by
now :-) you can still probably make out what codes have been
assigned. Whatever codes are not mentioned are assumed to be as IBM's
code page 437, the normal character set, called extended ASCII, that is
embedded in all IBM PCs and clones.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% cut here %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentstyle[12pt,multicol]{article}
\def\diatop[#1|#2]{{\setbox1=\hbox{{#1{}}}\setbox2=\hbox{{#2{}}}%
\dimen0=\ifdim\wd1>\wd2\wd1\else\wd2\fi%
\dimen1=\ht2\advance\dimen1by-1ex%
\setbox1=\hbox to1\dimen0{\hss#1\hss}%
\rlap{\raise1\dimen1\box1}%
\hbox to1\dimen0{\hss#2\hss}}}%
%e.g. of use: \diatop[\'|{\=o}] gives u macron acute

\title{Standardization of Sanskrit for Electronic Data Transfer
and Screen Representation}

\author{Dominik Wujastyk}
\date{9 September 1990}

\begin{document}
\maketitle

\section*{Text Encoding Guidelines}
During the 8th World Sanskrit Conference, Vienna 1990, a panel
was held to discuss the standardization of Sanskrit for
electronic data transfer. Participants were encouraged to
acquire and study the {\em ACH-ACL-ALLC Guidelines for the
Encoding and Interchange of Machine-readable Texts}, edited by
Lou BURNARD and C.~M.~SPERBERG-MCQUEEN (Chicago and Oxford,
1990). These {\em Guidelines\/} are available free of charge in
Europe from L.~Burnard, Oxford University Computing Service, 13
Banbury Road, Oxford OX2 6NN, England, or in the USA from C. M.
Sperberg-McQueen, Computer Center (MIC 135), University of
Illinois at Chicago, Box 6998, Chicago, IL 60680, USA.

\section*{7-bit coding for file transfer}
Professor H. Falk presented a program called {\tt CONVERT} that
conveniently converts any coding scheme used in a data file to
any other coding scheme. This program was generously made
available at no cost, together with Turbo Pascal source code.
Prof.\ Falk also presented a very useful 7-bit, multi-byte
``mediation code'' which will be of general use for file
exchange.

\section*{8-bit character set for text display}
Finally, although the above two provisions cover all essential
needs, the panel still felt that a standard assignment of graphic
codes for the display of Sanskrit transliteration would be
helpful. An ad hoc committee of interested parties was formed,
and two 8-bit `code pages'' were designed. One, {\em Classical
Sanskrit\/} (CS), for standard use and another, {\em Classical
Sanskrit Extended\/} (CSX), which included the former, but also
provided for Vedic, MIA, Tamil and some special usages.

The codes assigned were as follow:
\begin{multicols}{2}[\subsection*{Classical Sanskrit (CS)}]
\begin{small}
\begin{tabbing}
000 \= x underdot macron acute \= (normally German eszett, xx) \kill
166 \> l tilde \> \~ l \\
167 \> m overdot \> \.m \\
224 \> a macron \> \a=a \\
225 \> not used (normally German {\em eszett}, \ss) \\
226 \> A macron \> \a=A \\
227 \> i macron \> \a=\i \\
228 \> I macron \> \a=I \\
229 \> u macron \> \a=u \\
230 \> U macron \> \a=U \\
231 \> r underdot \> \d r \\
232 \> R underdot \> \d R \\
233 \> r underdot macron\> \diatop[\a=|\d r]\\
234 \> R underdot macron\> \diatop[\a=|\d R]\\
235 \> l underdot \> \d l \\
236 \> L underdot \> \d L \\
237 \> l underdot macron\> \diatop[\a=|\d l]\\
238 \> L underdot macron\> \diatop[\a=|\d L]\\
239 \> n overdot \> \.n \\
240 \> N overdot \> \.N \\
241 \> t underdot \> \d t \\
242 \> T underdot \> \d T \\
243 \> d underdot \> \d d \\
244 \> D underdot \> \d D \\
245 \> n underdot \> \d n \\
246 \> N underdot \> \d N \\
247 \> s acute \> \a's \\
248 \> S acute \> \a'S \\
249 \> s underdot \> \d s \\
250 \> S underdot \> \d S \\
251 \> not used (normally the root sign $\surd$) \\
252 \> m underdot \> \d m \\
253 \> M underdot \> \d M \\
254 \> h underdot \> \d h \\
255 \> H underdot \> \d H \\
\end{tabbing}
\end{small}
\end{multicols}
\newpage

\begin{multicols}{2}[\subsection*{Classical Sanskrit Extended (CSX) additions}
The following definitions are added to the above Classical
Sanskrit character set.]
\begin{small}
\begin{tabbing}
000 \= x underdot macron acute \= (normally German eszett, xx) \kill
159 \> r underbar \> \b r \\
168 \> a macron breve \> \diatop[\u|\a=a]\\
169 \> i macron breve \> \diatop[\u|\a=\i]\\
170 \> u macron breve \> \diatop[\u|\a=u]\\
173 \> n underbar \> \b n \\
181 \> a macron acute \> \diatop[\a'|\a=a]\\
182 \> a macron grave \> \diatop[\a`|\a=a] \\
183 \> i macron acute \> \diatop[\a'|\a=\i] \\
184 \> i macron grave \> \diatop[\a`|\a=\i] \\
189 \> u macron acute \> \diatop[\a'|\a=u] \\
190 \> u macron grave \> \diatop[\a`|\a=u] \\
198 \> r underdot acute\> \diatop[\a'|\d r] \\
199 \> r underdot grave\> \diatop[\a`|\d r] \\
207 \> r underdot macron acute\>
\raisebox{.25ex}{\rlap{\a'{ }}}\diatop[\a=|\d r] \\
208 \> a tilde \> \~ a \\
209 \> i tilde \> \~ \i \\
210 \> u tilde \> \~ u \\
211 \> e tilde \> \~ e \\
212 \> o tilde \> \~ o \\
213 \> e breve \> \u e \\
214 \> o breve \> \u o \\
215 \> l underbar \> \b l \\
\end{tabbing}
\end{small}
\end{multicols}
\bigskip
These codes were chosen to have minimal impact on the standard
IBM PC extended ASCII character set, but they are intended for
general use in displaying Indological texts on any machine with
an 8-bit (or greater) character set.

Dr. D. Wujastyk will be making available small programs that load
the above character sets into the EGA or VGA display adaptors,
for IBM PC users.

The above character codings have been approved by R. E. Emmerick,
H. Falk, R. Lariviere, G. J. Meulenbeld, H. Nakatani, M.
Tokunaga, D. Wujastyk, and M. Yano.

These character codings are primarily intended for use in
situations when the screen display of these characters is
requried, such as in word processing. They may, of course, be
used for data transfer, where, however, a 7-bit code (perhaps with
multi-byte character codes) is still preferable. One such 7-bit
scheme is provided hy H. Falk (see 2. above).

\newpage

These character codings are currently open for discussion and
comments may be directed to Dr. D. Wujastyk at

Wellcome Institute,

183 Euston Road,

London NW1 2BN, England,\\
or by email at

Bitnet/Earn: {\tt dow@harvunxw} or

Janet: {\tt D.Wujastyk@uk.ac.ucl}.

After a suitable lapse of time, the character sets will be sent
to ECMA and ISO for registration. They will also be sent to the
Text Encoding Initiative for registration, probably with H.
Falk's 7-bit coding scheme.

Such registration in no way enforces these schemes; it merely
makes them available centrally for reference. Other schemes may
also be registered in the future.

\end{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% cut here %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

------------------------------------------------------------------------------
Dominik Wujastyk, | Janet: D.Wujastyk@uk.ac.ucl
Wellcome Institute for | Bitnet/Earn/Ean/Uucp: D.Wujastyk@ucl.ac.uk
the History of Medicine, | Internet/Arpa/Csnet: dow@wjh12.harvard.edu
183 Euston Road, | or: D.Wujastyk%ucl@nsfnet-relay.ac.uk
London NW1 2BN, England. | Phone no.: +44 71 383-4252 ext.24
-------------------------------------------------------------------------------

Next message: Elaine Brennan & Allen Renear: "4.0664 Information on CETEDOC (from CETEDOC) (1/181)"
Previous message: Elaine Brennan & Allen Renear: "4.0662 R: Lists Related to Education (1/10)"