2.927: coding Sanskrit, cont. (68)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Wed, 3 May 89 20:06:46 EDT


Humanist Mailing List, Vol. 2, No. 927. Wednesday, 3 May 1989.

Date: 3 May 1989
From: Wilhelm Ott <ZRSZOT1@DTUZDV2>
Subject: Transcribing Sanskrit

Though I am not a Sanskritist, I can not resist any longer
the temptation to add a comment to the respective discussion.

What about an encoding scheme relying on "floating diacritics"
which may be added to any character?

At Tuebingen, we have begun early enough with scholarly text data
processing to have had from the very beginning some difficulties
with character sets. It was the time of 6-bit BCD characters,
allowing 64 different characters of which 48 only were supported
on some key punches. Nevertheless we coded everything we needed
(including Greek characters with breathings and accents and
Hebrew characters with vowels) on these machines, using printable
characters only.

The situation has not changed in principle since then, though 8 bit
character code allows something above 200 different characters (only).
Therefore, also the encoding scheme we adopt has not changed in
principle: we rely on the common subset of printable characters
available in all of the different national versions of ASCII and EBCDIC
for transcribing and encoding everything we need.
Perhaps you discover that using your keyboard you adopt a similar
procedure: in addition to two or three keys like CTRL and ALT, you
press the keys for printable ASCII characters only.

Replacing, while transcribing, also the ESC, CTRL, ALT etc. keys
by printable characters allows you to have all the necessary
codes in your file. Escaping to an other font may be coded e.g.
as #g+ for greek, #h+ for hebrew, #r+ for cyrillic (r = "russisch"),
#p+ for phonetics, #s+ for syriac, #/+ for slanted, #f+ for bold ("fett")
etc.; the shift back to latin is done by terminating the respective
font (#g- or #h- ...).
Diacritics are coded by a different escape character, %, and a
subsequent character which looks similar to the diacritic to be used:
%.a for "dot over a", %..a for "dot under a", %-a for "dash over a",
"%--a" for "dash under a", %?a for "tilde over a", %??a for "tilde
under a", %/%-%..a for "acute over dash over a and dot under it", etc.
This avoids some of the problems:
- there is (almost) no limitation for combining diacritics and letters,
- there are no problems with data transfer and exchange,
- the text remains readable also on terminals with limited character
set or without graphics capabilities.

For Sanskrit, Peter Schreiner and Renate Soehnen have used this approach
for producing their "Sanskrit Indexes and Text of Brahmapur-a.na"
(published 1987 by Harassowitz, Wiesbaden), transcribing the text
as shown in the last word of the title just quoted, on plain ASCII
terminals, and transforming these codes to the required %-sequences
by simple search-and-replace just before printing.

Those interested in this approach may find it worthwhile to have a
look to the TUSTEP demonstration at the Toronto fair in June.


Wilhelm Ott, Tuebingen (ZRSZOT1 at DTUZDV2)