Pundits take note: more on Sanskrit coding (277)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Tue, 18 Apr 89 20:16:32 EDT


Humanist Mailing List, Vol. 2, No. 855. Tuesday, 18 Apr 1989.


(1) Date: Tue, 18 Apr 89 15:28:07 EDT (12 lines)
From: Mathieu Boisvert <BOISVERT@vm.epas.utoronto.ca>
Subject: Sanskrit

(2) Date: Tue, 18 Apr 89 16:06 (246 lines)
From: Wujastyk (on GEC 4190 Rim-C at UCL) <UCGADKW@EUCLID.UCL.AC.UK>
Subject: Sanskrit, character codes

(1) --------------------------------------------------------------------
Date: Tue, 18 Apr 89 15:28:07 EDT
From: Mathieu Boisvert <BOISVERT@vm.epas.utoronto.ca>
Subject: Sanskrit

Here is a query in regard to Sanskrit ASCII coding: I've been told that
some Indologists are using a "keyboard layout based on **TIME INDIAN**",
which was supposedly developed at Berkeley. Does anyone have more info
about TIME INDIAN? I would appreciate if you could forward it to my
personal E-Mail address.
Thank you,
Mathieu
BOISVERT@VM.EPAS.UTORONTO.CA
(2) --------------------------------------------------------------250---
Date: Tue, 18 Apr 89 16:06
From: Wujastyk (on GEC 4190 Rim-C at UCL) <UCGADKW@EUCLID.UCL.AC.UK>
Subject: Sanskrit, character codes


For the records, here is the Sanskrit coding scheme used by Prof.
R. E. Emmerick (Hamburg), extracted from his "A Guide to
operating BHELA.EXE" (part of a suite of programs for inputting,
sorting, searching and editing Sanskrit verses, and their variant
readings):

195 long a
197 long i
198 long u
173 vocalic r
204 long vocalic r
202 vocalic l
203 long vocalic l
199 guttural nasal
164 palatal nasal
194 retroflex t
172 retroflex d
239 retroflex n
192 nasal l: allowed only by BHELA.EXE
211 palatal sibilant
171 retroflex sibilant
230 anusvara
247 visarga.


******************************************************************************

And here are my replies to some of the recent contributions on the subject of
character sets (all of which have been most stimulating and helpful):

>Humanist Mailing List, Vol. 2, No. 824. Wednesday, 12 Apr 1989.
>
>Date: 12 April 1989, 15:31:15 EDT
>From: Brad Inwood (416) 978-3178 INWOOD at UTOREPAS
>Subject: Coding for Sanskrit, Greek, etc.
>
>... There have
>been a few idiosyncratic formats in use: users of Lettrix have been
>transliterating in their own private ways; Academicfont coding looked
>as though it might approach being *a* not *the* standard, but then it
>died the death of all niche products;

I heartily agree that it would be shortsighted to link a coding scheme with any
particular piece of software, especially commercial (with no sources).

>... what are the prospects for a standard
>representation of exotic alphabets? Probably very poor, unless some
>central scholarly body throws its weight around, and even then ...

Very tricky question, I agree, but I think Sanskritists can sometimes be a
surprisingly cooperative lot. For example, we have all been using a standard
transliteration scheme since it was recommended by the 1894 Geneva Oriental
Congress! This is in stark contrast, for example, to our Arabist colleagues,
who are riven by the issue of transliteration, often along ugly nationalist
lines. Tibetologists, Sinologists, and others too are all burdened with many
different systems of transliteration. Not so Sanskritists. I really do
believe that the World Sanskrit Conference next year could lay down a guideline
that would gradually be adopted.

>... What
>is the effect on standardization of the competing approaches represented
>by the PC world and the Mac?

I have never used a Mac (no prejudice, just no money), and I am very curious
about what the Mac does by way of a character set. Does it have an 8-bit set
(256 positions)? If so, I assume that 0--127 are ASCII; what about the rest?
Are they easy to reassign? Is that how multilingual Mac s/w in fact works?

> And what is the relevance of word-processing
>vs text-base, text-retrieval, or database applications?

Yes indeed. And sorting Sanskrit is not trivial (though not overwhelmingly
difficult either). Some of the issues one has to address are: where does
anusvara go and what to do about the dipthong vowels "ai" and "au"?

*******************************************************************************

>Humanist Mailing List, Vol. 2, No. 825. Wednesday, 12 Apr 1989.
>
>Date: Wed, 12 Apr 89 08:42:00 EDT
>From: DEL2@phoenix.cambridge.ac.uk
>Subject: Sanskrit coding
>
>There is a WP program called ChiWriter which specialises in scientific,
>mathematical and `foreign' text; Nagari is available on it ...

Yes, I'm afraid the wheel has been invented *yet again*. Madhav Deshpande (Ann
Arbor) created a very nice set of Devanagari fonts (including bold, italic,
etc.) for ChiWriter some years ago. I think Madhav's fonts look nicer that the
new Swabian ones, which have some residual problems with spacing. Madhav's
package also included all the diacritical marks for transliterated Indic
languages too.

But TeX users (like Unix users) tend to be a bit sniffy about other systems. I
use TeX, with Velthuis's superb Devanagari, which was recently tested on a
Linotronic 300 with stunning results. Several years ago I decided that TeX
would be my terminal system [ final, not VDU :-) ], not only because of its
excellence, but because of Knuth's beliefs about the importance of software
stability; I shall not change to anything else in the forseeable future.

Incidentally, I have not used ChiWriter beyond a 5 minute fiddle, but I
gatherthat the underlying format in which the text is stored
is pretty hard to deal
with (in the sense of converting to other programs): superscripts, for example,
are stored as characters on a line of their own above the main line of the
text. Is that so?

>
>The second one (standard character codes for Sanskrit and Pali):
>
>(a) there is surely no reason to assume that this or any other
>scheme can or will become standard:

See my remarks above about the cooperativeness of Sanskritists. Of course I
may be wrong. Talk to me again after Vienna, 1990.

>... anyway translation between schemes can easily be automated;

Of course. This is not an argument against establishing a set of guidelines.

> (b) if the
>scheme really is intended for universal use it ought to contain lots more:
>Vedic accented vowels, vowel+macron+breve, nasalised vowels written with
>tilde, etc. And it is not safe to assume that upper case versions of some
>characters will not be needed.

YES! Please send in a precise and comprehensive list of your suggestions for
such a character set coding scheme. This would be most helpful
and interesting.

******************************************************************************

>Humanist Mailing List, Vol. 2, No. 833. Thursday, 13 Apr 1989.
>(1) --------------------------------------------------------------------
>Date: Wed, 12 Apr 89 21:45:18 -0400
>From: jonathan@eleazar.Dartmouth.EDU (Jonathan Altman)
>Subject: coding for sanskrit, greek, etc.
>
>There seems to be much discussion about the inability to agree on
>standards and the problems with incompatible data formats that
>results. I have a question which may sound stupid or oversimplistic,
>but I believe is not. Who cares about formats, exactly? I do not, for
>one. Since the hope of standardizing on one format seems to be very
>low, I would rather discuss the issue of how best to convert between
>formats.
> ...
>Jonathan Altman

Yes, I agree that conversion can be a relatively simple matter, especially if
you have a nice OS like Unix.

But there is a specific situation obtaining in the Sanskrit world. About a
dozen or so Sanskritists (probably a lot more) are pretty computer literate,
and use the PC. We have all got hold of the Duke University Language toolkit,
or some equivalent, and created dandy downloadable Sanskrit screen fonts for
our EGAs and VGAs. We then input masses of text using our particular character
coding scheme, and write software to manipulate our text bases. The only
trouble is, we all use different systems. For example, in the 87/88 academic
year I was a visiting scholar at Harvard. I lived *next door* to my friend and
colleague, Gary Tubb. He and I had done identical work in inventing screen
fonts, with different coding schemes. It was a real nuisance exchanging texts,
to such an extent that we didn't bother much. And in a whole year as neighbors
neither of us adopted the other's scheme, because we had quite an investment of
programs that supported our own system, and it was not clear that either of our
schemes was better. Now if an outside body made a recommendation on this very
simple matter, it would be *worth* changing.

>(2) --------------------------------------------------------------21----
>Date: Thu, 13 Apr 89 01:48:45 EDT
>From: cbf%faulhaber.Berkeley.EDU@jade.berkeley.edu (Charles Faulhaber)
>Subject: Re: coding "strange" languages (37)
>
>I would hope that the Text Encoding Initiative would be looking
>at funny character set as they set about their endeavors. And what
>is the ISO doing? This looks like a good candidate for a session
>in Toronto.
>
>Charles Faulhaber

Yes to everything. I am sorry that I shall not be able to attend the Toronto
meeting.

But I would add this: Sanskritists are not going to take any notice of what a
bunch of computer scientists say, even Humanists. They have to hear it from
fellow Sanskritists, preferably with a good track record in Sanskrit. So any
such scheme for character coding has to be thrashed out at a Sanskrit
conference, to stand any chance of being adopted.

>(4) --------------------------------------------------------------28----
>Date: Thu, 13 Apr 89 12:39:57 EDT
>From: elli@harvunxw.BITNET (Elli Mylonas)
>Subject: coding for strange languages (21)
>
>The goal in developing a standard is not so much in creating something
>that will work on any machine at any time; so that a character set
>can be read by all word processors. The goal is to create a standard
>that all (ha!) software can translate into and out of. If you ask the
...
>So what we need is systems that can read in , and when moving to
>some other system, write out beta code (in the case of Greek).

I agree wholeheartedly with your basic point, of course. But for the reasons
given above, I still think that some guidelines for the assignment of
character codes 128--255 would be a good idea, and would actually be *used* by
Sanskritists.

I hasten to add that I am not thinking of such a scheme as *the* coding scheme
for Sanskrit, by any means. I firmly believe that *the* coding scheme should
be a 7-bit one, and this will mean digraphs and trigraphs. This scheme should
fall within the SGML scheme, and be used for any big projects, or text
archives, etc., just as you suggest.

All that I have in mind in the present discussion about coding schemes is
really an agreement about the use of the high bit codes for some characters
with diacritical marks, to be used for data entry and by some of the simpler
sorting and editing programs, since so many of us have already developed our
home-grown versions of such a scheme. The advantage of such a scheme is that
bog-standard software can then be used to deal with Sanskrit data (e.g., I use
ProCite, TeX and XyWrite all the time for Sanskrit, with complete ease). I
think that a very important feature of any scheme that is to be adopted as a
serious standard for large scale work is that it should be extendible, and
clearly the one-to-one assignment of character codes in an 8-bit grid is not
extendible. And *any* such scheme should be implemented only after taking on
board the recommendations of the relevant Text Encoding Initiative committee.


Dominik


PS I don't like the above format of replying to mail in a bitty, pseudo-
conversational manner. I won't do it again.


-----------------------------------------------------------------------------
Dominik Wujastyk, | Janet: wujastyk@uk.ac.ucl.euclid
Wellcome Institute for | Bitnet/Earn/Ean/Uucp: wujastyk@euclid.ucl.ac.uk
the History of Medicine, | Internet/Arpa/Csnet: dow@wjh12.harvard.edu
183 Euston Road, |
London NW1 2BP, England. | Phone: London 387-4477 ext.3013
-----------------------------------------------------------------------------