9.358 encoding, accents, interpretation

Humanist (mccarty@phoenix.Princeton.EDU)
Tue, 5 Dec 1995 19:13:43 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 358.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)

[1] From: Jean V=8Eronis <veronis@univ-aix.fr> (170)
Subject: Re: 9.356 Lettres accentue'es

[2] From: Harry Gaylord <galiard@let.rug.nl> (8)
Subject: Re: 9.356 Lettres accentu&eacute;es

[3] From: Ian Lancashire <ian@epas.utoronto.ca> (81)
Subject: encoding and interpretation

Date: Tue, 05 Dec 1995 12:22:31 +0200
From: Jean V=8Eronis <veronis@univ-aix.fr>
Subject: Re: 9.356 Lettres accentue'es

We had quite a number of (occasionnally hot) discussions on the LN list=20
(ln@cnusc.fr) about accents in French.

What seems to the majority feeling is that:

(1) it is NOT acceptable to lose the accents in any language (think of writ=
English without the letter group "th", and you will understand the problem)

(2) systems like SGML entities are too heavy to type and read without appro=
software--that most people don't have (even if you have it, chances are goo=
d that=20
it is not connected to your mailer).

In addition, it seems that in most Internet communication nowadays, accents=
transmitted correctly. Only a few environments (including, unfortunately Li=
are accent-hostile, but I am sure that the situation will change.

For the time being, there is a broadly accepted e-mail convention for=20
accent-hostile environments, it is the "Easy French" convention developed b=
Franc,ois Pinard (mailto:pinard@iro.umontreal.ca), which I quote below. It =
is easy=20
to type, easy to read, and there is a free program (`recode' from the GNU=
distribution) that can translate from about any character set into it (and =
with a few ambiguities).

Jean Ve'ronis :-)


Quoted from the GNU `recode' on-line information:

> ASCII with easy French conventions
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> This charset is available in `recode' under the name `Texte' and has
> `txte' for an alias.
> This charset is a seven bits code, identical to `ASCII-BS', save for
> French diacritics which are noted using a slightly different convention.
> At text entry time, these conventions provide a little speed up. At
> read time, they slightly improve the readability over a few alternate
> ways of coding diacritics. Of course, it would better to have a
> specialized keyboard to make direct eight bits entries and fonts for
> immediately displaying eight bit ISO Latin-1 characters. But not
> everybody is so fortunate. In several mailing environments, the eight
> bit is often willingfully destroyed.
> Easy French has been in use in France for a while. I only slightly
> adapted it (the diaeresis option) to make it more comfortable to several
> usages in Qu'ebec originating from Universit'e de Montr'eal. In fact,
> the main problem for me was not to necessarily to invent Easy French,
> but to recognize the "best" convention to use, (best is not being
> defined, here) and to try to solve the main pitfalls associated with
> the selected convention.

> Diacritics
> ----------
> French quotes (sometimes called "angle quotes") are noted the same
> way English quotes are noted in TeX, *id est* by ```' and `'''.
> No effort has been put to preserve Latin ligatures (`ae', `oe')
> which are representable in several other charsets. So, these ligatures
> may be lost through Easy French conventions.
> This is almost the French convention for simplified diacritics entry:
> `e''
> Acute accent
> `e`'
> Grave accent
> `e^'
> Circumflex accent
> `e"'
> Diaeresis
> `c,'
> Cedilla
> In some countries, `:' is used instead of `"' to mark diaeresis.
> `recode' support one convention on a single call, depending on the `-c'
> option of the `recode' command.
> The convention is prone to loosing information, because the diacritic
> meaning overloads some characters that already have other uses. To
> alleviate this, some knowledge of the French language is boosted into
> the recognition routines. So, the following subtleties are
> systematically obeyed by the various recognizers.
> * A single quote which follows a `e' does not necessarily means an
> acute accent if it is followed by a single other one. For example:
> `e''
> will give an `e' with an acute accent.
> `e'''
> will give a simple `e', with a closing quotation mark.
> `e''''
> will give an `e' with an acute accent, followed by a closing
> quotation mark.
> There is a problem induced by this convention if there are English
> quotations with a French text. In sentences like:
> There's a meeting at Archie's restaurant.
> the single quotes will be mistaken twice for acute accents. So
> English contractions and suffix possessives could be mangled.
> * A double quote or colon, depending on `-c' option, which follows a
> vowel is interpreted as diaeresis only if it is followed by another
> letter. But there are in French several words that *end* with a
> diaeresis, the program also recognizes them. *Note Ending
> diaeresis::, for a study of all the problematic cases.
> * A comma which follows a `c' is interpreted as a cedilla only if it
> is followed by one of the vowels `a', `o' and `u'.
> =1F
> File: recode.info, Node: Ending diaeresis, Prev: Diacritics, Up: Texte
> List of words ending with diaeresis
> -----------------------------------
> Here is a classification of all cases of a diaeresis at the end of a
> French word:
> * Words ending in "igue"
> - Feminine words without a relative masculine: `besaigue"' and
> `cigue"'.
> - Feminine words with a relative masculine (1): `aigue"',
> `ambigue"', `contigue"', `exigue"', `subaigue"' and
> `suraigue"'.
> * Words not ending in "igue"
> - Ended by "i" (2): `ai"', `congai"', `goi"', `hai"kai"',
> `inoui"', `sai"', `samurai"', `thai"' and `tokai"'.
> - Ended by "e": `canoe"'.
> - Ended by "u" (3): `Esau"'.
> Notes:
> 1. There are supposed to be seven words in this case. So, one is
> missing.
> 2. Look at one of the following sentences (the second has to be
> interpreted with the `-c' option):
> "Ai"e! Voici le proble`me que j'ai"
> Ai:e! Voici le proble`me que j'ai:
> There is an ambiguity between an `ai"', the small animal, and the
> indicative future of *avoir* (first person singular), when followed
> by what could be a diaeresis mark. Hopefully, the case is solved
> by the fact that an apostrophe always precedes the verb and almost
> never the animal.
> 3. I did not pay attention to proper nouns, but this one showed up as
> being fairly evident.
> Just to complete this topic, note that it would be wrong to make a
> rule for all words ending in "igue" as needing a diaerisis. Here are
> counter-examples: `becfigue', `be`sigue', `bigue', `bordigue',
> `bourdigue', `brigue', `contre-digue', `digue', `d'intrigue',
> `fatigue', `figue', `garrigue', `gigue', `igue', `intrigue', `ligue',
> `prodigue', `sarigue' and `zigue'.

Date: Tue, 5 Dec 1995 12:47:55 +0100 (MET)
From: Harry Gaylord <galiard@let.rug.nl>
Subject: Re: 9.356 Lettres accentu&eacute;es

The point seems to be missed. The SGML text with entities can be
transfered safely always and arrives anywhere safely. It can be
used with any SGML conformant software now. Some clever SGML
editors have easy ways for entering it. It works now and will
do so in the future.

I don't have to remember where a piece of software expects the
slash before or after the vowel. And I get the French directly
on screen. I even get oelig which you can't using ISO 8859-1.

HG, Character Representation TC, TEI

Date: Tue, 5 Dec 1995 12:38:43 -0500 (EST)
From: Ian Lancashire <ian@epas.utoronto.ca>
Subject: encoding and interpretation

My thanks to Sperberg-McQueen, Burnard, and Durusau for their detailed
critiques of my comments. We need to discuss encoding issues
openly. No traditional humanities journal, I'm informed, has
reviewed TEI P3. Awareness of SGML in the humanities is weak.

It's good to have some consensus that SGML is interpretative. That's a sta=
If so, do we not have to say that no editor can adopt anyone
else's tagset as markup without thoroughly understanding the meaning
and implications of the tags and of the syntax of SGML? Does this
concession also not mean that Fortier was right in saying
that tagging must be the responsibility of the individual researcher?

If so, no one can accept TEI or SGML without questioning their assumptions.

Those of you who are passionately committed to SGML and TEI should take
my comments at face value. My criticisms come from my personal
encoding of 3 volumes of English poetry 1500-1900, 11 Renaissance
dictionaries, and 2 17th-century books -- in SGML. I devised parallel enco=
guidelines for both SGML and COCOA-style encoding. Yes, I reported on
this at MLA in San Diego in 1994.

I am of course questioning both TEI and SGML. Isn't questioning what
scholarship is all about?

While serving on TEI committees, I objected strenuously -- to no
effect -- to the failure of TEI to address presentational (typographic) and
analytic bibliographic encoding problems. Here I use Sperberg-McQueen's
terms gladly to escape a confusion that some of you believe I share
with Goldfarb, the developer of SGML! He contrasts generalized
markup, i.e., SGML, with typographic markup, in his The SGML Handbook
(1990; see pp. 16-17). He argues there that you do not need to do typograp=
markup, although I of course know that SGML as a syntax can be used
for that purpose -- after all, I cited HTML as a good example!

For some editors like myself, presentational markup is where you
begin. My objections to TEI were not welcomed. It was as if W. W. Greg,
Fredson Bowers, and Thomas Tanselle had never existed, as if they had
never discovered anything about the elements and structures of texts.

Despite what the three critiques say, the TEI DTD does *not* handle these
things and I have difficulty understanding how even SGML can be
*assumed* to handle either typographic encoding or the structures of
analytic bibliography when no one has published formally or informally
a successful DTD that does so.

Let me quote TEI P3 (p. 557):

=09These guidelines particularly do not address the encoding of
=09physical description of textual witnesses: the materials of the
=09carrier, the medium of the inscribing implement, the layout
=09of the inscription upon the material, the organisation of the
=09carrier materials themselves (as quiring, collation, etc.),
=09authorial instructions of scribal markup, etc."

What could be plainer? The HI and SPACE elements, and the REND
attribute, and as well the MILESTONE and FORMEWORK elements, do
not satisfy the needs of declarative markup for manuscripts or books,
despite the valuable work of Peter Robinson and the editors.

Now, am I also saying that you all should *not* use SGML, as Patrick
Durusau implies? Certainly not. I say this. The SGML community has
not demonstrated that it can handle the most basic textual structures
of use in the humanities, not after all the TEI discussions and
debates that raged from 1987 to 1994.

For example, how does one create a hierarchy of book, gathering, inner
and outer forms, and page? Hint: the problem is with the forms. Or,
how does one encode the acrostic at the beginning of Ben Jonson's Volpone,
or George Herbert's Easter Wings?

Please let us all know if someone has done these things. TEI P3, at
1289 pages, is longer than Goldfarb's Handbook at 664 pages, but I
have not found any reference to forms or acrostics.

I did not say that TEI or SGML could not encode italics as italics but
rather that TEI and SGML were uninterested in what the TEI community
began to call "rendition" in those days.

Burnard cites McLeod that even reading a printed letter -- the smudge --
is an act of interpretation. That's true enough. Determining
presentational features involves judgment. (Note, however, that the
ink-blot isn't what you have to interpret; if you're at that level,
you have to examine the "impression" of the slug or the pen on the page.
Ink can spread, as McLeod might have known.)

Yet there is a sizable difference between asserting that a given ink-blot
is a b instead of a p, and including -- in TEI's core tagset -- tags
defining the structure of a poem without producing tags for a
rhythmical unit or for a metrical foot! It is certainly NOT obvious
that the fundamental unit of verse is the line. It is this kind of
bland interpretation in TEI that can arouse some skepticism.

If my fellow users of SGML want to develop a tagset for physical
description of books and manuscripts and for analytical bibliography,
they should contact Murray McGillivray at the Department of English at
the University of Calgary. He took the initiative of organizing a
successful physical and online conference on encoding medieval
manuscripts. As far as I know, we haven't restricted admission to
those who have experimented with fire in Toronto.