Humanist Discussion Group

Humanist Archives: Feb. 14, 2019, 6:04 a.m. Humanist 32.452 - the McGann-Renear debate

Humanist Discussion Group, Vol. 32, No. 452.
Department of Digital Humanities, King's College London
Hosted by King's Digital Lab
www.dhhumanist.org
Submit to: humanist@dhhumanist.org

[1] From: C. M. Sperberg-McQueen
Subject: some notes on the origins of SGML and XML (128)

[2] From: Peter Robinson
Subject: Re: [Humanist] 32.451: the McGann-Renear debate (69)

[3] From: Gabriel Egan
Subject: Re: [Humanist] 32.451: the McGann-Renear debate (71)

[4] From: Desmond Schmidt
Subject: Re: [Humanist] 32.451: the McGann-Renear debate (44)

[5] From: Michael Falk
Subject: Re: [Humanist] 32.451: the McGann-Renear debate (80)

--[1]------------------------------------------------------------------------
Date: 2019-02-13 21:50:06+00:00
From: C. M. Sperberg-McQueen
Subject: some notes on the origins of SGML and XML

In Humanist 32. 435, Martin Mueller writes:

The TEI rules were first expressed in SGML, a technology developed
by IBM.

This is at least half true.

The TEI was indeed first written as an SGML application; a team at IBM
did develop a system called GML (generalized markup language); and
SGML was in some sense a standardized form of GML.

But what GML contributed to SGML is probably better described as an
approach or a philosophy of text representation than as 'technology'.
The ideas of generic markup were also being developed by others outside
IBM: the book designer Stanley Rice; Bill Tunnicliffe of the Graphic
Communications Association (a printing-industry trade group) and what
became the GCA 'GenCode' (generic coding) committee; Brian Reid (whose
1980 dissertation described a document processing system called
Scribe, which used generic markup and later became a commercial
product and an inspiration for Leslie Lamport's LaTeX).
Some of the most important ideas in SGML were not present in GML, and
GML usage looks relatively little like the technological ecosystem
that eventually developed around SGML and XML.

In Humanist 32.436, Desmond Schmidt writes:

XML was invented by IBM and Microsoft, through the organ of the
W3C, to serve the needs of web services. Document processing was
very much a sideline.

This is nowhere close to half true.

All the members of the Working Group and 'Editorial Review Board’
responsible for the initial development of XML were document people:
Jon Bosak, Dave Hollander, Eliot Kimber, and Eve Maler had backgrounds
in technical documentation; Tim Bray, James Clark, Steve DeRose, Tom
Magliery, Jean Paoli, and Peter Sharpe had all developed major
document editing, document processing, and document publication
applications, and I had spent eight years as editor in chief of the
Text Encoding Initiative.

Some of us were convinced that SGML, being a system for the
representation of structured information, with user control over what
should be represented and how, had great potential for information of
all kinds, including the kinds of client server applications that were
later developed under the rubric "web services", but the charter of
the WG was "SGML on the Web", and the members of the WG and the ERB
all used SGML first and foremost (or exclusively) as a language for
documents. Some ERB members like Tim Bray and Tom Magliery were
perhaps interested in the Web first and in SGML as a way to develop
and improve the Web; the rest of us were by all appearances more
interested in our documents. Our goal for the Web was only to make it
capable of delivering our documents without unwanted loss of
information.

After the initial draft was published, during the further development
of the spec, the WG continued to grow. It is possible that some of the
new members were interested primarily in what later became known as
web services, but I don't recall any discussion of issues relevant to
that interest.

After the XML spec was finished, database vendors and those interested
in what became web services took an interest because the notation
could in fact be used (as some of us had thought) for non-document
information as well, and was superior to the alternatives then on
hand. It is possible that Desmond Schmidt and others have mistaken
the promotional literature and hype of the late 1990s and early 2000s
for serious documentation on the origins of XML.

Database vendors, web-services enthusiasts, and programming-language
type theorists were all involved in the development of some of the
later XML-related technologies like XSD and XQuery, and the tensions
between 'data heads' and 'document heads' were palpable in some
working groups.
Desmond Schmidt also writes:

Humanists must follow business.

I'm not convinced this is true. It's certainly convenient when we can
use off-the-shelf hardware and software, but many of the milestones of
computer usage in the humanities involve humanists who developed their
own software (and sometimes hardware) when they judge commercial
offerings are not suitable. I will mention only the names Susan
Hockey, Claus Huitfeldt, Wilhelm Ott, David Packard, and Manfred
Thaller.

As Jon Bosak pointed out in a talk at the TEI@10 conference in 1998,
one reason for digial humanists to care about XML is that even if
commercial support and development were to disappear, the format
is simple enough that we can write our own software to process it
and need not rely on commercial vendors.

Nor is the opposition between the needs of humanists and the needs of
commercial applications nearly so clear-cut as the quoted sentence
suggests: there is no problem in humanistic work with texts that does
not have an analogue in commercial or bureaucratic applications, and
vice versa. And some technologies (e.g. Unicode and XML) were
developed by representatives of commercial interests and humanists
working together, attempting to provide results useful to multiple
user communities.

Desmond Schmidt is quite right to say that XML is no longer as
fashionable in non-document quarters as it once was. Those interested
in web services have discovered that it has many features like
validation and mixed content which are of interest to people
interested in texts and which they regard as not relevant for
themselves. It has these features precisely because it was developed
for documents by people who worked with documents, and not (pace DS)
for web services.

Those who choose their computing tools based on their current vogue
will naturally also choose to migrate away from XML. Those who care
about user control of data and suitability of tools for tasks should
make choices based on careful examination of the relevant tools and
not based on the winds of current fashion.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

--[2]------------------------------------------------------------------------
Date: 2019-02-13 16:50:14+00:00
From: Peter Robinson
Subject: Re: [Humanist] 32.451: the McGann-Renear debate

Donald Mckenzie tells us, the history of reading is the history of misreadings
(or something like that).

By a “text-complete theory” I don’t mean that every text is completely described
by this theory. Quite the reverse. Just that every text can be represented
according to this theory. That is: that every text may be represented as a
collection of leaves all present on two unrelated trees (or OHCOs). This is
minimal only, if you like: and any kind of work is likely to add far more that
to that basic representation.

So, let’s push this a little further. The basis propositions again, shared by
every text which has existed, does exist, can exist (hence, “text-complete”):

1. All texts are real, in that each and every text is an act of communication
present in a physical document
2. Therefore, every text has at least two aspects: it is an act of
communication; it has physical properties in terms of the document in which it
is present
3. Each aspect may be represented as a OHCO: an ordered hierarchy of content
objects, a tree
4. The two trees are entirely independent of each other, and of any other tree
hypothesized as present in the text

Your first challenge. If this is not correct, you should be able to put up lots
of examples (one will do!) of texts of which these propositions are not true. Go
ahead. Knock yourself out. (A side note: axiomatically, one document may
contain, and usually does contain, multiple acts of communication: hence,
multiple texts. Equally, an act of communication may appear in many documents,
or many times in one document; acts of communication may be related to each
other, as versions or revisions, and may again appear in one document, many
documents, many times in one document.)

A few more notes. As both act of communication and document may be represented
by tree structures, we can use all the well-known tools of tree structures for
each aspect of the text. In every text there is, we can create tables of
contents of both the document, page by page etc, and of the act of
communication, by act, scene, speaker/line. It means we can navigate the trees
for each aspect, for line to line, page to page, act to act, act to scene to
line, up down and across. Useful. Every text, two tables of content.

Further: as each leaf of text exists on both trees, it makes sense — for every
text that ever was, is, shall be — to ask these questions:
What documents does this act of communication exist in? (what manuscripts
contain the Gospel of John?)
What acts of communication does the document contain? (what parts of the Bible
are present in Codex Sinaiticus?)
What parts of what acts of communication are on this page, in this column, in
this writing space?
What pages of what documents contain this part of this act of communication
(what pages of what mss have John 1.1?)

Again, all you have to do is find a text for which these are not meaningful
questions, and the proposition is disproved.

I don’t say, again, that this is all there is to texts. Not at all. Or that
there not other phenomena in texts which are not OHCO trees, etc etc. Or that
one might indeed dismember and re-member a text in entirely different ways.

By the way: this also answers the problem posed by Michael Falk: you encode
instances of the act of communication in each document you encounter (e.g. ); you locate each instance in every document and compare them. Add more
documents, with the same encoding for the act of communication, and so on. And
use CollateX for the comparison if you want really useful results.

More reading again..
https://wiki.usask.ca/pages/viewpage.action?pageId=1306492976 shows how we
implement this via an ontology into an API. We call every part of an act of
communication an “entity”, with the act of communication itself being a single
entity.

--[3]------------------------------------------------------------------------
Date: 2019-02-13 13:56:16+00:00
From: Gabriel Egan
Subject: Re: [Humanist] 32.451: the McGann-Renear debate

Dear HUMANISTs

Michael Falk gives an eloquent account of the difficulty
of recording in a single XML document the changes made
to a poem during its revision by its author. To Falk,
the alteration seems to sprawl across XML units (in this
case, lines) so that to record it the editor must do
one of the following: i) break the XML principle of nestedness,
or ii) preserve nestedness by treating whole lines as the
subjects of revision when in fact only parts of lines were
revised, or iii) preserve nestedness by recording a single
revision as multiple revisions, each occurring wholly within
its own line.

It seems to me that in Falk's example the editor might be
trying to record more than he knows. He writes that in
one revision "the word 'shores' has been moved yet again to
another element". What does it mean, exactly, for a writer
to move a word? One possibility is that the writer has
crossed out "shores" in one place and written it again
in another. A second possibility is that the writer has
inserted matter before "shores" that closes off the
element (say, a line) that "shores" used to be within,
making the same inscription of "shores" now appear inside
a different line element.

In either case, it is not indisputable that the word
"shores" before the revision is the same as the word
"shores" after the revision. Falk writes that the
poet has "moved" the word "shores" but it is equally
reasonable to say that the poet has deleted an
occurrence of "shores" and made a new one, especially
if a new inscription of the letters "s-h-o-r-e" has
occurred. Looking at the same phenomenon this way,
the revision did not necessarily break the hierarchical
nestedness. That is, I'm not clear what would count in
this instance as evidence that the author made a "single
act of revision" rather than two acts.

Indeed, what makes "shores" the unit that was "moved"?
We might say that poets think in terms of words, so words
are the natural units by which to describe revisions.
But it is at least equally arguable that poets think
in terms of lines or that they think in terms of
phonemes or even individual letters. If a poet deletes
"the" in line one and turns "me" in line two into
"theme", is this two changes or a single one in which
the letters "t-h-e" were moved from line one to line
two? It seems to me that the answer to that question
is non-obvious and requires us to state more explicitly
what count as the units of revision. There really is
more than one way to describe the difference between
two versions of something.

Is not the problem Falk is describing a conflict between
a hierarchy he has implicitly (unconsciously?) imposed
on the text in order to describe its revision and the
hierarchy he inherits from the XML encoding scheme
he has selected? If so, that is not in itself a criticism
of XML's principle of a hierarchical nestedness. It
seems to me that here again XML's insistence on
hierarchical nestedness forces us to think more
clearly about what we are doing in editing texts and
diagnosing revision.

Regards

Gabriel Egan

--[4]------------------------------------------------------------------------
Date: 2019-02-13 10:26:47+00:00
From: Desmond Schmidt
Subject: Re: [Humanist] 32.451: the McGann-Renear debate

On 2/13/19 Hugh Cayless wrote:

> Michael Falk's contribution seems to me to exemplify many of the kinds of
> error we see in these sorts of discussions

> >
> > May I just reiterate the point Bill Pascoe made a few emails ago. It is not
> > the case that: "3. Each aspect may be represented as a OHCO: an ordered
> > hierarchy of content objects, a tree." This is the central weakness of XML
> as a universal markup language. It insists on an impossibly strict nesting
> > of elements.
>

> I'm afraid if you're going to make assertions about impossibility, you will
> have a very hard time proving them.

It is quite possible for a TEI encoding of holograph manuscripts to be
so complex that it is practically, although not literally, impossible
to edit. That is, it is just as likely to be damaged as improved by
any attempt to edit it. If it is shared by a group of editors this
level of complexity is reached much sooner. The problem then becomes:
how do you communicate your understanding of the "howling wind-storm"
of tags that results to your colleagues so they may share your
interpretation of the textual phenomena being described?

Here is a moderately difficult example. A succession of hired
transcribers simply refused to encode this for us. I wonder how
hierarchies help us here?

http://charles-harpur.org/corpix/english/harpur/A87-2/00000131a.jpg

Breaking it down into separate layers as we have done is close to the
method Michael describes, and renders the editorial task perfectly
manageable.

http://charles-
harpur.org/View/Twinview/?docid=english/harpur/poems/h509&version1=/h509b/layer-
final

Desmond Schmidt
eResearch
Queensland University of Technology

--[5]------------------------------------------------------------------------
Date: 2019-02-13 07:30:03+00:00
From: Michael Falk
Subject: Re: [Humanist] 32.451: the McGann-Renear debate

I appreciate Peter Cayless's response. I overstated my case. I'm not
against TEI XML in general. It's a very useful way of encoding the
structure of a document, and has some useful metadata standards that I
think anyone who's into distant reading appreciates.

Nonetheless, a few points:

(1) My aim was simply to show that there are common cases in textual
editing where the strict nesting of XML elements causes inconvenience. The
poet I drew my example from is an extreme case, where virtually every poem
exists in multiple manuscripts, and virtually every manuscript has multiple
layers of revision. In other cases XML works very well. Some of the
problems with my particular example could be fixed if the standard were
changed. For instance, pagination creates just the kind of frustrating
overlapping of elements that lineation does in my example. I can't access
the TEI documentation at the minute, but as I recall, this problem has been
overcome by representing page breaks as self-closing elements. This would,
however, be quite difficult to do for text elements like lines.

(2) Peter makes the same suggestion that different "works" should be
treated as different texts and be encoded in different XML documents. This
is more or less exactly what I was suggesting. I must say I do not agree
that my nonce encoding involved "mashing" three different works together,
though he would be perfectly entitled to make that decision in his own
edition. It entirely depends what you mean by "work," a word whose
definition will always be contested.

(3) As for my faith in algorithms, I must declare an interest. The system
my colleague Desmond Schmidt has designed for the Charles Harpur Critical
Archive appears to work quite well. But I might reframe my point in the
light of James Rovira's question and Peter's critique. I do not want to
claim that algorithms are the solution where XML is not. It was simply an
example of an alternative approach. A database of versions and revisions
would be another. These approaches all have their advantages and
disadvantages, of course. A database requires a whole software ecosystem to
run, where a bunch of XML files are pretty resilient to changing software
standards and virtually self-describing. Graph representations like
Desmond's allow rapid collation and efficient storage, but are not human
readable. As Peter quite rightly says, it all depends on your purpose.

(4) So far I agree with Peter and simply wish to refine my argument. But
there is one point where we disagree. I don't think it is true that
representations are equivalent if they can be transformed into each other
without loss of information. Perhaps I misunderstand Peter's point, but
this seems to overlook information entropy. It must be obvious that some
representations are more efficient than others, and encode the same
information in fewer bits. Otherwise it would be "paragraph" not "p",
and "line-group", not "lg." But there is also the more important point in
practice, that some representations are more laborious to make than others.
In many common cases of textual editing, XML is both more laborious than
the alternatives, and it would not surprise me if it also required many
more bits for the same information (though I could well be wrong there). In
other cases, like preparing a well-structured reading text to be rendered
on a variety of devices in different ways, it is surely the ideal
technology. There is also the simple matter of elegance, which matters
because it goes to the interpretability of a representation by a human.

I think these points are important, because I have seen projects (I will
not name them) that have opted for XML early on, and involve themselves in
excruitating labours down the line that possibly could have been avoided.
To me it seems that the tree-structure of XML is usually the issue, so that
I what I criticised. I don't mean to rant against it. I use TEI XML all the
time in my own work. Having the metadata in a standard format, good
abstractions of common textypes, and a range of supportive technologies
like XPath and XLST make it a joy to use in many applications. I intend
only those types of criticism that Peter welcomes.

Michael Falk

--
Michael Falk
Developer and Research Project Manager
Digital Humanities Research Group
Western Sydney University

Living and Writing in the Blue Mountains
https://www.michaelfalk.com.au

Sent from my phone

_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.