Humanist Discussion Group

Humanist Archives: Feb. 24, 2019, 7:20 a.m. Humanist 32.487 - editions, in print or in bytes

Humanist Discussion Group, Vol. 32, No. 487.
Department of Digital Humanities, King's College London
Hosted by King's Digital Lab
www.dhhumanist.org
Submit to: humanist@dhhumanist.org

[1] From: C. M. Sperberg-McQueen
Subject: What's it all about, really? Yet another coda to the McGann/Renear thread (425)

[2] From: C. M. Sperberg-McQueen
Subject: SGML and the legacy of print (and other topics) (386)

[3] From: Desmond Schmidt
Subject: Re: [Humanist] 32.486: editions, in print or in bytes (45)

[4] From: Katherine Harris
Subject: Re: [Humanist] 32.486: editions, in print or in bytes (66)

--[1]------------------------------------------------------------------------
Date: 2019-02-24 07:11:50+00:00
From: C. M. Sperberg-McQueen
Subject: What's it all about, really? Yet another coda to the McGann/Renear thread

In Humanist 32.486, Dana Paramskas asks:

I admit to being fascinated by the discussion, although slogging
through an alphabet soup of acronyms for various codings makes it
a wee bit hard to understand.

Guilty as charged. I see, moreover, glancing through the long history
of the thread, that I am among the chief offenders (i.e. users of
unexplained acronyms). As penance, I will append to this mail a
glossary of the acronyms found in Humanist 32.452 (since DP's subject
line refers specifically to that item) and subsequent contributions to
the thread, as well as some other terms referred to without
explanation that I hope may be helpfully explained.

The details of the argument do occasionally hinge on very specific
technical issues, but anyone who will admit to being interested should
be made as welcome as can be managed. Please be patient if the
explanations given below don't hit the right level for you.

DP asks:

The core question is (please correct a non-coder if I have
misunderstood): what is the "real" text?

Probably.

When the wind is in the southeast and I am a nominalist, I am not
convinced that there is any single identifiable core question in the
discussion, merely a long sequence of remarks provoking other remarks
which provoke yet others, just a chain reaction of readers turning
into writers and back into readers.

Historically, though (and when the moon is nearly full and I admit to
my inner Platonist again), yes, there is a core question, and yes
"what is the 'real' text" is a good first approximation to it.

The question, though, has several senses.

In connection with editions, Desmond Schmidt has referred to the long
tradition of attempts to understand the transmission of a text well
enough to trace the history of errors or other changes made in the
course of reproduction, organize the extant textual witnesses into a
stemma or tree of descent, and use systematic methods to reconstruct
the text of the document at the root of the tree (the archetype), or
(with the aid of conjectural emendation) the text of the work as it
left the hands of the author. This method is best known in the
context of ancient and medieval works, where it is usually associated
with the name Karl Lachmann (an important figure both in the study of
Latin literature and in that of medieval German literature), and there
is a long and sometimes interesting history of debates over whether
the presuppositions of Lachmann's method hold in the real world, or in
our understand of text: is the "real" text really the one that left
the hands of the author, or is the "real" text the text(s) actually
read by readers of the work? Can we in any case assume that what left
the hands of the author was "a" text in the singular? What if it was
a mass of notes not yet put into final order (think Vergil, Chaucer,
Büchner, ...)? It can also be argued that the transmission pattern of
manuscript works is often not really a tree in which each new
manuscript is copied from exactly one exemplar: scribes sometimes
switch back and forth between exemplars, some manuscripts seem to have
been created by copying from two exemplars at once (now following the
one, now the other), and lyric was clearly often transmitted by
memory, which will typically defy reduction to a stemma.

A second sense of the question is given by a paper by Allen Renear and
others, widely cited and almost as widely read, which posed in its
title the question "What is text, really?" (For non-philosophers, it
may not be obvious at first glance that the last word of the title
conceals a wicked Platonic sting; be forewarned.) Renear and his
co-authors inquire into the "real" nature of text, partly as a
question with its own intrinsic interest and partly as a way of
discussing different approaches to the electronic representation of
text. The implicit assumption (widely shared, I think) is that an
electronic representation which comes closer to capturing the "real"
nature of text will be superior to others not just on theoretical
grounds but for practical reasons as well.

Renear et al. argue that text consists of -- or rather just *is* -- an
'ordered hierarchy of content objects' (OHCO), and that SGML is a good
representation for text precisely because SGML elements form an
ordered tree. (SGML elements are 'ordered' because the sequence in
which sibling elements occur in the document is considered essential
information; the first paragraph of the chapter and the second
paragraph of the chapter cannot be interchanged without changing the
document. This distinguishes SGML from typical database management
systems in which constituent objects are explicitly not ordered. They
form a 'hierarchy' in the sense that they nest, so each element is
directly contained by exactly one 'parent' element, except the
outermost element of all, which has no parent element.) They conclude
their paper by pointing to a number of textual phenomena like
hyperlinks and overlapping structures, which go beyond the OHCO model,
and suggest that a really good representation of text will have to
provide both for hyperlinks and for multiple ordered hierarchies.
(This concluding section is customarily ignored by those who object to
the OHCO model on the grounds that texts may have multiple
hierarchical structures; 'the OHCO thesis' is universally taken to be
the claim that texts have only one hierarchical structure, despite the
fact that Renear et al. explicitly deny that proposition. It is also
customary to ignore the fact that Renear et al. do not define "content
object" and to pretend that it's clear to everyone what the phrase
denotes. That is, no one else defines it, either. After struggling
with it for a while, I concluded that in practice the term means
'thingy; doohickey; whatchamacallit'.)

The first sense of the question ("real" text as the text sought or
constructed by an editor) ties into the second sense of the question
(the Platonic nature of text and the best way to represent text
electronically) by way of Desmond Schmidt's arguments that (a) the
search for a unitary archetypal text is not a task given to us by
reality but an artefact of printed distribution, best abandoned in the
digital age, and (b) that XML and SGML (see glossary below) are
ineluctably tied to the worldview of print and also best abandoned in
the digital age.

A third sense of the question relates to the relative importance of
particular physical realizations of a text (inscription in a
particular document, or vocal performance at a particular place and
time) and those properties of a text common to multiple physical
realizations. Jerome McGann is often cited as an example of a modern
critic who argues that the meaning of literary texts is inextricably
bound up with their bibliographic realization and that the abstract
view of text he believes inherent in SGML and XML (or possibly only in
TEI, since his own projects often also use XML) is necessarily a
falsification of the "real" text, because it does not attend closely
enough to the source text's physical realization. (In this context,
it is slightly disappointing that in the debate referred to in the
"McGann-Renear debate" thread, McGann's interpretation of the literary
text at issue was a perfectly conventional close reading of the text
focused on style and language, with no specifically bibliographic
argument at all.)

It is, as I understand it, this argument over whether the "real" text
is an abstract object (realized in possibly different ways in
different documents) or a concrete physical object, which Peter
Robinson is attempting to dissolve with his postulate that every text
has two aspects, each representable as an ordered hierarchy. (Q. Abstract
or physical? A. Both, please!)

A fourth sense of the question is related to the third but involves a
disagreement about how complete we should make our representations of
text. To take a concrete example: in an electronic representation of
a play by Shakespeare, should the metrical hierarchy (acts, scenes,
verse lines, possibly verse groups, and maybe even feet?) be
represented "explicitly" in the electronic form? What about the
dramaturgical structure (acts, scenes, speeches, stage directions)?
And the typographic structure (quires, leaves, pages, columns,
typographic lines, ...)? Some participants in the discussion feel (or
so I gather) that these structures should be represented explicitly in
the electronic document format only if they are "real", which for some
discussants apparently means only if they were demonstrably present in
the mental acts of the author. Some participants have suggested that
some of the structures can safely be omitted, because software can (or
will soon be able to, just as soon as artificial intelligence
progresses a bit further) recognize them even if they are not
represented explicitly in the electronic document. For each of these
propositions, there are those who will deny them. ("Artificial
intelligence is not now any match for human intelligence in annotating
texts"; "textual structures can be real even if they were not part of
the author's conscious world"; "textual structures can be real even if
they aren't part of the author's world at all [since texts are a
collaboration between author and reader, all this talk of the author
without talk of the reader is in any case misguided]"; and "the
electronic format should contain the information we need for
processing, whether that is a 'real' part of the text or a 'not real'
part of the text or information which is not part of the text at all
but may yet be of use [like a library shelf number for the exemplar
from which an electronic document was transcribed]".)

As someone rightly pointed out, the question even arises for a
single, author-vetted version, which as always must rely on or
conform to the reader's interpretation, short of an extended
dialogue with the author (or even then). If an author chooses to
make available multiple versions, with additions or corrections
and without a final, definitive version, is she or he not saying
simply - up to you all to figure it out, and have fun?

Quite.

Those who believe (as I think Desmond Schmidt does) that SGML and XML
are intrinsically tied to the print medium, and that the print medium
is inescapably tied to a single-text notion of textual reality, see
such texts as illustrating the fundamental inadequacy of SGML, XML,
and TEI. Those who doubt either of those premises will see such texts
as interesting examples which illustrate the power and utility of
SGML, XML, and TEI.

Then there is a jump in logic which I have trouble understanding
because, of course, I am not a coder. Each code -- or whatever
it's called -- has its own structure or hierarchy. While this adds
a measure of complexity, does it really help to identify an
"original text"? Could someone in the know explain to a naive
reader just how the various codings do or do not aid in reaching
the "original text" aside from the arguments about the ease or
validity of the different codings?

I believe that the multiple senses of the "real text" are here
crossing their wires. As far as I know, no one is arguing that any
particular representation of text will make it materially easier or
harder to reconstruct an archetypal or original text; as I understand
the discussion so far, then, the answer to your "how do the codings
aid ..." question is "they don't".

There *is*, however, an argument in the other direction. At least one
or two participants in the discussion have argued that any attempt to
represent the text of multiple textual witnesses in a single
electronic document will necessarily cause painful difficulties in the
electronic document, and further that the hierarchical structure of
SGML and XML documents makes the difficulties even worse than they
would otherwise be. So although I don't think anyone has argued that
particular choices in text formats can help establish an original
text, there *have* been arguments that particular choices in text
formats can make it harder. There is, needless to say, no more
unanimity on this point than on any other.

And always and everywhere, there is the sense that making particular
textual features explicit in an electronic representtion
makes them easier to identify, not for humans, but for machines.
As Elli Mylonas sometimes says “in an XML application, the
software is stupid and the data are smart.” Those of us who
expect software to remain stupid for a long while want to make
explicit our understanding of a text, in order that software
which is not intelligent can at least be well informed. “Markup”,
as Lou Burnard used to say (and no doubt still does) “is a
hermeneutic activity.”
If we think of the features we mark up as (part of) the “real”
text, then yes, the codes (tags) we put into the text are to help
readers (software or human) identify what we believe is the real text.

It is possible that I have misunderstood your questions, or not
answered them satisfactorily; if so, I ask for your patience. Some of
the points at issue are best illustrated with concrete examples, but
clear short examples are in my experience very hard to come by, and in
a discussion about things becoming unwieldy, concrete examples are
likely to require the explanation of so many technical details that I
feared they would make things even harder to follow than you have
already found them. I hope this helps, a little.

-C. M. Sperberg-McQueen

Glossary of acronyms, unexplained references, etc.

(Further information on any of these may be found by web search.)

ACE = 'Automatic Computing Engine'. The name of a computer designed
by Alan Turing in 1945; a scaled-down version was completed by the
National Physical Laboratory in 1950 and a second version by 1952/53.

ADHO = 'Association of Digital Humanities Organizations'. An umbrella
organization for the cooperation of regional professional societies
focused on digital humanities.

ASCII = 'American Standard Code for Information Interchange'. An
extremely influence coded character set which specifies electronic
representations for 95 graphic characters (the letters 'A' to 'Z' and
'a' to 'z', the digits '0' to '9', and numerous punctuation marks. In
discussions of textual representation, terms like 'ASCII text' are
often used to denote textual representations which do not use explicit
markup in the style of SGML, XML, or COCOA.

COCOA = 'Word COunt and COncordance Program on Atlas'. A batch
concordance system using an influential form of markup , developed at
the Atlas Computing Laboratory in the 1960s (? or 1970s?).

CSS = 'Cascading Style Sheets'. A language specified by W3C for
describing the rendering of HTML (and, more generally, XML) documents
in browsers (and, eventually, in other media).

DTD = 'document type definition'. A formal specification of the types
of elements used in marking up documents of a given type. A document
which conforms to the rules specified in the DTD is said to be
'valid'.

EDVAC = 'electronic discrete variable automatic computer'. A binary
stored-program computer built by J. Presper Eckert and John Mauchly at
the University of Pennsylvania on contract to the U.S. military in
1946-49; the successor to ENIAC. Widely known because of the
publication of a design document attributed to John von Neumann.

ENIAC = 'electronic numerical integrator and calculator'. A vacuum-tube
computer built by J. Presper Eckert and John Mauchly at the University
of Pennsylvania during World War II.
ERB = 'editorial review board'. A small body within the W3C’s “SGML on
the Web” working group responsible for supervising
the work of the editors, and in particular making design decisions
about what the spec should say.

GML = 'Generalized Markup Language'. An IBM system for descriptive
markup devised by Charles Goldfarb, Ed Mosher, and Ray Lorie; GML
eventually shipped as part of IBM's Document Composition Facility
product, and was re-implemented at the University of Waterloo as an
add-on to Waterloo Script.

Hockey, Susan. The longtime head of the Humanities Computing Unit at
Oxford University Computing Services; longtime chair of the
Association for Literary and Linguistic Computing (ALLC), now the
European Association for Digital Humanities (EADH); author of A Guide
to Computer Applications in the Humanities; participated in the
development of the COCOA system; led the development of the Oxford
Concordance Program.

Huitfeldt, Claus. Professor of Philosophy at the University of
Bergen, founding director of the Wittgenstein Archive there, and
editor of the Bergen Electronic Edition of Wittgenstein's posthumous
papers (the 'Nachlass'); declined to use SGML and TEI for transcribing
Wittgenstein and instead invented MECS and MECS-WIT (q.v.) for the
purpose.

LaTeX. A set of macros for TeX (q.v.) which provide a system of
high-level descriptive markup for documents (so documents can be
described in terms of sections, subsections, and the like, and not
solely in terms of font family and size, etc.).

MECS = 'Multi-element coding system'. An alternative to SGML developed
by and for the Wittgenstein Archive at the University of Bergen.
MECS-WIT. A specific MECS-based markup language developed by the
Wittgenstein Archive at the University of Bergen for the transcription
of Wittgenstein's notebooks.

OCP = 'Oxford Concordance Program'. A batch concordance system using
(an extension of) COCOA markup, developed in the late 1970s.
OHCO = 'ordered hierarchy of content objects'. A model of text
outlined by Allen Renear et al. in their paper "What is text, really?"
Often asserted to be the model of text underlying SGML and the TEI.

Ott, Wilhelm. Founder and longtime head of the Department for Literary
and Documentary Data Processing at the University of Tuebingen
computer center; creator and co-programmer of Tustep, an extensive and
tightly integrated set of programs for encoding text, processing text,
and producing editions (both scholarly and commercial).
Packard, David. Classicist, philanthropist (and son of the electrical
engineer David Packard) and creator of the Ibycus minicomputer system
(a customized version of a Hewlett-Packard minicomputer) and (in the
1980s) the Ibycus Scholarly Personal Computer. Both Ibycus systems
had hardware assistance for quick search in the Thesaurus Linguae
Graecae and other texts using the same markup as TLG.

SGML = 'Standard Generalized Markup Language', a set of rules for
electronic representation of documents. Formally SGML is defined by
International Standard ISO 8879, published in 1986. Strictly speaking
what SGML defines are rules for defining rules for document
representation; SGML is a "meta-language" (a language for defining and
talking about languages). Its rules are specific enough that all
markup languages defined using SGML have certain family resemblances,
so often in discussions like this one people speak as if all
SGML-based document languages were alike, or as if there were only
one. Notable properties of SGML-based languages are that documents
consist of a mixture of 'character data' and 'markup', that the most
prominent form of 'markup' are 'tags' which mark the beginning and
ending of portions of the document called 'elements', and that
elements nest within each other like Russian dolls. The last property
has the consequence that two elements can nest one within the other,
or be entirely separate, but cannot overlap each other. An optional
feature of SGML called 'concurrent markup' and signaled using the
keyword CONCUR allows multiple sets of tags in a document; the
elements marked by each set must nest, but elements in different sets
may overlap. Each set of elements is defined formally by a 'document
type definition' or DTD (q.v.).

SOAP = 'Simple Object Access Protocol'. A set of rules for
representing programming-language objects in XML for transmission over
the network and reconstitution as obects at the other end.
TEI = 'Text Encoding Initiative', an international project begun in
1987 which produced a set of "Guidelines for Text Encoding and
Interchange" in 1994, which took the form of an SGML document type
definition (or rather: a set of SGML DTDs) and documentation of their
meaning. Since 2000 (or so), the Guidelines have been maintained by
the TEI Consortium; among the changes made is that the Guidelines now
use XML rather than SGML as their basis.

TeX (tau epsilon chi, conventionally written 'TeX' in Latin contexts),
a batch formatter developed by the Stanford computer scientist Donald
E. Knuth for producing typeset output, with particular emphasis on the
typesetting of mathematical expressions.

Thaller, Manfred. German historian and digital humanist; creator
first of CLIO (an adaptation of a commercial database management
system for managing data sets of interest to historians) and then of
Kleio (a PC-based database system for historians without any reliance
on commercial database management systems) and other systems; long
time advocate for the creation of a 'historial informatics' as a
discipline distinct from and independent of computer science.

Tustep = 'Tuebingen System of Text Processing Programs'. ...

W3C = World Wide Web Consortium. The stanards development body
responsible for Web technical standards like HTML, CSS, XML, and many
others. W3C created an "SGML on the Web" working group in 1996,
reorganized later as the "XML" working group. For historical reasons,
W3C standards are called "Recommendations".

WG = 'working group'. The organizational unit for technical work in
ISO, W3C, and many other standards development organizations.

XML = 'Extensible Markup Language', a subset of SGML specified in 1998
by the World Wide Web Consortium (W3C). XML restricts many of the
choices left open by SGML, in the interest of making it easier to
write software to support XML. Most of the restrictions affect only
the syntax of the format, and not its model of text, but there is one
exception: the CONCUR feature is not part of XML.

XQuery. A query language for XML databases, developed by a W3C working
group between 1998 and 2011.
XSL-FO = 'Extensible Stylesheet Language - Formatting Objects'. An
XML vocabulary for high-level abstract descriptions of pages.

XSLT = 'Extensible Stylesheet Language: Transformations'. A
programming language for specifying processes which transform one XML
document (the input) into another (the output); one of the most widely
used tools for processing XML data.

XSD = 'XML Schema Definition Language'. An XML-based language
developed by W3C for defining schemas for XML-based markup languages;
XSD schemas are one of several alternatives to DTDs.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

--[2]------------------------------------------------------------------------
Date: 2019-02-23 21:37:23+00:00
From: C. M. Sperberg-McQueen
Subject: SGML and the legacy of print (and other topics)

[The following is a long point by point discussion of some topics
raised by Desmond Schmidt’s contribution to Humanist 32.486.
Those who like chasing intellectual hares, or watching them be chased,
may enjoy at least parts of it. But those for whom these
particular hares have no special importance or meaning may find their
attention flagging a bit and should feel no oblgation to
read all the way to the end. It’s just one hare after another. -CMSMcQ]

In Humanist 32.486, Desmond Schmidt writes:

I think you are right to point out that the context matters. GML
was only a print technology. SGML was potentially both print and
digital, but mostly print. XML was primarily digital. In 1980
there was not much that was true digital text. Digital documents
including those in SGML were mostly used to make printed books, or
as an adjunct to print inserted as CDs in the covers.

Since you are distinguishing here between SGML and XML, I infer that
you are talking about uptake and usage, and not about the intrinsic
natures of the two. I'm not certain I have a good overview of the
entire SGML marketplace -- or for that matter of the XML marketplace
-- so I am not in a position to agree or disagree. The SGML products
and tools I remember most vividly were not print-centered
(e.g. Author/Editor, Panorama, DynaBook, DynaWeb, not to mention all
the work on interactive electronic technical manuals within the
defense community), but it's true enough that some major SGML products
with huge dollar values (e.g. DataLogic) were aimed at typesetting,
and it's also true that the print-centered products were much slower
moving to XML than other products.

(However, neither XML nor SGML are relevant for 1980, since neither
existed then.)

But what was in their consciousness when they designed it?
Print.

It's not clear what this means. If it means that the designers of
SGML paid lip service to other uses of data but actually thought only
about print, I think it's clearly false. If it means that the
designers understood print in ways they did not understand linguistic
annotation, then it's probably true. But as an argument about what
SGML and XML are good for, it is utterly beside the point: it's an
argument for guilt by association. The designers were interested in
print, and therefore nothing they did can be relevant for online work?
(Interestingly, wasn't it only a couple of weeks ago that documents
were purely an afterthought? And now, voila, they are the anchor that
drags the entire enterprise to the bottom of the sea.)

The decision to separate out metadata into elements and attributes
was only part of the legacy of print that was included in SGML.

I do not see any relation here, let alone causality.

In what sense is a distinction between elements and attributes a
'legacy of print'? Do all systems for generation of print have such a
distinction? I don't see it in troff, or Runoff, or Script, or TeX,
or LaTeX, or Scribe; am I missing it?

The assumption that "elements and attributes" constitute "metadata" is
also not one I think can be taken for granted. The idea that "markup"
is always and only "metadata" is not hard to find, and is often useful
when teaching beginners the rudiments of markup, but it's hard to take
seriously as a philosophical statement and -- like the concept of
"metadata" itself -- does not (in my limited experience) withstand
sustained scrutiny. (It's also often mixed up with a model of text as
a essentially a sequence of characters, which can be a handy idea in
some contexts but should not be taken seriously for very long.)

No one that I know of has ever proposed any plausible distinction
between "data" and "metadata" that does not make the distinction
ultimately depend on one's point of view at a particular moment or for
a particular application. Calling some information "metadata" is
inviting a particular way of thinking about its relation to other
information; it is not a classification stable across time, persons,
organizations, or intentions. While the distinction between markup
and character data content is also fluid across time (applications may
transform each into the other), it tends in my experience to be a
little more stable than the distinction between 'data' and 'metadata'.

The deliberate decision to introduce explicit hierarchies
was another,

Like the preceding sentence, this seems to assume a line of argument
with which I am not familiar. Why should a tree-structured
organization of the input, or the ability to describe a document
format with a context-free grammar, be a legacy of print?

If hierarchies are intrinsic to print orientation, why did I spend so
much time when I was talking about the TEI in public answering
objections from people claiming that pages (being two-dimensional)
could not possibly be described usefully using hierarchical data
structures?

Perhaps the experience of having SGML and XML attacked first as being
too distant from print and incompatible with interest in page layout
and presentation, and then having them attacked as being too laden
with legacy print-oriented assumptions, has made me insufficiently
sympathetic to either argument. But if anyone wants to take a shot at
explaining why the inclusion of document type definitions in SGML, or
the rule that each element instance in a document, other than the
outermost element instance, has exactly one parent, are best explained
as legacies either from print or from pre-SGML, pre-GML computer-based
typesetting or layout systems, I would be eager to hear the argument.

as was the use of processing instructions meant originally to
control the printer.

Three questions arise here. (1) What leads DS to believe that
processing instructions were originally intended specifically for
printers and not for other processors, such as editors, stylesheet
processors, plotters, or formatting engines (just to stay within a
paper-oriented work flow)? Do other kinds of processors never need
ad-hoc instructions? When I teach SGML or XML to beginners, I
normally describe processing instructions as a bit like the rainbow in
the story of Noah: the processing instruction is the sign of a promise.
It signifies that SGML and XML were developed not by ivory tower
theorists but by people who knew what it's like to need to get a
document ready in time to make a 4 pm Fedex pickup. Yes, the
formatting macros ought to get this page correct in a purely
declarative way, but if they don't, then at 3:15 I'm going to inject
a processing instruction that means "I don't care what the algorithm
says, force a paragraph break RIGHT HERE." (I use document
formatting in the example not because it's the primary use of
processing instructions, but because beginners can often relate to it
better than to other applications.)

(2) If the original intention was for PIs to control printers only,
why does the spec not limit them to that use? Is it just a case of
bad spec drafting, in which a WG fails to say what they want to say
because they can't express themselves cleanly? Is it a case of such a
limitation being impossible to write in spec prose?

Surely no one in their right mind can believe that the SGML WG would
have been dissuaded from specifying that PIs were only to be used to
control printers by the likelihood that doing so would result in
awkward prose or convoluted logic.

(3) If we were to suppose that the entailment here were correct and
that processing instructions were indeed originally meant "to control
the printer", and that the failure of the spec to limit them to
printers was just a case of botched drafting, what relevance does that
have to questions of what SGML can be used for?

I think DS has succumbed here to the intentional fallacy.

JSON has done away with attributes and even though it is not a
document format, it shows that they were superfluous.

Of course attributes are superfluous, in the sense that a version of
SGML or XML which lacked them would lose no expressive power (modulo
some questions of datatyping which I will address only by waving my
hands to make them go away). This has been known since ... gosh, I
don't know when. I first heard the observation from the computer
scientist Sandra Mamrak, the first chair of the TEI's metalanguage
committee, in 1989 or thereabouts. The TEI retained them not because
we thought the element/attribute distinction was essential for
expressive power but because we thought the distinction made it easier
to build more usable vocabularies. Thirty years of listening off and
on to people complaining that they don't know when to use elements and
when to use attributes have persuaded me that design sense and
technical tact are not universal gifts, but they haven't changed my view
that the element/attribute distinction is a useful tool for vocabulary
designers, just as the logical operators 'or', 'and', and 'not' are
useful for those who wish to apply formal logic to real-world
problems (even though like attributes they are superfluous).

Similarly a version of SGML or XML which had attributes but lacked
content models would also lose no expressive power; I remember Jon
Bosak pointing this out in the late '90s, when discussing the claim
that XML would be faster to parse if it lacked attributes. He
argued, plausibly, that it would probably be even faster if it had
attributes
but lacked character data content.

If anyone was waiting around for JSON to establish the truth of this
not terribly demanding proposition, they cannot have been paying any
attention. (I notice in passing that JSON does not seem to
demonstrate that tree structures are superfluous. Why is that, I
wonder?)

No doubt 15th century bookmakers would be aghast at the accusation
that their creations resembled manuscripts because they
generalised books and made them reproducible.

Say what? How can anyone look at any of the productions of Gutenberg
or Fust and think they would be aghast at a resemblance between their
books and manuscript books? I am reminded of the disclaimer I once
saw in a novel: The characters in this novel are fictional. Any
resemblance between them and real people is the result of one hell of
a lot of hard work.

When we look back on XML from the perspective of the
future we will see these seeming innocent design decisions as the
traces of print technology that they are. And they are not
innocent. They influence powerfully what we can do with digital
editions.

I'll leave future perspectives for the future to worry about.

From the perspective of the present, it seems to me that the claim
that any of (a) the element/attribute distinction, (b) the constraints
that have as a consequence that the elements of an SGML or XML
document form a tree (or, in a document using CONCUR, multiple trees),
or (c) processing instructions has any visible connection to print is
one that lacks all plausibility (as well as, so far, any evidence or
argument).
Yes, the design decisions of SGML powerfully influence what we can do
with digital editions. Quite true. This is -- as has already been
pointed out in this discussion -- one reason some of us think SGML and
XML provide a better basis for digital editions than the alternatives
we have seen so far. The design decisions of those alternatives also
powerfully influence what can be done with them -- that's pretty much
one of the basic properties of design decisions -- and based on
considered reflection (and some practical experience) we don't think
the alternatives measure up. That can change, of course; I have
already mentioned the work on Ronald Haentjens Dekker and
David Birnbaum and their collaborators, and other people are also
interested in the problem of document representation.

The problems with attributes being used to link elements across
the native hierarchical structure of marked-up texts is
symptomatic of the fact that the SGML/XML markup language was
designed primarily for print.

How so?

How do encoders of a text grasp mentally what is going on with
links? For all but the simplest cases it requires a serious mental
effort to follow the structure, and this greatly increases the
chance that someone will "stuff it up".

Good question, well and thoroughly discussed in the hypertext
literature (back before the Web killed most hypertext literature). The
problem doesn't seem to me particular to SGML or XML. The same arises
if one represents arbitrary graph structures in TeX or in Word (only,
in Word, the problem seems to me to be much more difficult -- I do
know users of SGML and XML who use complex structures and understand
them, but I have never met anyone who could understand complex
structures in Word or any other word processor) or in COCOA or in
every other machine-readable notation I have ever seen anyone propose
for arbitrary graph structures.

It seems to me to be true that arbitrary graph structures are in the
general case harder to think about than tree structures; that is one
of the reasons trees have historically been such an important tool of
thought (e.g. in syntactic analysis, where both dependency grammars
and phrase-structure grammars use trees as a formalism). Trees
themselves take some learning; I recall spending several years
training myself to think about SGML documents not merely as sequences
of characters interspersed with tags (the only model I know for batch
formatters like troff, Runoff, and Script) but as representations of
tree and graph structures. So the question does arise: how do we
manage to keep some kind of intellectual grip on what we are doing?

Often, when we have trouble grasping something, we give ourselves
tools to help: visualizations, automatic checks for well formedness,
validity, or soundness, and so on. In general, I believe users
interacting with complex documents can benefit from special-purpose
user interfaces.
All of this is true regardless of the underlying document
representation format.
Although you can verify the element and attribute structure how do
you verify these links? Can you even check that each idref has
its id somewhere in the document?

Checking that IDREF values match an ID in the document has been
implemented as part of DTD-based validation for SGML and XML for
thirty-odd years now, so I think it's probably safe to say that yes,
it is possible to check that each IDREF matches an ID somewhere in the
document. (I find it hard to believe that anyone with any interest in
document representation and processing can be unaware of this. I had
assumed in the past that DS's unhappiness with SGML and XML was based
on some at least rudimentary knowledge of those formats; the question
above seems to suggest that I was wrong.)

Or that they do not form a directed cycle?

Checks for cycles do not form part of DTD-based validation in either
SGML or XML; they do form part of the 'Service Modeling Language'
defined by a W3C Recommendation of 2009, based on a submission of
2007. (SML defines a number of cross-document validation constraints,
but it does not separate cleanly between validation constraints and
domain-specific issues of interest to its designers, so it
unfortunately cannot be recommended as a general tool for stronger
document validation. It does, however, provide an example of the
definition of validity constraints that go beyond those defined by
conventional schema languages.)

Or that the structures thus created are even computable?

I'm not sure I know how to define any set of constraints on links so
as to make the decision on document validity be non-computable; if
anyone knows a way, I am pretty sure at least one computer science
journal would like to talk to them about a paper on the subject.

But there was some discussion a few years ago (probably some time
between 2000 and 2010) about using additional constraints on ID/IDREF
links (or, equivalently, XSD referential integrity constraints) to
model 3SAT as a document validity problem, which means validation
involving constraints of that kind may be slow. As the continued
popularity of Schematron shows, however, many potential users are not
particularly fussed about proving that their schema language has any
particular guaranteed complexity ceiling. So it's not surprising that
after a few weeks, the discussion died down (and in retrospect I find
myself thinking I really ought to check, before sending this mail, to
make sure the whole thing wasn't an April Fool's prank -- but I'm not
going to check).

The problem of link validation seems at first glance to become
somewhat more difficult in arbitrary graph representations of
documents; in the directed graph of an XML document, the element
structure of the document provides an easily identified spanning tree,
but in a pure directed graph, the identification of a spanning tree is
one of those problems for which I have to break out the data
structures and algorithms books on my shelf.

Although this looks on the surface to be a useful way to break out
of hierarchies, in practical terms it is not very useful.

Well, utility is probably one of those things in the eye of the
beholder. That some people do find it useful is an easily verified
empirical fact. Your mileage may vary.

The same goes for CONCUR. It is not even available in XML and does
not seem to have been much used in SGML.

It is quite true that most conforming SGML parsers did not implement
CONCUR, and that CONCUR appears to have been seldom used in SGML
applications. I believe that Jean-Pierre Gaspart wrote a number of
applications for the European Commission which did use CONCUR, which
he used to simplify both DTD construction and processing dispatch.
But those applications were probably never available outside the
Commission.

Some people have argued (I am thinking particularly of Elli Mylonas,
Allen Renear, et al. in the discussion after their paper at the 1992
ALLC/ACH conference) that we need to devise a new model of documents
to handle overlapping structures, and that CONCUR is not a good
candidate because CONCUR was not widely implemented. This seemed to
me then and seems to me now to conflate theoretical issues and
practical issues. The lack of implementations is a practical issue --
it means that if we want to use CONCUR we need to write software to
support it. But that practical issue says nothing at all about
whether CONCUR does or does not provide a useful model of text. Nor
does it offer a decisive practical argument: if we decide not to use
CONCUR as our model and devise a brand new model instead, we will have
still have to devise software to support the model.
DS appears to argue not that CONCUR has no theoretical interest but
only that it's not an off-the-shelf solution one can use today. True
enough, but I don't know of any off-the-shelf solutions one can use
today.

Neither DS nor EM and her co-authors seem to have contemplated (or at
least: they have not as far as I know addressed publicly) the
possibility that the low user demand for CONCUR might suggest that
overlapping structures might not be so pressing an issue in the practice
of document analysis and processing as some of us had thought.

In November 1987 at the Poughkeepsie meeting which launched the Text
Encoding Initiative, when the issue of overlap came up, David Barnard
mentioned that he had done some work on CONCUR in collaboration with a
couple of people in industry. The feature, he said, had some
problems. So it might be tempting to devise our own solution. But
the problem with that approach, he pointed out, was that objectively
speaking no alternative solution one came up with was likely to be
much better than CONCUR, or any better at all.

The degree of critical self-reflection shown in David's remark is
one I have often wished were more frequently present in discussions of
alternatives to SGML and XML (and indeed other technologies). But
like good design sense and technical tact, self-critical intellectual
honesty does not look to be as widely distributed a gift as one might
wish it were.

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

--[3]------------------------------------------------------------------------
Date: 2019-02-23 19:29:47+00:00
From: Desmond Schmidt
Subject: Re: [Humanist] 32.486: editions, in print or in bytes

On 22 February 2019 Hugh Cayless wrote:

> The new must overthrow the old; the
> producers of digital media need pay no mind to the requirements of print.
> Indeed, any association with print is grounds for dismissal.

Perhaps you should walk into a library some time and simply observe
people studying. Some libraries still have books on their shelves. The
more modern ones dependent on teaching texts and journals have thrown
them out or redeployed them as innovative decoration of the ceiling.
In the State library here it is a major event if someone picks up a
book at all. Reading one is something I have not seen in the past few
years. Even in university libraries the same behaviour manifests
itself: students studying with laptops, surfing the web, librarians
who actively pursue a policy of "digital first" - that is buying
nothing physical if they can possibly help it. At UQ humanities
library students still consult physical books because the humanities
basically study the past, but their use of that resource has
definitely decreased. If this trend continues it looks as if libraries
will eventually contain nothing but rare books and manuscripts, if
they had any in the first place. In this changing environment what can
humanists do but move with the times?

> This sort of thing leads to the case where, while scholars may *use*
> digital editions in their research (e.g. in searching for evidence to
> support an argument), they only *cite* the print versions, because they are
> "better". I just want a world where that divide has been better-bridged.

Let's say we had digital scholarly editions that could be cited, and
be regarded as the best ones to cite - what need would we have of
print editions? I don't see how the ability to cite digital editions
bridges the worlds of print and digital. It just says goodbye to
print. Why would anyone pay $50 for a print edition if they can have a
"better" one that costs nothing? Why would a library do that either?
Why would a publisher risk producing something for a niche market that
is unlikely to sell enough copies to recoup the investment? You are
right that humanists don't currently cite digital editions much. They
rely on print. But this is simply evidence of our failure to make good
enough digital editions. I just want them to be better also.

Desmond Schmidt
eResearch
Queensland University of Technology

--[4]------------------------------------------------------------------------
Date: 2019-02-23 15:33:01+00:00
From: Katherine Harris
Subject: Re: [Humanist] 32.486: editions, in print or in bytes

Hi,

Might I add to this discussion by pointing to the Modern Language
Association's Committee on Scholarly Editions
[https://www.mla.org/About-Us/Governance/Committees/Committee-
Listings/Publications/Committee-on-Scholarly-Editions],
which has been working on integrating more scholarly editions into its
process of vetting for approval of a seal. By no means the authority, but
the rotating membership of the committee has included venerable scholarly
editors and digital scholar editors.

The Committee created Guidelines for Scholarly Editors
[https://www.mla.org/Resources/Research/Surveys-Reports-and-Other-
Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on-
Scholarly-Editions/Guidelines-for-Editors-of-Scholarly-Editions]
which encompasses at the core issues surrounding the debate here on
Humanist. Editions are submitted for rigorous review using Guiding
Questions for Vetters of Scholarly Editions
[https://www.mla.org/Resources/Research/Surveys-Reports-and-Other-
Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on-
Scholarly-Editions/Guidelines-for-Editors-of-Scholarly-Editions#questions]
- and recently updated that review questionnaire to include issues unique
to scholarly editions such as sustainability and mark up. Editions that
have received the MLA Seal
[https://www.mla.org/Resources/Research/Surveys-Reports-and-Other-
Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on-
Scholarly-Editions/CSE-Approved-Editions]include
some venerable digital scholarly editions, including The Blake Archive. The
Committee does have an issue with only large, well-funded scholarly
editions being submitted usually based on a single author. In May 2016, the
Committee published a White Paper on the "Scholarly Edition in the Digital
Age [https://www.mla.org/content/download/52050/1810116/rptCSE16.pdf]" (pdf
- online version
[https://scholarlyeditions.mla.hcommons.org/cse-white-paper/]) questioning
the move from print to digital for the scholarly edition and asking for
further feedback from the MLA community. The Committee has a series of
blog posts [https://scholarlyeditions.mla.hcommons.org/category/blog/] on
the issues as well.

What a digital edition affords is not only the study of a single author's
work, but also a particular form and genre across time periods. This
digital version also affords a deep dive into authors that would not
otherwise be deemed "worthy" or canonical by funding agencies for support.
The question is, and I just saw a CFP
[http://www.archives.gov/nhprc/announcement/depc] for building out digital
scholarly editions infrastructure to be used widely, are we still
discussing what is a scholarly edition just moved to digital or are we
discussing what is an *authoritative *digital scholarly edition, in which
case that means we're spending a lot of time defining the areas and
creating boundaries to keep out *unauthoritative* digital scholarly
editions. I'm less interested in the who's in and who's discussion and more
interested in how to expand the digital scholarly edition beyond the
limitations of the codex without having to spend $1million+ to get it done.

~Kathy

********************
Dr. Katherine D. Harris
Professor, Department of English & Comparative Literature
San Jose State University
Research Blog: http://triproftri.wordpress.com/
Co-Editor, *Digital Pedagogy in the Humanities*
[https://github.com/curateteaching/digitalpedagogy/blob/master/description.md]

Author,* Forget Me Not: The Rise of the British Literary Annual, 1823-1835*
[http://www.ohioswallow.com/book/Forget+Me+Not]

_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.