Home | About | Subscribe | Search | Member Area |
Humanist Discussion Group, Vol. 32, No. 487. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: C. M. Sperberg-McQueenSubject: What's it all about, really? Yet another coda to the McGann/Renear thread (425) [2] From: C. M. Sperberg-McQueen Subject: SGML and the legacy of print (and other topics) (386) [3] From: Desmond Schmidt Subject: Re: [Humanist] 32.486: editions, in print or in bytes (45) [4] From: Katherine Harris Subject: Re: [Humanist] 32.486: editions, in print or in bytes (66) --[1]------------------------------------------------------------------------ Date: 2019-02-24 07:11:50+00:00 From: C. M. Sperberg-McQueen Subject: What's it all about, really? Yet another coda to the McGann/Renear thread In Humanist 32.486, Dana Paramskas asks: I admit to being fascinated by the discussion, although slogging through an alphabet soup of acronyms for various codings makes it a wee bit hard to understand. Guilty as charged. I see, moreover, glancing through the long history of the thread, that I am among the chief offenders (i.e. users of unexplained acronyms). As penance, I will append to this mail a glossary of the acronyms found in Humanist 32.452 (since DP's subject line refers specifically to that item) and subsequent contributions to the thread, as well as some other terms referred to without explanation that I hope may be helpfully explained. The details of the argument do occasionally hinge on very specific technical issues, but anyone who will admit to being interested should be made as welcome as can be managed. Please be patient if the explanations given below don't hit the right level for you. DP asks: The core question is (please correct a non-coder if I have misunderstood): what is the "real" text? Probably. When the wind is in the southeast and I am a nominalist, I am not convinced that there is any single identifiable core question in the discussion, merely a long sequence of remarks provoking other remarks which provoke yet others, just a chain reaction of readers turning into writers and back into readers. Historically, though (and when the moon is nearly full and I admit to my inner Platonist again), yes, there is a core question, and yes "what is the 'real' text" is a good first approximation to it. The question, though, has several senses. In connection with editions, Desmond Schmidt has referred to the long tradition of attempts to understand the transmission of a text well enough to trace the history of errors or other changes made in the course of reproduction, organize the extant textual witnesses into a stemma or tree of descent, and use systematic methods to reconstruct the text of the document at the root of the tree (the archetype), or (with the aid of conjectural emendation) the text of the work as it left the hands of the author. This method is best known in the context of ancient and medieval works, where it is usually associated with the name Karl Lachmann (an important figure both in the study of Latin literature and in that of medieval German literature), and there is a long and sometimes interesting history of debates over whether the presuppositions of Lachmann's method hold in the real world, or in our understand of text: is the "real" text really the one that left the hands of the author, or is the "real" text the text(s) actually read by readers of the work? Can we in any case assume that what left the hands of the author was "a" text in the singular? What if it was a mass of notes not yet put into final order (think Vergil, Chaucer, Büchner, ...)? It can also be argued that the transmission pattern of manuscript works is often not really a tree in which each new manuscript is copied from exactly one exemplar: scribes sometimes switch back and forth between exemplars, some manuscripts seem to have been created by copying from two exemplars at once (now following the one, now the other), and lyric was clearly often transmitted by memory, which will typically defy reduction to a stemma. A second sense of the question is given by a paper by Allen Renear and others, widely cited and almost as widely read, which posed in its title the question "What is text, really?" (For non-philosophers, it may not be obvious at first glance that the last word of the title conceals a wicked Platonic sting; be forewarned.) Renear and his co-authors inquire into the "real" nature of text, partly as a question with its own intrinsic interest and partly as a way of discussing different approaches to the electronic representation of text. The implicit assumption (widely shared, I think) is that an electronic representation which comes closer to capturing the "real" nature of text will be superior to others not just on theoretical grounds but for practical reasons as well. Renear et al. argue that text consists of -- or rather just *is* -- an 'ordered hierarchy of content objects' (OHCO), and that SGML is a good representation for text precisely because SGML elements form an ordered tree. (SGML elements are 'ordered' because the sequence in which sibling elements occur in the document is considered essential information; the first paragraph of the chapter and the second paragraph of the chapter cannot be interchanged without changing the document. This distinguishes SGML from typical database management systems in which constituent objects are explicitly not ordered. They form a 'hierarchy' in the sense that they nest, so each element is directly contained by exactly one 'parent' element, except the outermost element of all, which has no parent element.) They conclude their paper by pointing to a number of textual phenomena like hyperlinks and overlapping structures, which go beyond the OHCO model, and suggest that a really good representation of text will have to provide both for hyperlinks and for multiple ordered hierarchies. (This concluding section is customarily ignored by those who object to the OHCO model on the grounds that texts may have multiple hierarchical structures; 'the OHCO thesis' is universally taken to be the claim that texts have only one hierarchical structure, despite the fact that Renear et al. explicitly deny that proposition. It is also customary to ignore the fact that Renear et al. do not define "content object" and to pretend that it's clear to everyone what the phrase denotes. That is, no one else defines it, either. After struggling with it for a while, I concluded that in practice the term means 'thingy; doohickey; whatchamacallit'.) The first sense of the question ("real" text as the text sought or constructed by an editor) ties into the second sense of the question (the Platonic nature of text and the best way to represent text electronically) by way of Desmond Schmidt's arguments that (a) the search for a unitary archetypal text is not a task given to us by reality but an artefact of printed distribution, best abandoned in the digital age, and (b) that XML and SGML (see glossary below) are ineluctably tied to the worldview of print and also best abandoned in the digital age. A third sense of the question relates to the relative importance of particular physical realizations of a text (inscription in a particular document, or vocal performance at a particular place and time) and those properties of a text common to multiple physical realizations. Jerome McGann is often cited as an example of a modern critic who argues that the meaning of literary texts is inextricably bound up with their bibliographic realization and that the abstract view of text he believes inherent in SGML and XML (or possibly only in TEI, since his own projects often also use XML) is necessarily a falsification of the "real" text, because it does not attend closely enough to the source text's physical realization. (In this context, it is slightly disappointing that in the debate referred to in the "McGann-Renear debate" thread, McGann's interpretation of the literary text at issue was a perfectly conventional close reading of the text focused on style and language, with no specifically bibliographic argument at all.) It is, as I understand it, this argument over whether the "real" text is an abstract object (realized in possibly different ways in different documents) or a concrete physical object, which Peter Robinson is attempting to dissolve with his postulate that every text has two aspects, each representable as an ordered hierarchy. (Q. Abstract or physical? A. Both, please!) A fourth sense of the question is related to the third but involves a disagreement about how complete we should make our representations of text. To take a concrete example: in an electronic representation of a play by Shakespeare, should the metrical hierarchy (acts, scenes, verse lines, possibly verse groups, and maybe even feet?) be represented "explicitly" in the electronic form? What about the dramaturgical structure (acts, scenes, speeches, stage directions)? And the typographic structure (quires, leaves, pages, columns, typographic lines, ...)? Some participants in the discussion feel (or so I gather) that these structures should be represented explicitly in the electronic document format only if they are "real", which for some discussants apparently means only if they were demonstrably present in the mental acts of the author. Some participants have suggested that some of the structures can safely be omitted, because software can (or will soon be able to, just as soon as artificial intelligence progresses a bit further) recognize them even if they are not represented explicitly in the electronic document. For each of these propositions, there are those who will deny them. ("Artificial intelligence is not now any match for human intelligence in annotating texts"; "textual structures can be real even if they were not part of the author's conscious world"; "textual structures can be real even if they aren't part of the author's world at all [since texts are a collaboration between author and reader, all this talk of the author without talk of the reader is in any case misguided]"; and "the electronic format should contain the information we need for processing, whether that is a 'real' part of the text or a 'not real' part of the text or information which is not part of the text at all but may yet be of use [like a library shelf number for the exemplar from which an electronic document was transcribed]".) As someone rightly pointed out, the question even arises for a single, author-vetted version, which as always must rely on or conform to the reader's interpretation, short of an extended dialogue with the author (or even then). If an author chooses to make available multiple versions, with additions or corrections and without a final, definitive version, is she or he not saying simply - up to you all to figure it out, and have fun? Quite. Those who believe (as I think Desmond Schmidt does) that SGML and XML are intrinsically tied to the print medium, and that the print medium is inescapably tied to a single-text notion of textual reality, see such texts as illustrating the fundamental inadequacy of SGML, XML, and TEI. Those who doubt either of those premises will see such texts as interesting examples which illustrate the power and utility of SGML, XML, and TEI. Then there is a jump in logic which I have trouble understanding because, of course, I am not a coder. Each code -- or whatever it's called -- has its own structure or hierarchy. While this adds a measure of complexity, does it really help to identify an "original text"? Could someone in the know explain to a naive reader just how the various codings do or do not aid in reaching the "original text" aside from the arguments about the ease or validity of the different codings? I believe that the multiple senses of the "real text" are here crossing their wires. As far as I know, no one is arguing that any particular representation of text will make it materially easier or harder to reconstruct an archetypal or original text; as I understand the discussion so far, then, the answer to your "how do the codings aid ..." question is "they don't". There *is*, however, an argument in the other direction. At least one or two participants in the discussion have argued that any attempt to represent the text of multiple textual witnesses in a single electronic document will necessarily cause painful difficulties in the electronic document, and further that the hierarchical structure of SGML and XML documents makes the difficulties even worse than they would otherwise be. So although I don't think anyone has argued that particular choices in text formats can help establish an original text, there *have* been arguments that particular choices in text formats can make it harder. There is, needless to say, no more unanimity on this point than on any other. And always and everywhere, there is the sense that making particular textual features explicit in an electronic representtion makes them easier to identify, not for humans, but for machines. As Elli Mylonas sometimes says “in an XML application, the software is stupid and the data are smart.” Those of us who expect software to remain stupid for a long while want to make explicit our understanding of a text, in order that software which is not intelligent can at least be well informed. “Markup”, as Lou Burnard used to say (and no doubt still does) “is a hermeneutic activity.” If we think of the features we mark up as (part of) the “real” text, then yes, the codes (tags) we put into the text are to help readers (software or human) identify what we believe is the real text. It is possible that I have misunderstood your questions, or not answered them satisfactorily; if so, I ask for your patience. Some of the points at issue are best illustrated with concrete examples, but clear short examples are in my experience very hard to come by, and in a discussion about things becoming unwieldy, concrete examples are likely to require the explanation of so many technical details that I feared they would make things even harder to follow than you have already found them. I hope this helps, a little. -C. M. Sperberg-McQueen Glossary of acronyms, unexplained references, etc. (Further information on any of these may be found by web search.) ACE = 'Automatic Computing Engine'. The name of a computer designed by Alan Turing in 1945; a scaled-down version was completed by the National Physical Laboratory in 1950 and a second version by 1952/53. ADHO = 'Association of Digital Humanities Organizations'. An umbrella organization for the cooperation of regional professional societies focused on digital humanities. ASCII = 'American Standard Code for Information Interchange'. An extremely influence coded character set which specifies electronic representations for 95 graphic characters (the letters 'A' to 'Z' and 'a' to 'z', the digits '0' to '9', and numerous punctuation marks. In discussions of textual representation, terms like 'ASCII text' are often used to denote textual representations which do not use explicit markup in the style of SGML, XML, or COCOA. COCOA = 'Word COunt and COncordance Program on Atlas'. A batch concordance system using an influential form of markup , developed at the Atlas Computing Laboratory in the 1960s (? or 1970s?). CSS = 'Cascading Style Sheets'. A language specified by W3C for describing the rendering of HTML (and, more generally, XML) documents in browsers (and, eventually, in other media). DTD = 'document type definition'. A formal specification of the types of elements used in marking up documents of a given type. A document which conforms to the rules specified in the DTD is said to be 'valid'. EDVAC = 'electronic discrete variable automatic computer'. A binary stored-program computer built by J. Presper Eckert and John Mauchly at the University of Pennsylvania on contract to the U.S. military in 1946-49; the successor to ENIAC. Widely known because of the publication of a design document attributed to John von Neumann. ENIAC = 'electronic numerical integrator and calculator'. A vacuum-tube computer built by J. Presper Eckert and John Mauchly at the University of Pennsylvania during World War II. ERB = 'editorial review board'. A small body within the W3C’s “SGML on the Web” working group responsible for supervising the work of the editors, and in particular making design decisions about what the spec should say. GML = 'Generalized Markup Language'. An IBM system for descriptive markup devised by Charles Goldfarb, Ed Mosher, and Ray Lorie; GML eventually shipped as part of IBM's Document Composition Facility product, and was re-implemented at the University of Waterloo as an add-on to Waterloo Script. Hockey, Susan. The longtime head of the Humanities Computing Unit at Oxford University Computing Services; longtime chair of the Association for Literary and Linguistic Computing (ALLC), now the European Association for Digital Humanities (EADH); author of A Guide to Computer Applications in the Humanities; participated in the development of the COCOA system; led the development of the Oxford Concordance Program. Huitfeldt, Claus. Professor of Philosophy at the University of Bergen, founding director of the Wittgenstein Archive there, and editor of the Bergen Electronic Edition of Wittgenstein's posthumous papers (the 'Nachlass'); declined to use SGML and TEI for transcribing Wittgenstein and instead invented MECS and MECS-WIT (q.v.) for the purpose. LaTeX. A set of macros for TeX (q.v.) which provide a system of high-level descriptive markup for documents (so documents can be described in terms of sections, subsections, and the like, and not solely in terms of font family and size, etc.). MECS = 'Multi-element coding system'. An alternative to SGML developed by and for the Wittgenstein Archive at the University of Bergen. MECS-WIT. A specific MECS-based markup language developed by the Wittgenstein Archive at the University of Bergen for the transcription of Wittgenstein's notebooks. OCP = 'Oxford Concordance Program'. A batch concordance system using (an extension of) COCOA markup, developed in the late 1970s. OHCO = 'ordered hierarchy of content objects'. A model of text outlined by Allen Renear et al. in their paper "What is text, really?" Often asserted to be the model of text underlying SGML and the TEI. Ott, Wilhelm. Founder and longtime head of the Department for Literary and Documentary Data Processing at the University of Tuebingen computer center; creator and co-programmer of Tustep, an extensive and tightly integrated set of programs for encoding text, processing text, and producing editions (both scholarly and commercial). Packard, David. Classicist, philanthropist (and son of the electrical engineer David Packard) and creator of the Ibycus minicomputer system (a customized version of a Hewlett-Packard minicomputer) and (in the 1980s) the Ibycus Scholarly Personal Computer. Both Ibycus systems had hardware assistance for quick search in the Thesaurus Linguae Graecae and other texts using the same markup as TLG. SGML = 'Standard Generalized Markup Language', a set of rules for electronic representation of documents. Formally SGML is defined by International Standard ISO 8879, published in 1986. Strictly speaking what SGML defines are rules for defining rules for document representation; SGML is a "meta-language" (a language for defining and talking about languages). Its rules are specific enough that all markup languages defined using SGML have certain family resemblances, so often in discussions like this one people speak as if all SGML-based document languages were alike, or as if there were only one. Notable properties of SGML-based languages are that documents consist of a mixture of 'character data' and 'markup', that the most prominent form of 'markup' are 'tags' which mark the beginning and ending of portions of the document called 'elements', and that elements nest within each other like Russian dolls. The last property has the consequence that two elements can nest one within the other, or be entirely separate, but cannot overlap each other. An optional feature of SGML called 'concurrent markup' and signaled using the keyword CONCUR allows multiple sets of tags in a document; the elements marked by each set must nest, but elements in different sets may overlap. Each set of elements is defined formally by a 'document type definition' or DTD (q.v.). SOAP = 'Simple Object Access Protocol'. A set of rules for representing programming-language objects in XML for transmission over the network and reconstitution as obects at the other end. TEI = 'Text Encoding Initiative', an international project begun in 1987 which produced a set of "Guidelines for Text Encoding and Interchange" in 1994, which took the form of an SGML document type definition (or rather: a set of SGML DTDs) and documentation of their meaning. Since 2000 (or so), the Guidelines have been maintained by the TEI Consortium; among the changes made is that the Guidelines now use XML rather than SGML as their basis. TeX (tau epsilon chi, conventionally written 'TeX' in Latin contexts), a batch formatter developed by the Stanford computer scientist Donald E. Knuth for producing typeset output, with particular emphasis on the typesetting of mathematical expressions. Thaller, Manfred. German historian and digital humanist; creator first of CLIO (an adaptation of a commercial database management system for managing data sets of interest to historians) and then of Kleio (a PC-based database system for historians without any reliance on commercial database management systems) and other systems; long time advocate for the creation of a 'historial informatics' as a discipline distinct from and independent of computer science. Tustep = 'Tuebingen System of Text Processing Programs'. ... W3C = World Wide Web Consortium. The stanards development body responsible for Web technical standards like HTML, CSS, XML, and many others. W3C created an "SGML on the Web" working group in 1996, reorganized later as the "XML" working group. For historical reasons, W3C standards are called "Recommendations". WG = 'working group'. The organizational unit for technical work in ISO, W3C, and many other standards development organizations. XML = 'Extensible Markup Language', a subset of SGML specified in 1998 by the World Wide Web Consortium (W3C). XML restricts many of the choices left open by SGML, in the interest of making it easier to write software to support XML. Most of the restrictions affect only the syntax of the format, and not its model of text, but there is one exception: the CONCUR feature is not part of XML. XQuery. A query language for XML databases, developed by a W3C working group between 1998 and 2011. XSL-FO = 'Extensible Stylesheet Language - Formatting Objects'. An XML vocabulary for high-level abstract descriptions of pages. XSLT = 'Extensible Stylesheet Language: Transformations'. A programming language for specifying processes which transform one XML document (the input) into another (the output); one of the most widely used tools for processing XML data. XSD = 'XML Schema Definition Language'. An XML-based language developed by W3C for defining schemas for XML-based markup languages; XSD schemas are one of several alternatives to DTDs. ******************************************** C. M. Sperberg-McQueen Black Mesa Technologies LLC cmsmcq@blackmesatech.com http://www.blackmesatech.com ******************************************** --[2]------------------------------------------------------------------------ Date: 2019-02-23 21:37:23+00:00 From: C. M. Sperberg-McQueen Subject: SGML and the legacy of print (and other topics) [The following is a long point by point discussion of some topics raised by Desmond Schmidt’s contribution to Humanist 32.486. Those who like chasing intellectual hares, or watching them be chased, may enjoy at least parts of it. But those for whom these particular hares have no special importance or meaning may find their attention flagging a bit and should feel no oblgation to read all the way to the end. It’s just one hare after another. -CMSMcQ] In Humanist 32.486, Desmond Schmidt writes: I think you are right to point out that the context matters. GML was only a print technology. SGML was potentially both print and digital, but mostly print. XML was primarily digital. In 1980 there was not much that was true digital text. Digital documents including those in SGML were mostly used to make printed books, or as an adjunct to print inserted as CDs in the covers. Since you are distinguishing here between SGML and XML, I infer that you are talking about uptake and usage, and not about the intrinsic natures of the two. I'm not certain I have a good overview of the entire SGML marketplace -- or for that matter of the XML marketplace -- so I am not in a position to agree or disagree. The SGML products and tools I remember most vividly were not print-centered (e.g. Author/Editor, Panorama, DynaBook, DynaWeb, not to mention all the work on interactive electronic technical manuals within the defense community), but it's true enough that some major SGML products with huge dollar values (e.g. DataLogic) were aimed at typesetting, and it's also true that the print-centered products were much slower moving to XML than other products. (However, neither XML nor SGML are relevant for 1980, since neither existed then.) But what was in their consciousness when they designed it? Print. It's not clear what this means. If it means that the designers of SGML paid lip service to other uses of data but actually thought only about print, I think it's clearly false. If it means that the designers understood print in ways they did not understand linguistic annotation, then it's probably true. But as an argument about what SGML and XML are good for, it is utterly beside the point: it's an argument for guilt by association. The designers were interested in print, and therefore nothing they did can be relevant for online work? (Interestingly, wasn't it only a couple of weeks ago that documents were purely an afterthought? And now, voila, they are the anchor that drags the entire enterprise to the bottom of the sea.) The decision to separate out metadata into elements and attributes was only part of the legacy of print that was included in SGML. I do not see any relation here, let alone causality. In what sense is a distinction between elements and attributes a 'legacy of print'? Do all systems for generation of print have such a distinction? I don't see it in troff, or Runoff, or Script, or TeX, or LaTeX, or Scribe; am I missing it? The assumption that "elements and attributes" constitute "metadata" is also not one I think can be taken for granted. The idea that "markup" is always and only "metadata" is not hard to find, and is often useful when teaching beginners the rudiments of markup, but it's hard to take seriously as a philosophical statement and -- like the concept of "metadata" itself -- does not (in my limited experience) withstand sustained scrutiny. (It's also often mixed up with a model of text as a essentially a sequence of characters, which can be a handy idea in some contexts but should not be taken seriously for very long.) No one that I know of has ever proposed any plausible distinction between "data" and "metadata" that does not make the distinction ultimately depend on one's point of view at a particular moment or for a particular application. Calling some information "metadata" is inviting a particular way of thinking about its relation to other information; it is not a classification stable across time, persons, organizations, or intentions. While the distinction between markup and character data content is also fluid across time (applications may transform each into the other), it tends in my experience to be a little more stable than the distinction between 'data' and 'metadata'. The deliberate decision to introduce explicit hierarchies was another, Like the preceding sentence, this seems to assume a line of argument with which I am not familiar. Why should a tree-structured organization of the input, or the ability to describe a document format with a context-free grammar, be a legacy of print? If hierarchies are intrinsic to print orientation, why did I spend so much time when I was talking about the TEI in public answering objections from people claiming that pages (being two-dimensional) could not possibly be described usefully using hierarchical data structures? Perhaps the experience of having SGML and XML attacked first as being too distant from print and incompatible with interest in page layout and presentation, and then having them attacked as being too laden with legacy print-oriented assumptions, has made me insufficiently sympathetic to either argument. But if anyone wants to take a shot at explaining why the inclusion of document type definitions in SGML, or the rule that each element instance in a document, other than the outermost element instance, has exactly one parent, are best explained as legacies either from print or from pre-SGML, pre-GML computer-based typesetting or layout systems, I would be eager to hear the argument. as was the use of processing instructions meant originally to control the printer. Three questions arise here. (1) What leads DS to believe that processing instructions were originally intended specifically for printers and not for other processors, such as editors, stylesheet processors, plotters, or formatting engines (just to stay within a paper-oriented work flow)? Do other kinds of processors never need ad-hoc instructions? When I teach SGML or XML to beginners, I normally describe processing instructions as a bit like the rainbow in the story of Noah: the processing instruction is the sign of a promise. It signifies that SGML and XML were developed not by ivory tower theorists but by people who knew what it's like to need to get a document ready in time to make a 4 pm Fedex pickup. Yes, the formatting macros ought to get this page correct in a purely declarative way, but if they don't, then at 3:15 I'm going to inject a processing instruction that means "I don't care what the algorithm says, force a paragraph break RIGHT HERE." (I use document formatting in the example not because it's the primary use of processing instructions, but because beginners can often relate to it better than to other applications.) (2) If the original intention was for PIs to control printers only, why does the spec not limit them to that use? Is it just a case of bad spec drafting, in which a WG fails to say what they want to say because they can't express themselves cleanly? Is it a case of such a limitation being impossible to write in spec prose? Surely no one in their right mind can believe that the SGML WG would have been dissuaded from specifying that PIs were only to be used to control printers by the likelihood that doing so would result in awkward prose or convoluted logic. (3) If we were to suppose that the entailment here were correct and that processing instructions were indeed originally meant "to control the printer", and that the failure of the spec to limit them to printers was just a case of botched drafting, what relevance does that have to questions of what SGML can be used for? I think DS has succumbed here to the intentional fallacy. JSON has done away with attributes and even though it is not a document format, it shows that they were superfluous. Of course attributes are superfluous, in the sense that a version of SGML or XML which lacked them would lose no expressive power (modulo some questions of datatyping which I will address only by waving my hands to make them go away). This has been known since ... gosh, I don't know when. I first heard the observation from the computer scientist Sandra Mamrak, the first chair of the TEI's metalanguage committee, in 1989 or thereabouts. The TEI retained them not because we thought the element/attribute distinction was essential for expressive power but because we thought the distinction made it easier to build more usable vocabularies. Thirty years of listening off and on to people complaining that they don't know when to use elements and when to use attributes have persuaded me that design sense and technical tact are not universal gifts, but they haven't changed my view that the element/attribute distinction is a useful tool for vocabulary designers, just as the logical operators 'or', 'and', and 'not' are useful for those who wish to apply formal logic to real-world problems (even though like attributes they are superfluous). Similarly a version of SGML or XML which had attributes but lacked content models would also lose no expressive power; I remember Jon Bosak pointing this out in the late '90s, when discussing the claim that XML would be faster to parse if it lacked attributes. He argued, plausibly, that it would probably be even faster if it had attributes but lacked character data content. If anyone was waiting around for JSON to establish the truth of this not terribly demanding proposition, they cannot have been paying any attention. (I notice in passing that JSON does not seem to demonstrate that tree structures are superfluous. Why is that, I wonder?) No doubt 15th century bookmakers would be aghast at the accusation that their creations resembled manuscripts because they generalised books and made them reproducible. Say what? How can anyone look at any of the productions of Gutenberg or Fust and think they would be aghast at a resemblance between their books and manuscript books? I am reminded of the disclaimer I once saw in a novel: The characters in this novel are fictional. Any resemblance between them and real people is the result of one hell of a lot of hard work. When we look back on XML from the perspective of the future we will see these seeming innocent design decisions as the traces of print technology that they are. And they are not innocent. They influence powerfully what we can do with digital editions. I'll leave future perspectives for the future to worry about. From the perspective of the present, it seems to me that the claim that any of (a) the element/attribute distinction, (b) the constraints that have as a consequence that the elements of an SGML or XML document form a tree (or, in a document using CONCUR, multiple trees), or (c) processing instructions has any visible connection to print is one that lacks all plausibility (as well as, so far, any evidence or argument). Yes, the design decisions of SGML powerfully influence what we can do with digital editions. Quite true. This is -- as has already been pointed out in this discussion -- one reason some of us think SGML and XML provide a better basis for digital editions than the alternatives we have seen so far. The design decisions of those alternatives also powerfully influence what can be done with them -- that's pretty much one of the basic properties of design decisions -- and based on considered reflection (and some practical experience) we don't think the alternatives measure up. That can change, of course; I have already mentioned the work on Ronald Haentjens Dekker and David Birnbaum and their collaborators, and other people are also interested in the problem of document representation. The problems with attributes being used to link elements across the native hierarchical structure of marked-up texts is symptomatic of the fact that the SGML/XML markup language was designed primarily for print. How so? How do encoders of a text grasp mentally what is going on with links? For all but the simplest cases it requires a serious mental effort to follow the structure, and this greatly increases the chance that someone will "stuff it up". Good question, well and thoroughly discussed in the hypertext literature (back before the Web killed most hypertext literature). The problem doesn't seem to me particular to SGML or XML. The same arises if one represents arbitrary graph structures in TeX or in Word (only, in Word, the problem seems to me to be much more difficult -- I do know users of SGML and XML who use complex structures and understand them, but I have never met anyone who could understand complex structures in Word or any other word processor) or in COCOA or in every other machine-readable notation I have ever seen anyone propose for arbitrary graph structures. It seems to me to be true that arbitrary graph structures are in the general case harder to think about than tree structures; that is one of the reasons trees have historically been such an important tool of thought (e.g. in syntactic analysis, where both dependency grammars and phrase-structure grammars use trees as a formalism). Trees themselves take some learning; I recall spending several years training myself to think about SGML documents not merely as sequences of characters interspersed with tags (the only model I know for batch formatters like troff, Runoff, and Script) but as representations of tree and graph structures. So the question does arise: how do we manage to keep some kind of intellectual grip on what we are doing? Often, when we have trouble grasping something, we give ourselves tools to help: visualizations, automatic checks for well formedness, validity, or soundness, and so on. In general, I believe users interacting with complex documents can benefit from special-purpose user interfaces. All of this is true regardless of the underlying document representation format. Although you can verify the element and attribute structure how do you verify these links? Can you even check that each idref has its id somewhere in the document? Checking that IDREF values match an ID in the document has been implemented as part of DTD-based validation for SGML and XML for thirty-odd years now, so I think it's probably safe to say that yes, it is possible to check that each IDREF matches an ID somewhere in the document. (I find it hard to believe that anyone with any interest in document representation and processing can be unaware of this. I had assumed in the past that DS's unhappiness with SGML and XML was based on some at least rudimentary knowledge of those formats; the question above seems to suggest that I was wrong.) Or that they do not form a directed cycle? Checks for cycles do not form part of DTD-based validation in either SGML or XML; they do form part of the 'Service Modeling Language' defined by a W3C Recommendation of 2009, based on a submission of 2007. (SML defines a number of cross-document validation constraints, but it does not separate cleanly between validation constraints and domain-specific issues of interest to its designers, so it unfortunately cannot be recommended as a general tool for stronger document validation. It does, however, provide an example of the definition of validity constraints that go beyond those defined by conventional schema languages.) Or that the structures thus created are even computable? I'm not sure I know how to define any set of constraints on links so as to make the decision on document validity be non-computable; if anyone knows a way, I am pretty sure at least one computer science journal would like to talk to them about a paper on the subject. But there was some discussion a few years ago (probably some time between 2000 and 2010) about using additional constraints on ID/IDREF links (or, equivalently, XSD referential integrity constraints) to model 3SAT as a document validity problem, which means validation involving constraints of that kind may be slow. As the continued popularity of Schematron shows, however, many potential users are not particularly fussed about proving that their schema language has any particular guaranteed complexity ceiling. So it's not surprising that after a few weeks, the discussion died down (and in retrospect I find myself thinking I really ought to check, before sending this mail, to make sure the whole thing wasn't an April Fool's prank -- but I'm not going to check). The problem of link validation seems at first glance to become somewhat more difficult in arbitrary graph representations of documents; in the directed graph of an XML document, the element structure of the document provides an easily identified spanning tree, but in a pure directed graph, the identification of a spanning tree is one of those problems for which I have to break out the data structures and algorithms books on my shelf. Although this looks on the surface to be a useful way to break out of hierarchies, in practical terms it is not very useful. Well, utility is probably one of those things in the eye of the beholder. That some people do find it useful is an easily verified empirical fact. Your mileage may vary. The same goes for CONCUR. It is not even available in XML and does not seem to have been much used in SGML. It is quite true that most conforming SGML parsers did not implement CONCUR, and that CONCUR appears to have been seldom used in SGML applications. I believe that Jean-Pierre Gaspart wrote a number of applications for the European Commission which did use CONCUR, which he used to simplify both DTD construction and processing dispatch. But those applications were probably never available outside the Commission. Some people have argued (I am thinking particularly of Elli Mylonas, Allen Renear, et al. in the discussion after their paper at the 1992 ALLC/ACH conference) that we need to devise a new model of documents to handle overlapping structures, and that CONCUR is not a good candidate because CONCUR was not widely implemented. This seemed to me then and seems to me now to conflate theoretical issues and practical issues. The lack of implementations is a practical issue -- it means that if we want to use CONCUR we need to write software to support it. But that practical issue says nothing at all about whether CONCUR does or does not provide a useful model of text. Nor does it offer a decisive practical argument: if we decide not to use CONCUR as our model and devise a brand new model instead, we will have still have to devise software to support the model. DS appears to argue not that CONCUR has no theoretical interest but only that it's not an off-the-shelf solution one can use today. True enough, but I don't know of any off-the-shelf solutions one can use today. Neither DS nor EM and her co-authors seem to have contemplated (or at least: they have not as far as I know addressed publicly) the possibility that the low user demand for CONCUR might suggest that overlapping structures might not be so pressing an issue in the practice of document analysis and processing as some of us had thought. In November 1987 at the Poughkeepsie meeting which launched the Text Encoding Initiative, when the issue of overlap came up, David Barnard mentioned that he had done some work on CONCUR in collaboration with a couple of people in industry. The feature, he said, had some problems. So it might be tempting to devise our own solution. But the problem with that approach, he pointed out, was that objectively speaking no alternative solution one came up with was likely to be much better than CONCUR, or any better at all. The degree of critical self-reflection shown in David's remark is one I have often wished were more frequently present in discussions of alternatives to SGML and XML (and indeed other technologies). But like good design sense and technical tact, self-critical intellectual honesty does not look to be as widely distributed a gift as one might wish it were. ******************************************** C. M. Sperberg-McQueen Black Mesa Technologies LLC cmsmcq@blackmesatech.com http://www.blackmesatech.com ******************************************** --[3]------------------------------------------------------------------------ Date: 2019-02-23 19:29:47+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 32.486: editions, in print or in bytes On 22 February 2019 Hugh Cayless wrote: > The new must overthrow the old; the > producers of digital media need pay no mind to the requirements of print. > Indeed, any association with print is grounds for dismissal. Perhaps you should walk into a library some time and simply observe people studying. Some libraries still have books on their shelves. The more modern ones dependent on teaching texts and journals have thrown them out or redeployed them as innovative decoration of the ceiling. In the State library here it is a major event if someone picks up a book at all. Reading one is something I have not seen in the past few years. Even in university libraries the same behaviour manifests itself: students studying with laptops, surfing the web, librarians who actively pursue a policy of "digital first" - that is buying nothing physical if they can possibly help it. At UQ humanities library students still consult physical books because the humanities basically study the past, but their use of that resource has definitely decreased. If this trend continues it looks as if libraries will eventually contain nothing but rare books and manuscripts, if they had any in the first place. In this changing environment what can humanists do but move with the times? > This sort of thing leads to the case where, while scholars may *use* > digital editions in their research (e.g. in searching for evidence to > support an argument), they only *cite* the print versions, because they are > "better". I just want a world where that divide has been better-bridged. Let's say we had digital scholarly editions that could be cited, and be regarded as the best ones to cite - what need would we have of print editions? I don't see how the ability to cite digital editions bridges the worlds of print and digital. It just says goodbye to print. Why would anyone pay $50 for a print edition if they can have a "better" one that costs nothing? Why would a library do that either? Why would a publisher risk producing something for a niche market that is unlikely to sell enough copies to recoup the investment? You are right that humanists don't currently cite digital editions much. They rely on print. But this is simply evidence of our failure to make good enough digital editions. I just want them to be better also. Desmond Schmidt eResearch Queensland University of Technology --[4]------------------------------------------------------------------------ Date: 2019-02-23 15:33:01+00:00 From: Katherine Harris Subject: Re: [Humanist] 32.486: editions, in print or in bytes Hi, Might I add to this discussion by pointing to the Modern Language Association's Committee on Scholarly Editions [https://www.mla.org/About-Us/Governance/Committees/Committee- Listings/Publications/Committee-on-Scholarly-Editions], which has been working on integrating more scholarly editions into its process of vetting for approval of a seal. By no means the authority, but the rotating membership of the committee has included venerable scholarly editors and digital scholar editors. The Committee created Guidelines for Scholarly Editors [https://www.mla.org/Resources/Research/Surveys-Reports-and-Other- Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on- Scholarly-Editions/Guidelines-for-Editors-of-Scholarly-Editions] which encompasses at the core issues surrounding the debate here on Humanist. Editions are submitted for rigorous review using Guiding Questions for Vetters of Scholarly Editions [https://www.mla.org/Resources/Research/Surveys-Reports-and-Other- Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on- Scholarly-Editions/Guidelines-for-Editors-of-Scholarly-Editions#questions] - and recently updated that review questionnaire to include issues unique to scholarly editions such as sustainability and mark up. Editions that have received the MLA Seal [https://www.mla.org/Resources/Research/Surveys-Reports-and-Other- Documents/Publishing-and-Scholarship/Reports-from-the-MLA-Committee-on- Scholarly-Editions/CSE-Approved-Editions]include some venerable digital scholarly editions, including The Blake Archive. The Committee does have an issue with only large, well-funded scholarly editions being submitted usually based on a single author. In May 2016, the Committee published a White Paper on the "Scholarly Edition in the Digital Age [https://www.mla.org/content/download/52050/1810116/rptCSE16.pdf]" (pdf - online version [https://scholarlyeditions.mla.hcommons.org/cse-white-paper/]) questioning the move from print to digital for the scholarly edition and asking for further feedback from the MLA community. The Committee has a series of blog posts [https://scholarlyeditions.mla.hcommons.org/category/blog/] on the issues as well. What a digital edition affords is not only the study of a single author's work, but also a particular form and genre across time periods. This digital version also affords a deep dive into authors that would not otherwise be deemed "worthy" or canonical by funding agencies for support. The question is, and I just saw a CFP [http://www.archives.gov/nhprc/announcement/depc] for building out digital scholarly editions infrastructure to be used widely, are we still discussing what is a scholarly edition just moved to digital or are we discussing what is an *authoritative *digital scholarly edition, in which case that means we're spending a lot of time defining the areas and creating boundaries to keep out *unauthoritative* digital scholarly editions. I'm less interested in the who's in and who's discussion and more interested in how to expand the digital scholarly edition beyond the limitations of the codex without having to spend $1million+ to get it done. ~Kathy ******************** Dr. Katherine D. Harris Professor, Department of English & Comparative Literature San Jose State University Research Blog: http://triproftri.wordpress.com/ Co-Editor, *Digital Pedagogy in the Humanities* [https://github.com/curateteaching/digitalpedagogy/blob/master/description.md] Author,* Forget Me Not: The Rise of the British Literary Annual, 1823-1835* [http://www.ohioswallow.com/book/Forget+Me+Not] _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.