Humanist Discussion Group, Vol. 32, No. 423. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: firstname.lastname@example.org  From: Henry Schaffer
Subject: Re: [Humanist] 32.417: the McGann-Renear debate (154)  From: Gabriel Egan Subject: Re: [Humanist] 32.417: the McGann-Renear debate (52)  From: Desmond Schmidt Subject: Re: [Humanist] 32.416: the McGann-Renear debate (189) -------------------------------------------------------------------------- Date: 2019-02-04 14:31:14+00:00 From: Henry Schaffer Subject: Re: [Humanist] 32.417: the McGann-Renear debate On Mon, Feb 4, 2019 at 2:55 AM Humanist wrote: > Humanist Discussion Group, Vol. 32, No. 417. > Department of Digital Humanities, King's College London > Hosted by King's Digital Lab > www.dhhumanist.org > Submit to: email@example.com > > >  From: William Pascoe > Subject: Re: [Humanist] 32.416: the McGann-Renear debate (71) > >  From: Dr. Herbert Wender > Subject: Re: [Humanist] 32.416: the McGann-Renear debate (11) > > > > -------------------------------------------------------------------------- > Date: 2019-02-04 00:47:04+00:00 > From: William Pascoe > Subject: Re: [Humanist] 32.416: the McGann-Renear debate > > Hi, > > > Just thinking about how a practical alternative technology to XML for > marking up > texts based on this discussion might work, where text = a linear series of > discrete characters (since there are many such things and it's useful to > find > general ways to work with them). > > One issue is that there are many heirarchies applicable to any text > depending on > what someone is interested in. It's not realistic to apply all this markup > to a > single text file, so how would you overlay all these different markups? > > Another is that start and end points of features of interest sometimes > overlap, > which is a problem for the strictly nested heirarchy required in XML. > This is IMHO a very important point. If our markup definitions don't reflect our texts, which have to give way? Here's a string of characters *bbbbbb**iiiiii *where bold (*b*) and italic (*i*) don't overlap and so there's no problem with overlapping tags. But what about *bbbbbb**iiiiii* where there is overlap (if the fonts don't come through, 3 bold *b*s, 3 bold italic *b*s, 3 bold italic *i*s and 3 italic *i*s) which is well represented by the overlapping tags: bbbiii > Why not just leave the text file alone, not put the tags in the text file > itself, and specify markup in a different file, that has pointers to the > start > and end character. From my vantage point, there really isn't a difference. The two formats are "isomorphic" in the mathematical sense. We accept this without noticing it with respect to all computer storage of text. The letter "A" isn't an "A" in computer storage, it's a group of bits (O or 1) in an cluster (perhaps 01000001) which we all agree stands for an "A" and when we tell the computer to display it on the screen it shows up as "A". Similarly, both bbbiii bbbiii and something like bbbbbbiiiiii in one file and bold[0:8] italic[3:11] in another file represent exactly the same text and are presented with identical appearance if the screen allows the four fonts bold-non-italic, bold-italic, and non-bold-italic. > The strictness of the heirarchy would be optional, so for those purposes > that > things like DTDs are useful for, you could still have that (such as act, > scene, > speaking character or POS tagging), but you could also have complete > looseness, > such as annotations on overlapping text segments. Yes, it can be done in two files. But it can be done in one file - unless we have a "rule" that tags are not allowed to overlap. But if we are to honor that rule, then surely we need to honor it in the two-file storage method. So the problem resolves to the rule of non-overlapping, not to the method of storage and representation. So we come down to the question of whether we continue to use a method if it doesn't comport well with reality. However, non-overlapping may not be a problem with respect to representation, but more one of awkwardness. Here are two tagged representations bbbiii bbbiii (overlap) bbbiii bbb bbbiii iii(non-overlap) which produce/represent identical texts. So, should we even care which is used? And similarly each would produce a different second-file representation (i.e. tags with pointers), but again, they would produce/represent identical texts. > This external markup file could still even be XML itself, with a lot of > tags and > values that point to start and end points, and it's own DTDs for problem > domains, as normal. Instead of text in it, there is just the pointers. It > would > be easy enough to automatically convert a text marked up with XML into one > of > these external ones, just by stripping the text and potting a 'start' and > 'end' > attribute on every tag. Bill, I think I've said above what you are saying, but taken a lot more space to spell it out with examples! > That way an interface could slurp in as many of these and highlight/colour > code > etc a text with as many different markup files as you wanted to import, if > you > wanted to visualise it, or also, to process it, with these different ways > of > looking at it from different people's work. > > I'd be surprised if someone hasn't already come up with this approach. > > A problem with this approach is that the text of interest would need to be > static, because you make one change and all the pointers point to the wrong > address, but in many useful cases there are static texts for which this > would be > useful. Though some version control system could be brought to bear. Here's where the isomorphism becomes useful. If you make one change in the visual one-file representation, it is automatically changed in the storage two-file representation. That's what happens today in editing - one inserts/deletes one character in a file, and we don't even think about it, but the computer has to shift all the succeeding characters in the file and does it so successfully in the background that we're oblivious to the work needed. > So for example, this would be useful for the set of Shakespeare plays. > Cannonical, unchanging plain text versions are readily established, and > there is > much interest in people marking it up in different ways for different > purposes. > In your app you could view the text and import/overlay any number of > scholars > markup and annotations. Or you could process different exo-markup files to > datamine for correlations etc. I'll go off in another direction. I often work with text files made up of ASCII characters. There are 128 ASCII characters (or perhaps 256 http://www.asciitable.com/) but not all of them are "printable", i.e. they show up on the computer screen as " ", i.e. a blank space. But they aren't a blank/space character in storage. To distinguish between these non-printing characters and a true space character one needs to peer inside the computer storage (the Unix/Linux app "od" will do that.) (The most common reason for doing this is that Microsoft Word loves to use these characters outside the standard 128 ones - and this can lead to unanticipated results when putting material on the web or doing computer processing. --henry schaffer -------------------------------------------------------------------------- Date: 2019-02-04 12:55:41+00:00 From: Gabriel Egan Subject: Re: [Humanist] 32.417: the McGann-Renear debate Dear HUMANISTs In response to my querying ("show us an example") of Desmond Schmidt's claim that in early modern drama it's possible not only for a dialogue line to be inside a speech but also for a speech to be inside a line, Herbert Wender offers an example from Goethe's 'Faust'. The example is of a manuscript in which the name "ROSENKNOSPEN" appears anomalously in the middle of a spoken line. In an article cited by Wender, the emendation of this line is discussed, and two options considered. One is to move the name to the beginning of the line so it forms a speech prefix, and the other is to move it to another line altogether. I'm not seeing how this example illustrates what Schmidt claimed, which was that it's permissible in early modern drama for a speech to be inside a line. Rather, it seems to illustrate that it's possible for a textual witness to contain error. Far from treating this moment in the play as an example of a speech being inside a line, the editions of 'Faust' under discussion in the essay Wender cites treat this as an error to be corrected. There can of course be a speech prefix (marking the end of one speech and the start of another) occurring within a manuscript or printed line. Where these occur, no early modern dramatist thought that the speech was inside the line. We know they didn't because when they came to make the actors' 'parts', each of which contained all the lines to be spoken by a single character, they would divide such a manuscript or printed line between two different 'parts' (different physical documents), one for each of the two characters. That is, they saw such a manuscript or printed line of words as really containing two lines: first the last line of one person's speech and secondly the first life of someone else's speech. They treated such a shared manuscript or type line just as we do today, as being really two lines crammed together. The context for all this is that I was defending the claim that texts such as early modern plays really are an Orderly Hierarchy of Content Objects. The tree-ness is not merely in the eye of the beholder as Schmidt claimed Regards Gabriel Egan -------------------------------------------------------------------------- Date: 2019-02-04 08:11:14+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 32.416: the McGann-Renear debate Gabriel, I haven't counted them but there are probably thousands of such cases in Shakespeare. An example from Hamlet Act 1, Scene 1: Horatio: Friends to this ground. Marcellus: And liegemen to the Dane. (the pentameter line being split over two speeches) This example was given by Barnard et al way back in 1988 (LLC 3.1, 26–31). In addition any speech that ends in a half-line followed by one starting in another half-line provides an example of overlap. Sure, Shakespeare had in his mind that his plays were composed of 5 acts, each consisting of a number of scenes etc, but when they were printed they were rendered using lead characters arranged on a rectangular grid using fonts of different sizes and types without the need for any explicit hierarchies. And when he wrote them he was not constrained as we are to create elements that strictly nest. He did whatever he could with his pen, not his computer. I chose the Shakespeare example deliberately because you CAN analyse it with a strong hierarchical structure that almost works. But all you are doing by insisting on it, is setting up your hierarchy to be broken by someone else who wants a different analysis. And what do the hierarchies actually achieve? They don't tell you anything you didn't already know, and so can be dispensed with. They are just a requirement of the markup language. Hugh, I did not say that semantic markup should be external to the text. I said that semantic information can be derived from text without using any kind of markup. Semantic markup in XML is too often focused on the narrow needs of the people who encoded it, or merely records things that are self-evident and hence not useful for general search and retrieval. I was advocating instead the use of concept-mining tools like Leximancer that can extract meaning from plain text, HTML and the like. Also, if modern machine learning techniques can translate from Chinese to English fluently they can also extract meaning from text. So marking up small amounts of meaning internally or externally to a text doesn't seem worth the effort to me. I am advocating a much simpler format for text close to plain text that can be easily mined for information, that contains only rendering or abstract rendering information. Deeply structured texts as once provided by XML don't fit the bill because they mix up the rendering with the semantics and use too rigid a document structure that invites overlapping hierarchies on reuse. On 2/3/19, Humanist wrote: > Humanist Discussion Group, Vol. 32, No. 416. > Department of Digital Humanities, King's College London > Hosted by King's Digital Lab > www.dhhumanist.org > Submit to: firstname.lastname@example.org > > >  From: Hugh Cayless > Subject: Re: [Humanist] 32.412: the McGann-Renear debate (56) > >  From: Gabriel Egan > Subject: Re: [Humanist] 32.412: the McGann-Renear debate (34) > > > -------------------------------------------------------------------------- > Date: 2019-02-02 17:46:09+00:00 > From: Hugh Cayless > Subject: Re: [Humanist] 32.412: the McGann-Renear debate > > This has been one of those threads where I'm torn between responding and > unsubscribing. > > Desmond argues (if I am understanding correctly) that since semantic markup > cannot perfectly describe what's going on in a text, it's better not to do > it in the text, and instead to focus on a minimalist production, with only > the > necessary features, which can then have different layers of annotation > wrapped > around it. I suspect Alex might agree with this approach. In this view, > TEI/XML > is fundamentally flawed because it imposes structures on the text that > aren't > really there, or are only there in certain interpretations or readings of > the > text. > > His interlocutors are arguing that this argument confuses format with > function > and that nothing stops you from doing a minimal TEI with annotations, or > deriving a minimal text from a maximally marked up text. TEI is not > fundamentally flawed because though it can't do everything, it can be a > foundation for doing practically anything. It just gives you a language for > making text models, what you say in that language is up to you. I suspect > Desmond might counter that language influences cognition, and an imperfect > language may steer you to think in ways that are actually pernicious. > > Alex's point about the "quantum" nature of text is well taken, though I > think perhaps it points more at the character encoding level than the > markup > level. In the former, in order to represent his example, I have to decide > whether the thing in question is a circle, or a Latin o, or perhaps an > omicron > or Cyrillic o, or something else entirely. In fact, at the markup level (in > TEI > at any rate), there are ways to represent this kind of uncertainty. > > But this leads us towards a problem: I've often heard the argument from > folks > who do things like machine text analysis that TEI is too messy a format for > them. And indeed it often is. You can't just derive a token stream from > many > TEI documents without first making informed decisions about how to get at > that > stream -- normalized or original text? Base text or particular readings? But > here > is, I think, where the crux lies: TEI says, if you will, "this whole > digital > artifact is the edition, my (the editor/encoder)'s argument about the text. > The annotators say, "here is the text, and there are my arguments about it. > You can easily have one without the other." > > But can you separate the argument about the text from the text itself? My > own > answer to this question is a resounding NO. But maybe that comes from my > perspective as someone who works a lot with messy and difficult edge case > texts. > Likely in many cases it doesn't really matter. In my view the splitting of > text from argument (or even the idea that you should) pushes you towards > error > in the same sorts of ways Desmond believes hierarchy pushes you towards > error. > Who's right? I dunno. > > Perhaps we just have to be aware that our tools and formats all have their > benefits and risks and we have to make decisions in the light of that > awareness. > My plea would be for more open collaboration and constructive criticism and > less > "You're doing it wrong!" > > Al the best, > Hugh > > PS I really like Herbert's suggestion of a TEI Guidelines "Dirty Tricks" > chapter. > > -------------------------------------------------------------------------- > Date: 2019-02-02 10:12:06+00:00 > From: Gabriel Egan > Subject: Re: [Humanist] 32.412: the McGann-Renear debate > > Dear HUMANISTs > > I asserted that in early modern drama, "All the > dialogue lines occur inside speeches, all the > speeches occur inside scenes, and all the scenes > occur inside acts, and there are exactly five acts". > Desmond Schmidt responded: > > > Well, that's your analysis. Another way to analyse > > it is to say that the headings for scenes and acts > > are simply in italics or a big font. > > But as I pointed out, the creators of early modern drama > repeatedly described their work the way I've described > it, as a tree, and never once (in the materials I'm > aware of) described it in terms of the typographical > representations of the units' headings. Are you > saying that it doesn't matter how the creators > thought of their work? > > > In any case your example is not perfect: sometimes > > speeches are inside lines and sometimes lines are > > inside speeches. How do you explain that? > > I believe the claim that "sometimes speeches are > inside lines" to be untrue. Can you give us an > example? > > Regards > > Gabriel Egan -- Dr Desmond Schmidt Mobile: 0481915868 Work: +61-7-31384036 _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: email@example.com List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.