Home | About | Subscribe | Search | Member Area |
Humanist Discussion Group, Vol. 34, No. 117. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Peter RobinsonSubject: Humanist 34.89 to 34.116, on texts, hierarchies, XML, JSON (143) [2] From: Desmond Schmidt Subject: Re: [Humanist] 34.113: annotating notation (40) --[1]------------------------------------------------------------------------ Date: 2020-06-16 14:55:08+00:00 From: Peter Robinson Subject: Humanist 34.89 to 34.116, on texts, hierarchies, XML, JSON In response to the discussion about models of text, JSON, XML, and hierarchies, etc, which was sparked by Michael Sperberg-McQueen's comments on the Oxford Concordance Program in 34.89, my reply in 34.93, and since carried on by multiple further discussions, in 34.100 and 34.106 (M.S-M), 34.105 (Desmond Schmidt), 34.108 (Herbert Wender on OHCO), 34.110 (Hugh Cayless, Desmond Schmidt, William Pascoe), 34.113 (Hugh Cayless and John Keating), 34.116 (Willard McCarty, expressing some weariness), First, a disclaimer, of sorts. I am not specially interested in discussions about the merits of XML versus JSON etc, and I sympathize with Willard's distaste for that cup of tea. My motivation in my original reply to Michael's comments was not to promote JSON, but to incite a discussion about hierarchies and texts. Especially, I wanted to draw people's attention to the model I have proposed for text in various publications. Here are the essential parts of this model: 1. The three key terms are 'document', 'text' and 'act of communication'. I argue, most accessibly in https://www.academia.edu/43355876/ Creating_and_implementing_an_ontology_of_documents_and_texts, that 'a text is an act of communication inscribed in a document'. Accordingly, all texts have a double aspect. They are acts of communication, and they are present in documents. Each aspect may be represented as a tree, with each tree independent of the the other. Text may therefore be conceived as a collection of leaves, with each leaf present on both the document and act of communication trees. A typical document tree might be: document --> quires --> pages--> columns --> line-like writing spaces; the act of communication tree might be play --> acts --> scenes --> lines. We use the term 'entity' as shorthand for "act of communication'. 2. We may conceive an architecture for these three aspects as follows. a. The document and act of communication ('entity') trees can each be represented as a collection of nodes linked in an OHCO tree: Name: the name of the node (div, page, etc) Attributes: a list of property/value pairs associated with the node Ancestor: the ancestor node of this node. For the root node (a TEI, etc) there is no ancestor Children: an ordered list of children of this node. These children may be nodes in each tree (quires as children of the root document, etc) or they may be leaves from the third collection. b. The leaves may be seen as a string pool, with each node as follows: Text: the text of this leaf, essentially a plain UTF-8 string Entity: the immediate ancestor node of this leaf in the act of communication tree Document: the immediate ancestor node of this leaf in the document tree. A few notes: The document and entity trees are completely independent of each other. They may have leaves in different orders. A leaf of text may be an immediate sibling of another leaf in one tree (for example, two words on the same page) but be far apart in the other tree (they might belong to different poems). The entity tree, particularly, could be translated directly to and from a TEI/XML document without any loss of information. No 'pcdata' is held in the document and entity trees, All pcdata is held in the the 'string pool', with all leaves referenced as children of nodes within the document and entity trees. Additional hierarchies could be added, just by creating further tree-like node collections, with the same structure as the document and entity trees, and with leaves as children. You would also need to add an extra field for your new hierarchy in the leaves present in your new hierarchy, and likely restructure and add some leaves. There are some straight-forward examples of this architecture in https://www.ac ademia.edu/43355876/Creating_and_implementing_an_ontology_of_documents_and_texts . *********** The advantage of this structure is that it makes easy some aspects of text representation and handling. Many document implementations now allow representation of the text present in a line within a manuscript page, and of the whole page. This can be extracted very easily via the documents tree. Alternatively, many systems (for example collation tools) require the text of a speech or chapter by extracted, independent of how that text spans line and page boundaries. This can be extracted very easily via the entities tree. Operations currently very difficult in TEI/XML -- e.g representing an authorial document where the author rewrites a whole section of text, composed of multiple paragraphs, where both the original version AND the rewritten version span multiple (and distinct) sets of pages -- can be handled easily in this scheme. Concerning implementation: famously, it seems, we chose to use JSON for all data within the system. This choice was not based on any inherent conviction as to the superiority of JSON. We first tried to make this work in a ground-up closed system, the Anastasia publishing system, itself modelled on the long-extinct DynaText (Anastasia is still working, in fact, at https://github.com/peterrobinson/Anastasia2). Then we tried to use an XML database (XML DB, now part of Oracle). For long we thought relational databases the way to go as they explicitly support joins etc as our model requires. It took us a long time to discover that SQL etc just could not do what we wanted. We experimented with RDFs and SPARQL before finally, around 2014, settling on JSON. The issue here is we had to have a system that would run live: that is, it would permit us to update all three element collections simultaneously and in real- time as our editors changed the text, added and deleted documents, entities, etc. Readers can imagine that this is challenging. I am not ascribing any particular virtue to JSON here. It just happens to be exceptionally efficient and transparent (particularly in the MongoDB implementation), very widely- supported, and allows you to write every part of the back end server, middleware, browser interface in a single language (Javascript) using a single presentation encoding (HTML with CSS etc) with JSON holding all the data as it moves through all the many parts of the system. We still use XML tools at many points of the process: for ingest of documents, for page by page transcription, for output of apparatus and other files. This description differs in many respects from the way TC is currently structured. This is mostly because I only figured out the model of 'a collection of leaves with each leaf present in two more independent trees' a long way along in the process. And there are many, many other ways in which TC is far from perfect. One might look more closely at Exist now: the documents and entities collections would be well handled by Exist. Peter ********** Acknowledgements Among many: I have learnt much from Xioahan Zhang, Federico Meschini, Zeth Green, David Parker, Troy Griffits, Steve DeRose, Peter Shillingsburg, Paul Eggert, Hans-Walter Gabler, Jerome McGann, Barbara Bordalejo, Prue Shaw, Klaus Wachtel, Susan Hockey, Ian Lancashire and Lou Burnard on the way to whatever I think I now know. A special thanks to Michael Sperberg-McQueen and Desmond Schmidt for their correspondence over the past few days. It could be said that I have spent thirty years ignoring what they say (and why should I change now?). But anyone familiar with the TEI will see how much it has infused everything I do, and I owe Michael and Lou Burnard more than I could ever explain. Reading: https://www.academia.edu/3233227/Towards_a_Theory_of_Digital_Editions (2012) explains the double aspect of text; https://www.academia.edu/9575974/The_ Concept_of_the_Work_in_the_Digital_Age_published_version_ (2013) argues for a concept of 'work' aligned with 'act of communication'; http://www.digitalhumanities.org/dhq/vol/11/2/000293/000293.html (2017) sets out the text/entity/document ontology here outlined. See too https://textualcommunities.org/app/ and https://wiki.usask.ca/display/TC/Textual+Communities. --[2]------------------------------------------------------------------------ Date: 2020-06-16 09:29:56+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 34.113: annotating notation Hugh, I don't necessarily disagree with what you say below. However, for web applications at least I think schemas are not needed because invalid inputs can be dealt with successfully within the application. The only thing I feel uneasy about is when you say: "I would argue that schemas allow you to sustain a higher level of complexity than you would be able to otherwise." This suggests to me that schemas may be used as a justification for introducing greater complexity than is really needed. Whereas I would suggest that a divide and conquer approach, for example, separating metadata or annotation or variation from the text being encoded may result in a simpler overall data model. Since you have been working with external annotation recently (Recogito) perhaps you can see the advantage of this type of approach. There is only so much complexity that people can take, or upon which a successful user interface can be built. If we believe William of Occam: "Entities are not to be multiplied beyond necessity." This sounds a bit like an admonition to the TEI. Desmond On 6/16/20, Humanist wrote: > I can think of few things more tiresome than reprising the more-or-less > annual arguments over the goodness or evil of particular file formats. But > maybe there's productive discussion to be had around the ideas of > simplicity and correctness. Desmond Schmidt and (iirc) Peter Robinson both > expressed a desire for simpler formats, by which I think they mean in part > "schema-less" formats. A schema gives you things like (broad) consistency > checking, some datatype checking, and it can be used as an aid to the user > encoding a document by giving them suggestions. I would argue that schemas > allow you to sustain a higher level of complexity than you would be able to > otherwise. But are these benefits worth the additional upfront cost? > _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.