Humanist Discussion Group, Vol. 34, No. 125. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: firstname.lastname@example.org  From: Peter Robinson
Subject: Re: [Humanist] 34.124: text structures and architecture and notation (114)  From: Desmond Schmidt Subject: Re: [Humanist] 34.123: annotating notation (53) -------------------------------------------------------------------------- Date: 2020-06-19 17:30:27+00:00 From: Peter Robinson Subject: Re: [Humanist] 34.124: text structures and architecture and notation This has indeed turned into a most productive discussion. A few comments: 1. Desmond's spare summary of the architecture I propose is quite correct, with the reservation noted by Michael (there is no 'spine' to the pool of text-fragment strings). He is also right about the dangers of millions of JSON fragments becoming unglued inside MongoDB, with frightening consequences. However, the same could be said of any database system (imagine what might happen to the world if the Visa/Mastercard database became unhinged). We do build in redundancy by supplying every text fragment with an identifier, see next comment. 2. In my compressed overview, I did not describe a key part of the system. It is this: every node of every document, every act of communication ('entity' hereafter), and every text fragment, has a unique identifier. This identifier is composed along XPath lines as follows: A. For documents and acts of communication: this is a straightforward designation of nodes as follows: For a document - The Hengwrt manuscript - document=Hengwrt Folio 1r of the Hengwrt manuscript - document=Hengwrt: folio=1r The first line space of Folio 1r of the Hengwrt manuscript - document=Hengwrt:folio=1r:linespace=1 For an entity - entity=Canterbury Tales The General Prologue of the Canterbury Tales - entity=Canterbury Tales:part=GP Line one of the General Prologue of the Canterbury Tales - entity=Canterbury Tales:part=GP: line=1 B. For text fragments. Recall that in this architecture every text fragment is attached to BOTH a document tree and entity tree. Accordingly, the text fragment containing the first line of the General Prologue in the Hengwrt manuscript has the following identifier: document=Hengwrt:folio=1r:linespace=1:entity=Canterbury Tales:part=GP: line=1 Suppose that in this manuscript the first line of the Tales happened to be written across two pages. Then it would be split into two text fragments as follows: (In linespace 30 of the first page) document=Hengwrt:folio=1r:linespace=30:entity=Canterbury Tales:part=GP: line=1 (In the first linespace of the second page) document=Hengwrt:folio=1v:linespace=1:entity=Canterbury Tales:part=GP: line=1 You can see that this makes it very easy to carry out operations such as: identify all manuscripts containing the first line of the Tales, and all pages containing it; retrieve all instances of the first line of the Tales; retrieve all the text on a page or span of pages; show the transcription of text written in a particular line space; and so on. This architecture should also answer Michael's question: say an author has revised his text five times within the one document. Each of these five texts would be a separate text fragment, on different nodes of the document tree but on the same node of the entity tree. Thus for the same sentence revised on five pages: document=myDoc:page=1;entity=myWork:sentence=1 document=myDoc:page=2:entity=myWork:sentence=1 document=myDoc:page=3:entity=myWork:sentence=1 document=myDoc:page=3:entity=myWork:sentence=1 document=myDoc:page=3:entity=myWork:sentence=1 In the Textual Communities architecture: entities are constructed very simply: just supply a n attribute to a content element, thus: etc. This identifier scheme also acts as a further (and more transparent) way of linking all the nodes together so we are not solely reliant on BSON ids. It does also mean that data could be exported and put into any other system you fancy in a straightforward way. In the TC implementation, we express all this in a urn notation, as follows, declared as âdet' scheme (for documents, entities and texts): urn:det:usask/document=Hengwrt:folio=1r:linespace=1:entity=Canterbury Tales:part=GP: line=1 Here, we designate usask as a naming authority, following the Kahn-Wilenskiy scheme (http://www.cnri.reston.va.us/k-w.html). 3. Michael says: "That such similar models have been developed independently by different groups may suggest a certain generality and appeal.' Absolutely. We are all responding, in different ways, to our sense of the multiple valencies of text. 4. I am not holding a particular brief for JSON as the way to do this. And the reason I did not give a full example of the JSON implementation of this architecture is that, frankly, I am not proud of it. We (originally, from 1998 on, myself, John Clark, Andrew West, Zeth Green, Federico Meschini, and later Xioahan Zhang) started out developing this before we were quite clear how all this worked. So there are a lot of legacy fumbles. You can see some examples of the TC implementation at https://www.academia.edu/43355876/Creating_and_implemen ting_an_ontology_of_documents_and_texts. Because documents and entities are trees it would make excellent sense to use an XML database to hold those structures. But I could not, and still can't, figure out how to make an XML database point into a pool of text fragments to populate the branches of each tree with data. Please someone. 5. The biggest flaw with TC at present is document/entity ingestion. We badly need an editor which fully understands this architecture. We currently swallow TEI/XML documents, and we assign nodes for the entity tree for each document using simple XPath-like syntax, as above. However, we have to infer the document structure as did COCOA, just by incrementing counters at milestone elements (pb cb lb for pages columns and line-spaces) and assuming a basic document/page/column/line hierarchy. This is really not adequate. Please someone. 6. I am aware of quite a few pieces of the puzzle which are not in place. Or rather: TC made various assumptions about the text it is processing which need to be turned into explicit and deniable rules. That is another discussion, if anyone is interested. Peter Robinson University of Saskatchewan -------------------------------------------------------------------------- Date: 2020-06-19 12:02:15+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 34.123: annotating notation Hi Hugh, You can validate your inputs to any program with simple application logic, which, if properly written, can be mathematically perfect. You do not have to do it in a schema. More than half the web already runs on JSON, which does not normally use schema-based validation. Schema validation can be a serious security weakness in complex web applications. When I was working at the Information Security Institute at QUT, we developed some highly effective denial of service attacks by sending a server only a tiny amount of deeply nested XML, which invoked recursive schema validation and took down a powerful web server . Check out our IEEE conference paper on this topic here: https://eprints.qut.edu.au/35760/1/c35760.pdf Desmond Date: 2020-06-18 17:53:13+00:00 From: Hugh Cayless Subject: Re: [Humanist] 34.117: annotating notation > However, for web applications at least I think schemas are not needed > because invalid inputs can be dealt with successfully within the > application. This means to me that you prefer that constraints be discovered later on, presumably in the form of bugs, and that those constraints be added (fixed) at processing time, rather than at creation time. Your preference is to pay later. But simplicity is its own constraint, surely. It seems obvious that one can achieve a complex representation using only simple techniques, as long as the simple techniques are composable. But if you work without controls on your inputs I'd have thought that invites accidental complexity. I suspect you mean to handle that by insisting on very limited formats as inputs, putting the complexity (or intelligence) in the application layer rather than the data layer. Applications are brittle things, however. My own preference is to put "intelligence" in the data rather than in the code, where possible, in part because I assume the data are going to last longer than the application, but also because (as a programmer) I expect code (my own not excluded) to be sloppy, broken, and riddled with errors. The structure of data seems to me easier to (dare I say) validate. On the other hand, "data is code" :-). -- Dr Desmond Schmidt Mobile: 0480147690 _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: email@example.com List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.