Humanist Discussion Group

etc. This identifier scheme also acts as a further (and more transparent) way of linking all the nodes together so we are not solely reliant on BSON ids. It does also mean that data could be exported and put into any other system you fancy in a straightforward way. In the TC implementation, we express all this in a urn notation, as follows, declared as âdet' scheme (for documents, entities and texts): urn:det:usask/document=Hengwrt:folio=1r:linespace=1:entity=Canterbury Tales:part=GP: line=1 Here, we designate usask as a naming authority, following the Kahn-Wilenskiy scheme (http://www.cnri.reston.va.us/k-w.html). 3. Michael says: "That such similar models have been developed independently by different groups may suggest a certain generality and appeal.' Absolutely. We are all responding, in different ways, to our sense of the multiple valencies of text. 4. I am not holding a particular brief for JSON as the way to do this. And the reason I did not give a full example of the JSON implementation of this architecture is that, frankly, I am not proud of it. We (originally, from 1998 on, myself, John Clark, Andrew West, Zeth Green, Federico Meschini, and later Xioahan Zhang) started out developing this before we were quite clear how all this worked. So there are a lot of legacy fumbles. You can see some examples of the TC implementation at https://www.academia.edu/43355876/Creating_and_implemen ting_an_ontology_of_documents_and_texts. Because documents and entities are trees it would make excellent sense to use an XML database to hold those structures. But I could not, and still can't, figure out how to make an XML database point into a pool of text fragments to populate the branches of each tree with data. Please someone. 5. The biggest flaw with TC at present is document/entity ingestion. We badly need an editor which fully understands this architecture. We currently swallow TEI/XML documents, and we assign nodes for the entity tree for each document using simple XPath-like syntax, as above. However, we have to infer the document structure as did COCOA, just by incrementing counters at milestone elements (pb cb lb for pages columns and line-spaces) and assuming a basic document/page/column/line hierarchy. This is really not adequate. Please someone. 6. I am aware of quite a few pieces of the puzzle which are not in place. Or rather: TC made various assumptions about the text it is processing which need to be turned into explicit and deniable rules. That is another discussion, if anyone is interested. Peter Robinson University of Saskatchewan --[2]------------------------------------------------------------------------ Date: 2020-06-19 12:02:15+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 34.123: annotating notation Hi Hugh, You can validate your inputs to any program with simple application logic, which, if properly written, can be mathematically perfect. You do not have to do it in a schema. More than half the web already runs on JSON, which does not normally use schema-based validation. Schema validation can be a serious security weakness in complex web applications. When I was working at the Information Security Institute at QUT, we developed some highly effective denial of service attacks by sending a server only a tiny amount of deeply nested XML, which invoked recursive schema validation and took down a powerful web server . Check out our IEEE conference paper on this topic here: https://eprints.qut.edu.au/35760/1/c35760.pdf Desmond Date: 2020-06-18 17:53:13+00:00 From: Hugh Cayless Subject: Re: [Humanist] 34.117: annotating notation > However, for web applications at least I think schemas are not needed > because invalid inputs can be dealt with successfully within the > application. This means to me that you prefer that constraints be discovered later on, presumably in the form of bugs, and that those constraints be added (fixed) at processing time, rather than at creation time. Your preference is to pay later. But simplicity is its own constraint, surely. It seems obvious that one can achieve a complex representation using only simple techniques, as long as the simple techniques are composable. But if you work without controls on your inputs I'd have thought that invites accidental complexity. I suspect you mean to handle that by insisting on very limited formats as inputs, putting the complexity (or intelligence) in the application layer rather than the data layer. Applications are brittle things, however. My own preference is to put "intelligence" in the data rather than in the code, where possible, in part because I assume the data are going to last longer than the application, but also because (as a programmer) I expect code (my own not excluded) to be sloppy, broken, and riddled with errors. The structure of data seems to me easier to (dare I say) validate. On the other hand, "data is code" :-). -- Dr Desmond Schmidt Mobile: 0480147690 _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php