Humanist Discussion Group

Humanist Archives: June 17, 2020, 7:51 a.m. Humanist 34.117 - annotating notation

                  Humanist Discussion Group, Vol. 34, No. 117.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Peter Robinson 
           Subject: Humanist 34.89 to 34.116, on texts, hierarchies, XML, JSON (143)

    [2]    From: Desmond  Schmidt 
           Subject: Re: [Humanist] 34.113: annotating notation (40)


--[1]------------------------------------------------------------------------
        Date: 2020-06-16 14:55:08+00:00
        From: Peter Robinson 
        Subject: Humanist 34.89 to 34.116, on texts, hierarchies, XML, JSON


In response to the discussion about models of text,  JSON, XML, and hierarchies,
etc, which was sparked by Michael Sperberg-McQueen's comments on the Oxford
Concordance Program in 34.89, my reply in 34.93, and since carried on by
multiple further discussions, in 34.100 and 34.106 (M.S-M), 34.105 (Desmond
Schmidt), 34.108 (Herbert Wender on OHCO), 34.110 (Hugh Cayless, Desmond
Schmidt, William Pascoe), 34.113 (Hugh Cayless and John Keating), 34.116
(Willard McCarty, expressing some weariness),


First, a disclaimer, of sorts. I am not specially interested in discussions
about the merits of XML versus JSON etc, and I sympathize with Willard's
distaste for that cup of tea. My motivation in my original reply to Michael's
comments was not to promote JSON, but to incite a discussion about hierarchies
and texts. Especially, I wanted to draw people's attention to the model I have
proposed for text in various publications. Here are the essential parts of this
model:

1. The three key terms are 'document', 'text' and 'act of
communication'. I argue, most accessibly in https://www.academia.edu/43355876/
Creating_and_implementing_an_ontology_of_documents_and_texts, that
        'a text is an act of communication inscribed in a document'.
Accordingly, all texts have a double aspect. They are acts of communication, and
they are present in documents. Each aspect may be represented as a tree, with
each tree independent of the the other. Text may therefore be conceived as a
collection of leaves, with each leaf present on both the document and act of
communication trees. A typical document tree might be: document --> quires -->
pages--> columns --> line-like writing spaces; the act of communication tree
might be play --> acts -->  scenes --> lines. We use the term 'entity' as
shorthand for "act of communication'.

2. We may conceive an architecture for these three aspects as follows.

a. The document and act of communication ('entity') trees can each be
represented as a collection of nodes linked in an OHCO tree:

Name: the name of the node (div, page, etc)
Attributes: a list of property/value pairs associated with the node
Ancestor: the ancestor node of this node. For the root node (a TEI, etc) there
is no ancestor
Children: an ordered list of children of this node. These children may be nodes
in each tree (quires as children of the root document, etc) or they may be
leaves from the third collection.

b. The leaves may be seen as a string pool, with each node as follows:

Text: the text of this leaf, essentially a plain UTF-8 string
Entity: the immediate ancestor node of this leaf in the act of communication
tree
Document: the immediate ancestor node of this leaf in the document tree.

A few notes:

The document and entity trees are completely independent of each other. They may
have leaves in different orders. A leaf of text may be an immediate sibling of
another leaf in one tree (for example, two words on the same page) but be far
apart in the other tree (they might belong to different poems).

The entity tree, particularly, could be translated directly to and from a
TEI/XML document without any loss of information.

No 'pcdata' is held in the document and entity trees, All pcdata is held in
the the 'string pool', with all leaves referenced as children of nodes
within the document and entity trees.

Additional hierarchies could be added, just by creating further tree-like node
collections, with the same structure as the document and entity trees, and with
leaves as children. You would also need to add an extra field for your new
hierarchy in the leaves present in your new hierarchy, and likely restructure
and add some leaves.

There are some straight-forward examples of this architecture in  https://www.ac
ademia.edu/43355876/Creating_and_implementing_an_ontology_of_documents_and_texts
.

***********

The advantage of this structure is that it makes easy some aspects of text
representation and handling. Many document implementations now allow
representation of the text present in a line within a manuscript page, and of
the whole page. This can be extracted very easily via the documents tree.
Alternatively, many systems (for example collation tools) require the text of a
speech or chapter by extracted, independent of how that text spans line and page
boundaries. This can be extracted very easily via the entities tree. Operations
currently very difficult in TEI/XML -- e.g representing an authorial document
where the author rewrites a whole section of text, composed of multiple
paragraphs, where both the original version AND the rewritten version span
multiple (and distinct) sets of pages -- can be handled easily in this scheme.

Concerning implementation: famously, it seems, we chose to use JSON for all data
within the system. This choice was not based on any inherent conviction as to
the superiority of JSON. We first tried to make this work in a ground-up closed
system, the Anastasia publishing system, itself modelled on the long-extinct
DynaText (Anastasia is still working, in fact, at
https://github.com/peterrobinson/Anastasia2). Then we tried to use an XML
database (XML DB, now part of Oracle).  For long we thought relational databases
the way to go as they explicitly support joins etc as our model requires. It
took us a long time to discover that SQL etc just could not do what we wanted.
We experimented with RDFs and SPARQL before finally, around 2014, settling on
JSON.

The issue here is we had to have a system that would run live: that is, it would
permit us to update all three element collections simultaneously and in real-
time as our editors changed the text, added and deleted documents, entities,
etc. Readers can imagine that this is challenging. I am not ascribing any
particular virtue to JSON here. It just happens to be exceptionally efficient
and transparent (particularly in the MongoDB implementation), very widely-
supported, and allows you to write every part of the back end server,
middleware, browser interface in a single language (Javascript) using a single
presentation encoding (HTML with CSS etc) with JSON holding all the data as it
moves through all the many parts of the system. We still use XML tools at many
points of the process: for ingest of documents, for page by page transcription,
for output of apparatus and other files.

This description differs in many respects from the way TC is currently
structured. This is mostly because I only figured out the model of 'a
collection of leaves with each leaf present in two more independent trees' a
long way along in the process. And there are many, many other ways in which TC
is far from perfect. One might look more closely at Exist now: the documents and
entities collections would be well handled by Exist.

Peter

**********
Acknowledgements
Among many: I have learnt much from Xioahan Zhang, Federico Meschini, Zeth
Green, David Parker, Troy Griffits, Steve DeRose, Peter Shillingsburg, Paul
Eggert, Hans-Walter Gabler, Jerome McGann, Barbara Bordalejo, Prue Shaw, Klaus
Wachtel, Susan Hockey, Ian Lancashire and Lou Burnard on the way to whatever I
think I now know. A special thanks to Michael Sperberg-McQueen and Desmond
Schmidt for their correspondence over the past few days. It could be said that I
have spent thirty years ignoring what they say (and why should I change now?).
But anyone familiar with the TEI will see how much it has infused everything I
do, and I owe Michael and Lou Burnard more than I could ever explain.

Reading: https://www.academia.edu/3233227/Towards_a_Theory_of_Digital_Editions
(2012) explains the double aspect of text; https://www.academia.edu/9575974/The_
Concept_of_the_Work_in_the_Digital_Age_published_version_ (2013) argues for a
concept of 'work' aligned with 'act of communication';
http://www.digitalhumanities.org/dhq/vol/11/2/000293/000293.html (2017) sets out
the text/entity/document ontology here outlined.  See too
https://textualcommunities.org/app/ and
https://wiki.usask.ca/display/TC/Textual+Communities.

--[2]------------------------------------------------------------------------
        Date: 2020-06-16 09:29:56+00:00
        From: Desmond  Schmidt 
        Subject: Re: [Humanist] 34.113: annotating notation

Hugh,

I don't necessarily disagree with what you say below. However, for web
applications at least I think schemas are not needed because invalid
inputs can be dealt with successfully within the application.

The only thing I feel uneasy about is when you say:

"I would argue that schemas allow you to sustain a higher level of
complexity than you would be able to otherwise."

This suggests to me that schemas may be used as a justification for
introducing greater complexity than is really needed. Whereas I would
suggest that a divide and conquer approach, for example, separating
metadata or annotation or variation from the text being encoded may
result in a simpler overall data model. Since you have been working
with external annotation recently (Recogito) perhaps you can see the
advantage of this type of approach. There is only so much complexity
that people can take, or upon which a successful user interface can be
built. If we believe William of Occam: "Entities are not to be
multiplied beyond necessity." This sounds a bit like an admonition to
the TEI.

Desmond

On 6/16/20, Humanist  wrote:

> I can think of few things more tiresome than reprising the more-or-less
> annual arguments over the goodness or evil of particular file formats. But
> maybe there's productive discussion to be had around the ideas of
> simplicity and correctness. Desmond Schmidt and (iirc) Peter Robinson both
> expressed a desire for simpler formats, by which I think they mean in part
> "schema-less" formats. A schema gives you things like (broad) consistency
> checking, some datatype checking, and it can be used as an aid to the user
> encoding a document by giving them suggestions. I would argue that schemas
> allow you to sustain a higher level of complexity than you would be able to
> otherwise. But are these benefits worth the additional upfront cost?
>




_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.