Home About Subscribe Search Member Area

Humanist Discussion Group

< Back to Volume 34

Humanist Archives: June 18, 2020, 6:27 a.m. Humanist 34.121 - annotating notation

                  Humanist Discussion Group, Vol. 34, No. 121.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                Submit to: humanist@dhhumanist.org

    [1]    From: C. M. Sperberg-McQueen 
           Subject: on JSON and documents and generalizations and concrete examples (26)

    [2]    From: Desmond  Schmidt 
           Subject: Re: [Humanist] 34.117: annotating notation (67)

        Date: 2020-06-17 14:34:14+00:00
        From: C. M. Sperberg-McQueen 
        Subject: on JSON and documents and generalizations and concrete examples

I thank Peter Robinson for his informative description in Humanist
34.117 of his proposal to model texts as collections of leaves shared
among multiple trees.  (Although it must be said that from a botanical
viewpoint that is one spectacularly bewildering metaphor.)

Like PR and others, I would gladly avoid an XML vs JSON flame war.
But my request to see a JSON representation of a real document with
multiple structures was seriously meant, and I am disappointed that PR
has shown us no concrete examples of the JSON used by his Textual
Communities system to read or write such documents. Concrete examples
would allow his general statements about efficiency and transparency
to be usefully understood and discussed; without examples, they are
about as informative as Mongo DB's advertising copy.

I hope that Humanist has not become too dignified and high-minded to
be dirtied with practicalities and technical details.

C. M. Sperberg-McQueen
Black Mesa Technologies LLC

        Date: 2020-06-17 09:00:00+00:00
        From: Desmond  Schmidt 
        Subject: Re: [Humanist] 34.117: annotating notation


I think it might help people with this long posting if I ventured to
provide an executive summary between the two lines below. Tell me if
I'm wrong in any essential details.
TC consists of 3 collections of things:
1. text fragments
2. document tree nodes
3. act-of-communication tree nodes

The nodes in the trees correspond roughly to the elements of an XML
tree. The things in the text collection are just fragments of unicode
text. Each tree has internal nodes and leaf-nodes which point to the
text fragments. Not all the text fragments need be in each tree.
Further trees representing other hierarchies may also be added.

With this arrangement you can easily can find out where in a
quire/page/column or line a piece of text resides, and also in which
part of the act-of-communication tree it belongs: say to a particular
speech/scene/act in a play. The key is the collection of text
fragments which in my mind form a kind of spine to which the two trees
attach themselves.

The argument about XML vs. JSON here seems quite irrelevant to me: the
user will be entirely unaware that mongodb stores the 'things' as JSON
(actually BSON) documents internally: one document per node. You could
equally store the same structure in an XML database like MarcLogic,
ExistDB or BaseX with no discernible difference in performance.

One concern with this design is that there is no distinction in the
text collection between fragments potentially belonging to different
versions. Their identity is provided solely by the two trees. You
argue that you could represent a heavily revised text in this system,
but I wonder if you have actually tried. How would you get a text
revised say 9 times into this structure, in any practical way, even if
you could theoretically represent it? There is a good short example in
my 'Tough cases" No.2 (http://charles-harpur.org/tough-cases/)

Another concern is the load this puts on the mongodb database. You
need one BSON document per node. If you stored a few thousand
documents in this system you would end up with several million nodes.
Not impossible to handle of course, but each document is connected to
the other only by its ID. If your program failed at some point or
contained an error you could easily end up with thousands of orphaned
nodes floating around inside the database, which you would have to
purge periodically, assuming you could do so safely.

Another worry is that you are storing documents in a fragmented and
unreadable state. How do you export the data for archiving in a
coherent form?  As an example of potential data loss when using a
database of this kind we imported some images of Harpur's poems in
rare newspapers and then lost or deleted the originals. Later when we
changed our data design we had to get them out again and discovered
that images with spaces in their file-names could not be extracted due
to a bug in mongodb. So we had to remake them all. :-(.


Dr Desmond Schmidt
Mobile: 0480147690

Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.