Humanist Discussion Group

Humanist Archives: March 2, 2019, 7:46 a.m. Humanist 32.498 - standoff markup

                  Humanist Discussion Group, Vol. 32, No. 498.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2019-03-01 07:33:09+00:00
        From: Iian Neill 
        Subject: Re: [Humanist] 32.423: the McGann-Renear debate

Hi Desmond,

> I am advocating a much
> simpler format for text close to plain text that can be easily mined
> for information, that contains only rendering or abstract rendering
> information. Deeply structured texts as once provided by XML don't fit
> the bill because they mix up the rendering with the semantics and use
> too rigid a document structure that invites overlapping hierarchies on
> reuse.


A lot of the recent discussion on annotation storage here has centred
around files. An alternative method is to store the text and its standoff
properties (to take one example) in a database. If the standoff properties
are stored as discrete records they can be queried within each other,
across texts, and across any entities connected to those property records.
For example, in our exploratory project 'Codex', editions are represented
by plain text and standoff property nodes stored in the Neo4j graph
database. A web-based standoff property 'text editor' allows the user to
freely add or remove text without breaking any property indexes. Through
various modal search windows, users can also search for entities in the
database, or create them on an ad hoc basis within the text editing session.

My point is that if you represent texts and standoff properties in a graph
database, you are freed up to carry out higher level analysis of texts,
leveraging database relationships and indexes. For example, in our digital
edition of the letters of Michelangelo (English translations), alongside
persons, places, events, and concepts annotated, NLP analysis is also
performed on all texts, and the Parts of Speech tokens (and morphologies)
generated are themselves represented as annotations, and can therefore be
queried in conjunction with both the plain text and other annotations. In
this way, one can conceive of a PoS layer or stratum sitting beneath the
text and other annotations. Using the database, one can quickly interrogate
all texts in the corpus for whatever annotational overlap you desire.

If the decision had been taken to represent these 'standoff property texts'
as files, it would have been impractical to run queries quickly over
hundreds of files looking for overlaps.

Regards,
Iian




_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.