Home About Subscribe Search Member Area

Humanist Discussion Group


< Back to Volume 32

Humanist Archives: March 7, 2019, 7:23 a.m. Humanist 32.520 - standoff markup & the illusion of 'plain text'

                  Humanist Discussion Group, Vol. 32, No. 520.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Dr. Herbert Wender 
           Subject: Re: [Humanist] 32.516: standoff markup & the illusion of 'plain text' (71)

    [2]    From: Jan Christoph Meister 
           Subject: Re: [Humanist] 32.505: standoff markup & the illusion of 'plain text' (137)


--[1]------------------------------------------------------------------------
        Date: 2019-03-06 22:01:52+00:00
        From: Dr. Herbert Wender 
        Subject: Re: [Humanist] 32.516: standoff markup & the illusion of 'plain text'

[NB In the following, angle-brackets have been replaced by brace-brackets 
to circumvent a current problem in Humanist's software. Apologies to 
all. --WM]


Raffaele,

it's one of my hobbies to look at the source code of digital distributed
editions under the perspective what information may be derived without those
extractions are foreseen by the distributors. In past I've worked with a
WordCruncher incorporated edition of Robert Musil's mss., with the SGML
conformant encoding of the Weimarer Ausgabe of Goethe's works (Chadwyck-Healey,
with the famous question at the end of each session: "Wollen Sie wirklich
Goethes Werke beenden?") and with the facsimile edition of Kafka's "Process"
distributed as PDF files by Stroemfeldt, facs & transcript side-by-side as in
one of the views in similar XML/TEI mss editions. No wonder that your posting
motivated me to look at a page of Notebook A in the "Frankenstein" archive. To be
precise: It was sufficient to look at folio "1r" to think that often it were
better to say nothing about modifications (as f.e. the Kafka editors in their
transcriptions) and to show them instead to say only a half of the truth.

Perhaps I don't understand what's going on there, you are the expert. But I
think it could be of interest what a non-expert takes out of the comparison
between text and encoding. (To remember: I feel here only as guest under
professionals because I'm eye-handicapped what causes that I can't read
handwritten texts without many mistakes; therefor I can't reference to the
facs page.)

When MOD means 'modification' the first one appears after "often" (see the
snippet beneath the following) and embraces 2 deletions and one addition in
sequence. Perhaps somewhere else in your representation system there are
linearized representations of the 2 states: the status quo ante
"Those events ... are often caused by slight or trivial occurences."
and the final state "Those events ... often derive their origin from a 
trivial occurrence."

I would think that a somehow 'logical' description would hypothesize the
modification as substitution, from "are ... caused by" to "derive their origin
from", and when there is another hand in the ADD element I would suppose that
the same hand has enacted the deletions, isn't it?

Why stands the deletion of "are" outside the MOD if not as consequence of the
monohierarchical model?

Why the revising hand is not attributed to the MOD element?

How do you come from this physical oriented encoding (which splits a coherent
deletion because there are two pen strokes) to the logical encoding to give f.e.
a Zeller like matirx representation of textual development?

All the best,
Herbert


[snip]

{line}Those events which materially influence our fu{/line}
    {line}ture destinies {del rend="strikethrough"}are{/del} often {mod}
        {del rend="strikethrough"}caused{/del}
        {del rend="strikethrough"}by slight or{/del}
        {add hand="#pbs" place="superlinear"}derive thier origin from a{/add}
      {/mod} tri{/line}
    {line}vial occurence{del rend="strikethrough"}s{/del}.
...

    {/line}
[/snip]





--[2]------------------------------------------------------------------------
        Date: 2019-03-06 10:43:23+00:00
        From: Jan Christoph Meister 
        Subject: Re: [Humanist] 32.505: standoff markup & the illusion of 'plain text'

I must say I feel equally undead as Wendell - in CATMA we've been
combining "raw text" and external stand-off markup in a (graph) data
base (originally relational, now Neo4J) since 2012... It's a web
application and the user need not read code, or XML, or be a DB expert
to annotate and analyze their texts and corpora, either individually (1
text/corpus 1 user) or collaboratively (n texts/n users).

And there's more to this approach than merely being able to handle
nested / overlapping / discontinous structures (which imho is really a
problem of the past, as is the Renear-McGann debate which, as far as I'm
concerned, Buzzetti 2002 "Digital Representation and the Text Model", New
Literary History, Vol. 33, No. 1 had already pretty much superseded).

The conceptual gain of abstracting from the text = file model in this
way, I believe, is that

   * markup can be n-dimensional: any 'source' string or fragment thereof
     can be attributed n properties (p) by n annotators (a) during n
     mark-up sessions (s)
   * there is no inherent restriction in terms of choice of property
     _values_: p2 assigned by a77 in s45 can contradict, duplicate,
     expand etc. p1 assigned by u14 in s46. In other words, you can of
     course enforce inter annotator agreement by stipulating procedures
     and conventions, but our markup concept per se is not only oblivious
     to this, yet expressly based on the principle that every annotation
     is a unique instance in its own semantic and functional right
   * source text and markup are no longer conceptualized as two distinct
     types of entities, i.e. "text" and "meta-text = annotation", but are
     modeled as a discursive continuum in which the roles of 'source' and
     'annotation' are functionally defined and can change at any time.
     This continuum can then
   * be queried at any complexity, level and in any combination (e.g.,
     "Show me all instances of discontonous strings across the works of
     Dante where more than 3 out of 5 annotators assigned conflicting
     property values within the same property category AND where the
     string was not automatically pos-tagged as verb or auxiliary AND
     where the median Z-score for the preceeding 2 nouns was greater than
     0.00034 AND where at least one annotator expressed a positive
     sentiment in a free-text commentary = meta-annotation" [don't ask me
     for the use case though...])

And that's still not all. The true beauty is versioning: in CATMA 6.0 we
use the Web Annotation Data Model's JSON-LD format in a Git/Gitlab
environment. This means that every user generates annotations in their
own git repository (and versions thereof) and Gitlab then manages the
data exchange (fetch, merge, push) between users. Query operations are
executed against an in memory graph representation of the data.

All the above is just my brief summary of our developer Marco's much
more detailed comment on the technical aspects. These apart, the crux of
the matter seems to be encapsulated in Desmond's observation

Why should we seek to make a rough approximation of a manuscript page
that can be precisely photographed (not without loss of information of 
course) but still vastly inferior to the page-facsimile image that 
already captures the spatial relationships between fragments of text?

Agreed. There's  a fundamental conceptual barrier between an
analogue/spatial model of text as a two dimensional physical continuum
that extends across pages, but is at the same time an n-dimensional
historic and semantic phenomenon, and the discrete/digital
representation of text as a computable character string. The two offer
different epistemological advantages, so it's really a philosophical
choice, not a technological.

Chris

------------------------
Dr. Jan Christoph Meister
Universitätsprofessor für Digital Humanities
- Schwerpunkt Deutsche Literatur und Textanalyse -
Universität Hamburg, Institut für Germanistik
Überseering 35 / Raum 08064
22 297 Hamburg
+49  40 42838 2972
+49 172 40865 41
http://jcmeister.de
http://catma.de


Am 04.03.2019 um 09:58 schrieb Humanist:
>                    Humanist Discussion Group, Vol. 32, No. 505.
>              Department of Digital Humanities, King's College London
>                     Hosted by King's Digital Lab
>                         www.dhhumanist.org
>                  Submit to:humanist@dhhumanist.org
>
 >
>          Date: 2019-03-03 18:12:22+00:00
>          From: Wendell Piez
>          Subject: Re: [Humanist] 32.499: standoff markup & the illusion of
'plain text'
>
> Dear Willard,
>
> Goodness, now we are recommending text bases instantiated as a graph
> model: both of these sound a lot like Luminescent's internal model, or
> for that matter the experimental system CMSMcQ mentioned way back in
> the dawn of this thread:
>
> Haentjens Dekker, Ronald, and David J. Birnbaum. “It's more than just
> overlap: Text As Graph.” Presented at Balisage: The Markup Conference
> 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage:
> The Markup Conference 2017. Balisage Series on Markup Technologies,
> vol. 19 (2017).https://doi.org/10.4242/BalisageVol19.Dekker01.
>
> Or for old timers:https://github.com/wendellpiez/Luminescent   (I'm
> not dead yet, I think I'll go for a walk!)
>
> Yet I fail to see why any of this once and future promising work
> invalidates XML in any way, whether XML is viewed as some sort of
> arguable abstraction, or a practical technology that none of us (even
> experts) can see in its entirety?
>
> Regards, Wendell
>
>
>
>
> --
> Wendell Piez | wendellpiez.com | wendell -at- nist -dot- gov
> pellucidliterature.org | github.com/wendellpiez |
> gitlab.coko.foundation/wendell  - pausepress.org

------------------------
Dr. Jan Christoph Meister
Universitätsprofessor für Digital Humanities
- Schwerpunkt Deutsche Literatur und Textanalyse -
Universität Hamburg, Institut für Germanistik
Überseering 35 / Raum 08064
22 297 Hamburg
+49  40 42838 2972
+49 172 40865 41
http://jcmeister.de





_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php


Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.