Home About Subscribe Search Member Area

Humanist Discussion Group


< Back to Volume 32

Humanist Archives: March 3, 2019, 7:47 a.m. Humanist 32.499 - standoff markup & the illusion of 'plain text'

                  Humanist Discussion Group, Vol. 32, No. 499.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Desmond  Schmidt 
           Subject: Re: [Humanist] 32.498: standoff markup (89)

    [2]    From: patrick@durusau.net
           Subject: "Plain text" as illusion, was Re: [Humanist] 32.498: standoff markup (79)


--[1]------------------------------------------------------------------------
        Date: 2019-03-03 01:56:51+00:00
        From: Desmond  Schmidt 
        Subject: Re: [Humanist] 32.498: standoff markup

I have seen Iian Neil's editor in action. It works well and is quite
impressive. It shows what can be done directly in the browser using
standoff markup. A frequent complaint against standoff is that editing
the text invalidates the pointers to the text from the markup held
outside. However, there is no intrinsic reason why that should be so,
although it may be difficult to achieve in XML. He gets around it by
using simplified HTML as the editing medium.

This is similar to the approach we use in our editor for Harpur. When
representing the text in a HTML editor all you need are two elements:
p and span. Add an attribute "class" and you can style that any way
you like. If you follow the proposition I made earlier in the
McGann-Renear debate, semantic markup and document metadata can be
held externally, variants go into versions and layers, and what's left
can easily be stored in files as highly interoperable HTML. There is
practically no document we cannot represent this way. Within the
computer's memory, as Iian suggests, separating markup from the
underlying text has numerous advantages. I think it is time we started
sharing texts and tools for real and this approach is one step toward
making that possible.

Desmond Schmidt
eResearch
Queensland University of Technology

On 3/2/19, Humanist  wrote:
>                   Humanist Discussion Group, Vol. 32, No. 498.
>             Department of Digital Humanities, King's College London
>                    Hosted by King's Digital Lab
>                        www.dhhumanist.org
>                 Submit to: humanist@dhhumanist.org
>
>
>
>
>         Date: 2019-03-01 07:33:09+00:00
>         From: Iian Neill 
>         Subject: Re: [Humanist] 32.423: the McGann-Renear debate
>
> Hi Desmond,
>
>> I am advocating a much
>> simpler format for text close to plain text that can be easily mined
>> for information, that contains only rendering or abstract rendering
>> information. Deeply structured texts as once provided by XML don't fit
>> the bill because they mix up the rendering with the semantics and use
>> too rigid a document structure that invites overlapping hierarchies on
>> reuse.
>
>
> A lot of the recent discussion on annotation storage here has centred
> around files. An alternative method is to store the text and its standoff
> properties (to take one example) in a database. If the standoff properties
> are stored as discrete records they can be queried within each other,
> across texts, and across any entities connected to those property records.
> For example, in our exploratory project 'Codex', editions are represented
> by plain text and standoff property nodes stored in the Neo4j graph
> database. A web-based standoff property 'text editor' allows the user to
> freely add or remove text without breaking any property indexes. Through
> various modal search windows, users can also search for entities in the
> database, or create them on an ad hoc basis within the text editing
> session.
>
> My point is that if you represent texts and standoff properties in a graph
> database, you are freed up to carry out higher level analysis of texts,
> leveraging database relationships and indexes. For example, in our digital
> edition of the letters of Michelangelo (English translations), alongside
> persons, places, events, and concepts annotated, NLP analysis is also
> performed on all texts, and the Parts of Speech tokens (and morphologies)
> generated are themselves represented as annotations, and can therefore be
> queried in conjunction with both the plain text and other annotations. In
> this way, one can conceive of a PoS layer or stratum sitting beneath the
> text and other annotations. Using the database, one can quickly interrogate
> all texts in the corpus for whatever annotational overlap you desire.
>
> If the decision had been taken to represent these 'standoff property texts'
> as files, it would have been impractical to run queries quickly over
> hundreds of files looking for overlaps.
>
> Regards,
> Iian



--
Dr Desmond Schmidt
Mobile: 0481915868 Work: +61-7-31384036



--[2]------------------------------------------------------------------------
        Date: 2019-03-02 15:41:25+00:00
        From: patrick@durusau.net
        Subject: "Plain text" as illusion, was Re: [Humanist] 32.498: standoff markup

Iian,

On 3/2/19 2:46 AM, Humanist wrote:



>         Date: 2019-03-01 07:33:09+00:00
>         From: Iian Neill 
>         Subject: Re: [Humanist] 32.423: the McGann-Renear debate
>
> Hi Desmond,
>
>> I am advocating a much
>> simpler format for text close to plain text that can be easily mined
>> for information, that contains only rendering or abstract rendering
>> information. Deeply structured texts as once provided by XML don't fit
>> the bill because they mix up the rendering with the semantics and use
>> too rigid a document structure that invites overlapping hierarchies on
>> reuse.
>
> A lot of the recent discussion on annotation storage here has centred
> around files. An alternative method is to store the text and its standoff
> properties (to take one example) in a database. If the standoff properties
> are stored as discrete records they can be queried within each other,
> across texts, and across any entities connected to those property records.
> For example, in our exploratory project 'Codex', editions are represented
> by plain text and standoff property nodes stored in the Neo4j graph
> database. A web-based standoff property 'text editor' allows the user to
> freely add or remove text without breaking any property indexes. Through
> various modal search windows, users can also search for entities in the
> database, or create them on an ad hoc basis within the text editing session.



But "plain text" in an electronic system is an illusion. Why not abandon
the distinction between text, markup and annotations, capturing all of
them in a database, upon which queries then search and/or render a
particular "view" of a "text" for your viewing?

If you desire XML, for further processing, that is one rendering of a
text, as is rendering in SVG, for example, such that readers can choose
dynamic renditions of variant versions, with or without a base version
being displayed.

Or any annotation of a text, as well as annotations of  annotations.

If, as you say, we should stop clinging to the file metaphor for
annotations, let's free ourselves of it with regard to texts.

Granting that for many purposes I would prefer a rendering that mimics a
hand written mss, but that is only one possibility out of many.

Gaps, spaces, margins, etc., can all have unique records in a database,
or even records based on unique x - y coordinates on a physical witness.

With that change, we can speak of renderings of texts, even renderings
that we claim match physical witnesses. Some renderings carry
annotations, some don't.

Hope you are having a great weekend!

Patrick



--

Patrick Durusau
patrick@durusau.net
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau





Attachments:
signature.asc: https://dhhumanist.org/att/46737/att00/ 



_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php


Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.