Home | About | Subscribe | Search | Member Area |
Humanist Discussion Group, Vol. 32, No. 499. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Desmond SchmidtSubject: Re: [Humanist] 32.498: standoff markup (89) [2] From: patrick@durusau.net Subject: "Plain text" as illusion, was Re: [Humanist] 32.498: standoff markup (79) --[1]------------------------------------------------------------------------ Date: 2019-03-03 01:56:51+00:00 From: Desmond Schmidt Subject: Re: [Humanist] 32.498: standoff markup I have seen Iian Neil's editor in action. It works well and is quite impressive. It shows what can be done directly in the browser using standoff markup. A frequent complaint against standoff is that editing the text invalidates the pointers to the text from the markup held outside. However, there is no intrinsic reason why that should be so, although it may be difficult to achieve in XML. He gets around it by using simplified HTML as the editing medium. This is similar to the approach we use in our editor for Harpur. When representing the text in a HTML editor all you need are two elements: p and span. Add an attribute "class" and you can style that any way you like. If you follow the proposition I made earlier in the McGann-Renear debate, semantic markup and document metadata can be held externally, variants go into versions and layers, and what's left can easily be stored in files as highly interoperable HTML. There is practically no document we cannot represent this way. Within the computer's memory, as Iian suggests, separating markup from the underlying text has numerous advantages. I think it is time we started sharing texts and tools for real and this approach is one step toward making that possible. Desmond Schmidt eResearch Queensland University of Technology On 3/2/19, Humanist wrote: > Humanist Discussion Group, Vol. 32, No. 498. > Department of Digital Humanities, King's College London > Hosted by King's Digital Lab > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2019-03-01 07:33:09+00:00 > From: Iian Neill > Subject: Re: [Humanist] 32.423: the McGann-Renear debate > > Hi Desmond, > >> I am advocating a much >> simpler format for text close to plain text that can be easily mined >> for information, that contains only rendering or abstract rendering >> information. Deeply structured texts as once provided by XML don't fit >> the bill because they mix up the rendering with the semantics and use >> too rigid a document structure that invites overlapping hierarchies on >> reuse. > > > A lot of the recent discussion on annotation storage here has centred > around files. An alternative method is to store the text and its standoff > properties (to take one example) in a database. If the standoff properties > are stored as discrete records they can be queried within each other, > across texts, and across any entities connected to those property records. > For example, in our exploratory project 'Codex', editions are represented > by plain text and standoff property nodes stored in the Neo4j graph > database. A web-based standoff property 'text editor' allows the user to > freely add or remove text without breaking any property indexes. Through > various modal search windows, users can also search for entities in the > database, or create them on an ad hoc basis within the text editing > session. > > My point is that if you represent texts and standoff properties in a graph > database, you are freed up to carry out higher level analysis of texts, > leveraging database relationships and indexes. For example, in our digital > edition of the letters of Michelangelo (English translations), alongside > persons, places, events, and concepts annotated, NLP analysis is also > performed on all texts, and the Parts of Speech tokens (and morphologies) > generated are themselves represented as annotations, and can therefore be > queried in conjunction with both the plain text and other annotations. In > this way, one can conceive of a PoS layer or stratum sitting beneath the > text and other annotations. Using the database, one can quickly interrogate > all texts in the corpus for whatever annotational overlap you desire. > > If the decision had been taken to represent these 'standoff property texts' > as files, it would have been impractical to run queries quickly over > hundreds of files looking for overlaps. > > Regards, > Iian -- Dr Desmond Schmidt Mobile: 0481915868 Work: +61-7-31384036 --[2]------------------------------------------------------------------------ Date: 2019-03-02 15:41:25+00:00 From: patrick@durusau.net Subject: "Plain text" as illusion, was Re: [Humanist] 32.498: standoff markup Iian, On 3/2/19 2:46 AM, Humanist wrote: > Date: 2019-03-01 07:33:09+00:00 > From: Iian Neill > Subject: Re: [Humanist] 32.423: the McGann-Renear debate > > Hi Desmond, > >> I am advocating a much >> simpler format for text close to plain text that can be easily mined >> for information, that contains only rendering or abstract rendering >> information. Deeply structured texts as once provided by XML don't fit >> the bill because they mix up the rendering with the semantics and use >> too rigid a document structure that invites overlapping hierarchies on >> reuse. > > A lot of the recent discussion on annotation storage here has centred > around files. An alternative method is to store the text and its standoff > properties (to take one example) in a database. If the standoff properties > are stored as discrete records they can be queried within each other, > across texts, and across any entities connected to those property records. > For example, in our exploratory project 'Codex', editions are represented > by plain text and standoff property nodes stored in the Neo4j graph > database. A web-based standoff property 'text editor' allows the user to > freely add or remove text without breaking any property indexes. Through > various modal search windows, users can also search for entities in the > database, or create them on an ad hoc basis within the text editing session. But "plain text" in an electronic system is an illusion. Why not abandon the distinction between text, markup and annotations, capturing all of them in a database, upon which queries then search and/or render a particular "view" of a "text" for your viewing? If you desire XML, for further processing, that is one rendering of a text, as is rendering in SVG, for example, such that readers can choose dynamic renditions of variant versions, with or without a base version being displayed. Or any annotation of a text, as well as annotations of annotations. If, as you say, we should stop clinging to the file metaphor for annotations, let's free ourselves of it with regard to texts. Granting that for many purposes I would prefer a rendering that mimics a hand written mss, but that is only one possibility out of many. Gaps, spaces, margins, etc., can all have unique records in a database, or even records based on unique x - y coordinates on a physical witness. With that change, we can speak of renderings of texts, even renderings that we claim match physical witnesses. Some renderings carry annotations, some don't. Hope you are having a great weekend! Patrick -- Patrick Durusau patrick@durusau.net Technical Advisory Board, OASIS (TAB) Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300 Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps) Another Word For It (blog): http://tm.durusau.net Homepage: http://www.durusau.net Twitter: patrickDurusau Attachments: signature.asc: https://dhhumanist.org/att/46737/att00/ _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.