Home | About | Subscribe | Search | Member Area |
Humanist Discussion Group, Vol. 32, No. 498. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2019-03-01 07:33:09+00:00 From: Iian NeillSubject: Re: [Humanist] 32.423: the McGann-Renear debate Hi Desmond, > I am advocating a much > simpler format for text close to plain text that can be easily mined > for information, that contains only rendering or abstract rendering > information. Deeply structured texts as once provided by XML don't fit > the bill because they mix up the rendering with the semantics and use > too rigid a document structure that invites overlapping hierarchies on > reuse. A lot of the recent discussion on annotation storage here has centred around files. An alternative method is to store the text and its standoff properties (to take one example) in a database. If the standoff properties are stored as discrete records they can be queried within each other, across texts, and across any entities connected to those property records. For example, in our exploratory project 'Codex', editions are represented by plain text and standoff property nodes stored in the Neo4j graph database. A web-based standoff property 'text editor' allows the user to freely add or remove text without breaking any property indexes. Through various modal search windows, users can also search for entities in the database, or create them on an ad hoc basis within the text editing session. My point is that if you represent texts and standoff properties in a graph database, you are freed up to carry out higher level analysis of texts, leveraging database relationships and indexes. For example, in our digital edition of the letters of Michelangelo (English translations), alongside persons, places, events, and concepts annotated, NLP analysis is also performed on all texts, and the Parts of Speech tokens (and morphologies) generated are themselves represented as annotations, and can therefore be queried in conjunction with both the plain text and other annotations. In this way, one can conceive of a PoS layer or stratum sitting beneath the text and other annotations. Using the database, one can quickly interrogate all texts in the corpus for whatever annotational overlap you desire. If the decision had been taken to represent these 'standoff property texts' as files, it would have been impractical to run queries quickly over hundreds of files looking for overlaps. Regards, Iian _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.