16.377 encoded text vs database

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty@kcl.ac.uk)
Date: Wed Dec 11 2002 - 02:22:44 EST

  • Next message: Humanist Discussion Group (by way of Willard McCarty

                   Humanist Discussion Group, Vol. 16, No. 377.
           Centre for Computing in the Humanities, King's College London
                       www.kcl.ac.uk/humanities/cch/humanist/
                         Submit to: humanist@princeton.edu

       [1] From: Patrick Durusau <pdurusau@emory.edu> (42)
             Subject: Re: 16.372 encoded text vs database?

       [2] From: "Fotis Jannidis" <fotis.jannidis@lrz.uni- (24)
                     muenchen.de>
             Subject: Re: 16.375 encoded text vs database

       [3] From: Julia Flanders <Julia_Flanders@brown.edu> (39)
             Subject: Re: 16.375 encoded text vs database

       [4] From: Chris Powell <sooty@umich.edu> (29)
             Subject: Re: 16.375 encoded text vs database

       [5] From: Michael Hart <hart@beryl.ils.unc.edu> (41)
             Subject: Re: 16.375 encoded text vs database

       [6] From: "George H. Williams" <WilliamsGH@umkc.edu> (27)
             Subject: Re: 16.375 encoded text vs database

       [7] From: yvind Eide <oyvind.eide@muspro.uio.no> (53)
             Subject: Re: 16.375 encoded text vs database

    --[1]------------------------------------------------------------------
             Date: Tue, 10 Dec 2002 10:05:46 +0000
             From: Patrick Durusau <pdurusau@emory.edu>
             Subject: Re: 16.372 encoded text vs database?

    Willard,

    Willard McCarty wrote:

    >A technical question: what might be reliable criteria for determining when
    >a given research problem involving textual data is approached with
    >relational database technology, when with text encoding? A related
    >non-technical question: how does one cultivate the ability to keep such
    >questions always in mind? As has been said, we tend to view each thing
    >as something to be hammered with the hammer in hand.

    With the advent of native XML databases as well as relational databases
    that store XML, is the question of "relational database technology vs. text
    encoding" really relevant? The XSTAR project at the Oriental Institute,
    University of Chicago, is building an XML database for archaeological data
    that includes textual data. Much of the data that will be used by that
    database once resided in relational databases but it will also have a
    substantial amount of textual data that uses text encoding.

    At least in the early stages, I think projects should be formulated without
    regard to available technology (either locally or read about) so that
    researchers can state fully what they would like to do, without regard to
    whether that is actually possible with current technology. A very precise
    formulation of the research problem and goals of the project would provide
    a basis for evaluating available technologies for the one that most closely
    meets the needs of the project.

    That I can approach some problem using MySQL, which is free and familiar,
    is not the same as formulating the problem separate and apart from the
    known limitations of MySQL. After precise formulation of the research
    problem or goals of the project, I may find that MySQL is sufficient but on
    the other hand, its limitations may provoke a search for other technology
    for the project.

    Technology is changing rapidly enough that I think that computing humanists
    should ignore it in assisting in the formulation of research projects. This
    avoids overlooking new technologies that have not become widely known or
    simply using the tools already at hand. Perhaps the ACH should provide web
    space for posting of project proposals so researchers could draw on upon
    diverse expertise and awareness of new technologies in the computing
    humanities community? (I started to suggest the Humanist but adequate
    project proposals would be too long for a discussion list.)

    Patrick

    --
    Patrick Durusau
    Director of Research and Development
    Society of Biblical Literature
    pdurusau@emory.edu
    

    --[2]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:00:14 +0000 From: "Fotis Jannidis" <fotis.jannidis@lrz.uni-muenchen.de> Subject: Re: 16.375 encoded text vs database

    > From: Willard McCarty <willard.mccarty@kcl.ac.uk> > > > A technical question: what might be reliable criteria for determining > when a given research problem involving textual data is approached > with relational database technology, when with text encoding?

    Isn't this more a question of query interfaces? As far as I know many xml databases use relational databases to store the data, but the user has an extended xpath or an xquery interface. In other words, we don't really care what technology is used to really store the data but which interfaces to the data are available. If you only have an SQL interface your data must be highly and consistently structured but if you have an xquery interface you can formulate all queries which are possible in SQL and a lot which are not possible there. If you only have extended xpath (extended by some free text search abilities like eXist uses) you are more or less confined to context aware queries. If your research is mainly limited to retrieving texts by meta data it may be more effective to store the meta data seperately and do the query over the meta data alone but this would probably only be interesting for very large corpora. I don't know enough about statistical approaches to text analysis to answer your question in this respect.

    Fotis Jannidis

    --[3]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:01:28 +0000 From: Julia Flanders <Julia_Flanders@brown.edu> Subject: Re: 16.375 encoded text vs database

    >Willard asks: > > > A technical question: what might be reliable criteria for determining when > > a given research problem involving textual data is approached with > > relational database technology, when with text encoding? A related > > non-technical question: how does one cultivate the ability to keep such > > questions always in mind? As has been said, we tend to view each thing > > as something to be hammered with the hammer in hand. > >This is of more than theoretical interest for librarians and cataloguers. >Having used both database technology and text encoding for medieval >manuscript cataloguing. There is no question that databases allow much >more sophisticated query and sorting capabilities. Right now the text >encoded projects with which I am familiar only allow string searches, >sometimes within a given tag, but with no way to sort the output or do the >"find me all the MSS written in Florence between 1400 and 1450 on >parchment and with illustrations and sort them by author and date."

    I suspect this really depends more on your delivery software than on the nature of text encoding. The publication framework that the Women Writers Project currently uses (which is a customization of DynaText/DynaWeb) certainly supports the kind of complex searching and sorting that Charles describes, and we expect to duplicate that functionality in any future delivery mechanism we use.

    It might help here to distinguish between the data structures being described, and the publication/interface systems that are currently available to publish those data structures--so we could parse out Willard's question into two separate questions:

    "how can one determine whether a given research problem involving textual data is best served by a database-like structure for the data, and when it is best served by a looser 'text-encoded' structure?"

    "how can one determine...best served by database software or by XML publication software?"

    As Charles suggests, there's now software emerging that really is both. But the fact that until recently there has been a dearth of good, affordable SGML/XML publication software and a whole lot of very powerful database software shouldn't obscure whatever differences there are between the two ways of representing the data itself, which is in itself a very interesting question.

    Best, Julia

    --[4]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:02:02 +0000 From: Chris Powell <sooty@umich.edu> Subject: Re: 16.375 encoded text vs database

    I'm not certain that this is necessarily a case of encoded text systems vs. databases. We have an encoded text system for transcription of speech online at http://www.hti.umich.edu/m/micase/ that combines the features of text searching with those of restricting/sorting based on metadata features for each transcription (about the speaker and the speech event). I can search for the phrase "you know" and restrict it to use by male native speakers of English in advising sessions, and then sort my results by context words and other variables I preset prior to searching. The system is built on transcriptions encoded in a TEI-based DTD.

    We're also enhancing our more widely known encoded text systems, like the Making of America, to permit sorting of results by frequency of the search term, author, title, and date. This should be available early next year.

    Both of these systems rely, however, on the consistent application of standard metadata across all the texts in a collection. This can be difficult to ensure, and it's also difficult to specify system behavior when inconsistent metadata is encountered (if you are going to sort by date and there is no date for a text in the result set, what should happen?). This may be why such sorting and specifying features are not commonly seen outside of database applications like library catalogs and related indexes, where a great deal of effort is put into the metadata.

    --[5]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:02:41 +0000 From: Michael Hart <hart@beryl.ils.unc.edu> Subject: Re: 16.375 encoded text vs database

    Believe it or not, the first CDROM of full text eBooks allowed for these kinds of searches a dozen years ago: I saw the Library of the Future, from World Library, at the ALA Midwinter conference, Chicago, Jan. 6, 1990.

    > On the other hand, text encoding captures much better the messiness of the > data. I don't know positively, but I do not think it is possible to OCR a > printed text and then put it into a relational database without losing a > lot of the stuff that doesn't fit in any particular field. Text encoding > is quite good at this.

    It is somewhat dependent on your proposed target audience.

    There *used* to be various database structures that would allow for the reconstruction of the original full text document.

    Thanks!!!

    Happy Holidays!!!

    Michael S. Hart <hart@pobox.com> Project Gutenberg Principal Instigator "*Internet User ~#100*"

    --[6]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:03:25 +0000 From: "George H. Williams" <WilliamsGH@umkc.edu> Subject: Re: 16.375 encoded text vs database

    I think this example would fulfill your criteria: http://www.irith.org

    For general technical details, see http://www.irith.org/about.jsp "[An] XML database [that] runs on Tamino, with a JSP interface that runs off a Tomcat Server."

    --GHW

    George H. Williams Department of English University of Missouri-Kansas City

    > Date: Mon, 09 Dec 2002 07:52:15 +0000 > From: "Borovsky, Zoe" <zoe@humnet.ucla.edu> > > > a follow-up question: > does anyone know of a project/product that uses XML to mark up > bibliographies suitable for Web delivery/searching, etc? > > i have found: > http://refdb.sourceforge.net/features.html > > and this example of using BibTeX for markup and Glimpse as the search > function: > http://liinwww.ira.uka.de/bibliography/ > > i would be interested in finding other examples. --zoe > ............................ > Zoe Borovsky, PhD > Academic Services Manager > UCLA, Center for Digital Humanities

    --[7]------------------------------------------------------------------ Date: Tue, 10 Dec 2002 11:04:21 +0000 From: yvind Eide <oyvind.eide@muspro.uio.no> Subject: Re: 16.375 encoded text vs database

    In the Museum project, we digitize catalogues of archaeological artefacts and tag them using SGML. In addition to making the texts searchable, we also import the content of the elements into a relational database, thus creating the historical part of the Norwegian university museums' archaeological find database.

    More information can be found in Jon Holmen and Espen Uleberg: "Getting the most out of it - SGML-encoding of archaeological texts." at http://www.dokpro.uio.no/engelsk/text/getting_most_out_of_it.html

    I think this is a good way to have the best of two worlds, but of course, the task is simplified by the static nature of our data.

    --

    / Kind regards, / yvind Eide, Unit for Digital Documentation, University of Oslo | Postal adr.: P.O. Box 1123 Blindern, N-0317 OSLO, Norway \ Phone: + 47 22 85 49 82 Fax: + 47 22 85 49 83 \ http://www.dokpro.uio.no/



    This archive was generated by hypermail 2b30 : Wed Dec 11 2002 - 02:23:48 EST