3.1095 electronic texts (172)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Mon, 26 Feb 90 08:10:55 EST

Humanist Discussion Group, Vol. 3, No. 1095. Monday, 26 Feb 1990.

(1) Date: Thu, 22 Feb 90 17:11 EST (59 lines)
Subject: Georgetown Catalog of Projects in Electronic Text

(2) Date: Fri, 23 Feb 90 11:56:21 CST (23 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Books on disk

(3) Date: Fri, 23 Feb 90 10:00 CST (65 lines)
From: John Baima <D024JKB@UTARLG>
Subject: RE: Annotated e-texts, retrieval

(1) --------------------------------------------------------------------
Date: Thu, 22 Feb 90 17:11 EST
Subject: Georgetown Catalog of Projects in Electronic Text

In a recent posting on HUMANIST, Bob Kraft generously
mentioned Georgetown's project of maintaining a catalogue
of archives and projects in machine-readable text.
Because he suggested that a progress report would be
welcome, I've compiled the following brief sketch.

Since April of 1989 we have gathered information (in
varying degrees of completeness) on 274 projects in
twenty-five different countries. Of these projects, 82
emphasize linguistics and language study, while 192 focus
on other disciplines in the humanities.

Arranged in geographical order, the entries contain ten
categories of information:
1. Identifying Acronym
2. Name and Affiliation of Operation
3. Contact Person
4. Disciplinary Interests
5. Focus (period, location, individual, or genre)
6. Language(s) Encoded
7. Intended Use
8. Format
9. Forms of Access
10. Source(s) of Archival Holdings

Because of the flow of correspondence and the lag time
in updating entries, the information is always in a state
of flux; therefore, we have been reluctant to distribute
obsolescent drafts of the catalogue. Nevertheless, Jean
Feerick, our Project Coordinator, responds directly to
inquiries about archives or disciplines on which we have
information, and we are constructing a database that will
support dial-in, on-line access.

We're grateful for the on-going support we've received
from Bob Kraft (who provided the initial vision and the
original data for the project), Marianne Gaunt and Bob
Hollander of the Rutgers-Princeton Project (a major
source of information about specific texts in electronic
form), Lou Burnard of the Oxford Text Archive (the
primary repository of etexts in the humanities), Ian
Lancashire and Willard McCarty for the valuable
information in the Humanities Computing Yearbook, and the
many project directors who have responded to our surveys
and follow-up letters. A complete account of our
indebtedness would require a separate file on the

Michael Neuman
Georgetown Center for Text and Technology
Georgetown University
Washington, DC 20057
(202) 687-6096
(2) --------------------------------------------------------------31----
Date: Fri, 23 Feb 90 11:56:21 CST
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Books on disk

I note most of the discussion involves books with accompanying disks.
It does not seem clear whether the disks are an extension of the books
or an exact copy.

At the libraries I work with, the book are available on disk, and all
the students and staff have to do is bring a floppy and copy the files
to take back to their own machines for research. The licenses include
use by all members of the college and the price breaks down to between
a penny and a dime per student for the complete works of Shakespeare.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

(3) --------------------------------------------------------------69----
Date: Fri, 23 Feb 90 10:00 CST
From: John Baima <D024JKB@UTARLG>
Subject: RE: Annotated e-texts, retrieval

In response to Pieter C. Masereeuw and Steven DeRose:

Steven DeRose states: "The features you described are basically the
extensions of everyday search tools to ***hierarchical*** documents.
For example, in most texts sentences and words are demarcated, but not
discourse units above the sentence, nor elements smaller than words,
such as morphemes. Any scheme which represents these levels should
allow annotations at all levels."

This is precisely what Lbase allows as a search retrieval engine.
Lbase supports hierarchical, recursive, multilingual tagged texts.
Recursion is a necessary feature for a retrieval engine because
recursion is a common feature. Tags can range from a single character
to about 4,000 characters. Lbase allows regular expression like
searches on the tags, including specifying agreement between tags on
different elements (e.g., give me all instances of an infinitive
followed by an indicative, but they have to have the same dictionary

So far, only Greek, Hebrew and Roman alphabets have been supported,
although others could be added and probably will be for the next
release. Besides searches, Lbase can also make word concordances
based on tags at the word or morpheme level. For example, if one of
the tags at the word or morpheme level is the dictionary form, Lbase
can make a word concordance based on that dictionary form.

While the search engine of Lbase supports all this, there are a couple
of problems with making this practical today. One problem is that
Lbase runs under MS-DOS and I am limited by 64k segments. Since a
search must often backtrack, this size limitation makes it impractical
to search an element that is larger than 64k. Thus it is not practical
at this time to allow paragraph or larger elements because they could
exceed that limit, although there is no built in limitation with the
search engine.

The second and main problem is that there has never been a standard
for encoding such tests. Thus I have several different "drivers" for
the different texts that Lbase knows about, but even the format of
these texts changes from time to time without warning. Since I am not
on any of the TEI committees, I am eagerly awaiting to see what they
recommend. Hopefully, they will provide us all with a usable standard.

I have had many requests to support brand X file format, but it is
simply not economically feasible support a new format to make one
sale. (Lbase has never received any outside funding.)

One other note. While the Summer Institute of Linguistic's "IT"
program helps in creating a text that is tagged at the word or
morpheme level, it lacks a search engine. Lbase can search these files

Lbase also supports the TLG and PHI/CCAT CD-ROM's. If anyone wants more
information, please write and I will try to answer.

John Baima
Silver Mountain Software
7246 Cloverglen Dr.
Dallas, TX 75249
(214) 709-6364