13.0002 TEI & the Gadfly's buzz

Humanist Discussion Group (humanist@kcl.ac.uk)
Fri, 7 May 1999 07:30:03 +0100 (BST)

Humanist Discussion Group, Vol. 13, No. 2.
Centre for Computing in the Humanities, King's College London

Date: Fri, 07 May 1999 07:25:02 +0100
From: Mark Olsen <mark@barkov.uchicago.edu>
Subject: 12.0610 TEI and...: The Gadfly Notes

I suspect that I am one of the guilty parties of whom Michael writes
in his post on [12.0610] "TEI and the individual scholar",

>> One thing I did
>> not foresee was that so many of the people who develop software for
>> humanities research would receive the TEI so coolly, and would believe
>> that it is simpler to invent their own scheme for text encoding than
>> to use an existing one.

I considered very carefully the potential costs and benefits of
using TEI-Lite, and even full TEI, as the mark-up scheme that we
would use in our new generation search engine, known as PhiloLogic,
before deciding on a much simpler encoding, which we are calling ATE
(ARTFL Text Encoding). ATE is essentially Dublin Core Metadata,
basic HTML, and some optional extensions.

My primary consideration was, and still is, where might I make the
best investment of development effort. I decided that any attempt
to build recognizers that would handle all of the possibilities of
TEI (or other SGML) DTDs would require a very significant development
effort all by itself. So, rather than include what would effectively
be a full SGML parser into the system, I decided that we would use
existing SGML parsers (such as Jim Clark's) to reduce all of the
variations to a small subset. SGML parsers are complex systems
and we have found that one must a handle a consider amount of variation
from database to database, and even text to text when dealing with
TEI encoded documents. Trying to build that capability into a
large scale text search and navigation engine would, I fear, be far
beyond ARTFL's means.

It is, thus, not particularly surprising to me that the fully SGML
aware systems that have appeared are very expensive, when they are
sold to the academic market at all, and tend not to be all
that effective for the kinds of functions that I think are useful because
they are designed more as corporate document management systems. I am
also not surprised that Michael has found a generally cool reception
amongst other developers in humanities computing. This stuff is really
hard and expensive to do and, so far, I have not seen the possibility of
radically extended functionality that would warrant the investment.

With all of the discussion of text *tagging*, little thought has been
given to development of systems much beyond how to render individual
documents in a browser. At a certain point, when writing a system,
every tag, every attribute, every variation has to be handled
or ignored. It is easy to develop very extensive tagsets that can
be demonstrated to balance using an SGML/XML parser/verifier. It is
MUCH, MUCH harder to develop systems that know what to do with each
and every tag/value/attribute/whatever. The burden and cost of doing
*SOMETHING* with all of the possibilities has been passed to the
developer. It's hard enough to do this as is, particularly within
the limited resources of most huanities computing outfits, so a cool
reaction to a specification that entails alot more effort is to be

Now, the next point in the discussion is that if nobody in humanities
computing can afford to develop software to handle all of the
potential variations richly encoded documents, then you gotta ask, why
encode that heavily? Michael suggests,

>> Of course, it's also possible that humanists simply care a lot less
>> about reusability, sharing of resources, and the electronic
>> preservation of the cultural heritage than was thought when the TEI
>> was created.
>> A cynic might observe that if the TEI's goal was to make it possible
>> to create reusable data with markup that allowed researchers to do the
>> work they were most interested in, then the TEI has already succeeded:
>> it is now in fact possible to do that. The fact that so few humanists
>> take advantage of the TEI should (the cynic might continue) be taken
>> not as an indictment of the TEI but of the humanists who work with
>> computers.
>> It's late, and I'm tired and worn down with problems in other
>> projects, but I resist the cynic's interpretation.

Tired as he might be, I hope that he does **NOT** take the cynic's
view. It is not nearly as stark a contrast as he might be inclined
to believe. The lofty and laudable goals of Michael's vision of the
TEI have to be balanced by the fact that we do not live in the best
of all possible worlds. Humanists who work with computers are faced
with a very significant set of limitations and restrictions, ranging
from very tight budgets [if you are lucky enough to have a budget
at all], to the need to produce systems for users, my problem, or
published research results, the problem for many other scholars,
in a relatively short time frames.

It is in trying to achieve this balance that I have declared myself to
be a very strong supporter of the intellectual goals of the TEI and,
at the same time, presented significant criticisms from a more practical
perspective. I would not, to be frank, waste my time criticizing the
TEI if I were not completely convinced of the very high value of the
goals of the endeavor and the impressive achievements already attained
by the TEI! I suspect that the goals of the TEI are probably going to
be achieved in much more limited and hard fought steps than Michael
might think, but we certainly agree that

>> Perhaps the recent discussions about the need for a new generation
>> of software for text analysis will lead to an improvement of this
>> situation. Let us all hope so.

No, Michael. Let us all hack... ;-)


Mark Olsen
ARTFL Project
University of Chicago

Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>