13.0015 TEI, gadfly and meta-gadfly

Humanist Discussion Group (humanist@kcl.ac.uk)
Fri, 14 May 1999 22:54:19 +0100 (BST)

Humanist Discussion Group, Vol. 13, No. 15.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: John Bradley <john.bradley@kcl.ac.uk> (113)
Subject: Re: TEI & more from the gadfly

[2] From: Mark Olsen <mark@barkov.uchicago.edu> (112)
Subject: Re: 13.0004 TEI, gadflies, commentators

--[1]------------------------------------------------------------------
Date: Fri, 14 May 1999 22:52:55 +0100
From: John Bradley <john.bradley@kcl.ac.uk>
Subject: Re: TEI & more from the gadfly

As another developer in humanities computing (or, perhaps, more
accurately these days, a "developer want-to-be") I have resisted
joining into this delightful, witty, but rather agonistic discussion
-- believing for myself at least that such a discussion can produce
even more useful results when the agonistic component is not so
prominent, and not being sure that I can resist this temptation
either.

However, it seems to me that both Michael and Mark, as well as they
put their points, don't seem to be representing either the average
humanist or even the average developer (if there is such a thing!).
Instead, they seem to have not taken into account several of the
issues that apply to these groups in their discussion.

I believe that Mark's position is a bit unusual when compared to most
humanities computing practitioners. Like "server-type" developers in
his and other fields, he has the responsibility for deciding what
ARTFL will and will not do, and designing the software to do these
particular set of tasks. He has at least some (probably not enough,
knowing how things are) technical resources to make this happen, and
has developed one way or another a sophisticated technical
understanding to accomplish what he needs to do. It doesn't seem to
me that most humanists are in that position, and I don't think that
it is reasonable to expect members of a humanities computing
community to develop their computing skills and resources in the way
that Mark has done to meet their own, perhaps differing, research
agenda. The model of the end user also being the developer seems to
not apply to most of the potential community.

As a developer who, although very interested in humanities computing,
is not himself a humanist with a particular research task to work on,
I have found the problems of TEI even more perplexing than Mark does.
I am not been in the position to decide for my users, as Mark is (and
apparently Michael assumes), what elements of TEI I will choose to
process and what I will choose to ignore. I want my eventual users
to do that, and hopefully be therefore able to use the tools I create
for a range of different tasks and with text marked up in a range of
different ways. In order to do so I need to recognise and deal with,
and allow them to recognise and deal with too, the many different
abstract structures (some introduced by using SGML, but some specific
and apparently unique to TEI) that a TEI document can represent. I
cannot write software that decides to simply ignore one TEI thing or
another because it doesn't meet my own needs, or my view of what my
users need. Perhaps more precisely, I want my software to ignore as
little TEI markup as possible so that my users can make use of all
the potential richness of TEI to meet the needs of their problems and
of their texts.

I have built software (sgml2tdb) that takes SGML/TEI texts and
prepares a TACT database from them -- ignoring some aspects of an
SGML view of markup that simply didn't fit TACT's sometimes rather
limited view of the textual world. I resisted the pressure from
users for years to develop TACT tools to process SGML texts because
the cost of merely recognising the materials in an SGML text was too
expensive to develop in all its generality that the SGML standard
required. Then a robust parser of SGML was available (nsgmls), which
I adapted and which provided an important boost to my own development
efforts. The task became feasible, at last. As Michael say, I found
that the incorporating in your software of a component that properly
processes an SGML sequence and announces to the rest of your software
what is there is no longer the hard part.

However, in order to satisfactorily process the TEI variations that I
found in the text from many different sources I was forced to develop
a relatively sophisticated language that the sgml2tdb user needed to
also understand. Why? Well, I needed a way of allowing the user to
say "this attribute, when appearing in this context should be
translated in this way into this TACT thing". My users had to
translate between the SGML way of saying things, and the objects that
my software worked with. If I was working on this program today I
would probably not develop my own language, but would try to apply
one of the many different emerging standards to do this task -- but I
think my point here is the same. Expecting the end user, often with
relatively few technical resources at her/his disposal, to master
something as complex as my relatively limited specification language,
let alone something as subtle as XSL, XQL, XQuery, etc etc, and apply
it to markup as potentially complex as a TEI document, is expecting a
lot. Expecting a 3rd party developer who is not an end user or a text
producer, to decide what constructions in an SGML or TEI document are
relevant to his goal and how they should be processed -- particularly
when the apparently same kind of "markup idea" can be expressed in
many different ways -- is also demanding.

In short, it is the third party developer who is presented with the
most difficult task here -- not being the encoder or the end user --
and trying to give the end user as much access to features in a range
of encoders materials as possible. Perhaps the numbers needed to take
on the support of even these standard tools such as XSL and applying
them to particular application tasks by a 3rd party developer is what
was meant by Tim Bray with his "battalions of programmers".

The talk in the Elta discussion list has, as of yet, been rather
limited, but it seems there that users are not only looking for tools
to parse (in the SGML sense) TEI document, or to even lay them out
for presentation, as Jade or Panorama are capable of doing. They
seem to be interested in combining these facilities with tools to
allow for sophisticated searching of both the text and the markup.
They are asking for tools to not only display the text in various
ways like Parorama, but display the results of their searches in ways
other than just KWIC displays. They want (as one of them said in the
discussions "a lot"), and they want it for free, or very close to
free. I think that developments in both the XML world, and in the
software development world in general, are starting to come together
to make the development of truly open and flexible systems possible
and capable of being developed by a large number of relatively
independent developers. However, there is quite a bit of complex and
perhaps rather abstract groundwork that needs to be done first.

Now, in the interest of reducing the agonistic tone of this
discussion (including probably my part of it), I want to say that at
the end of it all I believe I understand and indeed sympathise with
the reasons why the TEI markup is as complex and rich as it is. It
is perhaps even a part of its "glory" (and this reflects on Michael
and Lou's work on it) that it is so, and so well covers the
expressive needs of so many tasks. However, it is now more useful to
also recognise that this DOES make it difficult to process --
particularly for the 3rd party developer -- and take more seriously
the problems that this richness introduces.

Best wishes to you all. ... john b
----------------------
John Bradley
john.bradley@kcl.ac.uk

--[2]------------------------------------------------------------------
Date: Fri, 14 May 1999 22:52:39 +0100
From: Mark Olsen <mark@barkov.uchicago.edu>
Subject: Re: 13.0004 TEI, gadflies, commentators

>> >> From: Stephen Ramsay <sjr3a@etext.lib.virginia.edu>
>>
>> I think it's safe to say that humanists (on this list, as well as more
>> generally) care very deeply about the preservation of the cultural
>> heritage. Still, there's no denying that humanities computing folk are
>> generally less enthusiastic about standards and open source efforts than
>> our less humanistically motivated colleagues.

Standards and open source development are two different things.
So, let's disengage them for a moment. I would deny that humanities
computing folk are not concerned about standards. I think the
critical assessment of one particular standard, the TEI, does not entail
that the critics don't want standards. It is absolutely clear that
standards are necessary for any serious progress. Standards
do, however, have to fit BOTH where we are, with all of the considerations
that entails, AND where we want to go.

>> Creating tools to deal with the entire TEI, to choose one example,
>> certainly involves an *enormous* amount of time and programming effort,
>> but I dare say that it is no more herculean a task than any of the open
>> source efforts currently under the GNU license banner. Would any TEI tool
>> demand more effort than what is required for creating and maintaining the
>> GNU C compiler? Or the Perl distribution? Or GNU Emacs?. These projects
>> are monuments to collaborative effort and the "sharing of resources," and
>> they are largely undertaken by a volunteer army of people with the same
>> time and budgetary restrictions that we have. Indeed, it's these very
>> restrictions which make "Going it alone" unthinkable.

I could not agree more. We have been considering precisely this
model when thinking about what we should do with PhiloLogic:

Release Terms, Organization, Collaboration. Once
we have a system, I would like to begin releasing this
without charge to selected collaborating institutions.
XXXXXXXXXXXX has expressed interest, as have several
others. It is my opinion that ARTFL might play a
significant role in e-text scholarship [...] and might
facilitate such an activity by adopting the Netscape
approach of encouraging open development and
research with the the Mozilla organization. We are a
UNIX shop, and depend heavily on academic/industry
collaboration that built UNIX, such as GNU. Mozilla is
another model to use. Comments? [ARTFL Internal Documentation]

As a heavy consumer of GNU goodies, I would like to suggest that
GNU is a good model of joint development. This model assumes that the
tools are relatively small and knitted together in a general insfrastructure
like UNIX. Some years ago, there was a project called TSI (Text Software
Initiative) which was based on that model. It did not get very far, but
could be revived. I believe that this was also some of the thought
behind the ELTA initiative, but that too has not progressed too far, last
I checked. In general, this model is based on individuals or small
teams writing discrete programs within general infrastructure.
The other model is Mozilla.org, where a single large system is released
open source and contributors add components directly to the system.
This is also a good model, which is not dependent on a particular
infrastructure, but carries with it a heavier cost of organization
and management. Netscape, I believe, has 10 fulltime programmers
working on Mozilla.org and is very well organized from it's original
development environment. Both are proven and effective models with
very different operating principles and results.

All this to say that open source is desirable and should be considered
by developers in humanities computing. ARTFL is proceeding slowly in
this area because it requires very careful organization and workable
objectives. In practical terms, the move from the typical very small
humanities computing development team, which tend to work informally
together, to an open source environment requiring a much higher
degree of organization and coordination, such as more formal coding
and documentation practices, can be difficult, but certainly not
impossible.

It strikes me that while both models have limitations and costs,
development in humanities computing could benefit from visting,
once again, some form of coordinated development. We might want to
"round up the usual suspects" -- you know who you are :-) -- and try
again at ACH-ALLC at UVa. Past failures make something like this
work should not prevent us from at least trying again, since as
Steve notes, the alternative of going it alone is becoming increasingly
intractable.
>>
>> There are many important exceptions to this rule, but I still see a lot of
>> humanities software people operating under a fading model of intellectual
>> property: proprietary formats, hidden code, and restrictive licenses. I
>> know of at least one program developed by one of our number that attempts
>> to ensure, in its license, that the product and its author are properly
>> cited in scholarly work because of the "original algorithms" included in
>> the code. I understand that Linux also contains some original
>> algorithms--all of which are visible to anyone who wants to see them. Or
>> better, improve upon them.

Intellectual property is hardly fading away, in any field. Publishers,
software developers, and the entertainment industry are all working
hard to enforce and extend intellectual property rights. Intellectual
property is most commonly considered in terms of monetary reward, but
this is not the only way to think of intellectual property.

In the humanities, we write articles and books of which we surrender
copyright to publishers for little or no money, because we are paid
in other ways (salaries, academic credit, respect of peers). I would
hate to publicly announce my total royalties on a recent scholarly book,
but suffice it to say that I better keep my day job. :-) Open
source development is similarly dependent on surrender of intellectual
property rights for similar credits. Peer respect, if you already have
a salary or way to make money, is for most humanities scholars and open
source hackers, a primary motivation. When we publish books and
articles, we hope others will read and cite them. The software that
is mentioned above is an attempt to get the "academic" credit that motivates
much research and development in the humanties. Oddly enough, I see
strong points of contact between humanities scholarship and open
source hacking, since both are motivated less by monetary reward
(assuming, of course, that the scholar and/or hacker has a means of
support) than love of the work itself and recognition of peers.
This is intellectual property, but not directly related to monetary
return.

Developers in humanities computing, given the general culture of the
humanities and the relatively limited resources at our disposal, may be
very well suited and receptive to some form of open source development.

Mark

Mark Olsen
ARTFL Project
University of Chicago

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================