9.646 on TUSTEP

Humanist (mccarty@phoenix.Princeton.EDU)
Fri, 22 Mar 1996 19:26:04 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 646.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
Information at http://www.princeton.edu/~mccarty/humanist/

[1] From: "C. M. Sperberg-McQueen" (55)
<U35395%UICVM.BitNet@pucc.Princeton.EDU>
Subject: Re: 9.641 on TUSTEP

[2] From: Wilhelm Ott <Wilhelm.Ott@zdv.uni-tuebingen.de> (97)
Subject: Re: 9.641 on TUSTEP

--[1]------------------------------------------------------------------
Date: Thu, 21 Mar 96 13:55:01 CST
From: "C. M. Sperberg-McQueen" <U35395%UICVM.BitNet@pucc.Princeton.EDU>
Subject: Re: 9.641 on TUSTEP

On Wed, 20 Mar 1996 in Humanist 9.641, Chaim Milikowsky said:
>First of all, I'm not sure that TUSTEP is a practical answer to
>the first of Bob Kraft's "requests". ... I am sure
>that something like this could be done in TUSTEP -- my sense is
>that anything short of sending a man to the moon can be done in
>TUSTEP -- but not interactively. One would have to markup the
>texts beforehand and then run the program which would have to be
>written (TUSTEP is partially an interpreted higher-level computer
>language) to create the new output text. This of course defeats
>the entire sense of what Bob wants.

Does it? Maybe I misunderstood Bob's note, but I certainly thought,
when reading it, "What he wants is Tustep."

I am not myself anything like a Tustep expert (my university has just
acquired the package and I am just beginning to learn it), but I have
gotten, from experienced users of the program, the idea that what I
thought Bob wanted works quite well in Tustep, without special
preparatory text markup and without programming by the end user: just
compare two versions of the same text, and choose, at your own speed,
which version to take, for each variation. Tustep does this by
producing, as output from the comparison, a file of editor commands
which (if run) would transform one version of the text to be identical
to the other. It seems to me that editing this file, interactively, and
choosing, for each varation, which variant to accept (by deleting or
retaining the editor command which changes one variant into the other),
is identical to what Bob said he wants to do: "I've been looking for
"comparison" type software to ... permit me to choose interactively
what I want in the updated version ..." If it's not, then I am curious
to know what *is*.

Of course, if Bob meant he wanted software to show, after each decision,
what the output text looks like in full (e.g. so one can read the
currently chosen final text line by line), then I don't yet know how to
give him what he wants in Tustep, or any other production software.
(But in that case, the adverb "interactively" would seem to be modifying
the wrong verb. I think too much of Bob's English to accuse him of
that.)

I did write an interactive collation program once, as an exercise, but
it was regrettably limited to line-by-line comparison and was in any
case appallingly stupid. Most important, though, using it to collate
two scanned versions of the same text persuaded me beyond all argument
that text collation is better done in batch, not interactive, work. An
interactive interface for viewing the results might be very nice, and a
useful exercise if any software developers are lurking here looking for
good deeds to do. (But before you do that, I have some other ideas,
about SGML support and query languages ...) In the meantime, I think we
as a community would benefit a lot from knowing and using the tools that
do already exist for performing the kind of textual work that is a
humanist's bread and butter. Tustep was designed for humanistic work,
and deserves to be much more widely known and used.

-C. M. Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago
u35395@uicvm.uic.edu / u35395@uicvm

All opinions expressed in this note (except those I have quoted with a
view to refuting them) are mine. They are not necessarily those of the
Text Encoding Initiative, its executive committee or other participants,
its sponsors, or its funders. Anyone who says otherwise is wrong.

--[2]------------------------------------------------------------------
Date: Fri, 22 Mar 1996 11:19:29 +0100 (MEZ)
From: Wilhelm Ott <Wilhelm.Ott@zdv.uni-tuebingen.de>
Subject: Re: 9.641 on TUSTEP

Dear Prof. Milikowsky,

thank you for your commentary on TUSTEP. I will try to answer in this forum
as briefly as possible; the discussion of details should perhaps be shifted
to tustep-liste (to subscribe, send a mail saying "subscribe tustep-liste"
to majordomo@germanistik.uni-wuerzburg.de).

You are right when calling TUSTEP "partially an interpreted higher-level
computer language" with "basic algorithms ... deeply geared to a batch
mode of operating" and when doubting "that any such set of programs
could become popular in this day and age".

TUSTEP's goal is not to replace "popular" word processing or DTP programs,
but to provide tools for algorithmic solutions to scholarly problems
which can not easily be solved with tools made for other purposes.

The "philosophy" governing our "batch" approach (dating from the mainframe
era) has some advantages which we do not plan to abandon only because inter-
action and GUIs are popular. One of these advantages has to do with safety:
most TUSTEP programs read data from a source file, modify them according to
rules defined by user-provided parameters or according to instructions
contained in a third file, and write the resulting text to a destination
file. So, the source data remain untouched, and the rules or instructions
are automatically and completely documented and may be refined as often
as required to obtain the desired result from the intact input data.
A further aspect has to do with speed and with handling large quantities
of textual data: a typical TUSTEP user works not on few pages of text,
but on whole books (the largest single text file which TUSTEP can handle
is 7 GByte = 7000 MByte).

It is however not true that "there is absolutely no interactivity in
TUSTEP at all". TUSTEP contains a powerful editor which (except for special
cases) can exclusively be run in interactive mode. Beyond instructions for
entering, correcting and searching of texts, this editor allows to define
and to use powerful macros (and to save them for subsequent sessions) and
thus to perform complex interactive operations which also may include more
than one text file. The editor itself may be imbedded into a TUSTEP script
using command level makros which also allow for interaction with the user.

When you speak of lacking interactivity in TUSTEP, you probably have in
mind its formatting and typesetting routines (also a "partitur edition" is
the result of a formatting process). There, TUSTEP relies on markup (which
is anyway necessary for text analysis, and which therefore is normally a
generic markup which is transformed into - or interpreted as - typographic
markup not before the formatting/typesetting process starts).

But let us get back to the two points you made regarding collation.

I am not sure if my answer really "defeats the entire sense of what Bob
wants". What if the required markup is done automatically, on the basis
of an interactive selection between the variants to "remain" or to be
"lost" in the final output version?

As you know from your work with Gottfried Reeg, TUSTEP records the
"differences" (variants) in a file, indicating as a minimum for each
variant the exact location of the "lemma" in Version A, the type of
variation (omission, replacement, addition), and the wording of the
variant reading, e.g.:

1.2,3-4[a phrase]=replacement for two words

This means: in page 1, line 2, for word 3-4, the text of version B has
the phrase "replacement for two words". This "description", when submitted
to the batch correction program, will be interpreted as an instruction to
replace the words 3-4 in line 2 by the phrase "replacement for two words".

Therefore, it is easy to solve Bob Oakman's task as you describe it: you
call the editor, you split the screen for inspecting in the upper window the
text and in the lower window the "variants" file, where you interactively
decide (by adding a mark to the respective entries) which variant is not
to be taken over into the final output (if you prefer, you simply delete
the respective entries; it is however safer to mark them and to copy only
those ones to the "definitive" variant file which are not marked). Then you
add, by a short TUSTEP script, the markup necessary (if any) for identifying
these variants as variants from version B and run the batch correction
program which inserts them - together with the markup - into the "base
text" of version A. (If you have more than two versions, you could repeat
the procedure for every further version, or you could cumulate the variants
from different files into a single one and sort them before inspection).

Now the "more important ... matter", the (automatic) alignment of "partitur
texts" in places where the "base text" is lacking: It is obvious that a
program written specially for the purpose to align "any given (up to 15)
input versions of that text" can more easily provide satisfying results than
running the TUSTEP programs COMPARE (which collates every version to a
common base text) and COLLATE (which alignes the variants contained in the
other versions under the "lemmata" of the base text, but fails to mutually
align the words of the other versions where the main text is lacking).

The solution we chose when designing TUSTEP has certainly been influenced
by the fact that the first textcritical problem I have been confronted with
as programmer had been a text where "any number" meant 5000 (five thousand)
manuscripts (the Greek New Testament; cf. my report in "The Computer and
Literary Studies", ed. A.J. Aitken et al., Edinburgh: University Press 1973).
It would not be possible to compare "all input texts with all other input
texts" in cases like this. Also a "partitur edition" would not make much
sense. We therefore choose a solution which certainly has some drawbacks,
but which is open in principle: cumulation of the differences found by
collating all versions to a common basis.

But even with the existing version of TUSTEP, the solution to this problem
is not impossible: If none of the extant versions has the complete text, why
not use an "artificial" text as the collation basis, in which the "lacunae"
are filled with the missing words which may be taken over (perhaps semi-
automatically, after a first collation run) from a more complete version and
individually marked as such? When a "partitur text" is essential, then
this additional step could be worth the relatively small effort.

----------------------------------------------------------------------
Prof. Dr. Wilhelm Ott phone: +49-7071-292933
Universitaet Tuebingen fax: +49-7071-295912
Zentrum fuer Datenverarbeitung e-mail: ott@zdv.uni-tuebingen.de
Brunnenstrasse 27
D-72074 Tuebingen