8.0186 The Lays of Ancient ROM (1/162)

Wed, 14 Sep 1994 14:10:28 EDT

Humanist Discussion Group, Vol. 8, No. 0186. Wednesday, 14 Sep 1994.

Date: Wed, 14 Sep 1994 18:47:53 +0100
From: Lou Burnard <lou@vax.ox.ac.uk>
Subject: Text of Article published in Economist, aug 27th 94

[ There follows the text of an article published in the British Weekly,
The Economist, issue dated 27 August 1994, which I think many on this
list will find interesting and encouraging LB ]

The Lays of Ancient ROM

Databases are transforming scholarship in the most conservative comers
of the academy, forcing technological choices even on to the humanities

IN 1987 a budding classicist from the University of Lausanne finished
four years of labour. She had spent them scouring ancient Greek tomes
searching for the classical sources of 2,000 anonymous fragments of
medieval text. Then, just when she was getting down to writing her
doctoral dissertation, all that effort was eclipsed. In a few dozen
hours working with a new database she found every one of her 600
hard-won sources again -- and 300 more that had passed her by.

That database, the Thesaurus Linguae Grecae, was the first of the tools
that is transforming the staid world of what used to be bookish
learning. When computers were mere calculating machines, only the
sciences had need of them. Now that they can be easily used to scan vast
memories at inhuman speeds, the humanities have every reason to catch
up. Whole libraries are vanishing into the digital domain, where their
contents can be analysed exhaustively. The changes in the practice of
scholarship may be greater than any since Gutenberg.

The process seems now to have been inevitable, but even the inevitable
has to start somewhere. In the 1970s a group of classicists at the
University of California, Irvine, thought up a then extraordinary goal:
having every extant word of ancient Greek literature in a single
database; 3,000 authors, 66m words, all searchable, accessible and
printable. With the help of nearby computer companies, this idea became
the Thesaurus Linguae Grecae. There are now 1,400 places around the
world where a classicist can use it to do a lifetime's worth of scanning
for allusions or collecting references just for a single essay. On
compact disc, the whole thesaurus costs about $300.

Scholars using the growing electronic- text archives at such places as
Oxford University, the University of Virginia and Rutgers University,
New Jersey, have more than the classics to play with. There are at least
five different competing software bibles, some with parallel texts that
provide Greek and Hebrew, and several with full concordances and
indexing. Shakespeare's words have been done as well as God's; indeed,
the entire canon of British poetry prior to this century is now
digitised. So is Aquinas. Wittgenstein's unpublished fragments -- 20,000
pages of them -- are expected soon; so are 221 volumes of Migne's
Patrologia Latina, a collection of theological writings that date from
between 200 AD and the Council of Florence in 1439.

Some of this is the work of governments and charities: half the $7m
needed for the Thesaurus Linguae Grecae came from America's National
Endowment for the Humanities, the other half from foundations and
private patrons. Some of it is done for profit, just like traditional
publishing. The English Poetry Full-Text Database (EPFID), released in
June on four compact discs by Chadwyck Healey, a company in Cambridge,
England, costs #30,000 ($46,500). It took four years to assemble from
roughly 4,500 volumes of verse; it is easy to use, and is poised to
become an in- dispensable research tool. Chadwyck Healey says it has
sold more than 100 copies. The company is now working on an index to the
entire runs of more than 2,000 scholarly journals.

Typing in every word of a nation's literary heritage is a time-consuming
and expensive task, even when the work is exported to take advantage of
cheap labour in Asia, as it almost always is. Another approach is to
record the appearance of books and other writings, ratherthan their
contents, by scan ning in images of them. The Archive of the Indies in
Seville has used IBM scanners and software to put near-perfect
facsimiles ofthe letters of Columbus, Cortes and their contemporaries on
to the screen. Years of scanning by a full-time staff of 30 people has
put more than 10m handwritten pages -- one-seventh of the total -- into
the archive's memory banks.

The computers store the pages as images, not text, so they cannot be
searched and compared in the way that the EPFID can. They offer scholars
other compensations, though. Scanners like those originally designed for
medical imaging provide extremely detailed and subtle digitisation. This
can then be fed through image- enhancement software, so ancient smudges
and ink-spills can be filtered out. And since the users cannot damage
the copies as they might the originals, humble students can have access
to documents previously available only to a handful of elite scholars.
The Seville project is proving so successful that IBM and El Corte
Ingles, a big Spanish retailer, have founded a company to market the
tech- niques used. Half-a-dozen ventures are already under way,
including a proposal to digitise the gargantuan (and recently- opened)
Comintern archives in Moscow.

Logos and log-ons

It is possible to combine the image of a page with searchable electronic
text, simply by having both stored in the same system with
cross-references. A Chaucer archive that of- fers multiple manuscripts
and searchable texts is being released one Canterbury tale at a time
("The Wife of Bath" comes first). Of course, putting both together costs
even more than a straight text database, and those can be pretty
expensive. The EPFTD works out at a quite reasonable $10-or-so per
volume -- but that still makes it a pricey proposition when bought, as
it must be, all at once. The cost of such commercially compiled
databases worries some scholars, not to mention librarians. It is not
their only worry.

The difference between a printed page and the text it contains is not
just one of aesthetics; there can be meaning in the way typefaces are
chosen, in how pages are laid out, in the indentations before lines and
the gaps between them. There are data on the title page that apply to
the whole text. A good database has to encode all this information
somehow, and has to offer ways in which it can be used in searches.

That is why databases have "mark-up languages", which allow the text and
the spaces within it to be tagged with particular meanings. Mark-up
languages tell the computer, for example, that a title is a title and a
footnote is a footnote; the computer can then display them as such, with
typefaces to taste, and the interested user can search the text for
titles and footnotes. The more complex the search, the more extensive
the mark-up required. The mark-up for the EPFTD allows the computer to
identify things like stanzas, verses, dates and names.

In a perfect world mark-ups would be neutral and descriptive, capable of
applying equally well to almost any texts. In practice individual
mark-up languages have sprung up like mushrooms. There is now a move to
concentrate on using the Standard Generalised Mark-up Language (SGML) to
define the codes that tag text. It was developed at IBM for lawyers, and
adopted by the Pentagon for its mountains of manuals. At present SGML is
probably the most widely used mark-up language; officially, it is an
international standard. But it is not necessarily ideal for academics,
who are aware that the way a text is marked up will have far-reaching
implications for the kind of research that is possible. Marking up is an
invisible act of interpretation; the scholars want the interpreting left
to them.

That is why so much effort has gone into a specific way of using SGML
for prose, verse, drama and other forms of text that are pored over by
scholars. The Text Encoding Initiative is the sort of huge multinational
research effort that nuclear physicists are used to but that scholars in
the humanities can still be shocked by. After six years of work,
supported financially by the American government, the European Union and
others, the TEI published its guidelines in May -- all 1,300 pages of
them. More than 100 TEI scholars have had to decide everything from how
poetry should be distinguished from prose to whether footnotes to
footnotes are admissible in conforming texts. Their peers seem happy
with the work.

Standardised formats will enable electronic texts to move on-line. That
will make them available from any computer hooked up to a telephone
line, not just from a dedicated terminal devoted to a single database
and nothing else. That is good for the far- flung; the University of
Dubrovnik, its library destroyed, has just been given a networked
computer terminal that puts it on-line to a host of foreign databases.
It is also good for the independent researcher. Texts will be freed from
academia's grip, just as books before them were freed from the church
and the wealthy by printing.

More research; different research, too. Speculative hypotheses about
influence or style will be rigorously testable by textual comparisons as
cheap and plentiful as the numerical calculations in a computer model of
the weather. Critics still raise the spectre of great literature passing
under the die-stamp of conformity, but some degree of conformity may be
a price of new forms of access. The first die-stamp for literature was a
printing press. The passing of the illuminated manuscript made the world
a slightly poorer place; the coming of print made it a far, far richer