9.534 Patrologia Latina implementation report

Humanist (mccarty@phoenix.Princeton.EDU)
Tue, 13 Feb 1996 19:16:42 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 534.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
Information at http://www.princeton.edu/~mccarty/humanist/

[1] From: John Price-Wilkin <jpwilkin@dns.hti.umich.edu> (61)
Subject: HTI implementation of PLD

In late 1995, the UM Humanities Text Initiative mounted the most
recent and now complete version of the Patrologia Latina Database.
The DTD and markup are heavily influenced by the people and processes
that went into making the TEI, and we thought a brief report on the
problems and successes would be useful, esp. for others implementing
the PLD.

o The entire collection parsed without problems. That may seem
trivial, but we get lots of SGML from publishers and others, and
it's a rare collection -- esp. at over 1Gb -- that parses without
a single problem.

o Several things might be useful for those who implement the PLD with PAT.
The normalized file is 1.08Gb. Indexes, at the "word" level (supporting
stem searching), are 650Mb. Using the DTD to build the regions
(with sgmlrgn50), the region index is 250Mb. In this most recent release
Chadwyck-Healey has added a new attribute (com) at various levels, an
attribute that carries the page number for any of the elements in which
it's found. So, for example, one might have <p com=0984> for a paragraph.
This was apparently introduced by C-H to make it easier to build a
multi-faceted reference mechanism on result displays. For system implementors
inclined to use the "sortorder occurhead" display control, this can be a
useful way to show a page number at the same time that you show volume
and title information. Performance on our system, under full load from
campus-wide use for most collections and Internet-wide use for others,
is good, with nearly all searches returning results in 1-5 seconds.

o Needless to say, the PLD presents several problems problem for both
markup and delivery. One of those problems is the size of the works
and the way that works can nest at varying depths within each other.
C-H's consistent use of groups, numbered divs, and author/title attributes
on all of these allows iterative searching, walking down a tree, so to speak.
For example,
<DOC NAME="Notitia historica" AUTHOR="Sammarthanus, Dionysius" [...]>
<DIV1 NAME="Epistola LVIII" AUTHOR="Ludovicus VII Franciae" [...]>
A user asks for X in works with Y in the title, and you
check the contents of each <doc>, <div1>, the <div2>, etc. with Y in
the title attribute, with fairly little overhead and great effect.
Additionally, results can displayed in a context that not only shows
surrounding words, but also the work (and levels of work) containing
the result. Consequently, the user can pull a specific section
(e.g., "Caput VII") or the entire work. Similarly, for large works,
it becomes possible to pick one's way through the work, part by part
(see illustration below).

o We took an admittedly approach to the Greek, but one that has
been well-received by users. Because of the cross-section of PLD users,
we're much less likely to find a capable Greek font (or the same one)
on the desktops of all the users. Consequently, we used James Tauber's
GreekGIFs, a set of elegantly constructed Greek characters as GIFs with
transparent backgrounds. Tauber based his character set on the Unicode
Greek characters, and it represents a nearly complete set of characters.
We are working with him to identify the characters in the PLD but
not in his distribution. The package is available at:
http://www.entmp.org/GreekGIF/

While the PLD is restricted to access by UM faculty, staff, and students,
the web-based support resources such as the list of authors by volume
and (for other implementors) the editorial policy, are unrestricted.
These resources and search screens can be found at
http://www.hti.umich.edu/latin/pld/
A sample GIF of a result screen is at:
http://www.hti.umich.edu/latin/pld/pld-samp.gif
For more information on the HTI and access to publicly available
collections, please use
http://www.hti.umich.edu/

John Price-Wilkin