12.0342 SGML awareness

Humanist Discussion Group (humanist@kcl.ac.uk)
Fri, 15 Jan 1999 18:54:06 +0000 (BST)

Humanist Discussion Group, Vol. 12, No. 342.
Centre for Computing in the Humanities, King's College London

Date: Mon, 11 Jan 1999 13:20:27 +0000
From: Wendell Piez <wapiez@mulberrytech.com>
Subject: Re: 12.0328 SGML-aware? WWW manipulations? Kazar?


Since I have shifted my efforts from an academic venue into the
commercial world (I now work for a tiny, but effective,
vendor-independent SGML/XML consultancy), I may be in a position to
address Francois' question. (Warning: it's a bit long, four screens or

> There are larger issues here about ownership, stewardship and fair
> remuneration for the building of infrastructures. There is also an
> historical question: what has happened since the promising crest of
> 1996? Or was that crest mere illusion? Has the advent of XML slowed
> development and deployment of sgml-aware software?

Well, not really. In many respects, the Panorama browser (that Francois
mentions) was an anomaly, a piece of "bait-ware" released by a company
then led by a visionary, Yuri Rubinsky, who had his eye on the big
picture and the long term and was willing to take business risks to
achieve greater goals than business (though business too, no doubt).
Unfortunately, Rubinsky died in 1996, and took much (thankfully not all)
of his vision with him. (I never got the chance to meet him.)

But really opening the web to generic encoding, such as that represented
by the TEI, proved to require quite a bit of tightening of the SGML
specification before it could be feasible. Even in its successes,
Panorama projects demonstrated this. (Semi-proprietary style
specification language; associated limits to presentation semantics; no
capability to link from a web page _into_ an SGML file or perform
queries on SGML files via CGI-style URLs and thus embed queries in
documents; a few others -- these are walls we bumped up against.)
Without a trim, tight set of standards, such as SGML and its family
(HyTime/DSSSL) is not, any efforts the programmers or implementors made
could never achieve web-like ubiquity. One analogy I've heard from
people then involved was that trying to pass rich SGML across the web
turned out to be like trying to get a snake through a garden hose -- it
could sometimes be done, but it wasn't pretty.

Far from slowing things down, XML was conceived explicitly to address
these issues. And it was motivated, in part, through the direct
involvement of a handful of key TEI people -- blessings eternally upon
you! you know who you are -- as well as other friends of the Humanities.
If it weren't for XML, all of us fans of descriptive encoding would be
largely dead in the water, playing with proprietary implementations, or
doing up fancy hacks to HTML/CSS/CGI (with which you can do alot -- but
it doesn't scale or port, and it's hell to maintain), in order to build
generic markup systems. So the short answer to Francois is: thank
goodness for XML: if they didn't have it, SGML users would have to
invent it to make possible the software we humanists want. And they did.

But XML isn't there yet, and it remains to be seen whether this effort
will succeed in altering the landscape so much that vendors will think
it worthwhile to try and reach the poor cousin academic humanist. Much
has still to be decided, including in particular the XML-related
specifications for a style-sheet language (XSL) and a hypertext linking
specification (XLink/XPointer), both of which are crucially important,
and still in draft. Yet there's already a great deal more interest and
competition in generic markup at the low end than has ever been the case
with SGML, and many, many more small and independent developers directly
engaged (whose budgets look more like academics' than they do like those
of corporate shops).

So I'd suggest to Francois and others in his situation, that they look
to XML not as a usurper, but as an opportunity. It is surprisingly close
to SGML, close enough so the "lay user" looking at the data would
probably never know the difference. Getting your TEI-conformant scholary
publishing project into XML will mean a few tweaks, and probably
backwards-engineering an XML DTD for it. (Expressing the Whole TEI in
XML is no walk in the park, for technical reasons stemming from TEI's
requirement for inclusiveness. But your data set does not include
everything in the TEI universe.) But it can be done: XML is an SGML
subset. Then you will be able to use XML tools on your data as well as
what you've already got.

I'd also suggest we be more radical in imagining how we and our
audiences could get software to do what we need. On the functional side,
Java has taken off as a favorite platform for instruction in computer
science departments; there is no reason the same could not happen with
XML on the data-encoding side. Writing parsers and processors that
conform to the XML spec -- and making them building blocks that can be
readily repurposed for other applications -- is easier by an order of
magnitude than supporting its granddaddy SGML. Demonstration code is
already being made available -- by academics, independents and companies
-- often including source code, always on the net -- with which projects
can get started. Much of it is in Java, but not all: pick your platform
and language, there's already someone doing XML on it somewhere, and XML
was designed, in part, with the "DPH" (desperate Perl hacker) in mind,
with precisely the intention that college students, or anyone on a
shoestring, should be able to code to it. And the Open Source movement
will be a friend to generic encoding, precisely because it means the
user (not the software vendor) can own, control and support the data

But Rome wasn't built in a day -- and not by an outside contractor,
either (at least not till they started expanding :-). Why not free
academic projects altogether from exclusive dependency on commercial
software? Write your letter to the computer science department -- or
better, to your favorite free-thinking CS professor who wants to herd
cats and is looking for cool ideas. "I have a large, rich XML dataset
and a bunch of things I want to do -- and it's not database integration"
might just sound like catnip to her. Or find a rogue grad student who
wants to stay up till morning being passionate with a workstation....

> Francois
> "cohorts become a matter of ecology"

-- Wendell
"Everybody thinks of their ideas as if they were private property: but
they're not. They're currency." (Loosely translated from Heraclitus,
with a bow to Willard)

Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
Mulberry Technologies: A Consultancy Specializing in SGML and XML

Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>