3.753 notes on standards (283)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Wed, 15 Nov 89 21:12:57 EST

Humanist Discussion Group, Vol. 3, No. 753. Wednesday, 15 Nov 1989.

Date: Tue, 14 Nov 89 18:01:56 CST
From: "Michael S. Hart" <HART@UIUCVME>
Subject: A few notes on standards

Anders Thulin : Standards for text and graphics
Mark Veatch : Database Standards
Tim Bray (OED Project) : Suggested Database Standards
Robert Amsler : 'pure' text

------------------------------------------------------------------------

From: uunet!prosys!ath (Anders Thulin)
Subject: Re: Standards for text and graphics

ISO 8613:1989 (Office Document Architecture) contains some
'substandards' which might be of interest:

ISO 8613-5 : ODIF - Document Interchange Format
ISO 8613-7 : Raster graphics representation
ISO 8613-8 : Geometrical graphics representation

Of course, these are designed to be used in ODA documents(-7 and -8)
or to transfer ODA documents from one computer to another (-5). ODA
may or may not be useful to OBI, but I think that they could serve as
useful input on the problem of graphics standardization.

Anders Thulin, Programsystem AB, Teknikringen 2A, S-583 30 Linkoping, Sweden
ath@prosys.se {uunet,mcsun}!sunic!prosys!ath

--------------------

From: Mark Veatch <uunet!hplsla.hp.com!markv>
Subject: Database Standards

>From: uunet!mica.berkeley.edu!richardt
>Date: Fri, 10 Nov 89 16:29:12 PST
>
>Hmmm... I would suggest that text be the primary standard. However, in
>this case, standard means 'all articles can be output as text'. It sounds
>like many people are concentrating on the creation of a single consistent
>database. I suspect we will get farther faster if we instead try to build
>multiple databases (since there are quite a number in existence already).
>Thus, we can have datbases that are heavily indexed and those that are
>lightly indexed... as long as they all have reasonable search tools,
>and can all put out text, we've made a *major* step forward. The next
[deleted...]
>
>Storage: Distributed tools are good; we need to collect similar information
>to gether in the same place to make searching easier, and we need to have
>tools which can deal with requesting info from multiple databases.
>
>Searching: This issue is completely up in the air. I suggest that it is
>up to the individual databases to incorporate their own search code so
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This makes a great deal of sense, IMHO, both for the reasons that
Richard mentions below in the paragraph on server cycles vs. client cycles,
and for the one important reason that such an approach effectively
'encapsulates' a database as a collection of data along with methods of
accessing that data. (The use of OOP buzzwords is because I'm thinking of
just such a model for distributed DBs.) This encapsulation makes the
accessing of remote, distributed, *different* databases generic. This
makes access very easy from the client's point of view. It also makes
moving data between different DBs, or "bundling" data from different DBs
much easier to accomplish.

>that they can be accessed remotely. (or had everyone been intending to
>get their own set of CD-Roms?).
>
[deleted...]
>As you can see, I'm placing a heavy emphasis on remote use and maximum
>use by people with relatively low computing power (I.E., server cycles
>are cheap, client cycles are expensive, and Network bandwidth is like
>doritos - 'Go ahead, use it, we'll make more'). This is in part because
>there are already some good tools for remote access (telnet, mail, ftp,
>automatic mail servers) and people are working on more.
>
>In other news, Holographic laser memory is now a reality, and may soon
>be commercially available. Soon of course == ~5 years.
>
>RichardT
>
>
--
-MarkV.
 
--------------------
From: Tim Bray <uunet!watsol.waterloo.edu!tbray>
Subject: Suggested Database Standards
 
 
Tim Bray at the New OED Project here; have enjoyed the discussions about
format standardization.  Some feedback:
 
>Hmmm... I would suggest that text be the primary standard.
 
There is no other sane choice, for reasons of portability.
 
>The next
>question of course, is what to use for graphic information?  I hesitate
>to suggest any of the present standards, although I suspect that
>at present PostScript may be the best choice, on the grounds that there
>are a very large number of installed PostScript devices *and conversion
>utilities*.
 
Yes, and another great virtue of PostScript is that any text
incorporated in a graphic can be stored *as text* and then is available
for searching the normal course of affairs.  Furthermore, the conversion
utilities include excellent programs such as Adobe Illustrator for going
from scanned graphics to PostScript as well as the other way around.
 
The fellow who was arguing for a standard format like GIF or TIFF also
had some good points.  But those formats in general handle *text*
poorly.
 
>1. Storage of information.
>2> Searching of information.
>3. Display of information.
 
The first is (relatively) easy; put everything in a flat byte-stream
style with descriptive markup.  The second and third are both very hard
to do well, and the problem is that everybody who has any kind of a
solution wants to make lots of money selling it.  For example us: we
have a pretty good toolkit and are commercializing it with great
success.  Having each database include its own search capability isn't
very helpful because nobody can tolerate having to deal with N different
search systems.  I'd say just concentrate on getting #1 right, and punt
the other two.  That way, if people have nothing, they can at least use
grep.  If they have something like a NeXT, they can accomplish quite a
bit with the Digital Librarian software.  If they have some big-league
high-powered search/display software like we and some other people sell,
they can really make good use of the database.
 
I think OBI has to concentrate on making the stuff available in as
generic and flexible a fashion as possible, and leave the how-to-use
problem to the consumer.
 
>Tying ourselves to a mark-up language sounds like a bad idea;
>better to have the individual databases decide
 
I suggest that while markup *syntax* is unimportant, text which is
stored with descriptive as opposed to typographical markup will be much
much more useful for our purposes.  Descriptive markup doesn't mean you
have to go all the way to full SGML.  For example, if you're typing in
some text, it's much more useful in general to put in explicit markers
for the *structural* components than waste your time trying to mimic the
exact typography of the document.
 
*BUT*, most of the stuff that's going to be available is probably going
to be only available with typographical markup of some sort, so we'll
just have to live with that to a certain extent.
 
>(I.E., server cycles
>are cheap, client cycles are expensive, and Network bandwidth is like
>doritos - 'Go ahead, use it, we'll make more').
 
And memory/disk is real, real cheap.  Buying flexibility with somewhat
less compact markup is always a win.  No binary magic cookies of any
type in the text!
 
>Are there databases for nontextual records?
 
There aren't even good databases for *textual* material available
off-the-shelf at this time!
 
>It would be nice to have something which has cross-referencing (in the
>hypertext manner, for instance) inherent.  This capability could
>obviously be used for query capability as well (pretty easy to collect
>and index links).
 
Cross referencing is a Good Thing, but unfortunately I don't believe
there is a single problem here.  It seems that the cross-reference
problem is heavily tied up with the semantics and knowledge base of each
different document.  This is one of the reasons the hypertext people
have trouble dealing with large existing textbases.  On the other hand,
a good fast full-text search system does most of what you need.
 
 
>	I was trying to sell people (for a while) on the idea of
>developing a standard marked-up-text protocol (like SGML but without
>the COBOL goo) and using that to extend into a communications
>protocol, query language, and display driver protocol. I'm talking
>low-level, though. In other words, someone might make a query of a
>database server:
>
><QUERY>RELEVANCE TO "FOO" .GT. 90%</QUERY>
><QUERY>RELEVANCE TO "FOO" IN <H1> .GT. 90%</QUERY>
>
>(or something like that)
>And the server might send back an answer coded up in some simple
>mark-up that could then be displayed in a device-dependent manner
>on the user's display.
 
Absolutely.  One of the fundamental principles we've been applying with
great success here on the OED project is:
 
 1. All information, including software control files etc, must be
    stored as tagged text unless you can prove it's impossible, and
 2. All software modules must communicate with each other using streams
    of tagged text (exactly as described here) unless you can prove it's
    impossible.
 
The benefits, in terms of network independence and max flexibility, are
huge.
 
But then the same person goes on to say:
 
>...it is NOT going to work very well
>if the base form of everything is ASCII text. Some kind of higher-level
>language for representing structure/comments/format will be needed.
 
Wrong.  Yes, you want to represent the structure and so on, but the
right way to do it is with embedded descriptive markup in the text.
You're right that 7-bit ASCII ain't gonna do the job though; there are
interesting non-English languages.
 
Cheers, Tim Bray
New Oxford English Dictionary Project, U of Waterloo
 and
Open Text Systems, Inc.
 
--------------------
From: uunet!flash.bellcore.com!amsler (Robert A Amsler)
Subject: Re: `pure' text
 
And what is `pure text' format. Having spent many days now trying to
convert a key-punched text from the 1960s into some semblance of
contemporary keyboarding practice, I am curious where this guide to
how to keyboard `pure text' exists?
 
For  example,  the  text  in  question  contained   footnotes.    The
keyboarders didn't know what to do with them, so they put them in the
text at exactly  the point  on the  physical page  where the footnote
started far removed from  the point  of citation,  in mid-sentence at
the bottom of the appropriate column.  There was no mark  in the text
nor on the footnote to note that these were footnotes at all.
 
Headings are likewise just typed in as text. No conventions on line
breaks, blank lines, etc.  were followed.  Determining that these
were headings is thus not at all mechanical.
 
Then there is the punctuation. " stood for opening and closing
quotes; but to help in subsequent analysis all punctuation was typed
in separated by blanks from the surrounding words.  For commas,
periods, etc this is not a problem--but for quotes it is impossible
to tell whether they are attached to the preceding or following text.
This `pure' text also deleted all --'s; a small loss, but without
any indication these parenthetical comments lose their
distinguishability.
 
There there are symbols and foreign letters.  C  cedilla's, acute and
grave accents on foreign words.  What is  the `pure  text' version of
these?  Does one translate u  umlaut into  u", ue,  {u"}, @Ovp{"}u or
just u?  How should one encode the degree symbol as in 32 degrees.
Perhaps everything should be spelled out rather than special symbols
used? They did this with %, but not with degrees, nor fractions.
 
So... where is the keyboarder's guide to `pure text'?
 
Perhaps, `pure text' is what an OCR system would produce from a
document.... That is of course just another big problem.  How does
the OCR system scan columns? What about thin lines between different
stories on a newspaper page? Captions for photos?
 
I guess I just don't know what `pure text' looks like.  Does `pure
text' mean we translate % into `per cent'?
 
I think I'd prefer anything BUT `pure' text.  I'd prefer some type
well-documented format, with all the conventions noted for anything
that was outside ASCII.  With stated conventions for super/sub
scripts, fractions, formulae, headings in differing point sizes (esp.
where the point size indicated the level of heading), listing of
special symbols, notes on footnotes, and the dozens of other things
that I haven't specified.  Please no more `pure' text.  It is too
non-standard.