4.0572 Texts: Queries; SGML; SED; etc. (6/154)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Wed, 10 Oct 90 00:47:23 EDT

Humanist Discussion Group, Vol. 4, No. 0572. Wednesday, 10 Oct 1990.


(1) Date: 05 Oct 90 16:36:20 EST (14 lines)
From: James O'Donnell <JODONNEL@PENNSAS>
Subject: CETEDOC

(2) Date: Fri, 5 Oct 90 14:36:50 EDT (68 lines)
From: lang@PRC.Unisys.COM
Subject: Re: 4.0570 SGML

(3) Date: Fri, 5 Oct 90 22:04 EDT (17 lines)
From: GORDON DOHLE <DOHLE@Vax2.Concordia.CA>
Subject: Re: 4.0569 Macs/Characters; Foot Mice; Computer vs. Computer (4
/80)

(4) Date: Sat, 6 Oct 90 00:34 EDT (21 lines)
From: FZINN@OBERLIN.BITNET
Subject: Re: 4.0570 SGML, Markup, Empiricism (5/120)

(5) Date: Mon, 08 Oct 90 10:38:15 BST (34 lines)
From: Donald A Spaeth 041 339-8855 x6336 <GKHA13@CMS.GLASGOW.AC.UK>
Subject: 4.0570 SGML

(1) --------------------------------------------------------------------
Date: 05 Oct 90 16:36:20 EST
From: James O'Donnell <JODONNEL@PENNSAS>
Subject: CETEDOC

It was interesting to see a message originating from CETEDOC in Louvain on
HUMANIST this evening, seeking e-texts of Gregory of Tours. This would seem to
be an appropriate forum in which to ask CETEDOC to describe their own
resources of e-texts and to describe their policies and plans for making those
texts available to interested scholars, just as they are asking others to
supply them. It has been my experience that CETEDOC is not receptive to
inquiries of this sort, and to the best of my knowledge the only way in which
they have been willing to provide access to some large bodies of texts they
control (esp. late antique and medieval Latin) has been by selective
publication of large and expensive microfiche concordances.
(2) --------------------------------------------------------------29----
======================================================================== 29
======================================================================== 57
Date: Fri, 5 Oct 90 14:36:50 EDT
From: lang@PRC.Unisys.COM
Subject: Re: 4.0570 SGML


Dominik Wujastyk writes:

> If you have a file like this:
>
> This is a test <file just to see> if the SGML <strings
> do indeed get stripped> or not.
>
> And you run the cited SED commands on it (sed 's/<.*>//g' in >out)
> you will get this:
>
> This is a test if the SGML <strings
> do indeed get stripped> or not.

I'm afraid I missed the beginning of this discussion,
but the SED script Dominik Wujastyk cited will indeed
*not* do what I understand to be the job.

I am assuming that a script is needed to the following:
(1) remove all instances of <...> on one line
(where the ... does not contain a `<' char)
(2) after (1) is done, remove all chars between (and including)
any remaining `<' and the end of the line, and
(3) after (1) is done, remove all chars between (and including)
the beginning of the line and any remaining `>'.

The following will work:

sed -e 's/<[^<]*>//g' -e 's/<.*$//g' -e 's/^.*>//g' in > out

It it true that SED and AWK et. al. are basically line-oriented,
but that doesn't limit their usefulness to tasks whose scope
is but a single line of text. If there are other SED or AWK
scripts that anyone would like some help with, I'd be happy
to help.


Received: from BROWNVM by BROWNVM.BROWN.EDU (Mailer R2.07) with BSMTP id 5372;
Fri, 05 Oct 90 19:40:13 EDT
Received: from sun2.nsfnet-relay.ac.uk by brownvm.brown.edu (IBM VM SMTP
R1.2.1MX) with TCP; Fri, 05 Oct 90 19:40:12 EDT
Received: from vax.nsfnet-relay.ac.uk by sun2.nsfnet-relay.ac.uk
with SMTP inbound id <5224-0@sun2.nsfnet-relay.ac.uk>;
Fri, 5 Oct 1990 22:34:05 +0100
Received: from sun.nsfnet-relay.ac.uk by vax.NSFnet-Relay.AC.UK via Janet
with NIFTP id aa13338; 5 Oct 90 21:36 BST
Message-Id: <24302.9010052136@ucl.ac.uk>
Received: from localhost by uk.ac.ucl; Fri, 5 Oct 90 22:36:13-0000
To: editors@brownvm.brown.edu
Subject: apologies for contamination
Date: Fri, 05 Oct 90 22:36:11 +0100
From: Dominik Wujastyk <ucgadkw@ucl.ac.uk>


Sorry to all HUMANISTS: I recently posted a complaint about a SED
script that assumed the line-oriented nature of text files. But
actually, the discussion about this is taking place on the TEI public
discussion list, not here on HUMANIST. A nice case of textual
contamination!

Apologies.

Dominik

(3) --------------------------------------------------------------26----
Date: Fri, 5 Oct 90 22:04 EDT
From: GORDON DOHLE <DOHLE@Vax2.Concordia.CA>
Subject: Re: 4.0569 Macs/Characters; Foot Mice; Computer vs. Computer (4/80)

There has been a lot of discussion lately on this and other lists about
SGML TEI and other formatting problems. I have followed some of it out of
curiosity, but perhaps don't understand the issue. I use a Mac to download
a lot of material, including progams and texts, pictures and sound. Most of
them are Binhexed and Stuffed, which means they come off the mainframe to me
in what I guess is some kind of machine language. When I unbinhex and unstuff
them, they can be opened and edited with whatever creator program that was
used to make them, such as Word, Wordperfect, etc. It seems this is a network
standard for transmitting large or small files as quickly and efficiently as
possible and there is never any difficulty with digging around for ASCII
equivalents.
Is there no such standard encryption format in the DOS world?
Gordon
(4) --------------------------------------------------------------28----
Date: Sat, 6 Oct 90 00:34 EDT
From: FZINN@OBERLIN.BITNET
Subject: Re: 4.0570 SGML, Markup, Empiricism (5/120)

I would vigorously second Willard's request for a straightforward guide
to 'minimal' SGML markup for texts.

As for the comments fo Bill Ball concerning use of more advanced hardware,
if the procedures only function on highend machines, only a few people
will be able to use them. SGML _seems_ to be something that works on the
greatest variety of systems (IF you can find/program/etc an editor---and
that seems to me, at least at the present, to be a pretty large IF---I would
be delighted to discover that I am mistaken).

A kind HUMANIST once forwarded to me a formatted copy of the marked-up
Oxford list of texts. How was that done (i.e. the formatting?) I could
search my files and perhaps come up with the name of the person, but perhaps
this question will bring forth an answer.

Grover Zinn
FZINN@OBERLIN
(5) --------------------------------------------------------------46----
Date: Mon, 08 Oct 90 10:38:15 BST
From: Donald A Spaeth 041 339-8855 x6336 <GKHA13@CMS.GLASGOW.AC.UK>
Subject: 4.0570 SGML

I suspect that Dominik is right and a lot of editors WILL have
problems stripping out SGML. But no programming language
should have this problem, nor should word-processing packages
with macro facilities. WP macros and text-oriented languages (like
SNOBOL) are more character than line-oriented. To a word-processor,
carriage return is simply another character.

I've written macros in MS Word
which strip out markup bewteen angle brackets, although my
purpose was to replace this markup with underlining, italics,
paragraph styles to create a fully-formatted Word document.

The technique in a Word macro is: (1) search
for an open angle bracket, (2) open a defined block,
(3) search for the next close angle bracket, (4) delete the block,
(5) restart the loop. In a SNOBOL routine, the procedure
would be similar: (1) read in the next line from INPUT,
(2) pass everything up to the next open angle bracket to OUTPUT,
(3) if an open angle bracket is found, set markup=1 and
skip to the next close bracket, (4) if none is found read the
next line into INPUT, etc., (5) when a close angle bracket is
found start passing the contents of INPUT to OUTPUT again,
(6) restart the loop. (i.e., (5) is really (2) again)

Any programmer will find these obvious. But others may like to
see how one might construct a stripping macro, rather than being
told that "it's easy; any editor that's any good can do it"!.

Cheers,
Don Spaeth