Home About Subscribe Search Member Area

Humanist Discussion Group

< Back to Volume 33

Humanist Archives: April 29, 2020, 8:17 a.m. Humanist 33.808 - on proprietary formats

                  Humanist Discussion Group, Vol. 33, No. 808.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                Submit to: humanist@dhhumanist.org

    [1]    From: C. M. Sperberg-McQueen 
           Subject:  (Humanist 33.803) (131)

    [2]    From: Henry Schaffer 
           Subject: Re: [Humanist] 33.803: proprietary formats (11)

    [3]    From: Henry Schaffer 
           Subject: Re: [Humanist] 33.803: proprietary formats (9)

        Date: 2020-04-28 20:14:27+00:00
        From: C. M. Sperberg-McQueen 
        Subject:  (Humanist 33.803)

on proprietary formatsJonothan Halfin observes quite correctly that many common proprietary
formats (and formerly proprietary formats) can be read by open source
software, so that " it makes little difference in any research I have
done to date, whether the source material I am seeking is archived in
.pdf, in .doc, in .xls, in .ppt, in .jpg, in .tif, in .png, in .gif,
in .wpd, in.ps, or in.txt or .rtf. (or for that matter, any other
commonly used file format.)”  A similar argument has been made within
the library and archival community to the effect that one needn’t
really worry about commonly used formats, because the market will
ensure that there will always be decoders for them.

There are two reasons some people are unwilling to join JH in his
conclusion that "whether older archived digital media is still in a
usable format [is] rather a moot concern”.

First, the phenomenon is largely restricted to very widely used
formats which can be decoded without violation of relevant patents or
other intellectual property.  That many programs can handle JPEG is
good, and unsurprising given that it is defined by an open, publicly
available standard and has never been proprietary.  One might have
less luck with any of the scores of proprietary formats once heavily
marketed by vendors and now abandoned.  Yes, Open Office can read
current Microsoft Office formats, but how well does it do on
Multiplan? Or for that matter on early versions of Microsoft Word?  I
knew some people who were devoted users of Edix/Wordix, but the last
time I pulled down an Import menu in Open Office, Edix/Wordix was not
on the list of supported formats.  (And how is Gimp on Kodak Photo
CDs?  ouch!)

Yes, of course many readers of this list have not heard of
Edix/Wordix, or of many of the other proprietary formats used by
academics a few decades ago.  That is the point.  They were current
then, and largely forgotten today.  Some software and formats that are
current today will still be familiar in a few decades; some will be
forgotten.  When it comes to making sure that data you care about is
still readable in twenty or forty years, all you have to do is guess
right about which is which.  Who am I to say you won’t be lucky?

At the DH conference in Utrecht last summer there was a session on
virtual-reality projects, with a talk I found rather poignant about
some older VR projects into which their developers had sunk a great
deal of time, which are now effectively inaccessible because they can
be run only on older hardware and software, or in some cases on
emulators which not everyone in the potential audience is likely to
have lying around.  All of them had been done in formats that were
known to be proprietary, but also know to be quite commonly used.  But
despite being commonly used (by some standards, at any rate), those
formats are not now easily readable.

If you use proprietary formats for work that you would like to see
remain available for a while (say, while you are alive), then all will
be well, as long as the format you choose is so commonly used and so
commercially significant that someone (else) will write an open source
decoder for it.  If it turns out otherwise, well, you’ll be like a
recording artist who discovers that all of their master tapes and all
of their songs are owned by someone else.  So you'll have plenty of

The second reason is that existing decoders are so often faulty.
It’s not too hard to reverse engineer some formats *in part*; a
program to read WordStar files and translate them into a form other
then available software could read was one of the first programs I
ever wrote for someone else to use.  It lost all of the formatting
information (though I think it managed to detect and preserve
paragraph breaks), but the student had been so panicked by the fear
that they were going to have to retype their entire thesis (or rewrite
it, in the case of the chapters for which they did not have printouts)
that they were grateful just to get the character stream back out.  A
later program I wrote to decipher Word Perfect’s binary format and
produce SGML was, I think, more successful (but then, I did have
access to a running copy of Word Perfect and could run experiments to
try to understand the format, which was not the case for the WordStar
project), but my program was only ever intended to handle the one set
of files I was interested in.

That better programmers with more time can often do better, is clear.

But it’s also clear that it is rare for any decoder to handle
everything correctly.  (Is there any evidence that it has ever
happened?  that it has ever happened in a non-trivial case?  that it
has ever happened for the particular proprietary format you would like
to rely on?)

I don’t use word-processor formats much, but every year or two I write
a paper which a book editor or a journal cannot handle unless I
convert it into Word.  So I generate HTML from my TEI-encoded source,
and import it into Open Office, try to clean up some of the worst
excrescences of Open Office’s import facility, and save it as a Word
document.  After copy editing and perhaps some reformatting by the
publisher, the editor will send it back to me in Word and I will open
it in Open Office.  This has happened ten or twenty times in the last
twenty years, and I have yet to get a document back in which the Open
Office / Word interconversions have gotten everything right.  A list
has been changed from a bulleted list to a numbered list, or vice
versa, or screwed up in some other way.  Two of the footnotes have
mysteriously been changed to gibberish, or one paragraph has been
truncated in the middle.  If the conversion routines botch things this
badly for simply formatted expository prose, I don't like to think
what they do with complicated documents.

So when I hear people say that for commonly used formats there will
always be satisfactory conversion programs, I always want to ask "have
you ever actually *looked* at the output of those conversion
programs?"  (Actually what I want to ask is slightly different, but
it's rather rude, so I don't want to say it in front of Willard.)

Those who are content for the work they do to be preserved for the
future only in mutilated form are welcome to use proprietary formats.
But do please note that when you ask for sympathy later, after the
years of work you poured into that proprietary format are gone because
the format is no longer supported, you may only get a shrug: Your gun,
your bullet, your foot.

As for me, I have come to think that putting anything you care about
into any format not defined by an openly available specification is
like leaving the only copy of your newly completed manuscript lying on
the desk in your study and then turning around and throwing a lit
Molotov cocktail into the room before closing the door and going into
the front room to watch Netflix.  Enjoy the film!

C. M. Sperberg-McQueen
Black Mesa Technologies LLC

        Date: 2020-04-28 12:47:10+00:00
        From: Henry Schaffer 
        Subject: Re: [Humanist] 33.803: proprietary formats

  This discussion of proprietary formats nudged my memory and I found that
I still have a C program (with documentation) which I wrote back in the mid
1980s to translate an ASCII file into WordStar 1.4 format. So it gives some
information about that format.

  Might this be of interest to anyone? I could easily distribute it
(program + C source = 262 lines 10kB) via email or whatever (e.g. should I
put it in GitHub?)

--henry schaffer

        Date: 2020-04-28 13:21:05+00:00
        From: Henry Schaffer 
        Subject: Re: [Humanist] 33.803: proprietary formats

Oh, one more item from the past which makes me grin - at the end of the doc
I give my contact information - my postal mail address and also "

The last one was known as the "bang address".


Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.