14.0266 letter frequency in Latin

From: by way of Willard McCarty (willard@lists.village.Virginia.EDU)
Date: 09/25/00

  • Next message: by way of Willard McCarty: "14.0268 methodological primitives"

                   Humanist Discussion Group, Vol. 14, No. 266.
           Centre for Computing in the Humanities, King's College London
       [1]   From:    "Jim Marchand" <marchand@ux1.cso.uiuc.edu>          (90)
             Subject: Re: 14.0263 letter frequency in Latin?
       [2]   From:    Anne Mahoney <mahoa@bu.edu>                         (48)
             Subject: letter frequency in Latin
             Date: Mon, 25 Sep 2000 06:52:31 +0100
             From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
             Subject: Re: 14.0263 letter frequency in Latin?
    The question as to the frequency of letters in Latin is interesting
    and confronts us with a number of basic problems.  These may seem trivial,
    but I can assure you they are not.  First: What is a
    language and how can we delimit it?  Language is one of those words like
    _is_ which we glibly use, but scarcely ever define.  Secondly, what is
    Latin?  Just looking at Olmsted's Index to Language 26-30 (LSA 1955): Latin,
    Latin, Archaic; Latin, British; Latin, Church; Latin, Classical; Latin,
    Colloquial; Latin, Early; Latin,
    Hispeeric; Latin, Imperial; Latin, Late; Latin, Low; Latin,
    Medieval; Latin, Neeo; Latin, Old; Latin, Patrtistic; Latin,
    Pauline; Latin, Renaissance; Latin, Republican; Latin, Vulgar,
    etc., and I have not been careful to list them all.  Letters
    themselves offer numerous problems.  How about diphthongs, often
    spelled, e.g. ae, as ligatures. The standard lists are in what we
    nowadays would call ASCII (restricted), so that German contains no
    umlauts, French no accents, etc.  And what is the purpose of the
    list?  There was at one time a great movement to discover the
    frequency of sounds in various languages, and George Zipf collected these in
    search of support for his law of least effort, etc.  In
    fact, a glib answer to the question might be: Look at G. K. Zipf,
    he must list them somewhere. (for example: G. K. Zipf and F. M.
    Rogers, "Phonemes and variphones in four present-day Romance
    languages and classical Latin from the viewpoint of Dynamic
    Philology," Archives Nerlanddaises de Phontique Exprimentale 15
    (1939), 111-147.
        One might, for example, take any large corpus and count the
    letters (many `concordance' programs [e.g. TACT, available for ca.
    $50 from the Modern Language Association] will do this for you).
    Or, one might take one of the concordances (or several of the
    concordances available), some of which list as lagniappe the letter
    frequencies of the corpus they are working with.  This is not very
    `scientific', but will work well for sloppy work; after all, we all
    know that the sequence of the frequency of English letters is
    etaoinshrdlump, as Pogo assures us and Vanna White demonstrates each weekday
    My own count of Latin, made by running a text (the Five Books of
    Moses, j and i, v and u distinguished; ligatures expanded) of the
    Vugate through TACT, looks like this: e a i o t n l r s c m d p u
    v b g h f q z j x.  I have, naturally, left out y and k.
    The question may not have an answer.
    In the Humanist archives is a thread on etaoin shrdlu, which you
    could retrieve by searching shrdlu.
    -----Original Message-----
    From: Humanist Discussion Group
    <willard.mccarty@kcl.ac.uk>) <willard@lists.village.virginia.edu>
    To: Humanist Discussion Group <humanist@lists.Princeton.EDU>
    Date: Friday, September 22, 2000 4:01 AM
      >               Humanist Discussion Group, Vol. 14, No. 263.
      >       Centre for Computing in the Humanities, King's College London
      >               <http://www.princeton.edu/~mccarty/humanist/>
      >              <http://www.kcl.ac.uk/humanities/cch/humanist/>
      >         Date: Fri, 22 Sep 2000 09:45:24 +0100
      >         From: Melissa Terras <melslists@yahoo.com>
      >         Subject: letter frequency in latin
      >Hello All.
      >A Question - I am looking for some (any) articles on
      >statistical analysis of letter frequency in Latin.  I
      >know that there has been a lot of work done on letter
      >frequency and versatility in the English Language, but
      >does anyone know of any resources that deal with
      >letter frequency and propbable letter sequences in
      >Latin, from whatever period?
      >Melissa M Terras MA MSc
      >Engineering Science / Centre for the Study of Ancient
      >Christ Church
      >University of Oxford
      >Oxford 0X1 1DP
      >Do You Yahoo!?
      >Send instant messages & get email alerts with Yahoo! Messenger.
             Date: Mon, 25 Sep 2000 06:53:23 +0100
             From: Anne Mahoney <mahoa@bu.edu>
             Subject: letter frequency in Latin
    In a note to be published this year in Classical Outlook, my colleague Jeff
    Rydberg-Cox and I address this question.  We counted the letters in the Perseus
    Latin corpus and found that the relative ranking of letters is not too
    from that in English, except that 'i' and 'u' rank significantly higher
    than 'o'
    -- not surprising, given that they do double duty as consonants.
    The figures are as follows:
    letter  percent (rounded)
    e       9.3   (727,785 occurrences)
    i       8.9
    u       8.7
    a       6.8
    t       6.5
    s       6.0
    r       4.9
    n       4.9
    m       4.5
    o       4.4
    c       3.2
    l       2.5
    d       2.4
    p       2.2
    q       1.4
    b       1.1
    g       0.8
    f       0.8
    h       0.7
    x       0.3
    y       0.1
    k       0      (434 occurrences)
    w       0      (322)
    z       0      (307)
    At the time there were no 'j' in the Perseus texts (though 'j' does occur in
    some of our schoolboy commentaries).  The corpus is not consistent about
    'u' and
    'v', since we've retained whatever was in the original print editions, so we
    simply counted all 'v' as 'u'.  We also did not attempt to weed out Roman
    The corpus we counted was about 7.8 million characters (letters, digits, and
    punctuation), from Plautus, Caesar (BG), Catullus, Cicero (orations and
    letters), Virgil, Horace (Odes), Livy (books 1-10), Ovid (Metamorphoses),
    Suetonius (Caesars), the Vulgate, and Servius's commentary on Virgil.  Because
    this corpus is so heterogeneous, a lot more work could be done on refining the
    We did not look at letter sequences at all, and I don't think I've ever seen
    anything on that subject for Latin.
    --Anne Mahoney
    Perseus Project

    This archive was generated by hypermail 2b30 : 09/25/00 EDT