Home About Subscribe Search Member Area

Humanist Discussion Group

< Back to Volume 34

Humanist Archives: May 16, 2020, 8:42 a.m. Humanist 34.24 - online: the Coronavirus Corpus

                  Humanist Discussion Group, Vol. 34, No. 24.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                Submit to: humanist@dhhumanist.org

        Date: 2020-05-15 13:38:43+00:00
        From: Mark Davies 
        Subject: Coronavirus Corpus

We are please to announce the release of the Coronavirus Corpus:


The Coronavirus Corpus is designed to be the definitive record of the social,
cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond,
and it is part of the English-Corpora.org suite of corpora, which offer
unparalleled insight into genre-based, historical, and dialectal variation in

The corpus is currently about 270 million words in size, and it continues to
grow by 3-4 million words each day. (For example, there are already 4 million
words of text for yesterday, May 14). At this rate, the corpus may be 500-600
million words in size by August 2020.

The Coronavirus Corpus allows you to see the frequency of words and phrases in
10-day increments (and even day by day, if desired) since Jan 2020, such as
social distancing, flatten the curve, WORK * home, Zoom, Wuhan, hoard*, toilet
paper, curbside, pandemic, reopen, defy.

You can also look at collocates, to see what is being said about a certain
topic, such as (verbs near) virus, or any word near ban (v), stockpile,
disinfect*, or remotely. And you can even see and compare the collocates of a
word in 10-day periods since Jan 2020.

As is common with most online corpora, the Coronavirus Corpus allows you to see
re-sortable, PoS-colored Keyword in Context (KWIC) / concordance views, for any
word or phrase.

You can also compare between different time periods, to see how our view of
things have changed over time. A few examples might be: phrases with social * or
economic * that were more common in Jan/Feb than in Apr/May, words near BAN or
OBEY that were more common in Apr-May than in Jan-Feb, or all nouns that were
much more common in late April 2020 than in March 2020.

The corpus allows you to compare across the 20 countries in the corpus (US, UK,
Australia, India, etc), to see what is being said about the coronavirus in each
of these countries. You can also quickly and easily create "Virtual Corpora" for
particular topics, based on keywords in the text, country, date, publication
source, and more.

Finally, full-text data from the corpus will soon be available on a
"subscription" basis, where you can download nearly all of the new data every
day, week, or month -- just as with the other corpora from English-Corpora.org
(see https://www.corpusdata.org).

We hope that the corpus will be of use to you in your research and teaching.

Mark Davies

Mark Davies
Professor of Linguistics / Brigham Young University

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.