Home About Subscribe Search Member Area

Humanist Discussion Group

< Back to Volume 34

Humanist Archives: June 5, 2020, 8:03 a.m. Humanist 34.84 - HathiTrust extracted features

                  Humanist Discussion Group, Vol. 34, No. 84.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                Submit to: humanist@dhhumanist.org

        Date: 2020-06-04 18:52:10+00:00
        From: Ryan Dubnicek 
        Subject: Announcing HTRC Extracted Features v.2.0!

HathiTrust Research Center (HTRC) is excited to announce the release of the
Extracted Features 2.0 dataset! This new version of Extracted Features offers
volume- and page-level data for 17+ million volumes in the HathiTrust Digital
Library. The data include:

  *   Bibliographic metadata
  *   Computationally-inferred metadata about the page, such as language and
      line counts
  *   Tokens (words), parts of speech, and their per-page counts

Overall, the dataset represents more than 6 billion pages of text from the
digital library and includes nearly 3 trillion tokens from the corpus.

Not only does this release extend the number of volumes in HathiTrust available
as Extracted Features, it also incorporates linked data such that names in the
files are linked to external authorities when possible.

Learn more about the release and data schema:
Download Extracted Features 2.0 files: https://wiki.htrc.illinois.edu/x/_QGGAQ

Contact htrc-help@hathitrust.org with any

Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.