Content Mining: Free resources for mining

This guide provides information about available text mining resources and tools and whether or not the Libraries subscription databases support content mining.

Free resources and data for mining

The following resources allow free content mining. Where restrictions in volume or frequency apply, it is noted. Vendors and platforms regularly change their permissions, so let us know if you have trouble accessing the resources for your mining research.

Updated: March 18, 2021

Vendor Details
Adam Matthew

Primary sources in humanities and social sciences (via API)

Arxiv Bulk Data Access Supported by Cornell University

Association for Computing Machinery (ACM)

ACM Digital Library; ACM Transactions.

Contact USC Libraries to initiate the process.

Astrophysical Data Systems (ADS)

Digital library portal for researchers in astronomy and physics, operated by the Smithsonian Astrophysical Observatory (SAO).

BioMed Central

Corpus of peer-reviewed research, covered by open access license agreement allowing free distribution and re-use of the full-text article, including the highly structured XML version.

BYU Google Books Viewer

Search longer strings of words from the Google Books corpus. Offers same corpora as N-Grams.

Chronicle (visualizing language in NY Times) Visualized language usage in the New York Times. Archives page. Can also visit Research and Development page.
Chronicling America API Optical Character Recognition (OCR) bulk downloads of Library of Congress' digitized historical newspapers.
Culturomics 

Computations lexicology that studies human behavior and cultural trends through the qualitative analyzes of digitized texts.

Digital Public Library of America (DPLA) API Provides access to a wide range of content from America's museums, libraries, and archives. Offers metadata on items and collections. All DPLA data in the DPLA repository is available for download as zipped JSON files.
English Corpora Highly popular corpora of English language data sets for TDM
Early English Books Online (EEBO) University of Michigan Text Creation Partnership
Europeana API Openly licensed thematic datasets and 4 API options.
Google Books Ngram Viewer Charts the frequencies of any word or short sentence using yearly count of n-grams found in sources printed between 1500-Present. Texts in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese.
Hathi Trust data + API Free data sets on Google-digitized and non-digitized volumes. USC Libraries is a member of Hathi Trust.
Humanities Data  Data sets for digital research through Michigan State University.
Internet Archive  Full-text of over 8 million eBooks and texts in the public domain.
Lexis Nexis Contact USC Libraries to initiate the process. Does not have large TDM program, but since articles can be downloaded, TDM is possible. Users can batch download up to 500 articles in a single text file (as of November 2019).
New York Times article search API Search and mine articles, 1851-present
NextBio Contact USC Libraries to initiate the process. 
Project Gutenberg Online catalog Catalog of over 60,000 free books
Public Library of Science (PLOS) Over 200,000 articles in a variety of science subjects
PubMed Central text mining tools Variety of tools supported by National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI)
Top Ten LDC Corpora Linguistic Data Consortium supported by the University of Pennsylvania
Oxford Text Archive Literary and linguistic data supported by Oxford University's Bodleian Library
 Online Books page Over 3 million books available. Supported by the University of Pennsylvania