Research Guides: Content Mining: Free resources for mining

Free resources and data for mining

The following resources allow free content mining. Where restrictions in volume or frequency apply, it is noted. Vendors and platforms regularly change their permissions, so let us know if you have trouble accessing the resources for your mining research.

Updated: March 18, 2021

Vendor	Details
Adam Matthew	Primary sources in humanities and social sciences (via API)
Arxiv Bulk Data Access	Supported by Cornell University
Association for Computing Machinery (ACM)	ACM Digital Library; ACM Transactions. Contact USC Libraries to initiate the process.
Astrophysical Data Systems (ADS)	Digital library portal for researchers in astronomy and physics, operated by the Smithsonian Astrophysical Observatory (SAO).
BioMed Central	Corpus of peer-reviewed research, covered by open access license agreement allowing free distribution and re-use of the full-text article, including the highly structured XML version.
BYU Google Books Viewer	Search longer strings of words from the Google Books corpus. Offers same corpora as N-Grams.
Chronicle (visualizing language in NY Times)	Visualized language usage in the New York Times. Archives page. Can also visit Research and Development page.
Chronicling America API	Optical Character Recognition (OCR) bulk downloads of Library of Congress' digitized historical newspapers.
Culturomics	Computations lexicology that studies human behavior and cultural trends through the qualitative analyzes of digitized texts.
Digital Public Library of America (DPLA) API	Provides access to a wide range of content from America's museums, libraries, and archives. Offers metadata on items and collections. All DPLA data in the DPLA repository is available for download as zipped JSON files.
English Corpora	Highly popular corpora of English language data sets for TDM
Early English Books Online (EEBO)	University of Michigan Text Creation Partnership
Europeana API	Openly licensed thematic datasets and 4 API options.
Google Books Ngram Viewer	Charts the frequencies of any word or short sentence using yearly count of n-grams found in sources printed between 1500-Present. Texts in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese.
Hathi Trust data + API	Free data sets on Google-digitized and non-digitized volumes. USC Libraries is a member of Hathi Trust.
Humanities Data	Data sets for digital research through Michigan State University.
Internet Archive	Full-text of over 8 million eBooks and texts in the public domain.
Lexis Nexis	Contact USC Libraries to initiate the process. Does not have large TDM program, but since articles can be downloaded, TDM is possible. Users can batch download up to 500 articles in a single text file (as of November 2019).
New York Times article search API	Search and mine articles, 1851-present
NextBio	Contact USC Libraries to initiate the process.
Project Gutenberg Online catalog	Catalog of over 60,000 free books
Public Library of Science (PLOS)	Over 200,000 articles in a variety of science subjects
PubMed Central text mining tools	Variety of tools supported by National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI)
Top Ten LDC Corpora	Linguistic Data Consortium supported by the University of Pennsylvania
Oxford Text Archive	Literary and linguistic data supported by Oxford University's Bodleian Library
Online Books page	Over 3 million books available. Supported by the University of Pennsylvania