The following resources allow free content mining. Where restrictions in volume or frequency apply, it is noted. Vendors and platforms regularly change their permissions, so let us know if you have trouble accessing the resources for your mining research.
Primary sources in humanities and social sciences (via API)
|Arxiv Bulk Data Access||Supported by Cornell University|
Association for Computing Machinery (ACM)
ACM Digital Library; ACM Transactions.
Contact USC Libraries to initiate the process.
|Astrophysical Data Systems (ADS)||
Digital library portal for researchers in astronomy and physics, operated by the Smithsonian Astrophysical Observatory (SAO).
Corpus of peer-reviewed research, covered by open access license agreement allowing free distribution and re-use of the full-text article, including the highly structured XML version.
|BYU Google Books Viewer||
Search longer strings of words from the Google Books corpus. Offers same corpora as N-Grams.
|Chronicle (visualizing language in NY Times)||Visualized language usage in the New York Times. Archives page. Can also visit Research and Development page.|
|Chronicling America API||Optical Character Recognition (OCR) bulk downloads of Library of Congress' digitized historical newspapers.|
Computations lexicology that studies human behavior and cultural trends through the qualitative analyzes of digitized texts.
|Digital Public Library of America (DPLA) API||Provides access to a wide range of content from America's museums, libraries, and archives. Offers metadata on items and collections. All DPLA data in the DPLA repository is available for download as zipped JSON files.|
|English Corpora||Highly popular corpora of English language data sets for TDM|
|Early English Books Online (EEBO)||University of Michigan Text Creation Partnership|
|Europeana API||Openly licensed thematic datasets and 4 API options.|
|Factiva||Contact USC Libraries to initiate the process. Users can batch download up to 100 articles per session. Any larger project incurs $20,000 fee per project (as of May 2020).|
|Google Books Ngram Viewer||Charts the frequencies of any word or short sentence using yearly count of n-grams found in sources printed between 1500-Present. Texts in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese.|
|Hathi Trust data + API||Free data sets on Google-digitized and non-digitized volumes. USC Libraries is a member of Hathi Trust.|
|Humanities Data||Data sets for digital research through Michigan State University.|
|Internet Archive||Full-text of over 8 million eBooks and texts in the public domain.|
|Lexis Nexis||Contact USC Libraries to initiate the process. Does not have large TDM program, but since articles can be downloaded, TDM is possible. Users can batch download up to 500 articles in a single text file (as of November 2019).|
|New York Times article search API||Search and mine articles, 1851-present|
|NextBio||Contact USC Libraries to initiate the process.|
|Project Gutenberg Online catalog||Catalog of over 60,000 free books|
|Public Library of Science (PLOS)||Over 200,000 articles in a variety of science subjects|
|PubMed Central text mining tools||Variety of tools supported by National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI)|
|Top Ten LDC Corpora||Linguistic Data Consortium supported by the University of Pennsylvania|
|Oxford Text Archive||Literary and linguistic data supported by Oxford University's Bodleian Library|
|Online Books page||Over 3 million books available. Supported by the University of Pennsylvania|