Content Mining: Databases that support content mining

This guide provides information about freely available text mining resources and tools and whether or not the Libraries subscription databases support content mining.

USC Libraries' Databases that support content mining

Most of the libraries' databases do not allow text or data mining due to license agreements. We will continue to work with database vendors to include TDM into future license agreements. The resources listed here are the current exceptions. If you do not see a resource listed here, please contact us and we can investigate further

Vendor Fee? Details Help/Guides Examples 
Adam Mathew FREE

Contact USC Libraries to initiate the process

All databases from Adam Matthew (which digitize unique primary source collections) are available for mining. 
From license: "the Licensee and the Authorised Users may use the Licensed Materials to perform and engage in text mining /data mining activities in relation to the Licensed Materials for legitimate academic research and other non-commercial educational purposes, without obtaining the Licensor’s prior written consent"

Association for Computing Machinery

(ACM Digital Library, ACM Transactions)

FREE The individual researcher negotiates and signs directly with provider. 
Early English Books Online (EEBO)  FREE

Early English Books Online - Text Creation Partnership The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books.

Phase I content (25,000 titles) is freely available/searchable. USC does not have full-text access to Phase II content (we are not a partner library for this phase of transcription)

Gale 
(Primary source collections only)
Some free. Downloading large datasets costs $500 -$3,500/price per collection

Gale Artemis: Primary Sources, which searches across 23 of our Gale primary source databases covering 1500-2012, has a Term Frequency search option and Term Clusters viewer (available from the articles results list).

  • View results over time by entering a word or phrase,
  • Compare multiple terms.
  • Graph either the frequency of your search term (the number of documents per year) or its popularity (the % of total documents each year).
  • Click on a point on the graph to retrieve search results for that year
  • Click and drag to select a time period to zoom in on.

To download large datasets USC Libraries will have to request data on your behalf from our Gale sales representative. It can take up to 3 weeks to process requests. Gale will send a hard drive with the data requested to the libraries for you to use.

video: Using Term Frequency & Term Clusters [2:59]

Data Mining the Gale Digital Collections Frequently Asked Questions (2014)

Hathi Trust FREE Individual can access public domain materials for content mining project through Hathi Trust Research Center. Hathi Trust Research Center Analysis
JSTOR  FREE Data for Research (DfR) - Provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets (>1,000) or a type of data not available through the main site, you have to contact JSTOR directly: support@ithaka.org  Introduction to using DfR from DH @ Washington Lee University Gender composition of scholarly publications (1665 - 2011)
IEEE Cost negotiated per request Through a negotiation of the vendor license, the library facilitates on a case by case situation. 
LexisNexis FREE Does not "officially" support or provide data/text mining options. However, since text files can be downloaded, TDM is possible. You can batch download up to 500 articles at a time in one text file. 
NextBio FREE (via subscription)
Oxford English Dictionary (OED) FREE Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. To apply for research access to the Corpus, fill out and email this application form. The Oxford English Corpus Sketch Engine Documentation
Oxford University Press  FREE

Researchers are not required to request permission for non-commercial text-mining of OUP content. However, OUP offers consultation service with a technical project manager to assist in planning your TDM project, including avoidance of any technical safeguards triggers OUP has in place to protect the stability and security of our websites. 

To request a consultant for your TDM project, please e-mail Data.Mining@oup.com

ProQuest Cost negotiated per request

You can contact ProQuest directly to negotiate arrangements for text mining their content. 

ProQuest does allow free text mining for the newspapers to which USC Libraries have purchased perpetual access licenses. Those newspapers are: Los Angeles Times (1881-1931), Los Angeles Sentinel (1934-2005), and New York Times (1851-1934). USC Libraries will have to request this data on your behalf. 

ProQuest offers two methods of data delivery:

  1. Online cloud-based delivery - a link to a sitemap is sent from which the documents can be downloaded.
  2. Hard drive can be mailed
Robots Reading Vogue

ScienceDirect 

(Elsevier)

FREE (with subscription) You can text mine all subscribed content so long as it is for non-commercial purposes. You do this via Elsevier's Science Direct APIs. You must register first to use these APIs.  For access to data not available through API, researcher must contact Elsevier directly to negotiate: integrationsupport@elsevier.com

FAQs

How To Guide for text mining

Elsevier's Text & Data Mining Glossary

SpringerLink FREE (with subscription)

TDM rights, for non-commercial research, are now included in new and renewed subscription agreements

You can download subscribed and open access content for TDM purposes directly from the SpringerLink platform.

Full-text content can also be accessed via friendly URLs:

PDF: http://link.springer.com/[DOI].pdf

HTML (when available): http://link.springer.com/[DOI].html

Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool such as curl, wget and Python’s urllib, among others. Note that the tool should be enabled to follow HTTP 301, 302 and 303 redirects. See example below.

No API key or other authentication is required. TDM researchers are requested to be considerate and limit their downloading speed to a reasonable rate.