Content Mining: Databases that support content mining

This guide provides information about available text mining resources and tools and whether or not the Libraries subscription databases support content mining.

USC Libraries' Databases that support content mining

Most of the libraries' databases do not allow text or data mining due to license agreements. We will continue to work with database vendors to include TDM into future license agreements. The resources listed here are the current exceptions. If you do not see a resource listed here, please contact us and we can investigate further

Vendor Fee? Details Help/Guides Examples 
Adam Mathew FREE

Contact USC Libraries to initiate the process.

All primary source databases from Adam Matthew are available for mining. 
From license: "the Licensee and the Authorized Users may use the Licensed Materials to perform and engage in text mining /data mining activities in relation to the Licensed Materials for legitimate academic research and other non-commercial educational purposes, without obtaining the Licensor’s prior written consent"

Association for Computing Machinery

(ACM Digital Library, ACM Transactions)

FREE

Contact USC Libraries to initiate the process.

The individual researcher negotiates and signs directly with provider. 

Cambridge University Press Cost negotiated per request Contact USC Libraries to initiate the process.     
Early English Books Online (EEBO)  FREE

Contact USC Libraries to initiate the process.

Early English Books Online - Text Creation Partnership The Text Creation Partnership creates standardized, accurate XML/ SGML encoded electronic text editions of early print books.

Phase I content (25,000 titles) is freely available/searchable. 

Gale 
(Primary source collections only)
Some free. Downloading large datasets incurs costs

Contact USC Libraries to initiate the process.

Gale Artemis: Primary Sources, which searches across 23 of our Gale primary source databases covering 1500-2012, has a Term Frequency search option and Term Clusters viewer (available from the articles results list).

To download large datasets USC Libraries will have to request data on your behalf from our Gale sales representative. It can take up to 3 weeks to process requests.

Video: Using Term Frequency & Term Clusters [2:59]

Data Mining the Gale Digital Collections Frequently Asked Questions (2014)

Hathi Trust FREE USC is a member of Hathi Trust Research Center. Hathi Trust Research Center Analysis
JSTOR  FREE

Contact USC Libraries to initiate the process.

Data for Research (DfR) - Provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets (>1,000) or a type of data not available through the main site, talk to a USC Librarian.

Introduction to using DfR from DH @ Washington Lee University Gender composition of scholarly publications (1665 - 2011)
IEEE Cost negotiated per request

Contact USC Libraries to initiate the process.

Through a negotiation of the vendor license, the library facilitates on a case by case situation. 

LexisNexis FREE

Contact USC Libraries to initiate the process.

Does not "officially" support or provide data/text mining options. However, since text files can be downloaded, TDM is possible. You can batch download up to 500 articles at a time in one text file. 

NextBio FREE (via subscription)

Contact USC Libraries to initiate the process.

Oxford English Dictionary (OED) FREE

Contact USC Libraries to initiate the process.

Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. 

The Oxford English Corpus Sketch Engine Documentation
Oxford University Press  FREE

Contact USC Libraries to initiate the process.

Researchers are not required to request permission for non-commercial text-mining of OUP content. However, OUP offers consultation service with a technical project manager to assist in planning your TDM project, including avoidance of any technical safeguards triggers OUP has in place to protect the stability and security of our websites. 

ProQuest Cost negotiated per request

Contact USC Libraries to initiate the process.

ProQuest does allow free text mining for the newspapers to which USC Libraries have purchased perpetual access licenses. USC Libraries will have to request this data on your behalf. 

Robots Reading Vogue

ScienceDirect 

(Elsevier)

FREE (with subscription)

Contact USC Libraries to initiate the process.

You can text mine all subscribed content so long as it is for non-commercial purposes. You do this via Elsevier's Science Direct APIs

FAQs

How To Guide for text mining

Elsevier's Text & Data Mining Glossary

SpringerLink FREE (with subscription)

Contact USC Libraries to initiate the process.

TDM rights, for non-commercial research, are now included in new and renewed subscription agreements

You can download subscribed and open access content for TDM purposes directly from the SpringerLink platform.

Full-text content can also be accessed via friendly URLs:

PDF: http://link.springer.com/[DOI].pdf

HTML (when available): http://link.springer.com/[DOI].html

Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool such as curl, wget and Python’s urllib, among others. Note that the tool should be enabled to follow HTTP 301, 302 and 303 redirects. See example below.

No API key or other authentication is required. TDM researchers are requested to be considerate and limit their downloading speed to a reasonable rate.