Skip to Main Content

Corpora and Text/Data Mining For Digital Humanities Projects

A guide to the process of creating a large collection of text or speech data to computationally analyze and extract meaningful insights from that data,

INDEX THOMISTICUS - ROBERTO BUSA

The concept of text mining was successfully brought to fruition with the Index Thomisticus, an enormous index verborum or concordance of 179 texts centering around the works of Thomas Aquinas: the pioneering 30-year initiative conceptualized in 1946 by Roberto Busa (November 28, 1913 – August 9, 2011) an Italian Jesuit priest (from Gallarate, Italy). In 1949 Busa, in partnership with IBM, started developing (on IBM punch cards) the encoding of Thomas Aquinas’ writings as well as Latin works from the ninth to the sixteenth centuries. 

The first published volume of this pioneering large scale humanities computing (an inaugural digital humanities project) appeared in 1974 and the work was complete in 56 printed volumes in 1980. 

In 2005 a web-based version of the Index Thomisticus, designed and programmed by E. Alarcón and E. Bernot, in collaboration with Busa appeared online . In 2006 the Index Thomisticus Treebank (directed by Marco Passarotti) started the syntactic annotation of the entire corpus.

JOURNALS

Journal of Data Mining and Digital Humanities  Peer-reviewed, Open Access - Began with 2014. This journal focuses on the intersection of computing and the disciplines of the humanities, with tools provided by computing such as data visualization, information retrieval, statistics, text mining. Published scholarly work beyond the traditional humanities.  

Revista Humanidades Digitales (RHD) - Peer-reviewed, Open access from 2017. Devoted to research and scholarship in digital editions of texts, digital libraries; digital archives and memory; examination and analysis of multimedia resources; text mining and data mining, stylometry, topic modeling, sentiment analysis; georeferencing, maps, visualization tools ; corpus linguistics, Natural Language Processing (NLP) ; digital media, digitization, curatorship and preservation of digital objects. Abstracts are in Spanish and English. Articles are chiefly in Spanish with some articles in English, Portuguese, Italian, and French  (Published by Spain's Universidad Nacional de Educación a Distancia UNED.  ) 

Wiley interdisciplinary reviews. Data Mining and Knowledge Discovery Vol. 1: 2011. An interdisciplinary journal which includes "Text Mining" as one of its topics for publication.

SELECTED READINGS

Much has been written about Busa's project and his contribution to the start of the Digital Humanities.  This list of readings constitute a very brief overview of this topic.

Biffi, I. (1974). “L'ordinateur au service de la compréhension de S. Thomas: L'Index Thomisticus,” Bulletin de PhilosophieMédiévale; Louvain 16, (Jan 1, 1974): 152. 

Busa, R. (1980). “The Annals of Humanities Computing: The Index Thomisticus,” Computers and the Humanities, 14 (2), (Oct., 1980): 83-90.

Friedrich Frommann Verlag [Publisher] (1973). "Index Thomisticus," Computers and Medieval Data Processing - lnformatique et Études Médievales, 3(2): 60-62.

Jeremy Norman’s "Publication of Roberto Busa's Index Thomisticus: Forty Years of Data Processing in the Humanities - 1974 to 1980," (Last updated January 3rd, 2024, Accessed 3/4/24).

Jockers, Matthew L. & Ted Underwood (2016) "Text Mining in the Humanities," in S. Schreibman, R. Siemens, & J. Unsworth (Eds.), A New Companion to Digital Humanities (1st ed., pp. 305-320). John Wiley and Sons.

Joo, S., Hootman, J., & Katsurai, M. (2022). "Exploring the Digital Humanities Research Agenda: A text mining approach."Journal of Documentation, 78(4), 853-870. 

Hisette Roland (1977). "Etat de l'Index Thomisticus," Bulletin de Philosophie Medievale, 19: 68.

Judy, Albert G. (1974). "The Index Thomisticus: St. Thomas and IBM," Listening, 9(1): 105-118.

SchmidtRobert W.  (1976). "An Historic Research Instrument: The Index Thomisticus," The New Scholasticism, 50(2): 237-249.

Sprokel, Nico (1978). "The Index Thomisticus," Gregorianum, 59(4): 739-750.

Sula, Chris Alen and Heather V Hill (2019). “The Early History of Digital Humanities: An Analysis of Computers and the Humanities (1966–2004) and Literary and Linguistic Computing (1986–2004),” Digital Scholarship in the Humanities, Volume 34, Issue Supplement 1, December 2019, pp. 190–206. 

Truyens, M., & Van Eecke, P. (2014). "Legal Aspects of Text Mining." Computer Law and Security Report, 30(2), 153-170. 

TEXT PREPARATION TOOLS

OpenRefineAn open source desktop application, formerly called Google Refine, for data cleanup and file format transformation, and extending it with web services and external data.

Trifacta: A software company headquartered in San Francisco with offices in Boston, Berlin and London. The company was founded in October 2012. Trifacta develops data wrangling software for data exploration and self-service data preparation. Data wrangling, similar to ETL (Extract, Transform, and Load), is  the process of combining data from multiple sources into a large, central repository called a data warehouse). It is a user friendly AI driven platform and and allows for a fast way of clean, structure, and enrich raw data into a more usable form.

Trifacta is designed for analysts to explore, transform, and enrich raw data into clean and structured formats. Trifacta utilizes techniques in machine learning, data visualization, human-computer interaction, and parallel processing for non-technical users to prepare data for a variety of business processes. Trifacta works with cloud and on-premises data platforms.

TEXT MINING TOOLS

There are many many ready-to-use digital tools for conducting text mining research . The listing below includes a few of the most popular ones:​

AntWord Profiler A freeware tool for profiling the vocabulary level and complexity of texts. AntWord Profiler is a free download available for Windows, Mac OS X, or Linux.

Context: Developed at the School of Information Sciences at Illinois, ConText is a free, open-source application for performing a variety of text analysis techniques, including network graphs and topic models, based on textual data.

Gephi An open graph visualization platform that supports exploration of all kinds of networks and complex systems. Gephi can be downloaded for free onto any Linux, Windows, or Mac OS X device.

Mallet (MAchine Learning for LanguagE Toolkit):  A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

PhiloLogic: A full-text search, retrieval, and analysis tool developed by the ARTFL Project and the Digital Library Development Center (DLDC) at the University of Chicago. It is free software that can be downloaded for a wide range of systems.

Scrapy: An open source and collaborative framework for extracting the data you need from websites. It is available as a free download for Linux, Windows, Mac OS X.

Textal: A free smartphone app that allows you to analyze websites, tweet streams, and documents to explore the relationship between words in the text via an intuitive word cloud interface. The app allows you to generate graphs and statistics, as well as share the data and visualizations in any way you like. Textal is available as a free download from the App Store on your Apple iOS device.

TXM: A free, open source cross-platform Unicode & XML based text/corpus analysis environment and graphical client. It is available as a free download for Windows, Linux, and Mac OS X. It has a comprehensive range of analysis tools, such as concordances, collocate search, frequency list, etc.

TEXT ANALYSIS TOOLS

PORTAL:   TaPOR 3.0- Text Analysis Portal for Research - TAPoR is a gateway to the tools used in sophisticated text analysis and retrieval. It is a directory of digital humanities tools organized by type. The former DiRT (Digital Research Tools) Directory is now merged with TaPOR.

With TAPoR 3.0 you can: 

  • Discover text manipulation, analysis, and visualization tools
  • Discover historic tools
  • Read tool reviews and recommendations
  • Learn about papers, articles and other sources about specific tools
  • Tag, comment, rate, and review collaboratively
  • Browse lists of related tools in order to discover tools more easily

Broad domains of tools and services currently included in TAPoR:  Browser extensions; Communication tools; Development tools; GIS tools; Photo/Video/Audio Editor and Related Tools; Publishing tools;  Repositories and Archiving tools. For a large number of curated tool lists in TAPoR, See:  Curated Tool lists.

AntConcA free concordance program available for Windows, Mac OS X, and Linux operating systems. AntConc has evolved from a simple concordance program into a powerful tool for textual analysis. It is able to perform the following types of linguistic analyses: concordance, concordance plot, clusters, n-grams, collocates, word frequency, keyword list.

Lexos: A resource for visualizing large text sets through a web-based platform. The site has capabilities to upload multiple files, prepare, visualize, and analyze your data. The visualization tools encompassed in this tool include word clouds, multicloudbubbleviz, and rollingwindow graph. The analysis tools included are statistical analysis, clustering, similarity query, and topword.

Textanalyser: A free online text analysis tool that generates statistics about your text and allows you to find the most frequent phrases and frequencies of words. This analysis tool provides instant results for analyzing word groups, sentences, syllables, and keyword density, the prominence of word or expression, and word count. Non-English language texts are supported. 

Voyant: A web-based reading and analysis tool for digital texts. The tool allows you to type in multiple URLs, paste in full text, or upload your own files for analysis. The site is a collaborative project by Stefan Sinclair and Geoffrey Rockwell specifically built for digital humanities projects. The site also provides helpful instruction guides for getting started and additional information about other Voyant tools.

What you can do with Voyant:

  • Use it to learn how computers-assisted analysis works. Check out our examples that show you how to do real academic tasks with Voyant.
  • Use it to study texts that you find on the web or texts that you have carefully edited and have on your computer.
  • Use it to add functionality to your online collections, journals, blogs or web sites so others can see through your texts with analytical tools.
  • Use it to add interactive evidence to your essays that you publish online. Add interactive panels right into your research essays (if they can be published online) so your readers can recapitulate your results.
  • Use it to develop your own tools using our functionality and code.

See: "Introduction to Text Analysis using Voyant," presented by David Sye, University Libraries, Murray State University.

 Word and Phrase An online text analysis tool that has a variety of capabilities for analyzing text. To Access: click on "Login".  Text can be copied and pasted into a text box or take advantage of the data from the Corpus of Contemporary American English (COCA) (University of Virginia). The tool will first highlight all the medium and lower-frequency words in the text and create lists of the words. Secondly, the words can be clicked upon to create a "word sketch" of any of the words--this will show their definitions and detailed information from from the COCA. Finally, the tool has the capability to conduct powerful searches on select phrases and show related phrases in the COCA. The corpus of COCA contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020) the: TV and Movies subtitles, blogs, and other web pages.

TEXT ANALYSIS TUTORIALS

Programming Historian: Lessons Tutorials on a range of text analysis tools and techniques including mapping, web scraping, cleaning and transforming data, sentiment analysis, stylometry, basic text processing in R, text mining in Python, and topic modeling with Mallet.

Text Analysis with Topic Models for the Humanities & Social Sciences:  A series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning. Includes sample datasets of British and French literary texts.