Skip to Main Content

Corpora and Text/Data Mining For Digital Humanities Projects

A guide to the process of creating a large collection of text or speech data to computationally analyze and extract meaningful insights from that data,

WHAT IS A DIGITAL HUMANITIES CORPUS?

A "digital humanities corpus" refers to a large, electronically stored collection of texts, images, or other digital data that are specifically curated and compiled for research within the field of digital humanities (DH) allowing scholars to analyze and interpret cultural artifacts using computational methods, often by applying techniques from corpus linguistics to explore patterns and trends across a wide range of sources.

A digital humanities corpus is: 

Large scale: It typically consists of a significant number of texts or digital objects, enabling analysis of broader trends and patterns beyond individual case studies. 

A representative selection of data: The data within the corpus is chosen to represent a specific time period, genre, or cultural context, ensuring the analysis is relevant to the research question. 

Digital in format: The corpus is stored electronically, allowing for efficient searching, analysis, and manipulation using specialized software tools. 

Examples of digital humanities corpora: 

Literary corpus: A collection of digitized texts from a specific era or genre of literature, used to study stylistic changes or authorial influences. 

Historical document corpus: A compilation of digitized historical documents like letters, newspapers, and government records, enabling analysis of social trends across time. 

Visual corpus: A collection of digitized images (paintings, photographs) categorized by subject matter or style, allowing for comparative analysis of visual representations. 

A few examples of uses of digital humanities corpora

Text mining: Identifying recurring themes, keywords, and patterns within large datasetsto understand broader cultural narratives. 

Network analysis: Visualizing relationships between entities (people, places, concepts) within a corpus to reveal connections and influence. 

Stylometry: Analyzing the writing style of an author to attribute authorship or identify stylistic shifts. 

See University of Birmingham.   An online information pack about corpus investigation techniques for the Humanities.

     Unit 1: Introduction

     Unit 2: Compiling a corpus

     Unit 3: Available corpora and software 

     Unit 5 - Case Studies - Compiling Literary Corpora

DATA COLLECTION - WHAT TO CONSIDER

Many texts that you find on the internet will have to be converted from HTML, Word, or pdf format to standard text files, before you can explore them with software such as WordSmith or AntConc.

If you do look for texts on the Internet, there are certain tools that can help you to efficiently search for texts on specific topics. WebBootCat and  Sketch Engine are two examples of such tools. 

Three videos about Sketch Engine can be found here:  

     Build a corpus from your own texts/data

     Build a corpus from the web

     Building your own corpus using Sketch Engine

The texts for the corpus have to be collected in a systematic way, under controlled conditions, and in such a way that the corpus is an adequate representation of the text type/text genre that is to be studied. Keeping good records of the sources of your texts is most important.