Skip to Main Content

Corpora and Text/Data Mining For Digital Humanities Projects

A guide to the process of creating a large collection of text or speech data to computationally analyze and extract meaningful insights from that data,

IMPORTANT CONSIDERATIONS WHEN COMPILING A CORPUS

 

Creating a corpus is time consuming but it may be a necessary task if available text corpora contain no useful (or very little) data for a research project.  Before compiling a corpus:

  • First and foremost, your research objectives and/or research questions should be clear because these determine what material you have to collect.
  • Then you need to carefully select a collection of texts representing a specific language variety or domain, gather them in a consistent structured format, potentially annotating the data with relevant metadata depending on your needs, and then process them using specialized software to clean and normalize the data, making it ready for linguistic analysis.  

Important considerations when compiling a corpus: 

Representativeness: Ensure your corpus adequately reflects the language variety you are studying. 

Balance:  Distribute text types and genres proportionally to avoid overrepresentation of certain categories. 

Size:  Determine the appropriate corpus size depending on your research question and available resources. 

Accessibility:  Consider making your corpus available to other researchers if appropriate. 

        See below: Key steps in compiling a corpus.

KEY STEPS IN COMPILING A CORPUS

  • Define your research goals:  Clearly identify what linguistic features or patterns you want to study to guide your corpus design. 
  • Select text sources: Choose diverse and representative texts from different genres, registers, and domains relevant to your research. 
  • Data collection:  A. Gather texts: Access existing sources like published texts, web scraping, or collecting original data through surveys or interviews. B. Consider copyright: Ensure you have the necessary permissions to use the texts. 
  • Data pre-processing: Text cleaning: Remove irrelevant elements like headers, footers, punctuation, and special characters. 
  • Normalization: Standardize text formatting (e.g., capitalization, punctuation, lemmatization). 
  • Tokenization: Split text into individual words or units of analysis.
  • Metadata creation: Annotate relevant information: Add details like genre, author, date, source, register, or other contextual information to each text. 
  • Corpus software: Choose a corpus tool: Utilize specialized software like SketchEngine, AntConc, or WordSmith to store, manage, and analyze your corpus. 
  • Quality control: Review for consistency: Double-check your data for errors, inconsistencies, and potential biases.