Skip to Main Content

Corpora and Text/Data Mining For Digital Humanities Projects

A guide to the process of creating a large collection of text or speech data to computationally analyze and extract meaningful insights from that data,

BOOKS AND CONFERENCE PROCEEDINGS

 

 

Ramírez, A. G. (2023). Digital Humanities, Corpus and Language Technology = Humanidades Digitales, Corpus y Tecnología del Lenguaje. University of Groningen Press.

This title is an outstanding collection of research contributions that explores the intersection of technology and the humanities. The authors provide a  comprehensive overview of how these technologies can enhance research across various disciplines, from literature to history to anthropology. (...) New technologies have opened up new perspectives for research, allowing scientists to analyze data in previously impossible ways. The interdisciplinary approach and practical applications make it an invaluable resource for researchers, students, and anyone interested in the intersection of technology and the humanities. 

Google Book:  Tony McEnery, Richard Xiao, Yukio Tono (2006). Corpus-based Language Studies: An Advanced Resource Book, Taylor & Francis.

BOOK CHAPTERS

Percillier, M., & Hai-Jew, S. (2017). Creating and Analyzing Literary Corpora. In Data Analytics in Digital Humanities (pp. 91–118). Springer International Publishing. 

Using a study of non-standardized linguistic features in literary texts as a working example, the chapter describes the creation of a digital corpus from printed source texts, as well as its subsequent annotation and analysis. The sections detailing the process of corpus creation take readers through the steps of document scanning, Optical Character Recognition, proofreading, and conversion of plain text to XML, while offering advice on best practices and overviews of existing tools. The presented corpus annotation method introduces the programming language Python as a tool for automated basic annotation, and showcases methods for facilitating thorough manual annotation. The data analysis covers both qualitative analysis, facilitated by CSS styling of XML data, and quantitative analysis, performed with the statistical software package R and showcasing a number of sample analyses.

ARTICLES

Lukin, Stephanie et al. (2024). "SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus." arXiv.org; Ithaca, Nov 19, 2024.

Abstract: We introduce the Situated Corpus Of Understanding Transactions (SCOUT), a multi-modal collection of human-robot dialogue in the task domain of collaborative exploration. The corpus was constructed from multiple Wizard-of-Oz experiments where human participants gave verbal instructions to a remotely-located robot to move and gather information about its surroundings. SCOUT contains 89,056 utterances and 310,095 words from 278 dialogues averaging 320 utterances per dialogue. The dialogues are aligned with the multi-modal data streams available during the experiments: 5,785 images and 30 maps. The corpus has been annotated with Abstract Meaning Representation and Dialogue-AMR to identify the speaker's intent and meaning within an utterance, and with Transactional Units and Relations to track relationships between utterances to reveal patterns of the Dialogue Structure. We describe how the corpus and its annotations have been used to develop autonomous human-robot systems and enable research in open questions of how humans speak to robots. We release this corpus to accelerate progress in autonomous, situated, human-robot dialogue, especially in the context of navigation tasks where details about the environment need to be discovered.

Mair, Christian (2024).  "Digital Corpora in Language Study: Reviewing a Success Story in the Recent History of Linguistics Research,” English Language Pedagogy (RELP) 12(3) 469-477 (Summer 2024).

Abstract: The contribution provides an overview of the fifty-year success story of corpus linguistics. It acknowledges the enormous technical advancements and the excellent corpus resources available today – at least for some of the major languages in the world. A direct consequence of technological development is the rise of usage-based theoretical models in contemporary linguistic research. At the same time, the contribution points out the still deficient treatment of spoken spontaneous language in corpus linguistics and recommends giving more attention to the development of multilingual corpus resources in the future than has been done so far. Finally, it raises the question of the future function of corpus linguistics within the framework of Digital Humanities.

Sakai, Yusuke et al  (2024). “Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair,”  arXiv.org, Ithaca, Apr 18, 2024.

Abstract: In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets.