What is a “Corpus”?
The term “Corpus” refers to a body (a collection) of writings. In Latin Corpus (pl. corpora) literally means “body”. The connotation for “body of works” started to be used in the English language around the 18th century. The term would most commonly be used in a legal or religious context, referring to the "body" of laws or the body of religious texts, often appearing in phrases like "the corpus juris" (the body of law) or "the corpus of scripture" (the body of the Bible).
“Corpus” is the equivalent of “dataset” in a general machine learning task. A corpus represents a collection of (data) texts, typically labeled with text annotations: labeled corpus.
Corpora serve as a foundational resource for training machine learning models in natural language processing (NLP).
Building a corpus involves collecting texts, ensuring they are representative of the language or domain of interest, and possibly cleaning the data (removing irrelevant information, normalizing text, etc.).
The term “corpus” has different meanings, depending on who uses it. For literary scholars, philosophers and philologists, a corpus is essentially the body of text they are working on. Emphasis on the “right” corpus (the right version of the Bible, the right version of Aristotelian manuscripts, etc.).
In text mining, a “corpus” (plural: corpora) refers to a large and structured set of texts that are used for linguistic analysis, to study language patterns, frequency of words, syntax, semantics, and other linguistic features (“text mining”).
Types of Corpora:
- General Corpora: Collections of texts that represent a wide range of language use (e.g., the British National Corpus).
- Specialized Corpora: Collections focused on specific domains or genres (e.g., medical texts, legal documents).
- Annotated Corpora: Texts that have been tagged with additional information, such as part-of-speech tags or semantic roles, to facilitate more detailed analysis.
Corpora are essential for tasks such as:
- Text classification: Categorizing texts into predefined classes.
- Sentiment analysis: Determining the emotional tone behind a body of text.
- Topic modeling: Identifying themes within a collection of documents.
- Machine translation: Training models to translate between languages
Tools and Libraries: Various tools and libraries exist for working with corpora in text mining, such as NLTK, spaCy, and Gensim in Python. See this Guide's Page: Glossary.
Overall, corpora are fundamental to advancing our understanding of language and developing effective text mining and NLP applications.