The vocabulary relating to Text Mining and Texts Analysis is extensive. This Glossary provides information about the most frequently used terminology relating to bothText Mining and Text Analysis.
Two useful resources relating to this topic are:
Folger Shakespeare Library (2017) - Includes: Glossary of Digital Humanities Terms; Digital Tools for Textual Analysis; Bibliography of Textual Analysis Readings; Digital Humanities Readings and Resources (a comprehensive listing which includes Linguistic and Textual Analysis and Corpora).
Mandl, T.(2014). "Text mining" In M. Khosrow-Pour, Encyclopedia of information science and technology (3rd ed.). Hershey, PA: IGI Global.
API (Application Programming Interface): An interface that allows applications to talk to one another and can be used to facilitate downloading large amounts of data from a website,
Association: Associations measure how often a word co-occurs with other words. The more often words occur close to each other when compared to their general frequency, the higher their association will be (see Collocation.)
Classification: Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text classifiers can be used to organize, structure, and categorize any kind of text – from documents, studies, files, and web-based information.
Clustering: In the realm of Natural Language Processing (NLP), text clustering is the process of grouping similar documents or pieces of text into clusters or categories. This technique enables us to discover hidden patterns, extract valuable insights, and streamline large volumes of unstructured text data.
Collocation: A term used to describe words that are associated with one another, meaning that they often appear together. In corpus linguistics, text mining, and digital text analysis, collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword.
Corpus: A collection of written texts, particularly the entire body of work on a subject or by a specific creator; a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.
Concept: A concept is a semantic entity which can be expressed by several words or by a group of words.
Geographical Text Analysis (GTA) is a relatively recent development in the approach to studying, analysing, and extracting the content of textual sources. Combining techniques from Natural Language Processing (NLP), Corpus Linguistics, and Geographic Information Systems (GIS), GTA offers a new methodology for employing these computational tools in Humanities research. GTA focuses on the identification, manipulation, and analysis of spatial information on a large scale. Brief summary of this process:
Information retrieval (IR): A process that facilitates the effective and efficient retrieval of relevant information from large collections of unstructured or semi-structured data. IR systems assist in searching for, locating, and presenting information that matches a user's search query or information need. IR systems enable search access to a vast array of sources or items within documents, metadata, and databases of texts, images, videos, and sounds.As the dominant form of information access, IR is relied upon by people who use search engines.
Internet Archive: A not-for-profit, open-access digital library which provides free access to researchers, historians, scholars, people with print disabilities, and the general public. It contains over 3 million books that are in the public domain, as well as music, moving images, audio files, cultural artifacts in digital form, software, and archived Web pages. Digital material can be downloaded and uploaded by users. Internet Archive oversees one of the largest book-digitization projects in the world. 28+ years of web history are accessible through the Wayback Machine and the Archive works with 1,200+ library and other partners through our Archive-It program to identify important web pages. For information on how to search the Archive click here.
Keyword-in-Context (KWIC) Analysis: provides a list of a specific word or phrase in context (up to 7 words in each direction is common). Best for pattern identification and close reading.
Lemmatization - Identifying the base (root) form of a word such as "respect" in respectful, respected, respective, respecting, disrespectful, disrespected etc.
Lexical co-occurrence or collocation: Observes clusters of terms which are likely to appear together in a given population, based on statistical relationships. This is good for getting a sense of 'aboutness' for a specific term or population or detecting specific word associations. Topic modeling (See below) is based on this principle.
Machine Learning: A way of programming computers that allows for the evolution of computational behavior based on empirical data or past experience. Machine learning focuses particularly on the ability of computers to learn to recognize complex patterns and make intelligent decisions based on those patterns, an ability that is especially valuable in computational textual analysis.
Metadata: Data describing other data. Metadata provide information about one or more aspects of data, such as type, date, creator, location, and so on. Most often encountered in library and archival contexts, metadata facilitate the organization, discovery, and use of a wide range of resources. For further information, consult the National Information Standards Organization's publication Understanding Metadata [pdf], by Jenn Riley (2017).
Named Entity Recognition - Identifying proper names.
Natural language processing, (NLP) - combines computational linguistics—rule-based modeling of human language—with statistical and machine learning models to enable computers and digital devices to recognize, understand and generate text and speech. Essentially: the ability of a machine or program to understand human text or speech.
n-gram: An n-gram (n=natural number + suffix “gram”) is a collection of n successive items in a text document that may include words, phonemes, syllables, numbers, symbols, and punctuation. For example, a 2-gram (“a “bigram”) is a sequence of two words (e.g. “please turn”) and a 3-gram (a “trigram”) is a sequence of 3 words (e.g., “please turn your”). N-Grams are a tool that is useful for turning written language into data, and breaking down larger portions of search data into more meaningful segments that help to identify the root cause behind trends. This method is very good for identifying common phrases in a particular genre and stylistic features which are unique to a specific author.
Metadata: Data describing other data. Metadata provide information about one or more aspects of data, such as, for example, type, date, creator, location. For instance, the metadata within a digital image may consist of information such as its size, resolution, time of creation, and color depth. It is helpful in the classification, organization, labeling, sorting, and searching of data. Most often encountered in library and archival contexts, metadata facilitate the organization, discovery, and use of a wide range of resources.
Named entity recognition (NER) is a natural language processing (NLP) method that extracts, identifies, categorizes and extracts the most important pieces of information from unstructured text without requiring time-consuming human analysis. NER detects and categorizes important information in text known as "named entities." Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages. NER is particularly useful for quickly extracting key information from large amounts of data because it automates the extraction process.
Parts of Speech Tagging - Identifying the syntactic role of a word.
OCR (optical character recognition): Use of computer technologies and techniques to identify and extract text from unstructured documents like images, screenshots, and physical paper documents and convert them into machine-readable text. This conversion allows for the computerization of material texts into formats for digital storage, search, and display. Adobe Acrobat Professional supports OCR processes, as does Microsoft Office for Windows. OCR accuracy depends on the font and style of the original document.
Relation Extraction - Identifying the relationships between entities such as "daughter of" or "town in ? state"
Representational State Transfer (REST): REST is a software architectural style for distributed hypermedia systems, used in the development of Web services. Web services using REST are termed RESTful APIs or REST APIs and they provide interoperability between computer systems on the internet.
Sentiment analysis: Using software to identify attitudinal information from a text. It is used to determine whether a given text contains negative, positive, or neutral emotions. It’s a form of text analytics (the process of extracting meaning out of text.) that uses natural language processing (NLP) and machine learning. Sentiment analysis is also known as “opinion mining” or “emotion artificial intelligence”. A key aspect of sentiment analysis is polarity classification. Polarity refers to the overall sentiment conveyed by a particular text, phrase or word. This polarity can be expressed as a numerical rating known as a “sentiment score”. For example, this score can be a number between -100 and 100 with 0 representing neutral sentiment. This score could be calculated for an entire text or just for an individual phrase.
SGML (standardized general markup language): A markup language designed to format, store, and access large corpora of documents. The language is declarative, meaning that it describes source documents instead of specifying the particulars of their future display. These descriptive tags can then be processed in a variety of ways. SGML is the parent language of HTML, XHTML (the XML version of HTML), and XML.
Source code: is a group of instructions a programmer writes using computer programming languages, usually in the form of text. This source code is compiled into machine code that can then be executed by a computer. Most applications are distributed as executable files, not as source code. Source code is also the only format of computer code that human beings can read.
Stemming: The process of identifying the “stem” word (or the root word, also known as a “lemma” in linguistics), of a set of words which share the same stem etymologically. For example, given a set of words such as: respectful, respected, respective, respecting, disrespectful, disrespected etc., the root or stem word here is “respect”. Identifying that stem word and using it to encode the input is termed “stemming”. It is a very essential preprocessing step in Natural Language Processing (NLP). It reduces the vocabulary size which in turn helps decrease the model capacity required to capture the features in the information put into a machine or system.
Text Encoding: Broadly considered, the process of putting text in a special format for preservation or dissemination. In the digital humanities, textual encoding nearly always refers to the practice of transforming plain text content into XML. The TEl Guidelines are often followed when encoding textual materials in the arts, humanities, and social sciences. See TEl.
The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI.
Text mining: The process of automatically deriving previously unknown information from written texts using computational techniques. Text mining tools facilitate researchers' discovery of patterns within structured data.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).
Tokenization - Process of separating a string of characters into tokens which may be words, phrases or sentences. In the process punctuation is removed.
Topic modeling - Coding texts into meaningful categories.
Web Scraping (crawling, spidering) - Copying website information in order to extract large amounts of data and saving to a local file.
Word Cloud: A visualization of word frequencies. Usually, the more frequently a word appears in a given text, the larger its size in the resulting visualization. Programs designed to create word clouds are easily accessible; two of the most used are Wordle and the Many Eyes tag cloud.
Word Vector modeling: Like lexical co-occurrence, this looks for terms which are likely to appear together in a given population, and projects terms into multi-dimensional space to model semantic relationships between words at scale. This is especially good for discovering how texts' use of words can relate to each other.