Skip to Main Content

Digital Humanities - Research, Teaching, and Learning: TEXT MINING, TEXT ANALYSIS, TEXT ANALYTICS

Add later

WHAT IS TEXT MINING?

Text mining is an interdisciplinary field at the intersection of related areas such as information retrieval, machine learning, statistics, computational linguistics, and data mining. It has been used synonymously with "Text Analysis" and "Text Analytics". Text mining refers generally to the process of extracting patterns or knowledge from unstructured text documents. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases

In this Guide's Page each term is explained to highlight their differences.

For information about text mining's historical development, and a short listing of readings see this Guide's Page, "Text Mining Resources." 

See also this Guide's Pages, "Text Mining Glossary" and "Text Mining Tools."

For Related USC Libraries Research Guides go to:

 Content Mining (Original author Caroline Muglia (2023); Current author 2023 to present: Danielle Mihram).

 Inclusive and Responsible Dataset Usage. Information about getting started on working with data in an inclusive and responsible manner. (Author: Mike Jones).

TEXT MINING

Text mining is the process of using automation to obtain meaningful information from large collections of unstructured data. It is used to analyze vast collections of textual materials to capture key concepts, trends and hidden relationships.  It is a subset of data mining.

Web search engines (such as Google) are merely retrieving information, displaying lists of documents that contain certain keywords. They do not suggest connections or generate any new knowledge. Text-mining programs go further, categorizing information, making links between otherwise unconnected documents and providing visual maps.

With Text mining distilled, structured information can be used to address questions such as:

  • Which concepts occur together?
  • What else are they linked to?
  • What higher level categories can be made from extracted information?
  • What do the concepts or categories predict?
  • How do the concepts or categories predict behavior     (Source: "About Text Mining"  IBM Knowledge Center.)

Text mining uses Natural Language Processing (NLP), for machines to break down and understand the human language, It processes it automatically in order to transform unstructured text into a structured format and identify meaningful patterns and new insights. By combining the power of artificial intelligence, computational linguistics, and computer science NLP is at the core of tools we use every day – from translation software, chatbots, spam filters, and search engines, to grammar correction software, voice assistants, and social media monitoring tools.

For more information, See:  Mandl, T. (2014). "Text mining." In M. Khosrow-Pour, Encyclopedia of Information Science and Technology (3rd ed.). Hershey, PA: IGI Global.  

TEXT ANALYSIS

Text Analysis refers to a process of conducting analysis on a body, or corpus, of natural language text, in order to detect patterns (such as word frequency or associative links), create visualizations from the text, categorize or annotate the text, or otherwise "mine" it for relevant, novel, or interesting information.

In the case of text analysis of literary text, information can be gleaned – from its literal meaning to the subtext, symbolism, assumptions, and values it reveals. A common text analysis method is to have the computer program identify how frequently certain words appear in a body of text. Text Analysis is a form of qualitative analysis derived from the extraction of useful information from text so that the key ideas or concepts contained within this text can be grouped into an appropriate number of categories. See: John Burrows, "Textual Analysis," Chap. 23 in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemans, Joh Unsworth. Oxford: Blackwell, 2004.

Text mining began with the computational and information management fields (e.g. database searching and information retrieval), whereas Text analysis began in the humanities with the manual analysis of text, (e.g Bible concordances and newspaper indexes). More recently, the two terms have become synonymous, and now generally refer to the use of computational methods to search, retrieve, and analyze text data.

 

 

TEXT ANALYTICS

Text Analytics Text mining and Text Analytics are often used interchangeably. The term "text mining" is generally used to derive qualitative insights from unstructured text, while text analytics provides quantitative results.

"Text mining or text analytics is an umbrella term describing a range of techniques that seek to extract useful information from document collections through the identification and exploration of interesting patterns in the unstructured textual data of various types of documents – such as books, web pages, emails, reports or product descriptions." (Truyens & Van Eecke, 2014).