Skip to Main Content

Digital Humanities - Research, Teaching, and Learning: TEXT MINING TOOLS

Add later

DIRECTORY

As noted in this Guide's Page, DH TOOLS AN OVERVIEW, the exhaustive listing of tools used in the Digital Humanities (DH) requires updating quite often, as DH evolves, and this challenge is not negligible.  Hence, the listing of tools appearing in this Section features tools that are frequently used in DH Text analysis and mining.

DIRECTORY OF TOOLS FOR TEXT ANALYSIS AND MINING

TaPOR 3.0- Text Analysis Portal for Research - TAPoR is a gateway to the tools used in sophisticated text analysis and retrieval. It is a directory of digital humanities tools organized by type. The former DiRT (Digital Research Tools) Directory is now merged with TaPOR.

With TAPoR 3.0 you can:
  • Discover text manipulation, analysis, and visualization tools
  • Discover historic tools
  • Read tool reviews and recommendations
  • Learn about papers, articles and other sources about specific tools
  • Tag, comment, rate, and review collaboratively
  • Browse lists of related tools in order to discover tools more easily

Broad domains of tools and services currently included in TAPoR:  Browser extensions; Communication tools; Development tools; GIS tools; Photo/Video/Audio Editor and Related Tools; Publishing tools;  Repositories and Archiving tools. For a large number of curated tool lists in TAPoR, See:  Curated Tool lists.

TEXT MINING TOOLS

There are many many ready-to-use digital tools for conducting text mining research . The listing below includes a few of the most popular ones:​

AntWord Profiler A freeware tool for profiling the vocabulary level and complexity of texts. AntWord Profiler is a free download available for Windows, Mac OS X, or Linux.

Context: Developed at the School of Information Sciences at Illinois, ConText is a free, open-source application for performing a variety of text analysis techniques, including network graphs and topic models, based on textual data.

Gephi An open graph visualization platform that supports exploration of all kinds of networks and complex systems. Gephi can be downloaded for free onto any Linux, Windows, or Mac OS X device.

Mallet (MAchine Learning for LanguagE Toolkit):  A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

PhiloLogic: A full-text search, retrieval, and analysis tool developed by the ARTFL Project and the Digital Library Development Center (DLDC) at the University of Chicago. It is free software that can be downloaded for a wide range of systems.

Scrapy: An open source and collaborative framework for extracting the data you need from websites. It is available as a free download for Linux, Windows, Mac OS X.

Textal: A free smartphone app that allows you to analyze websites, tweet streams, and documents to explore the relationship between words in the text via an intuitive word cloud interface. The app allows you to generate graphs and statistics, as well as share the data and visualizations in any way you like. Textal is available as a free download from the App Store on your Apple iOS device.

TXM: A free, open source cross-platform Unicode & XML based text/corpus analysis environment and graphical client. It is available as a free download for Windows, Linux, and Mac OS X. It has a comprehensive range of analysis tools, such as concordances, collocate search, frequency list, etc.

TEXT PREPARATION TOOLS

OpenRefine: An open source desktop application, formerly called Google Refine, for data cleanup and file format transformation, and extending it with web services and external data.

Trifacta: A software company headquartered in San Francisco with offices in Boston, Berlin and London. The company was founded in October 2012. Trifacta develops data wrangling software for data exploration and self-service data preparation. Data wrangling, similar to ETL (Extract, Transform, and Load), is  the process of combining data from multiple sources into a large, central repository called a data warehouse). It is a user friendly AI driven platform and and allows for a fast way of clean, structure, and enrich raw data into a more usable form.

Trifacta is designed for analysts to explore, transform, and enrich raw data into clean and structured formats. Trifacta utilizes techniques in machine learning, data visualization, human-computer interaction, and parallel processing for non-technical users to prepare data for a variety of business processes. Trifacta works with cloud and on-premises data platforms.

TEXT ANALYSIS TOOLS

PORTAL:  TaPOR- Text Analysis Portal for Research - TAPoR is a gateway to the tools used in sophisticated text analysis and retrieval. It is a directory of digital humanities tools organized by type. The former DiRT (Digital Research Tools) Directory is now merged with TaPOR.

TOOLS:

AntConc: A free concordance program available for Windows, Mac OS X, and Linux operating systems. AntConc has evolved from a simple concordance program into a powerful tool for textual analysis. It is able to perform the following types of linguistic analyses: concordance, concordance plot, clusters, n-grams, collocates, word frequency, keyword list.

Lexos: A resource for visualizing large text sets through a web-based platform. The site has capabilities to upload multiple files, prepare, visualize, and analyze your data. The visualization tools encompassed in this tool include word clouds, multicloud, bubbleviz, and rollingwindow graph. The analysis tools included are statistical analysis, clustering, similarity query, and topword.

Textanalyser: A free online text analysis tool that generates statistics about your text and allows you to find the most frequent phrases and frequencies of words. This analysis tool provides instant results for analyzing word groups, sentences, syllables, and keyword density, the prominence of word or expression, and word count. Non-English language texts are supported.

Voyant: A web-based reading and analysis tool for digital texts. The tool allows you to type in multiple URLs, paste in full text, or upload your own files for analysis. The site is a collaborative project by Stefan Sinclair and Geoffrey Rockwell specifically built for digital humanities projects. The site also provides helpful instruction guides for getting started and additional information about other Voyant tools.

What you can do with Voyant:

  • Use it to learn how computers-assisted analysis works. Check out our examples that show you how to do real academic tasks with Voyant.
  • Use it to study texts that you find on the web or texts that you have carefully edited and have on your computer.
  • Use it to add functionality to your online collections, journals, blogs or web sites so others can see through your texts with analytical tools.
  • Use it to add interactive evidence to your essays that you publish online. Add interactive panels right into your research essays (if they can be published online) so your readers can recapitulate your results.
  • Use it to develop your own tools using our functionality and code.

See: "Introduction to Text Analysis using Voyant," presented by David Sye, University Libraries, Murray State University.

 Word and Phrase An online text analysis tool that has a variety of capabilities for analyzing text. To Access: click on "Login".  Text can be copied and pasted into a text box or take advantage of the data from the Corpus of Contemporary American English (COCA) (University of Virginia). The tool will first highlight all the medium and lower-frequency words in the text and create lists of the words. Secondly, the words can be clicked upon to create a "word sketch" of any of the words--this will show their definitions and detailed information from from the COCA. Finally, the tool has the capability to conduct powerful searches on select phrases and show related phrases in the COCA. The corpus of COCA contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020) the: TV and Movies subtitles, blogs, and other web pages.

TEXT ANALYSIS - TUTORIALS

Programming Historian: Lessons Tutorials on a range of text analysis tools and techniques including mapping, web scraping, cleaning and transforming data, sentiment analysis, stylometry, basic text processing in R, text mining in Python, and topic modeling with Mallet.

Text Analysis with Topic Models for the Humanities & Social Sciences A series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning. Includes sample datasets of British and French literary texts.