What is Text Data Mining (briefly): Text Data Mining is the structuring of unstructured text content, followed by the analysis of that unstructured content. Structuring text content applies some sense of order that can then be analyzed. Examples can include organizing ngram counts, parts of speech tagging, and named entity extraction. Types of analysis can largely be categorized as supervised and unsupervised. Supervised learning primarily focuses on classifying and categorizing data into pre-defined labels (such as sentiment analysis), or regression of continuous data, such as measuring the relationship between web page views and the calculated reading level of the page. Unsupervised learning does not contain pre-defined labels, but instead focuses on functions such as similarity clustering and keyword density in order to obtain a clearer picture of the text content.

 

Top 5 Software Tools we use at Baylor Libraries for TDM (listed least to most complex):

Voyant
Voyant

(1) Voyant – Perfect for the beginner. Entirely online with no software to download and no registration required.

 

AntConc
AntConc

(2) AntConc -Free downloadable software. Perfect to serve as the bridge between point and click and tools requiring programming/scripting. Be sure to check out all of Lawrence Anthony’s TDM software.

 

Knime
Knime

(3) Knime -Free and open source software for drag and drop programming/scripting. (The Scratch of machine learning.) Here is a direct link to their Text Processing package.

 

Mallet
Mallet

(4) Mallet -MAchine Learning for LanguagE Toolkit – Free tool providing TDM machine learning functions.

 

Python
Python

(5) Python – Full interpreted programming language offering a large variety of TDM functions and packages. Of special note are the NLTK and spaCy packages.

 

Software for Text Data Mining

Leave a Reply

Your email address will not be published. Required fields are marked *