Topic Modeling produces a topic representation of any corpus’ textual field using the popular LDA model. Each topic is defined by a probability distribution of words. Conversely, each document is also defined as a probabilistic distribution of topics.
In CorText, a topic model is inferred given a total number of topics users have to define. The composition of topics is accessible using the library pyLDAvis (https://pyldavis.readthedocs.io/en/latest/readme.html#usage) allowing to visualize the most relevant words fitting in each topic. Topics are positioned in 2d according to their distances using a multidimensional scaling algorithm. The script also produces a new table storing the assignments of topics to documents. The distribution is flattened such that each document is assigned to topics it is already linked to with a probability superior to the inverse number of topics.
The linguistic processing can be customized in the second panel. It’s quite straightforward. Be aware that stop-word removal option and lemmatization only work for english content yet.
Additionally one can modify the number of iterations which can be helpful if your corpus is small. Finally a graph is also produced showing the evolution of perplexity and log likelihood.
An example of outcome is shown using a dream dataset (data courtesy of dreamscloud):
Access the dynamic version of the above image by clicking this URL:
Field used for analysis
Number of Topics:
Number of topics you define, which will be determine the makeup of the analysis
Name the model
Will keep all label names as lower case
Stop-words Removal (english):
Will remove stop words, including words like “the”, “and” or “be”. Only works with English
This option will trigger the script to return the basic form of a word. This may help avoid orthographically different words which are actually the same.
Minimum frequency of words:
Threshold for the words appearing defined by occurrence.