Topic Modeling

Topic Modeling produces a topic representation of any corpus’ textual field using the popular LDA model. Each topic is defined by a probability distribution of words. Conversely, each document is also defined as a probabilistic distribution of topics.

In CorText, a topic model is inferred given a total number of topics users have to define. The composition of topics is accessible using the library pyLDAvis ( allowing to visualize the most relevant words fitting in each topic. Topics are positioned in 2d according to their distances using a multidimensional scaling algorithm. The script also produces a new table storing the assignments of topics to documents. The distribution is flattened such that each document is assigned to topics it is already linked to with a probability superior to the inverse number of topics.

The linguistic processing can be customized in the second panel. It’s quite straightforward. Be aware that stop-word removal option and lemmatization only work for english content yet.

Additionally one can modify the number of iterations which can be helpful if your corpus is small. Finally a graph is also produced showing the evolution of perplexity and log likelihood.

An example of outcome is shown using a dream dataset (data courtesy of dreamscloud):

Access the dynamic version of the above image by clicking this URL:

Field used for analysis

Number of Topics:
Number of topics you define, which will be determine the makeup of the analysis

Custom name:
Name the model

Lower Case:
Will keep all label names as lower case

Stop-words Removal (english):
Will remove stop words, including words like “the”, “and” or “be”. Only works with English

Remove punctuation:

Lemmatize (english):
This option will trigger the script to return the basic form of a word. This may help avoid orthographically different words which are actually the same.

Minimum frequency of words:
Threshold for the words appearing defined by occurrence.


learn about CorText scripts and share your experience