This script proposes to show how much two sub-corpus (defined by the user within a dataset) feature a different set of words in its textual content or entities in any categorical field. It uses the excellent library scattertext by Jason Kessler.
See below an interactive example showing which words were used relatively more often by Obama and Bush Jr during their state of the union adresses.
Contrast Analysis is thought as an exploratory script. As such it is only adapted to compare sub-corpora of small to medium scaled size. Beyond a few thousands of text, the browser may not be able to show the resulting visualization as original snippets from the corpus are also visible.
When selecting a textual field as data category, one has to select a strategy for tokenizing. Naive tokenizer (splitting words according to spaces and punctuation) is the default one, more advanced options are also offered (one is powered by spacy for more accurate result but slower, it’s also possible to use a specific tokenizer for Twitter that also extracts hashtags). Entities can be bigrams (based on the PMI) or monograms.
It is also mandatory to indicate the category of the variable you want to contrast the composition of. Then type the target value making sure to use the same casing than in the original database (running list builder can be helpful in that respect).
For instance in the case of the illustration, paragraph was selected as the text variable. Barack_Obama and George_W._Bush are the two possible modalities of the variable president that are being compared.
The last set of options will allow you to customize how snippets of texts corresponding to a given term are contextualized with other metadata in the lower part of the visualization. Modifying the minimum frequency of terms can also be useful for larger/smaller corpora. One can also limit the total number of terms shown. By default, terms with the largest discriminatory scores are shown.