This script proposes to show how much two sub-corpus (defined by the user within a dataset) feature a different set of words in its textual content or entities in any categorical field. It uses the excellent library scattertext by Jason Kessler.
See below an interactive example showing which words were used relatively more often by Obama and Bush Jr during their state of the union adresses.
Contrast Analysis is thought as an exploratory script. As such it is only adapted to compare sub-corpora of small to medium scaled size. Beyond a few thousands of text, the browser may not be able to show the resulting visualization as original snippets from the corpus are also visible.
When selecting a textual field, one can choose between naive tokenizer, more advanced one (powered by spacy), or a specific tokenizer for Twitter that also extracts hashtags. Entities can be bigrams (based on the PMI) or monograms.