Contrast Analysis

This script proposes to show how much two sub-corpus (defined by the user within a dataset) feature a different set of words in its textual content or entities  in any categorical field.  It uses the excellent library scattertext by Jason Kessler.

See below an interactive example showing which words were  used relatively more often by Obama and Bush Jr during their state of the union adresses.

 

 

Contrast Analysis is thought as an exploratory script. As such it is only adapted to compare sub-corpora of small to medium scaled size. Beyond a few thousands of text, the browser may not be able to show the resulting visualization as original snippets from the corpus are also visible.

When selecting a textual field, one can choose between naive tokenizer, more advanced one (powered by spacy), or a specific tokenizer for Twitter that also extracts hashtags. Entities can be bigrams (based on the PMI) or monograms.

learn about CorText scripts and share your experience