Contrast Analysis

Contrast Analysis script proposes to show how much two sub-corpus (defined by the user within a dataset) feature a different set of words in its textual content or entities in any categorical field. It uses the excellent library scattertext (Kessler J.S., 2017).

See below an interactive example showing which words were used relatively more often by Obama and Bush Jr during their state of the union adresses.

Contrast Analysis is thought as an exploratory script. As such it is only adapted to compare sub-corpora of small to medium scaled size. Beyond a few thousands of text, the browser may not be able to show the resulting visualization as original snippets from the corpus are also visible.

Choose the textual variable

Text variable

Choose the variable to use, which could contains full text or categories.

How to work with it

What is the nature of the data, textual or categorical ?

When selecting a textual field as data category, one has to select a strategy for tokenizing.  

text

Choose text if the text variable is full text.

category

 Choose category if the text variable contains categories.

Advanced options for textual field  (bigrams and tokenizer)

Exclude bigrams

Entities can be bigrams or monograms. The identification of the bigrams is based on the PMI score, for Pointwise Mutual Information, which mesures the likelihood of co-occurrence of two words.

naive

Naive is the default option for the tokenizer. It simply divides the words based on spaces and punctuation.

advanced

For more accurate result, advanced options are also offered to tokenize the sentences (powered by spacy). This option is slower and more resources consuming (avoide it for medium sized corpus).

Twitter

It’s also possible to use a specific tokenizer for Twitter that also extracts hashtags.

Build the contrast strategy

Categorical variable

It is also mandatory to indicate the category of the variable you want to contrast the composition of. Then type the target value making sure to use the same casing than in the original database (running list builder can be helpful in that respect). For instance in the case of the illustration, paragraph  was  selected  as  the text  variable. Barack_Obama  and  George_W._Bush are the two possible modalities  of  the variable president  that  are  being  compared.

Contrast the target category with another variable ?

By default, the target value will be contrasted against the rest of the corpus, but you can also contrast its occurrences with another variable value (e.g. George_W._Bush).

Add metadata when showing snippets

The last set of options will allow you to customize how snippets of texts corresponding to a given term are contextualized with other metadata in the lower part of the visualization.

Enter a word mimimal frequency

Modifying the minimum frequency of terms can also be useful for larger/smaller corpora.

Maximal number of terms in the visualization

One can also limit the total number of terms shown. By default, terms with the largest discriminatory scores are shown.

Reference

Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565