Distant Reading

Vocabulary evolution in a corpus of reports
Vocabulary evolution in a corpus of reports

Inspired from Franco Moretti work on literary corpus, this script provides a complete interface (files suffixed by distant.html in the resulting dataset directory) for comparing the dynamics of a series of items in a corpus. It is mainly designed to compare words from a given textual field but could be used for other purposes…

You can choose to use a pre-defined list of terms or not. If not, multi-terms will be extracted with no prior hypothesis about their grammatical category (meaning verbs,  adverbs, noun phrases, etc. will be  identified). The size of ngrams can be defined manually. If you provide a term list (that should be formatted in the usual three columns spreadsheet (see corpus_terms_indexer page)), only words and stems present in the file will be indexed.

In both cases, you can define the total number of items (only N most frequent ones will be conserved)

It is now also possible to compare the dynamics of a word in a given corpus with its dynamics in another one.

Available metrics and visualization in the final interface  (thanks to Pablo Ruiz for having established this full documentation) are listed below:

Slope: Determines the amount of frequency change that the term needs to have undergone to be kept in the filtered set, as well as the direction of this change.

A positive slope means that the term occurs more frequently in later periods of the corpus than in its earlier periods (using relative frequencies, normalized by the size of the corpus for each period).

A negative slope means that the frequency of occurrences in earlier periods is higher than in later periods of the corpus (these frequencies are again normalized taking into account the size of the corpus slice contained in each period).

RS, std_value and noise: Measures of “burstiness”, indicating to what an extent the term has had spikes of occurrence at given moments in the corpus.

Using these filters you can restrict the terms displayed to terms whose temporal profile interests you: terms the use of which increases with time, terms that increasingly become less used in the corpus, or terms who show “spikes” or “bursts” in their usage.

Besides the temporal profiles for individual terms in the corpus, the Distant Reading module also calculates an evolution profile type or evolution profile cluster based on the evolution profiles for individual terms. This is displayed in the last column of the UI.

For example, all terms that are often used early on in the corpus and whose usage then decreases may be clustered together as a “falling evolution” cluster. You can then select all terms in the cluster using the cluster control at the right end of the filter bar.

The Distant Reading interface gives access to several other modules for corpus exploration.

Word Tree

You can access the Word Tree by clicking on the (c) next to each term on the list. This will display a Word Tree for the corpus focused on that term: A representation for the corpus where series of two branches start from the term and its neighbours, like in the screenshot below for the word financial in the Bankspeak corpus.

Word Tree

Bump Charts

It helps studying the evolution of a term’s context. The corpus is divided into several periods, and the bump chart displays the most 10 frequent neighbours for the term at each period, within a configurable context window. See the screenshot below for the evolution of neighbours for the term financial in the Bankspeak corpus.

In early periods, notions like assistance or mission were more frequent as neighbours of financial than in later periods, where the most frequent neighbours belong to the accounting domain (e.g. financial statement or accounting). The word purpose may have replaced mission.

Egonetworks

Whereas the bump charts show the evolution of a terms’ most frequent neighbours, the egonetworks show how those neighbours for a given term are connected to each other; they provide an indication of the structure of a term’s context.

In the two screenshots below, we see egonetworks for neighbours of financial at two different periods in the corpus, where the different structure among those neighbours at each period can be observed.

period 0, “financial” ego-network
period 4, “financial” ego-network

In the earlier period (period 0), there is a connection between assistance and assistance financial, whereas in the corpus’ last analyzed period (period 4), assistance is disconnected from any other neighbour of financial.

Context window configuration

For Bump Charts and Egonetworks, the context windows that will be analyzed can be defined when processing the corpus with the Distant Reading Dialogue (see the Distant Reading Tutorial video).

The different windows configured can then be selected from the Distant Reading UI itself, using the link that indicates the context (see the screenshot below):

[0,0]: context is restricted to the same sentence

[5,0]: context is restricted to a window of 5 sentences around the term

[100,0]: context is restricted to a window of 100 sentences around the term

 

 

Please also see a video demonstrating the capacities of Distant Reading:

 

 

 

learn about CorText scripts and share your experience