SASHIMI

SASHIMI lets you study a corpus by simultaneously clustering documents and terms, revealing relationships between these clusters, and optionally extending them by clustering any categorical metadata dimension.

It does so in a parsimonious and nested (hierarchical) fashion, based on robust, state-of-the-art statistical models that avoid displaying statistically insignificant structures and provide information at different scales of  your data.

SASHIMI will cluster the entirety of your data, without the need to discard or filter elements. This opens the door to study both central and marginal issues, together and in as much detail as the data affords.

Navigating this multi-scale, multi-dimensional and potentially large landscape calls for a new approach to corpus data visualization, and that is why SASHIMI also provides a new style of interactive maps, conceived to let you seamlessly navigate your corpus and understand the detailed composition and role of clusters, as well as their place in the bigger picture and relatedness to clusters of other dimensions: documents, terms or metadata.

Static frame of a domain-topic map with no domain or topic selected, displaying the proportion of documents in each domain (left) and the total proportional usage of each topic (right).

Here are the main concepts you need to grasp before consuming SASHIMI:

ConceptDefinitionPossible interpretation (STS context)Cortext method
Domain
(document block)
A group of related documents, one whose documents tend to employ the same topicsEach domain can be see as the outputs of a specific epistemic communityDomain-topic model
Topic
(term block)
A group of related terms (words), one whose terms tend to get employed in the same domainsTopics can be related to the discursive resources mobilized by, and thus connecting distinct epistemic communitiesDomain-topic model
Metadata block
(for a chosen dimension)
A group of related metadata elements; e.g. either dates, institutions, journals or people.Typifies that dimension’s insertion in the domains; for dates it corresponds to periods; for people it corresponds to profiles of participation in epistemic communities; and so forthChain dimension

Examples

Here are some example outputs for the methods to be discussed:

Domain-topic map for a corpus of academic publications about “chloroquine”, produced during a Cortext workshop.

Domain-journals chained map for the same corpus, showing how journal clusters relate to the domains featured in the previous map.

Usage

SASHIMI follows a different workflow than most scripts in Cortext. Instead of a single “script”, analysis is carried out by a group of methods that perform sequential steps of the analysis.

Method 0: Prepare Corpus

Is the obligatory first step. It will take your corpus database (produced by “Data Parser”) and generate a “tokenization” of the data, while detecting n-grams.

Parameters

Text Source

Pick the field you want to get each document’s terms from. This is usually the one that contains your document’s text, but could also be titles, keywords or indexes produced by other methods.

Build n-grams when tokenizing

Whether you want composite terms that appear often in the corpus to be detected and treated as such, and to what length. For example, if you want “São Paulo” to be treated as a single term, you’d have to choose at least a value of 2, whereas if you want the same for “Rio de Janeiro”, you’d have to pick 3. In that case, the text “Hagbard moved from Rio de Janeiro to São Paulo” would likely be tokenized as [“Hagbard”, “moved”, “from”, “Rio de Janeiro”, “to”, “São Paulo”] if the names of both cities appear often in your corpus.

Method 1.0: Domain-Topic Model

This is the second and crucial stage of SASHIMI. It will fit the underlying generative model, finding clusters that correspond well to the data: documents get clustered in domains, and terms get clustered in topics.

Fitting a model

Because fitting a complex statistical model is a computationally difficult task, this stage may take a long time to complete. For a corpus of a few thousand documents it may take up to an hour, but at a size of hundreds of thousands of documents it may take over a day, so be patient.

Moreover, since the results are stochastic in nature, it might be worth running the method twice, or more, if you’re not satisfied with them. Yet, as we’ll discuss, we provide a clear criteria for preferring one run over another. Running again may be the case, for example, if the structure of your data is subtle and a first run finds no clusters. Still, a single run should yield a good clustering in most cases.

Because SASHIMI relies on well grounded statistics, the quality of different clusterings is quantitatively comparable. With each run, the method will output a file named like “entropy: 999.999.txt“, where the numbers in the there tell you the value of the model entropy with the inferred clusters. The comparison between runs is very simple: in general, the model fit with the lowest entropy should be preferred. Also, that file’s contents show the numbers of clusters at each level of the hierarchy.

Parameters

Prepared corpus

Which tokenization resulting from the previous step you wish to build upon, in case you ran that more than once.

Transformation to apply to document’s tokenized text source

Choose (none) to count multiple uses of a term in the same document, or (set) to only consider whether a term gets used or not in a document. This choice depends on your data, but choosing the latter may be useful to attenuate the influence of stylistic variations.

Method 1.1: Domain-Topic Map

This step produces the interactive map used to navigate the clusters found in the model. Instructions for working with the map are contained in the resulting file, by clicking the tab called “Help”.

Static frame of the domain-topic map from the example provided, showing the level-2 domain (selected on the left) with the strongest “spike_protein” topic (selected on the right) relevance, and the strong co-ocurrence there with the “coronavirus” topic (hover popup).

Parameters

Domain-topic model

Which Domain Topic Model resulting from the previous step you wish to map, in case you ran it more than once.

Title column

Which field to display as the title of documents on the map.

Time column

Which field to attempt to treat as a temporal dimension in order to display the evolution of document clusters. May also work with non-temporal dimensions, giving a histogram.

URI column

If your corpus contains URIs for documents, a link can be shown with their titles, providing easy access to the original. Many datasets have an URL or URI field, or some other field that can be modified to produce an URL for it.

URI template

If there is no explicit URI field, you may build URIs using values from the field set as URI, by providing a template where {} will get replaced by the values. For example, if you have a DOI field, you can get an URI with the following template: https://doi.org/{}

You’re good to go!

If you got here, you’re ready to use the fundamental steps of SASHIMI! We suggest you take a break to go ahead and try them.

When you come back, two further steps will allow you to extend the document-term clusters to metadata dimensions.

Method 2.0: Chain dimension

Once you have inferred domains for your corpus’ documents, and if they contain metadata such as time, institutions, authors or geospatial data, you may be asking yourself: how do these other variables get distributed between domains? Do some institutions or years tend to get associated with similar domains, while others with different ones?

That question can be answered by chaining these dimensions in a new model, called a Domain-Chained Model. In this new model, metadata values get clustered according to their association to domains.

Since this method will fit a new model, what has been said of model fitting for the Domain-topic model, in the Fitting a model section, applies as well.

Parameters

Domain-topic model

Which Domain Topic Model you wish to chain the metadata to, in case you ran it more than once.

Chained dimension

Yhe dimension of the metadata you want to cluster: dates, years, journals, authors, keywords, categories etc.

Method 2.1: Domain-chained Map

After chaining a metadata dimension to your domains, you will very likely want to navigate the inferred relationships between document and metadata clusters. This method produces the map for that, to be used in conjunction with the respective domain-topic map.

Static frame of the domain-journal chained map from the example provided, selecting the same level-2 domain as in the previous image. Journal clusters relevant to the selected domain are shown on the right. The strongest of those is selected, so on the left we see the domains publishing in it.

Parameters

This method takes parameters much like the Domain-topic map method, with the difference that, instead of taking as first input a Domain-topic model, it takes a domain-chained model obtained with the Chain dimension method.

Chopsticks

In order to properly consume SASHIMI, one is called to master the art of handling hashi. Fortunately, inside each and every map generated you will find a “Help” tab, where all the details of the visualization are explained.

If you’re curious or want a more thorough discussion on how to interpret these maps, you can read a pre-print of the paper introducing the methodology: Domain-topic models with chained dimensions: charting an emergent domain of a major oncology conference.