SASHIMI

Typical topic models work well to describe the textual components of individual documents, but don’t help you understand groups of documents and how their components relate. Usual document clustering techniques, on the other hand, don’t provide a description of document groups as a summary of document components, or do that as an afterthought not intrinsic to the clustering model. Equally important, variations of such instruments rearely make an effort to provide adequate data visualization and exploration interfaces, to allow one to grasp and manipulate such an abstraction of documents and terms as groups thereof, and the complex  network of relationships between them from which meaning and interpretability can emerge.

Sashimi lets you study a corpus by simultaneously clustering documents and terms, revealing relationships between these clusters, and optionally extending them by clustering metadata of any categorical dimension. It does so in a parsimonious and nested (hierarchical) fashion, based on robust, state-of-the-art statistical models that avoid presenting statistically insignificant structures, and that provide clusters at different scales. Moreover, its interactive maps afford seamless multi-scale and multi-sector navigation through the corpus, seen from the model’s lens. There’s thus no need to discard or filter elements previous to treatment, opening the door to study both central and marginal issues together and in all detail available. One may systematically inquire the corpus from individual documents up to the entire collection, and from the individual word up to the full vocabulary, in order to uncover the subsets and levels relevant to their research question.

Frame of a domain-topic map with no domain and no topic selected, displaying the proportion of documents in each domain (left) and the total proportional usage of each topic (right) at different scales (left to right on each side).

Here are the main concepts you need in order to enjoy Sashimi:

Concept Definition Possible interpretation (STS context)
Domain
(document block)
A group of related documents, one whose documents tend to employ the same topics Each domain can be see as the outputs of a specific epistemic community
Topic
(term block)
A group of related terms (words), one whose terms tend to get employed in the same domains Topics can be related to the discursive resources mobilized by, and thus connecting and interconnecting, distinct epistemic communities
Metadata block
(for a chosen dimension)
A group of related metadata elements; e.g. either dates, institutions, journals or people. Typifies that dimension’s insertion in the domains; for dates it corresponds to periods; for people it corresponds to profiles of participation in epistemic communities; and so forth

Examples

Here are some example outputs of the methods presented below (because Sashimi’s interactive maps are constantly being improved, the maps in these examples are likely outdated):

Domain-topic map for a corpus of academic publications about “chloroquine”, produced during a Cortext workshop.

Domain-journals chained map for the same corpus, showing how journal clusters relate to the domains featured in the previous map.

Usage

Sashimi follows a slightly different workflow than most scripts in Cortext. Instead of a single “script”, the analysis is carried out by a group of methods that perform sequential steps. The main sequences are :

Prepare corpus → Domain model (domain-topic)

Domain model (domain-topic) → Domain map

Domain model (domain-topic) → Domain model (domain-chained)

Domain model (domain-chained) → Domain map

1st step: Prepare Corpus

Is the obligatory first step. It will take your corpus database (produced by “Data Parser”) and generate a “tokenization” of the data.

Parameters

Text or Token source

Pick the fields you want to treat as the document’s terms. This is usually the one that contains your document’s text, but could also be titles, keywords or indexes produced by other methods. Select the field as a “text source” if it contains text that is not yet split into terms, such as abstracts, full text, messages and post contents. Select it as a “token source” if already corresponds to lists of individual terms, such as keywords, categories or term indexes.

Build n-grams when tokenizing

When tokenizing, this indicates whether to seek composite terms that appear often in the corpus, up to a certain length. For example, if you want “São Paulo” to be treated as a single term, you’d have to choose at least a value of 2, whereas if you want the same for “Rio de Janeiro”, you’d have to pick 3. A text such as “Hagbard moved from Rio de Janeiro to São Paulo”, if the names of both cities appear often in your corpus, might then get tokenized as [“Hagbard”, “moved”, “from”, “Rio_de_Janeiro”, “to”, “São_Paulo”].

Transformation to apply to document’s tokenized text source

Choose “none” to account for multiple uses of a term in the same document. Choose “discard frequency” to only consider whether a term is used or not. This choice depends on your data, but choosing the latter may be useful to attenuate the influence of stylistic variations.

2nd step: Domain-topic Model

This is the second and crucial stage of Sashimi. It will fit the underlying generative model, finding clusters that correspond well to the data: documents get clustered in domains, and terms get clustered in topics. The method is Domain Model with the option domain-topic

Fitting a model

Because fitting a complex statistical model is a computationally difficult task, this stage may take a long time to complete. For a corpus of a few thousand documents it may take up to an hour, but at a size of hundreds of thousands of documents it may take over a day, so be patient.

Moreover, since the results are stochastic in nature, it might be worth running the method twice, or more, if you’re not satisfied with them. Yet, as we’ll discuss, we provide a clear criteria for preferring one run over another. Running again may be the case, for example, if the structure of your data is subtle and a first run finds no clusters. Still, a single run should yield a good clustering in most cases.

Because Sashimi relies on well grounded statistics, the quality of different clusterings is quantitatively comparable. With each run, the method will output a file whose name states “entropy: 999.999“, where the numbers in the there tell you the value of the model entropy with the inferred clusters. The comparison between runs is very simple: in general, the model fit with the lowest entropy should be preferred. Also, that file’s contents show the numbers of clusters at each level of the hierarchy.

Parameters

Prepared corpus

The prepared corpus to be modeled. Choices are a list of IDs of executions of the Prepare Corpus method on the chosen corpus.

3rd step: Domain-topic Maps

This step produces the interactive map used to navigate the clusters found in the model. Instructions for working with the map are contained in the resulting file, by clicking the tab called “Help”. The method is Domain Maps with the option domain-topic

Static frame of the domain-topic map from the example provided, showing the level-2 domain (selected on the left) with the strongest “spike_protein” topic (selected on the right) relevance, and the strong co-ocurrence there with the “coronavirus” topic (hover popup).

Parameters

Domain-topic model

The domain-topic model to be mapped. Choices are a list of IDs of executions of the Domain Model method, with the option domain-topic, on the chosen corpus.

Title column

Which field to display as the title of documents on the map.

Time column

Which field to attempt to treat as a temporal dimension in order to display the evolution of document clusters. May also work with non-temporal dimensions, giving a histogram.

URI column

If your corpus contains URIs for documents, a link can be shown with their titles, providing easy access to the original. Many datasets have an URL or URI field, or some other field that can be modified to produce an URL for it.

URI template

If there is no explicit URI field, you may build URIs using values from the field set as URI, by providing a template where {} will get replaced by the values. For example, if you have a DOI field, you can get an URI with the following template: https://doi.org/{}

You’re good to go!

If you got here, you’re ready to use the fundamental steps of Sashimi! We suggest you take a break to go ahead and try them.

When you come back, two further steps will allow you to extend the document-term clusters to metadata dimensions.

4th step: Chain dimension

Once you have inferred domains for your corpus’ documents, and if they contain metadata such as time, institutions, authors or geospatial data, you may be asking yourself: how do these other variables get distributed between domains? Do some institutions or years tend to get associated with similar domains, while others with different ones?

That question can be answered by chaining these dimensions in a new model, called a Domain-chained Model. In this new model, metadata values get clustered according to their association to domains.

The method is Domain Model with the option domain-chained. Since this method will fit a new model, what has been said of model fitting for the Domain-topic model, in the Fitting a model section, applies as well.

Parameters

Domain-topic model

The domain-topic model for which to model metadata clusters. Choices are a list of IDs of executions of the Domain Model method, with option domain-topic, on the chosen corpus.

Chained dimension

The dimension of the metadata you want to cluster: dates, years, journals, authors, keywords, categories etc.

5th step: Domain-chained Map

After chaining a metadata dimension to your domains, you will very likely want to navigate the inferred relationships between document and metadata clusters. You can produce a map for that, to be used in conjunction with the respective domain-topic map. The method is Domain Maps with the option domain-chained.

Static frame of the domain-journal chained map from the example provided, selecting the same level-2 domain as in the previous image. Journal clusters relevant to the selected domain are shown on the right. The strongest of those is selected, so on the left we see the domains publishing in it.

Parameters

As described in the Domain-topic map step, except the choices of model are a list of IDs of executions of the Domain Model method with option domain-chained.

Chopsticks

Inside each map you will find a “Help” tab explaining how to use it.

Bibliography

If you’re curious about the details or want a more thorough discussion on how to interpret the maps, you’re welcome to read the paper introducing the present methodology. It’s also the reference to be cited if you publish using this method:

Alexandre Hannud Abdo, Jean-Philippe Cointet, Pascale Bourret, Alberto Cambrosio (2021). Domain-topic models with chained dimensions: Charting an emergent domain of a major oncology conferenceJournal of the Association for Information Science and Technology. doi:10.1002/asi.24606