SASHIMI

Intro

SASHIMI lets you analyze a corpus by simultaneously clustering documents and terms, and optionally extending this structure to any categorical metadata dimension.

It does so in a parsimonious and hierarchical fashion, based on robust, state-of-the-art statistical models that avoid displaying statistically insignificant structures and provide information at different scales of  your data.

In addition to modeling documents, terms and metadata together, SASHIMI will cluster the entirety of the data, without the need to discard or filter elements based on preconceived notions of importance. This opens the door to study both central and marginal issues, together and in as much detail as the data allows.

Such large amounts of multi-dimensional information call for an entirely new approach to corpus data visualization. SASHIMI provides a new style of interactive maps that lets you seamlessly navigate your corpus’ cluster hierarchy and understand the detailed composition and role of each cluster, as well as their role in the bigger picture and relatedness to others.

Usage

SASHIMI follows a different workflow than most scripts in Cortext. Instead of a single “script”, analysis is carried out by a group of methods that perform sequential steps of the analysis.

Prepare Corpus

Is the obligatory first step. It will take your corpus database (produced by “Data Parser”) and generate a “tokenization” of the data, while detecting n-grams.

The only parameters here are Text Source, where you pick the column that contains your document text, and Build n-grams when tokenizing, which tells whether you want composite terms that appear often in the corpus to be detected and treated as such, and to what length. For example, if you want “São Paulo” to be treated as a single term, you’d have to choose at least a value of 2, whereas if you want the same for “Rio e Janeiro”, you’d have to pick 3. In which case, the text “Hagbard moved from Rio de Janeiro to São Paulo” would likely be tokenized as [“Hagbard”, “moved”, “from”, “Rio de Janeiro”, “to”, “São Paulo”] if the names of both cities appear often in your corpus.

Domain Topic Model

This is the second and crucial stage of SASHIMI, it will fit the underlying generative model, finding clusters  that correspond well to the data.

Because fitting a complex statistical model is a computationally difficult task, this stage may take a long time to complete. For a corpus of a few thousand documents it may take up to an hour, but at a size of hundreds of thousands of documents it may take over a day, so be patient.

The results are also stochastic in nature, so it might be worth running the method twice, or more times, if you’re not satisfied with the results. For example, if the structure of your data is subtle, it may happen that in some runs no clusters are found because the method will reject statistically insignificant structures.

However, a single run should yield a good enough clustering for most cases, and because SASHIMI relies on well grounded statistics, the quality of clusters is quantitatively comparable: the method will output a file named like “entropy: 999.999.txt”, which tells you the value of the model entropy with the inferred clusters. In general, the model fit with the lowest entropy should be preferred. That file’s contents also display the numbers of clusters at each level of the hierarchy.

This method takes only two parameters. The first, Prepared corpus, lets you choose which tokenization resulting from the previous step, in case you ran it more than once, you wish to build upon. The second, Transformation to apply to document’s tokenized text source, lets you choose whether to analyze a document’s tokens as they are, or to treat them as a set, thus ignoring the frequency in which terms appear. This is useful when you’re more concerned with whether term appears or not in the document than with how often they are employed. It may, for example, attenuate stylistic variations.

Domain Topic Map

This step produces the interactive map used to navigate the clusters found in the model. It takes three parameters:

Domain-topic model: which Domain Topic Model resulting from the previous step, in case you ran it more than once, you wish to build upon.

Title column: a field from your corpus to display as document title in the visualization.

Time column: a field from your corpus to attempt to treat as a temporal dimension in order to display the evolution of document clusters.

Good to go!

If you got here, you’re ready to use the three fundamental steps of SASHIMI!

Two further steps allow you to extend the document-term clusters to metadata dimensions such as time, institutions, authors or geospatial data. These are currently under maintenance due to recent improvements to the underlying network analysis library that we need to adapt our methods to. This page will be updated as soon as they’re back on-line. (:

Chopsticks

In order to properly consume SASHIMI, one is called to master the art of handling hashi. Fortunately, inside each and every map generated you will find a “Help” tab, where all the details of the visualization are explained. Still, if you’re curious or want a more thorough discussion on how to interpret these maps, you can read a pre-print of the paper introducing the methodology: Domain-topic models with chained dimensions: charting the evolution of a major oncology conference (1995-2017).