SASHIMI lets you analyze a corpus by simultaneously clustering documents and terms, and optionally extending this structure to any categorical metadata dimension.
It does so in a parsimonious and hierarchical fashion, based on robust, state-of-the-art statistical models that avoid displaying statistically insignificant structures and provide information at different scales of your data.
In addition to modeling documents, terms and metadata together, SASHIMI will cluster the entirety of the data, without the need to discard or filter elements based on preconceived notions of importance. This opens the door to study both central and marginal issues, together and in as much detail as the data allows.
Such large amounts of multi-dimensional information call for an entirely new approach to corpus data visualization. SASHIMI provides a new style of interactive maps that lets you seamlessly navigate your corpus’ cluster hierarchy and understand the detailed composition and role of each cluster, as well as their role in the bigger picture and relatedness to others.
SASHIMI follows a different workflow than most scripts in Cortext. Instead of a single “script”, analysis is carried out by a group of methods that perform sequential steps of the analysis.
Is the obligatory first step. It will take your corpus database (produced by “Data Parser”) and generate a “tokenization” of the data, while detecting n-grams.
The only parameters here are Text Source, where you pick the column that contains your document text, and Build n-grams when tokenizing, which tells whether you want composite terms that appear often in the corpus to be detected and treated as such, and to what length. For example, if you want “São Paulo” to be treated as a single term, you’d have to choose at least a value of 2, whereas if you want the same for “Rio e Janeiro”, you’d have to pick 3. In which case, the text “Hagbard moved from Rio de Janeiro to São Paulo” would likely be tokenized as [“Hagbard”, “moved”, “from”, “Rio de Janeiro”, “to”, “São Paulo”] if the names of both cities appear often in your corpus.
This is the second and crucial stage of SASHIMI. It will fit the underlying generative model, finding clusters that correspond well to the data: documents get clustered in domains, and terms get clustered in topics.
Because fitting a complex statistical model is a computationally difficult task, this stage may take a long time to complete. For a corpus of a few thousand documents it may take up to an hour, but at a size of hundreds of thousands of documents it may take over a day, so be patient.
The results are also stochastic in nature, so it might be worth running the method twice, or more times, if you’re not satisfied with the results. For example, if the structure of your data is subtle, it may happen that in some runs no clusters are found because the method will reject statistically insignificant structures.
However, a single run should yield a good enough clustering for most cases. And because SASHIMI relies on well grounded statistics, the quality of clusters is quantitatively comparable: with each run the method will output a file named like “entropy: 999.999.txt”, where the numbers in the name tell you the value of the model entropy with the inferred clusters, and the file’s contents display the numbers of clusters at each level of the hierarchy. In general, the model fit with the lowest entropy should be preferred.
This method takes only two parameters. The first, Prepared corpus, lets you choose which tokenization resulting from the previous step, in case you ran it more than once, you wish to build upon. The second, Transformation to apply to document’s tokenized text source, lets you choose whether to analyze a document’s tokens as they are, or to treat them as a set, thus ignoring the frequency in which terms appear. This is useful when you’re more concerned with whether term appears or not in the document than with how often they are employed. It may, for example, attenuate stylistic variations.
This step produces the interactive map used to navigate the clusters found in the model. It takes parameters:
Domain-topic model: which Domain Topic Model resulting from the previous step, in case you ran it more than once, you wish to build upon.
Title column: a field from your corpus to display as document title in the visualization.
Time column: a field from your corpus to attempt to treat as a temporal dimension in order to display the evolution of document clusters.
URI column: if your corpus contains URIs for documents, a link can be shown with their titles, providing easy access to the original.
URI template: you may also build URIs using values from the column set as URI column, by providing a template to be filled with them.
Good to go!
If you got here, you’re ready to use the three fundamental steps of SASHIMI!
Two further steps allow you to extend the document-term clusters to metadata dimensions.
Once you have inferred domains for your corpus’ documents, and if they contain metadata such as time, institutions, authors or geospatial data, you may be asking yourself: how do these other variables get distributed between domains? Do some institutions or years tend to get associated with similar domains, while others with different ones?
That question can be answered by chaining these dimensions in a new model, called a Domain-Chained Model. In this new model, metadata values get clustered according to their association to domains.
This method takes two parameters: the Domain-topic model to chain the metadata to, and the Chained dimension, that is, the dimension of the metadata to cluster.
Since this will produce a new model fit, what was said of model fits for the Domain-topic model method is equally valid here.
After chaining a metadata dimension to your domains, you will very likely want to navigate the inferred relationships between document and metadata clusters. This method produces the map for that, to be used in conjunction with the respective domain-topic map. It takes parameters much like the Domain-topic map method, with the difference that, instead of taking as first input a Domain-topic model, it takes a Chained model.
In order to properly consume SASHIMI, one is called to master the art of handling hashi. Fortunately, inside each and every map generated you will find a “Help” tab, where all the details of the visualization are explained. Still, if you’re curious or want a more thorough discussion on how to interpret these maps, you can read a pre-print of the paper introducing the methodology: Domain-topic models with chained dimensions: charting the evolution of a major oncology conference (1995-2017).