Sashimi text preprocessing not working properly

CorText Manager Q&A forumCategory: SASHIMISashimi text preprocessing not working properly
Vincent Delbar asked 2 years ago

Hello,

I’m currently trying to use the SASHIMI module and I got undesired words in the model after the text processing step.
Stop words, verbs and other small tokens appears on top of the domain topic map.
See for yourself here.
I can’t see any setting I may have missed. Any ideas ?
I was ready to open an issue on github or something, but I can’t find any repository related to this.

1 Answers
aleabdo Staff answered 2 years ago

Ni! Olá Vincent,
As you’ve noticed, Sashimi doesn’t include language-specific preprocessing steps. That’s because stopwords will get clustered together because of their similar distribution in documents. There’s thus no need to remove them, but to take care to interpret the maps accordingly. What you get is actually different patters of stopword-like behavior in a set of stopword-like topics, composed by stopwords but also by other terms that in the corpus behave like stopwords.
In your map, most stopword-like terms get clustered in L1T25. This topic contains words that are frequent in every domain of the corpus. Compare that with L1T0, which also contains very generic words, but tending towards very rare words. L1T0 is still quite evenly distributed, but we can see it has a slightly stronger preference for some domains: domains that employ quite specific and rare vocabulary.
Stopwords may also appear in topics that don’t seem stopword-like, and that means that in this corpus they’re not so much behaving as stopwords – they’re statistically differentiating factors for some domains. For example L1T18 has the a priori stopwords “was” and “were”. If you look at the domains where this topic is most present you’ll see these are domains characterized by “dynamics and temporal patterns” that likely call for their use. You can also note that the distinction is very weak, with a maximum factor of “1.7”, so actually very stopword-like (L1T25 has max factor “1.3”), while more substantial topics will have higher factors.
Sashimi is thus telling you that a lot more words are not very informative, not only the a priori stopwords, but also telling you different ways in which they’re informative – even if they’re not much so.
Of course, these stopword-like topics are, as a consequence, less meaningful topics and not the ones that will be critical to your analysis – in a way these are things you should ignore unless you see good reason not to. In the maps this becomes more clear once you pick a domain. Stopword-like topics will tend to be less significant for a given domain, even if they’re frequent in the corpus.
Finding a more ergonomic way to translate this is on the agenda, but it’s not so evident 😉 because we actually want this information to be available, since it’s all statistically meaningful (in the sense of the underlying bayesian network modeling framework). But I agree that on the topic map it is confusing that at the highest level only stopwords show up among the frequent terms. That is, of course, a consequence of showing frequent terms. Maybe something else should be shown. Same for the topic color intensity when no domain is selected.
Now, of course, you could perform preprocessing before sending data to sashimi. For example topic L1T28 is clearly an artifact of articles from Taylor and Francis, which Sashimi detected and that is influencing the construction of domains. Ideally it should be possible to tell Sashimi to ignore a selection of topics (artifacts, stopwords) and replay the modeling to produce new domains and topics without those terms. That is part of a larger program we have in mind, awaiting for time and/or resources.
I hope my answer can help you interpret the maps and continue work with your data. In practical terms : the thing you do need to worry about is to remove the Taylor and Francis signature from your documents before modeling, as it’s clearly influencing the results. You can do that outside of Cortext with a number of tools; inside Cortext, maybe the Data Curation script can do that for you, I’m not entirely sure.
At your disposal for more clarifications,

Vincent Delbar replied 2 years ago

Wow, what a quick and exhaustive reply !
Okay, may be I’ll consider processing the text by myself with python and re-upload the corpus db into cortext.
At least to remove signatures at the end of abstracts, may be remap some terms…
But I’m not sure if we really need those SASHIMI results. It is quite hard to interpret for the rest of the team, and even for me.

Thanks you very much for the explanation !

aleabdo Staff replied 2 years ago

Ni! No trouble. At least you got from Sashimi that the publisher’s signature was influencing the analysis (which is likely also true for other analyses that won’t tell you that). 😉

Sashimi is definitely a different kind of corpus analysis instrument. The first way in which it can be useful is, of course, that it not only gives you topics, that documents can be seen as composed of, but also tells you what are the groups of documents that combine topics similarly (one could say a kind of “epistemic communities”), and how so. And it strives to provide ways to explore the material evidence for that (documents, topics, terms, temporal evolution) at different levels of detail – from whole corpus to individual document.

One thing that you may find interesting : it seems you get your corpus from Scopus, so it likely has a DOI field. Try passing this to the URL parameter in domain-topic map:

https://doi.org/{}

This should link the document titles that appear in the “Info” tabs directly to their web presence.

Best!