Hello,
Is it possible to perform topic modeling on a corpus that is not in English – that would require to upload a stopword list in another language?
Thanks!
Dear Dana,
Thanks for your question ! The answer is: yes !
We have updated our documentation according to what CorText Manager is able to do, regarding stop-words and stemming for different languages with topic modeling script.
I hope it helps,
Lionel
Hello,
This is fantastic news, Thank you for this!
I’ve just tried to run a Topic Modeling script on my corpus and I get a repeated error message, not sure if I did something wrong.
2021-10-15 15:36:46 INFO : Script Topic Modelling Started
2021-10-15 15:36:46 INFO :
Data Description:
Fields:
– Message
Number of Topics – (0 for automatic search): ’10’
Custom name for storing topics: ”
Maximum number of topics per document: ‘3’
Text Cleaning Parameters:
Lower Case: true
language: romanian
Stop-words Removal: true
Remove punctuation: true
Stemming: true
Minimum frequency of words: ‘5’
Maximum frequency of words (in percentage of the total corpus): ’50’
LDA algorithm parameters:
Alpha: symmetric
Number of iterations for learning the model: ’20’
2021-10-15 15:36:46 INFO :
Data Description:
Fields:
– Message
Number of Topics – (0 for automatic search): ’10’
Custom name for storing topics: ”
Maximum number of topics per document: ‘3’
Text Cleaning Parameters:
Lower Case: true
language: romanian
Stop-words Removal: true
Remove punctuation: true
Stemming: true
Minimum frequency of words: ‘5’
Maximum frequency of words (in percentage of the total corpus): ’50’
LDA algorithm parameters:
Alpha: symmetric
Number of iterations for learning the model: ’20’
2021-10-15 15:36:46 INFO : Compiling data
Thanks for the feedback!
Could you try again ?
L
Yes, it works now! But the stopword removal doesn’t seem to be working – is there any way I could dynamically remove words from the corpus after generating the first results? There are stopwords that are specific to each corpus and removing them helps with the topic generation, but this would mean uploading a modified stopword list at some point in the process.
Also, how can I perform further iterations on the same corpus, after the initial training?
Sorry for all the questions, but this tool has so much potential for my research!
Have you checked the exact list of keywords for your language? https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip
It is only the top most frequent words. Could you double check that they do not appear in the topic modeling results?
I have checked, yes. There are words on the stopword list that still appear in my results.
Dear Dana,
Could you please add me to your project? with the lionel dot villard at esiee dot fr
I will check and leave the project after
L
Done! Thank you for all your help!