Topic modeling in other languages than English

CorText Manager Q&A forumCategory: Topic modelingTopic modeling in other languages than English
Dana Sultanescu asked 10 months ago

Hello,
Is it possible to perform topic modeling on a corpus that is not in English – that would require to upload a stopword list in another language?
Thanks!

4 Answers
Lionel Staff answered 10 months ago

Dear Dana,
Thanks for your question ! The answer is: yes !
We have updated our documentation according to what CorText Manager is able to do, regarding stop-words and stemming for different languages with topic modeling script.
I hope it helps,
Lionel

Dana Sultanescu replied 10 months ago

Hello,
This is fantastic news, Thank you for this!
I’ve just tried to run a Topic Modeling script on my corpus and I get a repeated error message, not sure if I did something wrong.

2021-10-15 15:36:46 INFO : Script Topic Modelling Started
2021-10-15 15:36:46 INFO :
Data Description:
Fields:
– Message
Number of Topics – (0 for automatic search): ’10’
Custom name for storing topics: ”
Maximum number of topics per document: ‘3’
Text Cleaning Parameters:
Lower Case: true
language: romanian
Stop-words Removal: true
Remove punctuation: true
Stemming: true
Minimum frequency of words: ‘5’
Maximum frequency of words (in percentage of the total corpus): ’50’
LDA algorithm parameters:
Alpha: symmetric
Number of iterations for learning the model: ’20’

2021-10-15 15:36:46 INFO :
Data Description:
Fields:
– Message
Number of Topics – (0 for automatic search): ’10’
Custom name for storing topics: ”
Maximum number of topics per document: ‘3’
Text Cleaning Parameters:
Lower Case: true
language: romanian
Stop-words Removal: true
Remove punctuation: true
Stemming: true
Minimum frequency of words: ‘5’
Maximum frequency of words (in percentage of the total corpus): ’50’
LDA algorithm parameters:
Alpha: symmetric
Number of iterations for learning the model: ’20’

2021-10-15 15:36:46 INFO : Compiling data

Lionel Staff answered 10 months ago

Thanks for the feedback!
Could you try again ?
L

Dana Sultanescu replied 10 months ago

Yes, it works now! But the stopword removal doesn’t seem to be working – is there any way I could dynamically remove words from the corpus after generating the first results? There are stopwords that are specific to each corpus and removing them helps with the topic generation, but this would mean uploading a modified stopword list at some point in the process.
Also, how can I perform further iterations on the same corpus, after the initial training?
Sorry for all the questions, but this tool has so much potential for my research!

Lionel Staff answered 10 months ago

Have you checked the exact list of keywords for your language? https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip
It is only the top most frequent words. Could you double check that they do not appear in the topic modeling results?

Dana Sultanescu replied 10 months ago

I have checked, yes. There are words on the stopword list that still appear in my results.

Lionel Staff replied 10 months ago

Dear Dana,

Could you please add me to your project? with the lionel dot villard at esiee dot fr
I will check and leave the project after
L

Dana Sultanescu replied 10 months ago

Done! Thank you for all your help!

Lionel Staff answered 10 months ago

Dear Dana,
Apparently, it was more a problem on how are formatted the data. Please check in your project !
I hope it helps
L

Dana Sultanescu replied 10 months ago

Do you mean the UTF8 instead of UTF16?

Lionel Staff replied 10 months ago

everything is in your project 🙂

Dana Sultanescu replied 10 months ago

OK, thank you. Could you help me read/interpret the convergence likelihood graph?