Problems with parsing a .txt corpus

Cortext Manager Q&A forum › Category: Topic modeling › Problems with parsing a .txt corpus

JonD asked 4 years ago

Bonjour,
Je ne parviens pas à réaliser un topic modeling sur des fichiers .txt.
J’ai toujours le même message d’erreur, même avec différents paramétrage de l’analyse.
Ce qui est étrange, c’est que j’ai testé de transformer des données en Excel qui marchaient nickel pour le topic modeling, mais une fois en .txt ça crache.
Par contre, aucun soucis pour les autres analyses avec mes fichiers .txt.

Voici le message d’erreur si quelqu’un à une idée, je suis preneur 😉
Merci beaucoup.
Jonathan
2021-02-16 16:30:14 INFO : Script Topic Modelling Started 2021-02-16 16:30:14 INFO : Data Description: Fields: – text Number of Topics – (0 for automatic search): ‘0’ Minimum number of topics: ’10’ Maximum number of topics: ’40’ Steps: ’10’ Custom name for storing topics: ” Maximum number of topics per document: ‘3’ Text Cleaning Parameters: Lower Case: true language: french Stop-words Removal: true Remove punctuation: true Stemming: true Minimum frequency of words: ‘2’ Maximum frequency of words (in percentage of the total corpus): ’50’ LDA algorithm parameters: Alpha: symmetric Number of iterations for learning the model: ’20’ 2021-02-16 16:30:14 INFO : Data Description: Fields: – text Number of Topics – (0 for automatic search): ‘0’ Minimum number of topics: ’10’ Maximum number of topics: ’40’ Steps: ’10’ Custom name for storing topics: ” Maximum number of topics per document: ‘3’ Text Cleaning Parameters: Lower Case: true language: french Stop-words Removal: true Remove punctuation: true Stemming: true Minimum frequency of words: ‘2’ Maximum frequency of words (in percentage of the total corpus): ’50’ LDA algorithm parameters: Alpha: symmetric Number of iterations for learning the model: ’20’ 2021-02-16 16:30:14 INFO : Compiling data 2021-02-16 16:30:15 INFO : Applying linguistic filters 2021-02-16 16:30:16 INFO : adding document #0 to Dictionary(0 unique tokens: []) 2021-02-16 16:30:16 INFO : built Dictionary(2133 unique tokens: [u’repondr’, u’l\xe9gitim’, u’hommefemm’, u’d\xe9j\xe0′, u’asiat’]…) from 1 documents (total 8514 corpus positions) 2021-02-16 16:30:16 DEBUG : rebuilding dictionary, shrinking gaps 2021-02-16 16:30:16 INFO : discarding 0 tokens: []… 2021-02-16 16:30:16 INFO : keeping 0 tokens which were in no less than 5 and no more than 0 (=50.0%) documents 2021-02-16 16:30:16 DEBUG : rebuilding dictionary, shrinking gaps 2021-02-16 16:30:16 INFO : resulting dictionary: Dictionary(0 unique tokens: []) 2021-02-16 16:30:16 DEBUG : rebuilding dictionary, shrinking gaps 2021-02-16 16:30:16 INFO : Computing topics using gensim 2021-02-16 16:30:16 INFO : Testing with 10 topics.

Lionel Staff replied 4 years ago

Dear Jonathan,

Could you please add me to your project? lionel dot villard at esiee dot fr
I will leave your project just after.

Best regards
Lionel

2 Answers

0 Vote Up Vote Down

Lionel Staff answered 4 years ago

See my comment!

JonD replied 4 years ago

Hi Lionel, I’ve just add you to my project.
Thank you for your help.

Jon

0 Vote Up Vote Down

Lionel Staff answered 4 years ago

Dear Jon,
Thanks for the invitation. After a first look, it seems that your dataset is a little small (with 8 documents) for the default parameters. So, the list of selected tokens is empty in your case. To avoid that, you may want to play with :

minimum frequency of words: in number of occurrences
maximum frequency of words: in % of the total (e.g. increase it to 80 or 90, for 80% or 90% of the documents)

But, still, 8 documents is small.
I hope It helps
Lionel

Cortext Manager Documentation

learn about CorText methods and share your experience