Link terms extraction with corpus

CorText Manager Q&A forumCategory: Text processingLink terms extraction with corpus
tom.jo asked 5 years ago

Hello together!
I would like to link a list resulting from the terms extraction script with the initial corpus and to match this list with same column of  the “old” data base. Basically, what I need to do is to clean up the corpus and to get rid of all kind of useless terms like “and” and “they” and so on, but I have no idea how to do so. 

In the end I would like to run a network analysis script on the data base, but since it’s composed of continuous texts I need to get rid of conjunctions and therelike before. 
Thank you in advance! 

Jean-Philippe Cointet Staff replied 5 years ago

I’m not sure I correctly understood your problem.
Term Extraction is automatically producing a new variable containing the index of extracted words. Just choose the fields “Terms” in the network mapping parameter form.
Is it what you were wondering about ?

matias.milia replied 5 years ago

Je suis désolé, ajoutera un commentaire en anglais. Cela m’aidera à être plus clair.
I think I am at the same spot here. I have run the script on a big database (265k entries; yes, that’s a lot!) only asking to show for four variables. I had done this in the past and worked. Now I had made various attempts to do it again, but it won’t work. Could it be related to the chosen variables I am asking for (document tittle, abstract, country and thematic cluster)? Is there something else I can or should take into account? The script works ok, but as I want to visualize it, it stays at ‘processing’ forever.

Jean-Philippe Cointet Staff replied 5 years ago

Which script are you talking about, corpus explorer or term extraction ?

matias.milia replied 5 years ago

Sorry, I was referring to ‘Corpus exporer’. I can’t visualize the results; I managed to download them, but are more difficult to navigate from the ‘deep.txt’.

matias.milia replied 5 years ago

Old corpus exploration seem to work quite fine, the problem is with the new ones. I mean results from the last week or may be less.

Jean-Philippe Cointet Staff replied 5 years ago

Corpus explorer is rather designed for small to medium size corpus.
The code is not optimal and require the full information to be loaded in memory. Therefore your browser may not appreciate to show hundreds of thousands of texts. I added an option to sample corpora when corpus gets too big.

matias.milia replied 5 years ago

Wonderful, I’ll try that option. The thing is that I have been having trouble with querying the corpus (and extract a new corpus) to reduce its size, so this seems to be a good alternative. Thanks.

matias.milia replied 5 years ago

Sorry, but could not find how to sample corpora, does it do that automatically? I still can’t visualize the corpus.

Jean-Philippe Cointet Staff replied 5 years ago

I double checked, and you should be able to set a certain number of entries to show when launching the corpus explorer script. If set to zero (default behaviour), every record is shown.

matias.milia replied 5 years ago

Got it!