I would like to link a list resulting from the terms extraction script with the initial corpus and to match this list with same column of the “old” data base. Basically, what I need to do is to clean up the corpus and to get rid of all kind of useless terms like “and” and “they” and so on, but I have no idea how to do so.
In the end I would like to run a network analysis script on the data base, but since it’s composed of continuous texts I need to get rid of conjunctions and therelike before.
Thank you in advance!
I’m not sure I correctly understood your problem.
Term Extraction is automatically producing a new variable containing the index of extracted words. Just choose the fields “Terms” in the network mapping parameter form.
Is it what you were wondering about ?
Je suis désolé, ajoutera un commentaire en anglais. Cela m’aidera à être plus clair.
I think I am at the same spot here. I have run the script on a big database (265k entries; yes, that’s a lot!) only asking to show for four variables. I had done this in the past and worked. Now I had made various attempts to do it again, but it won’t work. Could it be related to the chosen variables I am asking for (document tittle, abstract, country and thematic cluster)? Is there something else I can or should take into account? The script works ok, but as I want to visualize it, it stays at ‘processing’ forever.
Which script are you talking about, corpus explorer or term extraction ?
Sorry, I was referring to ‘Corpus exporer’. I can’t visualize the results; I managed to download them, but are more difficult to navigate from the ‘deep.txt’.
Old corpus exploration seem to work quite fine, the problem is with the new ones. I mean results from the last week or may be less.
Corpus explorer is rather designed for small to medium size corpus.
The code is not optimal and require the full information to be loaded in memory. Therefore your browser may not appreciate to show hundreds of thousands of texts. I added an option to sample corpora when corpus gets too big.
Wonderful, I’ll try that option. The thing is that I have been having trouble with querying the corpus (and extract a new corpus) to reduce its size, so this seems to be a good alternative. Thanks.
Sorry, but could not find how to sample corpora, does it do that automatically? I still can’t visualize the corpus.
I double checked, and you should be able to set a certain number of entries to show when launching the corpus explorer script. If set to zero (default behaviour), every record is shown.