Hello,
I would like to have some information about the possibility to work with monograms in the term extraction script. The objective of this work is:
- Extract the 200 most frequent keywords in my corpus
- Index these keywords at the document level
- Make co-occurrence maps of these keywords
For this, I have done the following steps:
- Terms extraction, with the following parameters : list lenght = 200 ; Monogram are forbidden = no ; frequency computation level = document level ; ranking principle = frequency ==> the problem is that some keywords (for example “animal intention”) are split into “animal” & “intention” but this is not the case for all the keywords How can I tell that the keywords should be kept as written? (i.e. if it’s a monogram, it appears in the list as a monogram, if not, the combination of words appears as is in the list)
- I then index my corpus with corpus terms indexer then with Corpus list indexer (with the parameter : count only one occurrence per document during indexation = yes, to go to the document level)
- I then run the network mapping script
Does it seem correct to you ?
Thank you in advance for your help,
Sincerely,
Anne-Lise
Dear Anne-Lise,
There is no a simple answer to your question.
The problem with your example is that “animal intention” is at the same time : two monogram (“animal” and “intention”) and one bigram (“animal intention”).
My guess is that, as you are sorting your results by frequency, “animal” alone and “intention” alone are much more frequent than “animal intention” together. But “animal intention” is frequent enough to figure in the TOP 200 main terms sorted by frequency.
So, for the rest of the terms which are appearing only as monogram, it is only a matter of being selected in the top 200 main terms : their monogram versions are frequent enough to be included in the list, but not their bigram version.
You have three ways to “solve that” :
- To work only with monograms (which is not satisfying, I guess) or only with n-grams : https://docs.cortext.net/lexical-extraction/#monograms
- Increase the size of the list, and clean the list manually (by removing the unwanted versions) : https://docs.cortext.net/training-materials/french-visualisation-de-corpus-avec-cortext-manager/
- Not using the frequency: https://docs.cortext.net/lexical-extraction/#ranking-principle
I hope it helps
L
Dear Lionel,
Thank you very much for your answer! It seems to me that the second solution is the most suitable for the project.
Just to be sure of my reasoning: I decided to work with frequency and not chi2 in the ranking process because if I undesterstand correctly, as I am only working on the keywords of scientific articles, I don’t need to exclude irrelevant terms (and therefore, I don’t need to go through the chi2 test to identify words with an independent distribution in the corpus)?
Thanks again for your help,
Anne-Lise
Dear Anne-Lise,
I did not catch that you were working with keywords of scientific articles (and not extracted keywords from full texts or abstracts…).
If you just want to extract the top N keywords of scientific articles, you may want to use List Builder script, as it does not apply any grammatical calculations, so your keywords won’t be modified with it.
I hope it helps
L
Dear Lionel,
Thank you very much for this answer which fits perfectly into the project. My last question: if I understand correctly t he script “List Builder”, I have to stay on “no” for the two parameters “Compute Log Likelihood specificity”and “Search for Duplicate entries” in order to respect the idiosyncrasy?
Thank you very much,
Anne-Lise
yes !
Dear Lionel,
I’m still working on the keywords of scientific articles, and I have a new question. I followed your previous advice and if I understand correctly, as the list building script does not modify the form of the keywords, I should not have, for an extracted term, differences between the frequency and the number of distinct documents (as a keyword cannot be proposed twice for the same article). But in fact, for some keywords, the frequency is higher than the number of documents in the corpus, and I never found the same value for the frequency and the number of distinct documents. Could you explain why? And what value should I look at to work on the frequency of appearance of my keywords in my corpus?
Thank you in advance for your help,
Sincerely,
Anne-Lise
Dear Anne-Lise,
The keywords come from the author keywords? The variations between the two values are big or marginal?
I have checked in one of our corpuses, extracted for the Web Of Science on the chloroquine and hydroxychloroquine, and it happens marginally on the Author Keywords field. Meaning that a small number of keywords are repeated in the bibliographical documents provided by the Web Of Science from the DE attributes. For those keywords the frequency and the distinct number of documents are different.
I hope it helps
L
Dear Lionel,
In my first corpus, the variations were important, but it was my fault because the problem came from my corpus (I badly configured Scopus during the download).
With the right corpus, the variations are, in fact, marginal.
I apologise for this unnecessary question!
Thanks again for your help,
Sincerly,
Anne-Lise