Terms extraction : Monogramms

CorText Manager Q&A forumCategory: Text processingTerms extraction : Monogramms
Anne-Lise Dauphiné-Morer asked 5 days ago

Hello,
I would like to have some information about the possibility to work with monograms in the term extraction script. The objective of this work is:

  1. Extract the 200 most frequent keywords in my corpus
  2. Index these keywords at the document level
  3. Make co-occurrence maps of these keywords

For this, I have done the following steps:

  1. Terms extraction, with the following parameters : list lenght = 200 ; Monogram are forbidden = no ; frequency computation level = document level ; ranking principle = frequency ==> the problem is that some keywords (for example “animal intention”) are split into “animal” & “intention” but this is not the case for all the keywords How can I tell that the keywords should be kept as written? (i.e. if it’s a monogram, it appears in the list as a monogram, if not, the combination of words appears as is in the list)
  2. I then index my corpus with corpus terms indexer then with Corpus list indexer (with the parameter : count only one occurrence per document during indexation = yes, to go to the document level)
  3. I then run the network mapping script

Does it seem correct to you ?

Thank you in advance for your help,
Sincerely,
Anne-Lise

1 Answers
Lionel Staff answered 2 days ago

Dear Anne-Lise,
There is no a simple answer to your question.
The problem with your example is that  “animal intention” is at the same time : two monogram (“animal” and “intention”) and one bigram (“animal intention”).
My guess is that, as you are sorting your results by frequency, “animal” alone and “intention” alone are much more frequent than “animal intention” together. But “animal intention” is frequent enough to figure in the TOP 200 main terms sorted by frequency.
So, for the rest of the terms which are appearing only as monogram, it is only a matter of being selected in the top 200 main terms : their monogram versions are frequent enough to be included in the list, but not their bigram version.
You have three ways to “solve that” :

I hope it helps
L