Hi, I am trying to understand better the differences between the different options in the advanced settings for the terms extractions (namely Frequency (c-value) computation level and the different options in the Specificity score). I have read the references reported in the page, could you recommend other papers where the different options are explained?
Thank you very much.
Actually, the documentation may be slightly outdated with respect with the present state of the script. We will try to update it as soon as possible.
Broadly there are four different measures allowing to score term importance when sorting a final list in the CorText term extraction script.
The first straightforward option is to sort words by frequency and only keep most frequent terms.
Second available option is to use the so-call pigeon measure, which will score terms according to their multiplicity in each document. Put differently, this score measures the average frequency of terms within each document it appear in (corrected by a factor to accommodate with the likelihood of very frequent terms to appear several times in the same piece of content). The hypothesis here is that important words tend to be repeated within the same document. The measure naturally only works for long enough texts.
Third and thought options are specificity scores built using either a chi square of G square measure of specificity of a term comparatively with the other terms in the vocabulary. The rationale is that non-informative terms feature a random distribution across the documents, hence a random distribution across the words.