terms extraction: very different results depending on max string length

CorText Manager Q&A forumCategory: Text processingterms extraction: very different results depending on max string length
roberto cantoni asked 4 years ago

Dear all, 
I’m trying different terms extractions from my corpus. I’ve tried setting the maximum string length to be searched at 3, and then at 5, and I get considerably different results: namely in max5 I don’t get high frequency strings I obtained with max3. Is that expected? 
Thank you. 

Jean-Philippe Cointet Staff replied 4 years ago

Yes, by default, extracted terms are not ranked by frequency but by specificity (chi2).
When authorizing longer ngrams to appear in the list, one may modify the specificity score of each individual word resulting in a different ranking and a new selection of words.
Of course, you can set the ranking principle to be based on frequency in the advanced options !
Hope this helps

roberto cantoni replied 4 years ago

Thank you for your reply, Jean-Philippe. Just one more question: I know what a chi2 test is used for in physics, but what exactly does it mean when applied to terms extraction? What is specificity in this sense? I looked up the documentation here (https://docs.cortext.net/lexical-extraction/) but could not find a fully satisfying answer.