Not getting a "Chi2 Specificity" measure when extracting terms using Chi2

CorText Manager Q&A forumCategory: Text processingNot getting a "Chi2 Specificity" measure when extracting terms using Chi2
Kobi Hackenburg asked 4 months ago

Hello! 
I am trying to use CorText to extract a list of terms based on their chi2 specificity; however, my text is in a slightly unusual format: each document in my corpus represents a political speech, containing words extracted from a previously completed dictionary analysis. The text in each document therefor consists of a word followed by a period, then another word followed by another period (e.x: “president. help. president. leader. cruel. protect. lawyers. president. cruel. country. president. communities. nation. help.”), making each document essentially a string of one-word sentences. 
I have tried extracting terms based on sentence level co-occurrences, but the resulting “extracted terms” list does not have the “Specificity Chi2”, “Gf-idf”, “Occurrences”, and “Co-occurrences” columns I am used to seeing when I have used Cortext in the past (instead there is just a new column called “ngram log-likelihood). Is the extraction working properly? My goal is to extract a list of terms with the highest chi2 specificity based on the number of times they occur in each document in my corpus (*not* the number of distinct documents they occur in, which I believe the document-level analysis would provide). In other words, I want a ranking of these words which gives a lower weight to words used frequently by all documents in my corpus, and a higher weight to words used frequently in fewer documents etc. 
Let me know if there is a way I can achieve this! 
Thanks so much! 
Kobi Hackenburg 
 

1 Answers
Lionel Staff answered 4 months ago

Dear Kobi,
If I understand well, you should build a term extraction using these parameters:

  1. Allow monograms (yes)
  2. Maximal length of words = 1 (as, at least in your examples, all the terms are monograms )
  3. Check the box for advanced settings
  4. Disable the grammatical analysis of your sentences

And your right, it will remove some calculation from the provided results files.
Otherwise, you may want to :

  1. Allow monograms (yes)
  2. Maximal length of words = 1 (as, at least in your examples, all the terms are monograms )

But it will propose you some grammatical aggregations : mainly plurals or some grammatical variations (e.g. oxygen|&|oxygenation). If the dictionary is made in a way that all keywords is grammatically distinct, it should not be a problem.

Lets me know if it helps
L