Hello!
I am trying to use CorText to extract a list of terms based on their chi2 specificity; however, my text is in a slightly unusual format: each document in my corpus represents a political speech, containing words extracted from a previously completed dictionary analysis. The text in each document therefor consists of a word followed by a period, then another word followed by another period (e.x: “president. help. president. leader. cruel. protect. lawyers. president. cruel. country. president. communities. nation. help.”), making each document essentially a string of one-word sentences.
I have tried extracting terms based on sentence level co-occurrences, but the resulting “extracted terms” list does not have the “Specificity Chi2”, “Gf-idf”, “Occurrences”, and “Co-occurrences” columns I am used to seeing when I have used Cortext in the past (instead there is just a new column called “ngram log-likelihood). Is the extraction working properly? My goal is to extract a list of terms with the highest chi2 specificity based on the number of times they occur in each document in my corpus (*not* the number of distinct documents they occur in, which I believe the document-level analysis would provide). In other words, I want a ranking of these words which gives a lower weight to words used frequently by all documents in my corpus, and a higher weight to words used frequently in fewer documents etc.
Let me know if there is a way I can achieve this!
Thanks so much!
Kobi Hackenburg
Dear Kobi,
If I understand well, you should build a term extraction using these parameters:
- Allow monograms (yes)
- Maximal length of words = 1 (as, at least in your examples, all the terms are monograms )
- Check the box for advanced settings
- Disable the grammatical analysis of your sentences
And your right, it will remove some calculation from the provided results files.
Otherwise, you may want to :
- Allow monograms (yes)
- Maximal length of words = 1 (as, at least in your examples, all the terms are monograms )
But it will propose you some grammatical aggregations : mainly plurals or some grammatical variations (e.g. oxygen|&|oxygenation). If the dictionary is made in a way that all keywords is grammatically distinct, it should not be a problem.
Lets me know if it helps
L