Terms extraction and indexation at the "Document Level"

ForumCategory: QuestionsTerms extraction and indexation at the "Document Level"
Aurélien Féron asked 5 months ago

I have done a Terms extraction by choosing in the advanced settings “document level” as “Frequency computation level”.
I have made some changes in the list of terms obtained, and I would like now to proceed to a Corpus Terms Indexer during which the frequencies would also be calculated at the “document level”.
Is this possible and how? (I don’t see this possibility in the advanced settings of the Corpus Terms Indexer script)
Thank you!

2 Answers
Lionel Staff answered 2 weeks ago

Dear Aurélien,
Frequency computation level” will provided you the ability to measure how redundant a form is in your documents or not. The frequency of terms can be computed at the document level (meaning that terms frequency are computed based on the number of distinct documents they appear in) or at the sentence level (default choice, meaning that the repetition of a terms across sentences will be taken in account in the occurrence calculation).
As the frequencies of a term (main form) are calculated during the indexation process (so, after the extraction and selection of the relevant keywords), results for “Frequency computation level” are stored in the folder indexed_list, inside the multiterms_statistics_expanded.csv file.
I hope It helps

Lionel Staff answered 5 days ago

Dear Aurélien,
To answer to the second part of your question: Corpus Terms Indexer script will always work at the sentence level. But to reach the behaviour you are looking of, it is possible to add an extra step by running a Corpus List Indexer.
Here are the steps you would have to follow:

  • Run Terms Extraction script, at the document level (or not)! Download and tweak your list of main forms and forms;
  • Upload and index your corpus with Corpus Terms Indexer script! You can use advanced options to customize your indexation according to your needs);
  • Run Corpus List Indexer script! Choose the field (the one produced by the previous step) you want to group at the document level, leaves Define a custom list of entities and Add a dictionary of equivalent strings as No, and choose Yes on List indexation advanced settings, and click Yes on Count only one occurrence per document during indexation

It is done: you have added for you analysis a custom new field with keywords grouped at the document level!
I hope it helps

learn about CorText scripts and share your experience