Terms extraction and indexation at the "Document Level"

Cortext Manager Q&A forum › Category: Text processing › Terms extraction and indexation at the "Document Level"

Hi,
I have done a Terms extraction by choosing in the advanced settings “document level” as “Frequency computation level”.
I have made some changes in the list of terms obtained, and I would like now to proceed to a Corpus Terms Indexer during which the frequencies would also be calculated at the “document level”.
Is this possible and how? (I don’t see this possibility in the advanced settings of the Corpus Terms Indexer script)
Thank you!

Question Tags: Corpus Terms Indexer

3 Answers

0 Vote Up Vote Down

Lionel Staff answered 6 years ago

Dear Aurélien,
“Frequency computation level” will provided you the ability to measure how redundant a form is in your documents or not. The frequency of terms can be computed at the document level (meaning that terms frequency are computed based on the number of distinct documents they appear in) or at the sentence level (default choice, meaning that the repetition of a terms across sentences will be taken in account in the occurrence calculation).
As the frequencies of a term (main form) are calculated during the indexation process (so, after the extraction and selection of the relevant keywords), results for “Frequency computation level” are stored in the folder indexed_list, inside the multiterms_statistics_expanded.csv file.
I hope It helps
Lionel

Aurélien Féron replied 6 years ago

Hello Lionel,
Thank you for your answer!
It seems to confirm what I thought when I opened [terms]_statistics_expanded.csv files: the “frequency” column gives the “raw” frequency of the term (or more precisely of the main form) in the whole corpus, knowing that each occurrence of the term counts, while the “distinct number of documents” column gives a frequency for which only one occurrence per document counts even if the term appears several times; is that right?
If that’s right, this leads me to another question: let’s say for example that I want to do a network mapping or an epic-epoch using indexed terms. Which statistics are then used: the one in the “frequency” column (so the raw frequencies), or the one in the “distinct number of documents” column (so only one counting occurrence per document)? I don’t see, on the parameter definition tabs of these scripts, any possibility to choose one or the other.
Thank you in advance for your answer and have a nice day!
Aurélien

0 Vote Up Vote Down

Lionel Staff answered 6 years ago

Dear Aurélien,
To answer to the second part of your question: Corpus Terms Indexer script will always work at the sentence level. But to reach the behaviour you are looking of, it is possible to add an extra step by running a Corpus List Indexer.
Here are the steps you would have to follow:

Run Terms Extraction script, at the document level (or not)! Download and tweak your list of main forms and forms;
Upload and index your corpus with Corpus Terms Indexer script! You can use advanced options to customize your indexation according to your needs);
Run Corpus List Indexer script! Choose the field (the one produced by the previous step) you want to group at the document level, leaves Define a custom list of entities and Add a dictionary of equivalent strings as No, and choose Yes on List indexation advanced settings, and click Yes on Count only one occurrence per document during indexation

It is done: you have added for you analysis a custom new field with keywords grouped at the document level!
I hope it helps
Lionel

0 Vote Up Vote Down

Aurélien Féron answered 6 years ago

Thank you, Lionel, for your answer!
(Yes it helps!)
Aurélien

Cortext Manager Documentation

Terms extraction and indexation at the "Document Level"

Learn about Cortext methods and share your experiences