Hi,
I would like to ask a question about heterogeneous network mapping and how documents are indexed according to a list of terms.
I have a small corpus of 10 texts (of very variable length) associated with 10 different cities: the smallest text is 2000 words and the largest 16 000. And in the following analysis, I would like to make sure that I don’t measure these differences in length.
I first used the term extraction script. So far no problem, because you can choose to count at the sentence level (so several possible occurrences for a term in a document) or at the document level (presence or absence of the term in the document).
Afterwards, I manually selected and grouped the terms from the original list.
But when I import back this new list, the Corpus Terms Indexer script does not allow me to manage the type of counting. From what I understood, indexing counts the number of occurrences of each term in each text of the corpus.
Also, when building a heterogeneous network (in my case a network between the selected terms and cities), the count is taken into account, and the cities with the longest texts seem to me structurally more central, especially in the projections (which interests me) thanks to metrics such as the heterogeneous cosine (even if there is a normalization).
So my question is the following: is there a possibility of constructing a heterogeneous network (or a projection of such a network) by taking into account only the presence or absence of the terms (without the number of occurrences)?
any help would be appreciated,
Thank you for this incredible tool,
PG
Dear Paul,
Excellent question.
To achieve what you want to do, you can add a few extra steps.
First:
- (step1.1) Run a lexical extraction with Terms Extraction script
- (step1.2) Modify the list according to your needs
- (step1.3) Upload it and run a Corpus Terms Indexer
- (step1.4) You have updated you lexical extraction and the documents have been tagged with the list from step1.3
At that step (step1.4), as you alvready know, Main forms are added for each document, according to the list of Forms extracted and modified and their occurences. So, at the sentence level, and at the type of textual variables indexed level (which is not your case, but some users may have indexed two or more textual fields).
So, from there, you can:
- (step2.1) Run a List Builder selecting all your Main Forms tagged from the Corpus Terms Indexer on the variable/field which correspond to step1.4
- (step2.2) Modify it list by adding a new column which contains exactly the same Main Forms than the ones which are in the first columns (and removing the extra column). The file should have two similar columns.
- (step2.3) Upload it back, and perform a Corpus List Indexer using this second list, with the option Count only one occurrence per document during indexation activated, using the variable/field which correspond to step1.4
The result (step2.4) will be that your documents will receive only one occurrence per each Main form, even if the Main form and the corresponding Forms occur more.
I hope it helps.
L
Thank you very much !
It seems to work well.
Just a clarification:
in the last step (i.e “Upload it back, and perform a Corpus List Indexer using this second list, with the option Count only one occurrence per document during indexation activated”) : the field to be selected for the corpus list indexation is the result of the first Corpus Terms Indexer script , right ?
exactly! I have updated my answer to make it clearer