I am working on a Factiva dataset, which includes newspaper articles from 3 different sources. And I encountered two questions when using the term extraction and network mapping function:
- I extracted 50 terms from the full articles on demanding minimum frequency 3 and maximum length 5, the rest by default. I want to use the table which I found in index-list, including frequency and distinct numbers of documents. But then I found one term, the frequency and distinct numbers of documents are both 0. Why this term is extracted?
- Then I want to do a network mapping to see the interaction between source and term extracted. The first field chosen is the source name and the second is ISterm50 (extracted term). But in the network map, there are just 40 terms, not 50 terms. Have I done something wrong? How to achieve this goal for analysing the relation between source and extracted terms ?
Thanks a lot !!!! Have a nice day!!
Could you invite me to your project, using this email address: lionel dot villard at esiee dot fr
I will leave the project after…
It looks like it is a problem of cleaning the extracted list of terms in French. Especially for the covid 19 pandemic notion. For being able to trace it properly in all the textual information from the articles. In addition, you may want to enrich the list of terms extracted for this notion, with some extra lexical variations.
After this cleaning / enriching step of the terms list. Change the name of the file, save it, and run a term list indexer to apply it to your corpus.
For your question on Heterogenous Network Mapping, you will need to set the two types of nodes separately. See the Field2 number of nodes parameter on the documentation page.
I hope it helps!
Thanks, Lionel!! You helped me a lot!
Just one question concerning the term extraction. I enrich my list to 100 terms and then I see that the most frequent words are different from when I set my list to 50 terms. I wonder why this happens as I use the same corpus and with the same conditions ( minimum frequency 3 and maximum length 5). And I changed to 500 terms, this changes again. And for one term: En Chine, in the 50 term version, it doesn’t appear in the main form but just in forms, in the 100 terms its frequency is 128 and in the 500 terms its frequency is 115. What causes this change?
Thanks again !!
I would say it is due to the fact that it is not purely the top most frequent terms. See more here:
- And in french : https://docs.cortext.net/question/terms-extraction/
I hope it helps
Merci beaucoup! Ça m’aide beaucoup et je vous souhaite un bon dimanche !!
I tried the heterogenous nodes setting in network mapping. After cleaned my terms, I have 41 terms left. Then I do a network mapping with the field 1- editedterm and field 2 – source name. On the map produced, there are only around 20 terms appeared and linked to these three sources names. I know cortext use chi2 to define proximity. Is that because the other 20 terms are not close enough to the source name?
Thanks and have a nice day!
You have to fine tune:
1/ the proximity measure: https://docs.cortext.net/analysis-mapping-heterogeneous-networks/mapping-edges-definition/ by default the proximity measure is distributional
2/ the number of nodes for each variable (Number of nodes and Field2 number of nodes, in advanced option)
Remember, only the linked nodes will be shown.