Bobbele1 asked 4 years ago


after minimally editing my 700 words extracted terms list and re-uploading it with corprus terms indexer, the frequencies of almost half my list are set to 0. The same applies forr “distinct numberr of documents”, of course.

Opposed to that, the frequencies of many words have expired a strong increase in frequencies.

Dont know whats going on there…

Thanks in Advance!

Jean-Philippe Cointet Staff answered 4 years ago

When measuring term frequency, please refer to the results delivered by the corpus term indexer which is located in the csv file suffixed by “expanded” that you will find in the indexed_list folder. 
I understand the logics can seem a bit puzzling. But I suppose you are comparing frequencies with the extracted_terms_xxx file that is produced using the term extraction script. This file actually contains inexact statistics before the final indexation perform all the necessary steps for an accurate counting of frequencies.
For instance, certain  words if nested  (for instance if “United States” and “united” are both present in your term list, the corpus term indexer will only increment “united” when not preceding “States”) will result with a much different frequency count.