cluster tagging

orianabras asked 8 years ago

Dear colleagues, the option “tag cluster experimental” allows me to tag the cluster in my maps but I am not sure what do the tags describe. Are they simply the most frequent terms in each cluster or is there/or can be another computing involved? Thank you very much for your collaboration!

2 Answers
Jean-Philippe Cointet Staff answered 8 years ago

Dear Oriana,  It all depends on the “Tagging Measure” you selected.  First you choose a dimension for tagging your clusters. Then for each cluster the entities (among the top N – N being the number of nodes you chose in the first panel options)  which tagging measure is the highest will show up on the map.  For instance if you select “raw” or “tf” as tagging measure, the most occurrent items will show. If “tf.idf”, the frequency of apparition of each item in the documents projected in each cluster (that is the tf measure) are weighted by the inverse document frequency of the entity, allowing rarer entities to appear higher in the hierarchy.   Chi2, Cramer or Mutual Information will assign higher score to entities which are specifically linked to a given cluster.  Overall tf is convenient to show the generic distribution in volume, while the last three measures (and to a lesser extent tf.idf) is informative to show bias in the distribution of a given variable against a given partition of the corpus. However one should be careful when using  specificity measure as – if N is high – they tend to exhibit very large deviation from the null model for very rare entities. And these very high score of specificity should rather be interpreted as statistical noise. Put differently: if an entity only occurs once, it will necessarily ranks among the most specific entities in a given cluster.  A solution to avoid this caveat is to first prepare a table with a limited number of entities which frequency is large enough to avoir this kind of effet.

orianabras answered 8 years ago

Thank you very much!