I am contacting you, because I have some questions about the formulas in the documentation of Cortext in the section “edges definition”
For the chi2 formula, the formula that is presented is not that of the classic chi2 (I keep finding a numerator which is squared), what interests me is to know if it is indeed the displayed formula that is calculated in Cortext.
In the “distributional” formula, it is specified that the links n (i, k)> 0 are retained, is it only this link or also n(j, k)> 0?
Thanking you for your help,
Johannes van der Pol
thank you for your question.
You’re right, the equation of the so-called chi2 measure used in CorText is not perfectly conventional. As indicated in the documentation, we compute the ratio between the difference of observed minus expected number of cooccurrences between two entities i and j, divided by the square root of the number of expected cooccurrences.
If we were totally rigorous we should measure the contribution of each cell of the contingency matrix summarizing every possible event, a document may include:
- i and j
- i without j
- j without i
- and neither i or j
Rigorously one should test the “degree of correlation” between i and j using the full contingency table and applying the classic formula . Actually if i and j are strongly correlated, then most of the contribution to the precise Chi2 test will come from the first cell of the contingency matrix (i and j). This is the reason why we use this simplified equation.
Using the squared version is in-consequential considering it does not change the ranking of edges.
I’m not sure I understood your second question, are you referring to indication from this page ?