contingency matrix

This script proposes to visualize the joint distribution of two fields, later denoted  A and B, over documents in your corpus. Its parameters are defined in 4 different panels which are largely similar to the options proposed in the network mapping script. However no network is plotted, the contingency matrix shows the degree of correlation between any pair of items A(i) and B(j) drawn from  each chosen fields A and B. Red cells are the most correlated (many documents mentioning item A(i) also mention B(j)). Blue ones are anti-correlated   (few documents mentioning A(i) also mention B(j)). White cells do not feature any correlation (B(j) and A(i)  joint mentions are neither more nor less numerous than average).

Screenshot from 2016-08-16 14:53:49

Screenshot from 2016-08-16 14:53:17

Two measures are available to highlight the discrepancy between both distributions. First a matrix of expected values of co-occurrences is computed following the null hypothesis that distribution are independent. Either classic  oriented Chi2 measure  or deviation measure are proposed. Chi2 measure directly indexes the color of the cell to its chi2 score (that is the ratio between the square of the number of co-occurrences between A(i) and B(j) minus its expected number under null hypothesis divided by this same  expected number). The deviation measure maps the increase of observed co-occurrences of A(i) and B(j) compared to the expected value. If a cell has value 6 for instance it means that the number of joint mentions is 600% higher than expected. If negative, -4 for instance, it means that the number of expected co-occurrences is 400% higher than the observed number of co-occurrences.

Compared to network mapping script, three additional options are proposed in the form of: “Automatic block re-ordering”, “logscale” and “Evaluate whether deviations are statistically significant”. “Automatic block re-ordering” simply reorders the entries in each field such that adjacent columns and rows are similarly result in a matrix which is easier to read. One can also activate the colormap logscale option (recommended) so the smaller deviations are not faded by larger ones. Finally “Evaluate whether deviations are statistically significant” will perform a Fisher exact test on each cell value to detect whether the measured deviation has a p-value above 0.05, in which case an arrow (“x”) indicating that the signal is spurious will be added in the middle of the cell.