Network Analysis & Layout

Describe options under the Network Analysis and Layout panel in the mapping script

Clusters Detection Method:

Screenshot from 2016-08-16 15:37:03

The Heterogenous mapping script automatically identifies locally dense groups of nodes in the network. Different definitions/algorithm of these “communities of nodes” are possible. Users can choose between three popular ways to compute these meso-level structures: Louvain (Blondel et al. 2008) , Infomap (Rosvall et al. 2008), clique percolation (Palla et al. 2005) . Louvain is the most popular algorithm, while Infomap may succeed in detecting finer-grained communities. Clique percolation’s main advantage is to be interpretable as an algebraic property even though it will tend to exclude poorly connected nodes.

Historical layout:

Screenshot from 2016-08-16 15:37:31

Screenshot from 2016-08-16 15:37:59

When mapping networks, two options are available to define nodes abscissa (x coordinate). By default, nodes are spatialized in 2d and take positions that optimize the stress produced by network topology (typically two nodes are attracted when connected by a link, the force being proportional to edge weight).

But one can also choose to fix x-position of node according to their “date”. For example, cited references will be positioned according to their publication dates – the network layout is then solely optimized according to the y-axis. This option will produce historical maps such as the one illustrated above and produced by analyzing a corpus of publications about synthetic biology.

In other cases for which a “natural” date is not provided, historical maps are still possible but the “time” at which nodes are positioned then correspond to the date when their number of occurrences reaches 20% of their total frequency over the whole dataset.

Others and Advanced

capture-decran-2016-10-29-a-12-09-20

Project records onto clusters – by default, once the cluster structure of the map has been determined, every article is matched against each cluster composition to assess how close their content are. A document may then be assigned to zero or several clusters at once. Additionally, a new table (whose name starts with “projection_cluster” followed by the field name) will be created in the database. This new table can be used as a new field for further analysis.

Assign a unique cluster to each record (best match) – This option can be activated to limit the maximum number of clusters assigned to each document to 1.  The most similar cluster to a given document is then selected, as long as this similarity is higher than a predetermined threshold (meaning that some documents may still stay unassigned)

Add information from a 3rd variable – This option will produce  tags associated to each cluster according to a new dimension in the dataset (to be chosen). A tagging metric should be chosen. Only top N closest tags will appear on the final map (N being an option to be defined in the form). Tagging option is equivalent to computing a new network onto the different clusters that have been identified. Put differently, an heterogeneous network is computed between the cluster field and a second chosen one. For instance, one can compute a journal co-citation  network and then tag them with countries (see illustration below). Articles are then projected onto these clusters which become a new kind of variable (field 1). A proximity network between those semantic clusters and institution field can then be computed.   Options are tf ( in which institutions are proportionally the most present in the cluster), raw (which is equivalent), chi2 (a chi2 proximity measure is computed between semantic clusters  and institutions), cramer, mutual information (see the description of metrics for more information). Other metrics proposed are rather designed for heatmaps (see below)

Co-citation map of top 150 cited references in a synthetic biology corpus. Each Cluster has been tagged with the 3 most publishing countries.
Co-citation map of top 150 cited references in a synthetic biology corpus. Each Cluster has been tagged with the 3 most publishing countries.

Heatmap – Heatmaps allow to overlay on a given network( let’s say a co-citation network on Facebook: domains are linked when they are oftentimes being shared by the same users) the distribution of presence of an entity taken from a different field (for example the gender of the Facebook user). Both the new field and the variable have to be indicated (in our case male). The algorithm computes for every node on the map its specificity with this modality: are men more likely to  share links about 9gag.tv ? Possible metrics are the same than the one available for tagging clusters, plus chi2_dir, cramer_dir and cool_deviation which are more useful in this setting. Chi2_dir and Cramer_dir correspond to the classic chi2 and cramer measures except they will also allow the user to observe negative correlations, generating a blue area in the final visualization. Cooc_deviation also measures how distant the number of citations of 9gag.tv by men is from what it should have been if this domain was uniformly distributed among men and women (relatively to their respective numbers). If positive let’s say 2: this means that the number of citations by men is twice what should be expected. If negative let’s say – 2: it means twice the number of citations of 9gag.tv by men would have been needed to reach its expected theoretical number. The final visualisation averages the different specificity scores measured at each point of the network to produce a heatmap.

Co-citation map between domains shared by the same users on Facebook. Heatmap shows how male users concentrate on certain areas of the network
Co-citation map between domains shared by the same users on Facebook.
Heatmap shows how male users concentrate on certain areas of the network (note that cluster labels have been manually added)

 

Note finally that you can compute heatmaps over time. The background map will not change and shall still depend on the dynamical settings  set in panel 3. But the distribution of the variable plotted on the heatmap will depend on the time range you chose (one needs to define a time period over which successive heatmaps will be computed)

Screenshot from 2016-08-16 15:39:26

Replace circles with alpha-shapes – This option changes the final layout of the communities around nodes. Instead of circles whose sizes are proportional to the number of articles assigned to a given cluster (if this option was activated), alpha-shapes are drawn for a more “organic” outcome.

Automatic Intertemporal Threshold – This refers to the threshold value used to create inter-temporal links when constructing river networks (tubes). This threshold is computed such that the total number of bifurcation in the final river network scales with the square root of the number of clusters overall (in all time ranges). Nevertheless one can manually tweak the parameter to only consider stronger or weaker links connecting temporally successive clusters.

Small cluster Embedding in the river network – A special procedure absorbs smaller clusters in the river network that may tend to stay isolated otherwise. This is not part of the original algorithm described in (Rule et al, 2015) but it still gives good empirical results.

Hide orphan clusters in the phylogeny – By default every cluster is shown, even isolated in the river network. If one changes this option to yes, only dynamically connected clusters will appear in the tube layout and disconnected clusters will be colored grey in every map

I want my map fast – De-activate this option if experiencing memory errors or working with a very large corpus

Screenshot from 2016-08-16 15:39:51

Minimize number of crossings in tube layout – If your river network is very large, minimizing the number of crossings may be very long. This option allows the user to accelerate the computation time (at the cost of a possibly less compelling visual outcome of the river network)

Size community Threshold – Only clusters whose size are above N will be considered. (useful to get rid of “noisy” clusters made up of only two or three nodes)

Principal connected component only  – Nodes that do not belong to the principal connected component will not be shown (they will be considered as isolated nodes)

Avoid label overlap – Original network Spatialization will be slightly modified to make the labels more readable.

Robustness analysis – Uncertain edges will be represented by dots. If activated, the final map is compared with the same map computed with the same parameters but on a random sample made of half of the original database. If edges are found in both map lines connecting nodes will stay solid, otherwise they will be dotted.

Select if… – If your text is in Japanese, Chinese or Korean, select this option

References

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. J. Stat. Mech, 10008.

Martin Rosvall, & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123.

Palla, G., Derenyi, I., Farkas, I. J., & Vicsek, T. A. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435, 814.

Rule, A., Cointet, J. P., & Bearman, P. S. (2015). Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014. Proceedings of the National Academy of Sciences, 112(35), 10837-10844.

learn about CorText scripts and share your experience