Demography

Demography processes each field of the corpus and counts the raw evolution of occurrences of the top items. You will simply be asked to specify the number of top items you wish to evaluate. If you previously customized periods, you can also optionally  choose them   instead of the original time stamps.

The script creates two directories called “global distributions” and “temporal evolution”.

Chloroquine and hydroxychloroquine scientific publications – top 20 countries since 2001
  1. The first directory “global distribution” simply lists the distribution of items per document and the distribution of documents per item of each field. Those files are useful in order to understand – for instance –  the distributions of the number of authors per article or number of papers written by authors in a scientific database (by selecting the Authors field). Note that distributions are computed over all possible entries in the database, thus ignoring the number of top items to consider.
  2. In the “temporal evolution” directory, each field of the corpus will be enumerated over time in a csv file compiling the occurrences at each time step of the top items of the given field (original count of occurrences averaged over 3 or 5 time-steps windows are also available for analysis if raw statistics are too noisy).  A dedicated web interface (see illustration) is also provided by clicking the eye next to the html files to visualize and customize the chart of each chosen field.

Demography main parameters

Which variable(s)?

Choose which variable(s) to use.

Number of items to consider

Choose how many top items to consider for each selected variable(s), sorted by their total frequency.

Demography Parameters Advanced Settings

Two advanced options are also available.

Include the cumulated count of all the other less frequent entities in the final visualization

The first option allows to Include the cumulated count of all the other (less frequent than the top N entities) in the result. This option is useful to realize which proportion of the data is covered by the top N entities.

Normalize raw frequency count by the length of original textual field

The second option is only useful when analyzing textual entities that were indexed from an original raw textual content. Raw counts of occurrences of entities is replaced by a score that corresponds to the raw frequency of entities  divided by the length (in number of words) of the texts from which entities were extracted (at a given time step). (More precisely, the score is made of the ratio between term frequency, and length of all the textual fields (measured in number of characters) multiplied by 6 which corresponds to the average size of words in english).

See this simple result for most pertinent words extracted from world bank report as an example of “Other Items” and  “normalization” options:

dynamical profile of term “loans” in World Bank reports from 1946 to 2012 (data borrowed from Stanford Literary Lab) –