Demography

Demography processes each field of the corpus and counts the raw evolution of occurrences of the top items. You will simply be asked to specify the number of top items you wish to evaluate. If you previously customized periods, you can also optionally  choose them   instead of the original time stamps.

The script creates two directories called “global distributions” and “temporal evolution”.

Screenshot from 2016-08-16 15:06:02

Synthetic Biology scientific publications – top 20 countries since 2001

  1. The first directory “global distribution” simply lists the distribution of items per document and the distribution of documents per item of each field. Those files are useful in order to understand – for instance –  the distributions of the number of authors per article or number of papers written by authors in a scientific database (by selecting the Authors field). Note that distributions are computed over all possible entries in the database, thus ignoring the number of top items to consider.
  2. In the “temporal evolution” directory, each field of the corpus will be enumerated over time in a csv file compiling the occurrences at each time step of the top items of the given field (original count of occurrences averaged over 3 or 5 time-steps windows are also available for analysis if raw statistics are too noisy).  A dedicated web interface (see illustration) is also provided by clicking the eye next to the html files to visualize and customize the chart of each chosen field.

Two advanced options are also available.

The first option allows to Include the cumulated count of all the other (less frequent than the top N entities) in the result. This option is useful to realize which proportion of the data is covered by the top N entities.

The second option is only useful when analyzing textual entities that were indexed from an original raw textual content. Raw counts of occurrences of entities is replaced by a score that corresponds to the raw frequency of entities  divided by the length (in number of words) of the texts from which entities were extracted (at a given time step). (More precisely, the score is made of the ratio between term frequency, and length of all the textual fields (measured in number of characters) multiplied by 6 which corresponds to the average size of words in english).

See this simple result for most pertinent words extracted from world bank report as an example of “Other Items” and  “normalization” options:

dynamical profile of term “loans” in World Bank reports from 1946 to 2012 (data borrowed from Stanford Literary Lab) –

 

learn about CorText scripts and share your experience