The script creates two directories called “global distributions” and “temporal evolution”.
- The first directory “global distribution” simply lists the distribution of items per document and the distribution of documents per item of each field. Those files are useful in order to understand – for instance – the distributions of the number of authors per article or number of papers written by authors in a scientific database (by selecting the Authors field). Note that distributions are computed over all possible entries in the database, thus ignoring the number of top items to consider.
- In the “temporal evolution” directory, each field of the corpus will be enumerated over time in a csv file compiling the occurrences at each time step of the top items of the given field (original count of occurrences averaged over 3 or 5 time-steps windows are also available for analysis if raw statistics are too noisy). A dedicated web interface (see illustration) is also provided by clicking the eye next to the html files to visualize and customize the chart of each chosen field.
Demography main parameters
Choose which variable(s) to use.
Number of items to consider
Choose how many top items to consider for each selected variable(s), sorted by their total frequency.
Demography Parameters Advanced Settings
Two advanced options are also available.
Include the cumulated count of all the other less frequent entities in the final visualization
The first option allows to Include the cumulated count of all the other (less frequent than the top N entities) in the result. This option is useful to realize which proportion of the data is covered by the top N entities.
Normalize raw frequency count by the length of original textual field
The second option is only useful when analyzing textual entities that were indexed from an original raw textual content. Raw counts of occurrences of entities is replaced by a score that corresponds to the raw frequency of entities divided by the length (in number of words) of the texts from which entities were extracted (at a given time step). (More precisely, the score is made of the ratio between term frequency, and length of all the textual fields (measured in number of characters) multiplied by 6 which corresponds to the average size of words in english).
See this simple result for most pertinent words extracted from world bank report as an example of “Other Items” and “normalization” options: