corpus list indexer

This script is naturally connected to list builder script. It provides users with full control other a set of items that may later get mapped or analyzed. Technically, one or several new field(s) will be created  using a key defined by user along previously uploaded csv files.

How to use the script

corpus list indexer parameters
  • Field – Select the field you wish to work on
  • Define a custom list of entities – If checked, user can provide a csv file filled with a list of items that will be specifically indexed in the target field (concretely, only the first column of a tabulated csv file will be considered). By default (unchecked), every entities present under the chosen field will be indexed.
  • Add a dictionary of equivalent strings – If yes, one should provide a csv file made of couples of equivalent strings. Entities from the first column of the csv file will be automatically transformed into second, third, fourth, (etc.) column entities (remember that the default csv formatting is tabulation delimited, please use Open Office or Google Sheets if you want to edit it in a spreadsheet software). If an existing entity is not listed in the first column, then it will remain unchanged. Pay attention to the name of your columns as newly indexed variables will be named out of them.
  • Add a null label to every article with no matching tag – This will label “null” any field that has none of the entities pre-defined in the custom list of entities file
  • Count only one occurrence per article during indexation –  This option is useful when one does not wish that several occurrences of the same entry are mentioned for a given document. For instance, if one wants to compute the distribution of articles published by the USA in a scientific database, it may be useful to reindex the Country field first with this option, such that articles written by at least one american author are counted only once. By default, if several scientists with different US affiliations publish a paper, then this article is indexed with several occurrences of USA in the raw database.
  • Finally, one can give a custom name to the newly generated field.

If you are replacing a temporal field, the table (or equivalently csv column name) must necessarily be called “ISIpubdate” .

See the video below for a demo of how to add time step information to a dataset imported from raw text files: