Data Parsing

What does the parsing step?

“Data parser” is a generic parsing script that handles a wide range of data formats: isi files (as  downloaded from the Web Of Science), Factiva datasets, Pubmed, RIS files, batches of simple text files or any file formatted in csv format (please use “robust csv” parser in that case, see below). It is also possible to parse xls files from Excel or LibreOffice (.xls not .xlsx !). Europress parser is also available plus other specific database parsers.

Read this page for more details about the correct formatting  of data coming from different sources.

Upload it and choose the corpus format

Whatever the original source of the corpus is, it should first be archived and uploaded as a zip file (simply zip the set of original files into one unique zip file). Click on the “Upload file” button and Drag and Drop the zipped file. See this section for more details about the upload process.

At the end of the upload progress bar, if the upload is a success (“File uploaded” will appear), click in the green box and you will be redirected on the parsing option page.

At anytime you can also access to the parsing script using the common way: start script > Corpus > Data parsing

dataset

For zipped files form the sources listed in the upload corpus page.

cortext db

You may refine the data of the sqlite database outside CorText Manager. If the updated sqlite database still follow the CorText Manager datastructure, you can upload back the sqlite database (again, after having zipped it) by simply choosing the cortext db option in the parsing script.

Building the variables and the database

As output, data parser produces a sqlite database (suffixed by a .db). All the datasets from your parsed corpuses are listed in the web page of your project.

You can have several datasets in one project. From most of CorText Manager scripts, when producing results some variables are added, or some information are updated, directly in the database.

At anytime, the sqlite database can be downloaded as a single file which can be easily read outside CorText Manager if needed (with the firefox/chrome plugin sqlite manager, sqlite studio, or directly in R and python…). To upload back your corpus in CorText Manager, use cortext db option.

CorText Manager data storage philosophy

During the parsing of the corpus, CorText Manager will automatically assign an ID for each documents (e.g. scientific articles, articles from newspapers, tweets).

If the parsed information inside a document has more than one value (e.g. author addresses for a scientific paper), the RANK will increase.

For some variables (mainly from the Ready-made sources), CorText Manager will pre-process the data. It is the case for author addresses or Cited References, where strings are cut using comma. In those cases, the PARSERANK will increase.

In the last step, the DATA column stores the pieces of extracted information. This field will play a core role in the analyses run with the scripts.

Example of three parsed authors addresses in two documents

All variables extracted from your corpus, and most of the results produced by scripts, are built this way. It enable you to run nearly all scripts on all variables (limited only by the data type), and to build in a very flexible manner heterogeneous analyses.

So, four types of variables are accessible in CorText Manager:

  • raw variables: directly parsed from the data source, without any change;
  • pre-processed variables: information extracted, divided and cleaned, from the raw variables;
  • calculated variables: built from results after running CorText Manager scripts;
  • information added by users: variables added by users (e.g. edited lists of terms, external dictionaries, added metadata for csv or txt).

As said, some field names are automatically managed and some data from those fields are pre-processed by CorText Manager for Ready-made data sources. Simplified names and useful variables will be accessible in your analytical scripts. This strategy is pushed further for ISI source, where for scripts like heterogeneous mapping or lexical extraction a simplified and comprehensible ISI field list is provided in a drop-down menu or as checkboxes. If you are using pubmed, factiva or your own custom csv files, then you should find for most of the varaibles the original field names in the drop-down menus.

Variable names to avoid

Please whatever the entry format you are using, avoid calling tables with the following names which are protected: Terms, ISITerms.

Using csv: robust csv

Last, if your original corpus is made of csv  (remember to first archive them as a zip file) you can indicate which field corresponds to the record timestamps (this timestamp may correspond to years, months or weeks, or any time unit of your choice, the important point being that timestamps are coded as integer values in the source file). Please provide csv file(s) with column names as first line (with no special characters). These names will be later used as tables names in the sqlite database and will be used to launch analysis on the desired fields. Different options regarding csv file formatting are also proposed. To stay on the safe side, it is nevertheless better to comply with these default formatting options:

  • Columns should be Tabulation Separated (TSV)
  • Don’t quote textual fields
  • Use utf-8 as final encoding.

LibreOffice offers a convenient interface to make sure your dataset is in the desired format. Google Sheet is also convenient for that purpose (simply choose .tsv format to export).

Each field of your original files are recorded in its own table.

Time Field

Name of the column which contain the date. After the parsing step, this column will be accessible in CorText Manager with the name of ISIpudate (or year).

Leave empty if you do not have any time variable.

Date Format, by default, time entry should be formatted as an integer, acitvate this option if you have dates

Should be formatted as: “Year-month-Day”, for instance “2020-03-24”

Step-by-step tutorial on how to prepare a csv file

The following video illustrates step-by-step how to prepare a readable csv file for CorText: