Data Parsing

Screenshot from 2016-08-16 15:03:38

“Data parser” is a generic parsing script that handles a wide range of data formats: isi files (as  downloaded from the Web Of Science), Factiva datasets (select “Format for saving”, then right click and save as HTLM. Select Factiva during parsing), Pubmed (in the xml format export), RIS file as provided by scientific platforms such as Google Scholar or Scirus, batches of simple text files or any file formatted in csv format (please use the newly released “robust csv” parser in that case). It is also possible to parse xls files from Excel or Open Office (.xls not .xlsx !). If you have several xls files, as long as they share the same first row, data will be aggregated in the same database. Europress parser is also available plus other specific database parsers. Please read this page for more details about the correct formatting  of data coming from different sources.

Whatever the original source of the corpus it should first be archived and uploaded as a zip file (simply zip the set of original files into one unique zip file).

As output, data parser produces a sqlite database (suffixed by a .db). The database can be downloaded as a single file which can be easily read with the firefox plugin sqlite manager if needed. Last, if your original corpus is made of csv  (remember to first archive them as a zip file) you can indicate which field corresponds to the record timestamps (this timestamp may correspond to years, months or weeks, or any time unit of your choice, the important point being that timestamps are coded as integer values in the source file). Please provide csv file(s) with column names as first line (with no special characters). These names will be later used as tables names in the sqlite database and will be used to launch analysis on the desired fields. Different options regarding csv file formatting are also proposed. To stay on the safe side, it is nevertheless better to comply with these default formatting options:

  • Columns should be Tabulation Separated (TSV)
  • Don’t quote textual fields
  • Use utf-8 as final encoding.

Open Office offers a convenient interface to make sure your dataset is in the desired format. Google Spreadsheet is also convenient for that purpose (simply choose .tsv format to export)

Each field of your original files are recorded in its own table. Please whatever the entry format you are using, avoid calling tables with the following names which are protected: Terms, ISITerms. The index field indicates the id of the article and the field data corresponds to the parsed content. Field names are automatically managed for ISI files in the remaining analytical scripts. It means that for scripts like heterogeneous mapping or lexical extraction a simplified and comprehensible ISI field list is provided in a drop-down menu or as checkboxes. If you are using pubmed, factiva or your own custom csv files, then you should find the original field names in the drop-down menus.

The following video illustrates step-by-step how to prepare a readable csv file for CorText:

learn about CorText scripts and share your experience