Upload Corpus

To use the data parser you first need to first “upload a corpus” as a zipped file containing every single raw files forming your corpus (e.g. set of isi output files, csv files, etc). Once you have chosen your original dataset, you must select its type from the list of available formats (csv, factiva, pubmed, isi). You will be also asked whether you wish to index your dataset. If so, the process time will be longer but will allow you to perform queries and build sub-corpora (using query corpus).

Please if your data does not come from an existing platform, avoid using “Terms” and “ISITerms” as variable names.

Ready-made sources

Web of Science

Concerning files downloaded from ISI Web Of Knowledge, one should first select  the Web of Science database, type your query, then you can select 500 for each download (step 1), then select Full Records plus Cited References as output format (step 2), before downloading the file in Plain Text Format (step 3).

When you have finished downloading all the articles, simply zip the downloaded files and upload.

Scopus

It is also possible to import data coming from Scopus platform. Simply zip batches of data downloaded in RIS format. Choose RIS (scopus) for the Corpus Format of the parsing step. Compare to the basic RIS format, RIS (scopus) will enable CorText Manager to run more complex pre-processing steps to build useful variables (publications dates, authors address…) ready to be used in your analyses.

Factiva

Factiva exports work the same: select the first 100 articles (please restrain your selection to publications only!) click on the save icon (represented as a floppy disk for the ones old enough to remember what that looks like) and then save the source file using your browser (Menu File, then Save). When prompted, select the “Full Article format” option.  Iterate and save as many files as necessary until you have gathered every article. Zip your files before uploading them to the Manager. The following fields will be available to analyse textual data HeadLine (Title of the article), LeadParagraph and Text (the rest of the content). Finally article_fulltext simply concatenates the three parts in a single variable.

Please verify the file contains tag like in this  example. If not you should change the formatting of the display to “Full article and Indexation” using the option panel on the right side of the screen (see illustration below).

Choose English or French as the default language of Factiva web interface (option panel, top right section).

Europress

Use the classic version to export batches of 1000 articles in a html file.

In “advanced search” choose your sources.

Build your query, restrain your selection to Press only. Limit your selection to press: national or regional press for example. Which means that you must unclick Tv, radio, social media…

And scroll down the articles list to select all articles that correspond to one batch. Press save, and choose HTML format.

Twitter json files

A specific parser is provided for data collected through the Twitter search API (using tweet_mode=’extended’). You can then define a time granularity that goes from the year to the second. Only the most pertinent information from the API are conserved. Besides the original fields (like lang, user_name, or entities_hashtags) some additional fields are built: htgs, urls, usmesymbol. They simply correspond to the hashtags, urls, user mentions, and symbols present in a tweet. But a retweet, then original information about the original version are added.

Lexis Nexis files

Lexis Nexis data should be downloaded as simple text files. Simply zip the batch of 500-records you need to start your analysis. for the moment, time information is automatically transformed into months since January 1990.

Generic formats

csv files

You are also given the possibility to create your own dataset with simple csv files. To build your own dataset, csv files should have the following structure: each column should be entitled (first line) with the field name. Each line then corresponds to one document. If some of your fields are hosting multiple items (e.g. authors), then simply separate them with 3 successive stars: ***. Tabulations with no quotes are advised as formatting options to export your csv. See this example as a typical csv file (source VDN). Eventually, don’t forget to zip your csv file(s) for upload and use the robust csv option when selecting corpus type.

RIS

RIS (Research Information Systems) is a standardized tag format made to store bibliographic data. If you are using citation management applications such as RefWorks, Zotero, Papers, Mendeley or EndNote, you can easily build exports in a RIS format. Google Scholar (RefMan for .ris), Scirus, IEEE Xplore, the ACM Portal, ScienceDirect, SpringerLink and others are also offering this option to export selected articles. If your corpus is coming from Scopus, please use the option made specifically for this source: RIS (Scopus).

xls

Directly upload xls files (that you should first archive in a unique zipfile) structuring your data such that each line correspond to one document. Column titles will be used as variable names. Optionally you can type the column name which include the time information (that should be formatted as integers). If you have several xls files, as long as they share the same first row, data will be aggregated in the same database. When exporting spreedsheet from Excel Office or Google Sheets, use the .xls format (CorText Manager xls parser won’t work with .xlsx).

text files (.txt, .docx, .pdf)

Text files should be uploaded to the manager un-merged, meaning that each document should correspond to a unique file. Two fields will be created during parsing: text and filename. It is straightforward then to use corpus_list_indexer capacities to add some more metadata (like author, date, etc.) with the filename as a pivot key.

json files

It is also possible to import simply formatted (with no embedded dictionnary) json files. You will be asked wether there is a time entry. Either formatted as integers, either following the classic time format: Y-m-d (2001-12-23, being December 23rd 2001 for instance). If time information is completed with T-h:m:s, it will be automatically ignored. You then can choose to transforme time information into a certain number of years, months or days since January the 1st of any starting year you define.  

Upload process

Once you have zipped your files, you can drag and drop the zip file.

Wait until the end of the upload process bar, and click section when it becomes green.

It will automatically drive you to the parsing script and you will be asked to explain the nature of the data:

A file or collection of files as explained above (dataset), or a previous downloaded CorText dataset (cortext db).

If you data is a dataset, remember to also choose it’s source/format.

Data format options when parsing datasets
Data format options when parsing datasets

Once you have produced and uploaded your dataset, the next natural step is to parse it.

Deprecated source

Pubmed

While waiting a new release of the CorTexT Manager parsing script (for being able to work with .nbib), you can use RIS pour Pubmed (a few variables will be inaccessible).

The process is even simpler with Pubmed Medline as you can download every article at once. Simply select send to file and choose xml format as output to download your corpus.  Zip your file before upload.

Screenshot from 2016-08-17 10:21:35