Upload Corpus

To use the data parser you first need to first “upload a corpus” as a zipped file containing every single raw files forming your corpus (e.g. set of isi output files, csv files, etc). Once you have chosen your original dataset, you must select its type from the list of available formats (csv, factiva, pubmed, isi). You will be also asked whether you wish to index your dataset. If so, the process time will be longer but will allow you to perform queries and build sub-corpora (using query corpus).

Please if your data does not come from an existing platform, avoid using “Terms” and “ISITerms” as variable names.

Web of Science

Concerning ISI files downloaded from ISI Web Of Knowledge, one should first select  the Web of Science database, type your query, then you can select 500 for each download (step 1), then select Full Records plus Cited References as output format (step 2), before downloading the file in Plain Text Format (step 3).

Screenshot from 2016-08-17 10:53:08

When you have finished downloading all the articles, simply zip the downloaded files and upload.

Pubmed

The process is even simpler with Pubmed Medline as you can download every article at once. Simply select send to file and choose xml format as output to download your corpus.  Zip your file before upload.

Screenshot from 2016-08-17 10:21:35

Scopus:

It is also possible to import data coming from Scopus platform. Simply zip batches of data downloaded in RIS format.

Factiva:

Factiva exports work the same: select the first 100 articles, click on the save icon (represented as a floppy disk for the ones old enough to remember what that looks like) and then save the source file using your browser (Menu File, then Save). Iterate and save as many files as necessary until you have gathered every article. Zip your files before uploading them to the Manager. The following fields will be available to analyse textual data HeadLine (Title of the article), LeadParagraph and Text (the rest of the content). Finally article_fulltext simply concatenates the three parts in a single variable.

Screenshot from 2016-08-17 10:53:26

Please verify the file is in a viable HTML format (some academic access only gives you access to a limited HTML version) but beware that the parsing does not work with an RTF export.

csv files:

You are also given the possibility to create your own dataset with simple csv files. To build your own dataset, csv files should have the following structure: each column should be entitled (first line) with the field name. Each line then corresponds to one document. If some of your fields are hosting multiple items (e.g. authors), then simply separate them with 3 successive stars: ***. Tabulations with no quotes are advised as formatting options to export your csv. See this example as a typical csv file (source VDN). Eventually, don’t forget to zip your csv file(s) for upload and use the robust csv option when selecting corpus type.

xls:

Directly upload xls files (that you should first archive in a unique zipfile) structuring your data such that each line correspond to one document. Column titles will be used as variable names. Optionally you can type the column name which include the time information (that should be formatted as integers)

text files (.txt, .docx, .pdf):

Text files should be uploaded to the manager un-merged, meaning that each document should correspond to a unique file. Two fields will be created during parsing: text and filename. It is straightforward then to use corpus_list_indexer capacities to add some more metadata (like author, date, etc.) with the filename as a pivot key.

json files:

It is also possible to import simply formatted (with no embedded dictionnary) json files. You will be asked wether there is a time entry. Either formatted as integers, either following the classic time format: Y-m-d (2001-12-23, being December 23rd 2001 for instance). If time information is completed with T-h:m:s, it will be automatically ignored. You then can choose to transforme time information into a certain number of years, months or days since January the 1st of any starting year you define.  

Twitter json files:

A specific parser is provided for data collected through the Twitter search API (using tweet_mode=’extended’). You can then define a time granularity that goes from the year to the second. Only the most pertinent information from the API are conserved. Besides the original fields (like lang, user_name, or entities_hashtags) some additional fields are built: htgs, urls, usmesymbol. They simply correspond to the hashtags, urls, user mentions, and symbols present in a tweet. But a retweet, then original information about the original version are added.

Lexis Nexis files:

Lexis Nexis data should be downloaded as simple text files. Simply zip the batch of 500-records you need to start your analysis. for the moment, time information is automatically transformed into months since January 1990.

Upload process

During the upload process, you will be asked to explain the nature of the data;

Screenshot from 2016-08-17 10:47:03

A database as explained above, a list of terms without other defining fields attached, or a previous downloaded CorText dataset.

If you data is a dataset, remember to also choose it’s source/format.

Data format options when parsing datasets
Data format options when parsing datasets

Once you have produced and uploaded your dataset, the next natural step is to parse it which is a script automatically launched after the upload.

learn about CorText scripts and share your experience