How to clean and update research institution data after parsing

CorText Manager Q&A forumCategory: Data processingHow to clean and update research institution data after parsing
Aindu85 asked 9 years ago

How could I clean institution data?
I have a WOS database publications in computer science category. I upload it and parse it and the resulting Research Institutions data is not precise at all, apparently it is just the first string before comma in Adress field.
For exemple CNRS lab Machin could be parsed, depending on address spelling, as
2) Machin
3) CNRS Machin 
There is to many ways to write an address of an institution and it seems that the parsing algorithm cannot manage it.
It is possible to clean data outside the Cortext, basing on full adress (ISIC1_1 in db)  I’ve done it. But I haven’t found the way to update “Research Institutions” field using the cleaned institution data.
The problem is that uploading a dictionary (importing two column csv file and parsing it as a term list) and using Corpus List Indexer with Add a dictionary of equivalent strings option is not possible. The parsed “Research Institution” field contain not enough information to replace a wrong value by the correct one.
The correct institution name can be matched only with the full address contained in ISIC1_1 and not in “Research Institution” field (or ISIC1Inst).

Any advices?
Thanks in advance,

c24b replied 9 years ago

I'm not sure to understand exactly your question:<br>is it how to clean and update the data **inside** the cortext manager to expose a cleaned field Research Institution <br>or is it how to do it outside the Cortext Manager?<br>I will try my best to answer your question:<br><br>- First advice: if your purpose is to find a specific lab you have a way to filter it and get cleaned entry form using Advanced Search WOS options with this specific field that gives you a controlled index<br>OG=(Centre National de la Recherche Scientifique (CNRS))<br>You will have normally a normalized version of the Adress Field PA= with a unique format for the Research Institution<br>- Second tip (External cleaning)<br>if your purpose is to fetch the Research Institution that is effectively stored inside the address field you should clean your initial dataset, zip and reupload it. <br>A simple regex should do the tric inside your orginal ISI text file. So a search and replace inside the original ISI file a zip and an upload into Cortext Manager should work.<br>If you want a hand on cleaning your dataset I will be happy to examine the problem.<br><br>I will let my collegues answer on how to do it inside the cortext manager 🙂

Aindu85 replied 9 years ago

Thanks for your response.My question was either how to clean data inside cortext either how to upload cleaned data in cortext.And a second tip that you proposed is a good solution, I think it will help. I haven't thought to clean ISI files before parsing.<br>Thanks a lot.

etancoigne replied 9 years ago

Since you want to work on the textual field “Address” instead of the categorized field “Research institutions” you would need to use the “corpus_terms_indexer” script instead of the “corpus_list_indexer” script. However, at the moment this script doesn’t offer you to work on the “Address” field. Maybe we should ask a developer of Cortext to add this field to the parameters of the script.