How do I enrich existing tables in an already pairsed corpus?

CorText Manager Q&A forumCategory: Data processingHow do I enrich existing tables in an already pairsed corpus?
matias.milia asked 7 years ago

Hi, I am trying to enrich an existing table in my corpus. I have some entities list which I have detected through the entity recognizer. Since some tune up needed to be made, I downloaded the database as a .csv and worked on it on OpenOffice. Corrected the names that were inacurate and normalized them. Now, I want to integrate it to an existing database (composed of scientific papers). I am kind of lost here, is there a way to do so?

1 Answers
Best Answer
Jean-Philippe Cointet Staff answered 7 years ago

Corpus list indexer is the tool you need to perform any enrichment cleaning operation starting from an existing list of categories. 
What you need to do, is to:

  • create a csv file with all the changes you want to index. The format is straightforward. The table should have two columns. Any item in the first column will be “converted” into the string as found in the second column. It allows to clean wrongly labeled entities, merge several of them, etc.
  • The file should be saved with the usually formatting (no double quotes, tabulation separated) and then uploaded in the manager, specifying this is a “term list”
  • Finally corpus list indexer will allow you to clean the problematic table (here NER_XXX). You can either restrain the table to a selection of possible entities “define a custom list of entities” or homogenize the existing list using the second option entitled “add a dictionary of equivalent strings”. Of course you can also use both options at the same time. In both cases, you will be asked to indicate with csv file should be used to (first option) list the “authorized” entities  (only the first column of the csv file is read), or (second option) build a dictionary of equivalence. 

The following page should guide you on how to use the script: https://docs.cortext.net/corpus-list-indexer/
 
Good luck !