Indexing a new category in my corpus

CorText Manager Q&A forumCategory: Time processingIndexing a new category in my corpus
Yuri Saldarriaga asked 8 months ago

Dear Lionel,
I have tried many times to index a new category in my corpus, but I could not do it. I followed the next steps:

  1. I used the list builder to download a list (using the filename option). Then, I opened it in Excel, and found three columns: “entity”, “frequency” and “number of distinct documents”. I included a new column with the year, next to “entity”, and saved the document as CSV. At this point, I have tried deleting the columns ” frequency ” and ” number of distinct documents”, I have saved the list in other CSV formats, tried in Mac and Windows, etc.
  2. I uploaded the CSV file.
  3. I started a new script using the “corpus list indexer” function. In “field” I put “filename”, I said not in “Define a custom list of entities” and yes in “Add a dictionary of equivalent strings”, where I selected the list I had uploaded (in some attempts, I put yes in both options, and selected the CSV I uploaded). Here, I also changed the name of the new process.

At this point I have the first problem. When I see the “corpus list indexer” file it has three columns with the same information, and without the year I included in the CSV list. I mean, I have the same data in “entity”, “entity label” and “forms”. So, when I try to do the next analysis, it appears an error.
I do not know what else I can do. I would appreciate your help!
Thank you very much,
Yuri

1 Answers
Lionel Staff answered 8 months ago

Dear Yuri,
You are describing the typical steps that are required to achieve what you want. Two comments:

  • Try to ovoid MS Excel which is adding a few extra information to the original list file. You may want to use Google Sheets or LibreOffice Calc which are more respectful regarding the original data format. Typically the files produced by list builder are tab separated and utf8;
  • We have deploy recently an update in order to fully work in tsv format, from List Builder to List Indexer (and the other scripts that are working with textual fields). I recommend you to keep this format (even if it is just a matter of file’s extension).

You may want also to what the following video tutorial, from the time 1:46
I hope it helps !

Best
L

Yuri Saldarriaga replied 8 months ago

Hi Lionel! I used Google Sheets and it worked perfectly. Thank you so much!