Truncation for Corpus Terms Indexer

CorText Manager Q&A forumCategory: Text processingTruncation for Corpus Terms Indexer
Déborah Abhervé asked 2 years ago

Hello,
I would like to index my corpus on the basis of a list of terms previously identified by me. I will use the Corpus Terms Indexer script.
I would like to know if it’s possible (and how) to integrate truncations in this list. For example, if I indicate “chenalis*”, I would like to have the terms “chenalisation”, “chenalisations”, “chenaliser”…
Thanks for your help!
Déborah

1 Answers
Lionel Staff answered 2 years ago

Dear Déborah,
I would recommend to do it in two steps:

  • perform a (large) lexical extraction to identify the different forms of each noun phrase in your corpus based on grammatical variations;
  • work with it (you can even add new forms which are not detected using “|&|newform1|&|newform2”), and select only the forms which are in the “list of terms previously identified by” you.

During the re-indexation step you may want to have a look to the “Use the shared dictionary“, but just to add few other words variations (additions which are not based on grammar).
I hope it helps
L