How are hyphens handled in the Corps Terms Indexer Script?

Cortext Manager Q&A forumCategory: Text processingHow are hyphens handled in the Corps Terms Indexer Script?
Aurélien Féron asked 3 days ago

Hello,
I would like to know how hyphens are handled in the Corpus Terms Indexer Script.
Let’s take an example: I want to provide a list of terms to be indexed, which includes “ARG”, for “antibiotic resistance gene(s)”, often abbreviated to ARG or ARGs. Thus, in the ‘forms’ column of this list, there could be: args|&|arg|&|antibiotic resistance genes|&|antibiotic resistance gene.
What happens in the Terms Indexer Scrpit for all occurrences with hyphens before or after, or both, i.e. arg-, -arg, -arg- ? For example, what happens for occurrences of “phe-arg-beta-naphthylamide,” to take just one example where there is a hyphen on both sides (and here a type of occurrence that I would prefer not to be indexed).
It seemed to me that the script sticks very closely to what is in the forms column of the table (so in my case no hyphens), but I would prefer to have confirmation (or clarification) ;-).
Thanks in advance!
Aurélien

2 Answers
Lionel Staff answered 2 days ago

Hello Aurélien,
Thank you for your question!

In the Corpus Terms Indexer, by default when word boundaries is enabled, word boundaries are defined (https://docs.cortext.net/corpus-terms-indexer/#word-boundaries) based on spaces and punctuation marks.
It works in a “negative” way: that is, everything in the textual field to index that is not a letter (accented characters such as é, à, ô, ç, etc. are considered letters) or a digit is treated as a potential word boundary.
This therefore includes:

  • spaces, tabs, and line breaks
  • all common punctuation marks: . , ; : ! ? ' " ( ) [ ] { } - + * / \ & | @ # %, etc.

When Word boundaries is unchecked:

  • gene will also match oxygenes, degeneration, etc.
  • état will match état, état-société, but also étatique, etc.

Hyphens then no longer have any special status : they are simply treated as characters like any others in the text.

Case handling (uppercase/lowercase) can be disabled (by default, the search is case-insensitive) using the case-sensitive-search option (https://docs.cortext.net/corpus-terms-indexer/#case-sensitive-search).

In the TSV file of terms to be indexed (main form and forms), it is always possible to refine this behavior by adding hyphenated forms in order to capture only specific variations.

I hope it helps,
Lionel

Aurélien Féron answered 1 day ago

Thank you Lionel for your answer!
Best regards,
Aurélien