corpus terms indexer

This script works hand in hand with the lexical extraction. Actually, by default, it is even automatically launched every time a lexical extraction is executed. Its basic objective is, given a series of textual fields (provided by the user), to index every term found in a given term list csv file (specified by the user).

It then provides more flexibility in the indexation process as users are allowed to edit term list by themselves either by editing their own csv in a spreadsheet editor like open office (recommended) or Google Spreadsheet or by using the online csv editor provided by CorText.

Only the second and third columns are important for launching an indexation. Concretely, you should provide a tabulation separated UTF-8 encoded file with no text delimiter (which is already the format generated by CorText lexical extraction). The indexer will proceed as follows. The third column (classically entitled “forms”) provide a list (separated by |&|) of strings that will be be indexed using the label provided by the second column (entitled “main form”). It means that each time that one of those strings is found, the database will store this information. The first column is actually secondary. Just be sure to have a different value in each row, or rows with similar information in the first column will be merged under the same entity. Optionally, if your csv file provides further columns, the rows which ends with a “w” in the last column will be ignored in the indexation. Use the yellow contextual help boxes to tune optional parameters.

 

Other available options should be straightforward. They include the possibility to check for case when indexing to only index one occurrence of a term per sentence.

learn about CorText scripts and share your experience