I saw that my analysis contains duplicates (e.g. United States, United States of America, and USA). How may I clean the raw data? Is there any way we can upload something like a thesaurus file so that these labels can be merged?
Yes, the script corpus list indexer precisely aims at helping you to clean and homogeneize categorical entities in your dataset.
You mostly need to prepare a two column tsv file where each row contains pairs of equivalent strings, upload it in the manager and finally “add a dictionary of equivalent strings” when prompted by the corpus list indexer script. Hope it is clear enough.
column1:original entities, column2: recoded entities
United States, USA
United States of America, USA
for a similar question.