Cleaning raw data

jk asked 2 years ago

I saw that my analysis contains duplicates (e.g. United States, United States of America, and USA). How may I clean the raw data? Is there any way we can upload something like a thesaurus file so that these labels can be merged?

2 Answers
Jean-Philippe Cointet Staff answered 2 years ago

Yes, the script corpus list indexer precisely aims at helping you to clean and homogeneize categorical entities in your dataset. 
You mostly need to prepare a two column tsv file where each row contains pairs of equivalent strings, upload it in the manager and finally “add a dictionary of equivalent strings” when prompted by the corpus list indexer script. Hope it is clear enough.
column1:original entities, column2: recoded entities
United States, USA
United States of America, USA