Cleaning raw data

Cortext Manager Q&A forum › Category: Text processing › Cleaning raw data

jk asked 5 years ago

Hello,
I saw that my analysis contains duplicates (e.g. United States, United States of America, and USA). How may I clean the raw data? Is there any way we can upload something like a thesaurus file so that these labels can be merged?

2 Answers

0 Vote Up Vote Down

Jean-Philippe Cointet Staff answered 5 years ago

Yes, the script corpus list indexer precisely aims at helping you to clean and homogeneize categorical entities in your dataset.
You mostly need to prepare a two column tsv file where each row contains pairs of equivalent strings, upload it in the manager and finally “add a dictionary of equivalent strings” when prompted by the corpus list indexer script. Hope it is clear enough.
e.g.:
column1:original entities, column2: recoded entities
United States, USA
United States of America, USA

0 Vote Up Vote Down

Lionel Staff answered 5 years ago

See: https://docs.cortext.net/question/replace-name-duplicats-in-a-list-of-authors/
for a similar question.

Cortext Manager Documentation

Learn about Cortext methods and share your experiences