Geocoding an address or a toponym is a complex process. Locations across the world are extremely ambiguous and addresses may contain information that are not related to space. But when feeding a research work: quality is crucial!
To help you to increase the accuracy of the geocoded information, we have developed GeoEdit tool to refine geographical coordinates and associated metadata (city names, region names and country names) added to your dataset by CorTexT geocoding service.
In geocoding results section of the dashboard of a project, click on the eye next to result.geoedit to open the tool.
Confidence score / geo_score
The confidence score (shown as geo_score in the fields list of the spreadsheet) is a core element of the refinement process. It is a combination of two aspects: 1/ a similarity measure of how close are the original location name and the label returned by the geocoding process (not shown in the columns of the spreadsheet) and 2/ a qualitative evaluation of the geocoded type of object found.
String comparison score
The string comparison score is processed in tow steps.
- The first one splits the label return by the geocoding process in sections, using commas (or spaces if there is no comma);
- And, for each section, its best Levenshtein distance (V. Levenshtein, 1966) is calculated based on a word n-gram of similar length from the corpus original information, without taking into account case, word order and punctuation.
Finally, the average of these distances is the string similarity score.
Geographical object score
The geographical object score is based on the main type of geographical object identified to locate the original information and returned after the geocoding process (found or as a fallback). The score is built on a qualitative evaluation of these objects, as shown below.
A priority is given to the less ambiguous entities (postal codes) and the meso scaled entities (from bourhood to county). Macro entities (from region to country) and entities which may contain information with a high level of ambiguity (e.g. buildings and venues names) are under scored.
Final calculation of the confidence score
The confidence score, used for the filter bar and shown in the geo-score column, combines the results of the two last steps:
- if the string similarity score is lower than 0.5, the confidence score corresponds to the similarity score
- if the string similarity score is higher than 0.5 , the confidence score is the addition of the geographical object score weighted by 0.3 and the string comparison score weighted by 0.7
The confidence score is then scaled to go from 0 to 100.
The edit spreadsheet it the main tab of the two accessible in the GeoEdit tool. It is where you refine the results from the geocoding process results.
Search for a specific string in one of the column.
Addresses marked as ‘Checked’
Show or hide addresses or place names that have already been checked and saved. They appear as CHECKED in the geo_score column and are colored in green.
When unchecked (by default), all lines which have already been saved are hidden. This is very useful not to see the lines which have been verified.
Confidence score threshold
Filter the spreadsheet lines according the confidence score.
Filter the rows of the spreadsheet by the number of times the addresses or place names (in gray, in the first column) appear in the data set.
Hide a line of the spreadsheet lines.
The ability to delete a line is an important feature, especially after GPE Named Entity Recognizer as extracted entities may not be related to the geography. Click on the red Delete button to remove all information from the dataset: coordinates, meta data (city, region and country names) and also the original information (first column in grey). The deleted lines won’t appear again in GeoEdit tool (but do not worry you first have to review the changes in order to apply the deletions).
Find this place in a map
In the spreadsheet, you can right click on any line to search for the corresponding location into a map.
Trying to find
In the top right box, the selected line to refine is shown. In the example below the address has been geocoded in Campbell in Australia due to the presence of the Australian Defense Force Academy, as Australian Def Force is part of the address. Remove the noisy section to the search string, and select the most appropriate candidate proposed in the search results list.
Select right location
The gray pin refers to the coordinates found and already added to the dataset (if any); the pink pin refers to the one just selected from the candidates list that will be added as a replacement.
The two blue buttons, Edit Results and Review Changes, allow you to navigate between two tabs. In the Review Changes tab, any updates already made to the spreadsheet are listed and must be reviewed before being saved.
The types of actions that will be made on the dataset after saving the changes are shown in the first column. These can be : UPDATE or DELETE.
Mark as checked
Hide the CHECKED rows in the Edit Results spreadsheet after saving for ease of future work. If Yes, by default, a CHECKED value will be added to the geo_score column and will be hidden or displayed by turning on or off the Addresses marked as ‘Checked’ button.
If it is checked (by default), the changes will be added when you click on the green Save button. You can uncheck it for a given row or for a set of rows if you have finally decided not to take these changes into account.
Save the changes
On the Edit Results tab, click the green Save button to add the changes you made (editions of results or Deletions). It will automatically update the related variables of your datasets: geo_city, geo_region, geo_country and geo_longitude_latitude (and the variable of your dataset selected for the geocoding process in case of a deletion).
Pay attention that it will not at all update the csv file provided in the geocoding results section (geocoded.csv)!
The GeoEdit tool intend to refine city, region, country names and geographical coordinates obtained after the geocoding process, and stored it in the dataset. Thus, all the stored refinements are directly accessible in the future analyses you will perform.
Improvements and issues
- Some will probably come soon!
- For now, geo edit tool can’t update or delete results for a place name which correspond to more than 300 documents.
Section to expand with your comments.
Levenshtein, Vladimir I. (February 1966). “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady. 10 (8): 707–710. Levenshtein1966a.pdf.