In digitalized documents, especially those with texts, more than 70% contains geographical information. Nevertheless, a large proportion of this information are not formalized to be projected and manipulated on maps, and there are stored in various forms (Hill, 2006): pictures of places, toponyms in full text documents, addresses, structured metadata or, finally, geographical coordinates. Adding a layer of geographical information in an understandable way for computational treatments is a vast research field (since the 1980s), which has produced different set of methods: Geotagging, Geolocation, Georeferencing, Toponyms detection in full text, and Geocoding.
We are focusing, here, on geocoding. But in Cortext Manager, you are also able to extract toponym names with Name Recognition Entities script (identification of toponyms with GPE), which is a Natural Language Processing method based on language rules, with the use of contextual information to disambiguate toponyms. The two scripts can work one (NRE with toponyms extraction) after the other (Geocoding addresses).
What is Geocoding?
Modern geocoding engines “tackled the problems of assigning valid geographic codes to far more types of locational descriptions [than older methods] such as street intersections, enumeration districts (census delineations), postal codes (zip codes), named geographic features, and even freeform textual descriptions of locations.” (Goldberg et al., 2007)
The CorText geocoding engine has been built to manipulate semi-structured addresses written by humans. So, it is able to solve complex situations as:
- Different formats that rely on national postal services (or data providers), that largely vary across country;
- Non-geographic information (building names, lab names, person names…), that have ambiguities and could be multi-located;
- Ambiguous toponyms (e.g. Is “Paris” one the Paris in Canada or the capital city in France? Is “Osaka” in Japan, the region name or the city name?);
- Alternative and vernacular toponym names.
The typical steps needed to move from raw addresses related to documents/objects to coordinates (longitude and latitude) are:
- Normalization: cleaning, parsing (components are more essential than others, like city names or postal code), normalisation (same geographical object with the same name, like abbreviations or ambiguities);
- Matching: comparing the normalized address with one or several reference databases.
Classifying elements contained in an address
First, the tool is doing pre-processing step to do some cleaning and normalisation of the addresses (simplification of the punctuation, aliases and normalising the country names…).
Secondly, to classify elements in the address, the CorText Manager is using LibPostal: an address parser and normalizer, which is a multilingual, open source, Natural Language Processing based engine, to classify geographical elements in worldwide street addresses. LibPostal has been traine on OpenStreetMap. Libpostal is able to classify objects as: house, building name, near (e.g. “near New York”), level in a building, street number, postal codes, suburb, city, island, state and state district, region, country…
Matching the classified elements with the reference geo-databases
Pelias is comparing the classified elements in an address with a set of large open-source geo-databases (open street map, geonames, open addresses, WhosOnFirst), and manipulates different types of geographic objects (mainly: vectors for streets, points for locations, shapes for administrative boundaries).
To guess which location (couple of coordinates) is the most probable for a given address, Pelias articulates toponyms classes both vertically (administrative boundaries, mainly hierarchical nesting) and horizontally for context dependant toponyms (within a country, or for a given language: such as vernacular names, campus names, borough in a city, direction… when having different names for the same original geographical object). This requires a large graph with connected classes. They are detailed in WhosOnFirst’s ontology, which has been designed to understand “Where things are (and what they mean)”.
For the remaining ambiguities (e.g. when having “Paris” without any other information), the geocoding engine is using external variables (popularity criterium, such as number of inhabitants) to decide which candidate is the best (e.g. and to choose “Paris, France” instead “Paris, Texas, USA”).
Methods of the CorText Geocoding service
We are promoting this two steps approach (classification step and comparison step), with four options to fit the needs of the researches conducted by our users: from meso scales (e.g. regional level) to smaller geographical spaces (e.g. building names, streets or neighbourhoods).
- Filtering organisation names: organisation names, street names, person names and postal boxes are removed in order to reduce ambiguity (e.g. for multi-located compagnies or laboratories), and to retrieved more aggregated geographical information. Addresses can be located from the postal codes scale (sub-city scale) to urban areas and metropoles (sub regional scale, as for county). This method is opportunistic as it takes less ambiguous information first (postal codes), to end with more ambiguous (city names, and finally building and street names). Depending of the tagged elements in the address, the geocoding engine will decide which candidate and scale are the best. This method is made to cover a large variety of situations with a good quality of the results.
- Priority on city scale: tagged addresses will be searched in the specific sub-area (meso area as regions or counties) and retrieved coordinates as frequently as possible at the city scale. It tends to reduce the variety of geographical objects retrieved and to narrow the scales of the results on centroid coordinates of cities. This method is useful to build analysis at the city or regional scale.
These two approaches produce aggregated coordinates (shapes located by a centroid), with a less spread spatial distribution.
Full addresses geocoding:
- Priority on street scale: tagged addresses will be searched in the specific sub-area and retrieved with a very fine-grained toponyms or building names (and POI) detection. Street names, building names are prioritized. It needs a well and uniform full addresses coverage (with street names and/or building names) and fits when follow precise intra-urban local spatial dynamics. It tends to produce a spread spatial distribution of the coordinates.
- No customisation: full addresses are sent without any customisation (no pre-processing step, no toponym filter, no prioritisation of the scale) to the geocoding engine and let it decide which geographical object is the best.
- Select the field: select the field which contains the list of address. It should be formatted a least with city name and a country name (e.g. “Paris, France”). Ideally, country names should follow ISO standard. As the country boundaries are an important information to locate addresses, CorText Geocoding service is able to deal with aliases for country names (e.g. USA, US, United States of America). When needed, intermediary names (state, region, county….) are also useful to help reducing ambiguities. Postal codes (but not postal boxes) are powerful and non-ambiguous information and are supported for 11 countries.
- Top scale filter: remove from results all geocoded addresses that have been geocoded above this scale.
- Geocoding methods: choose which geocoding method to use according to your needs. See the above definitions.
New layers added
Depending the method chosen, users will get different results. The Cortext Geocoding engine is not only able to geocode your addresses (longitude and latitude coordinates), but also enrich your corpus with three new variables that can be used in other CorText scripts:
- geo_city: name of the city which combine locality name, localadmin name, and neighbourhood (when needed);
- geo_region: region name identified by the hierarchy or identified directly in the address if it was the only information associated with the country name. Regional layer is feed with different geographic elements that relay on the administrative divisions of the country (e.g. State in US, Department in France, county in Japan);
- geo_country: standardised country name (ISO standard).
Example of application
- Geocoding of all addresses from documents on the 3D printing field published between 2001 and 2015, with “Filtering organisation names” method;
- Lexical extraction of textual information (top 150 keywords, without monograms);
- Clustering of extracted terms with distributional algorithm (Terms – Terms, with Top 100 keywords);
- Clustering of geocoded cities (geo_city – geo_city, top 100, raw, Top 3 neighbours), with a 3rd variable to tag clusters (the name of the semantic clusters: PC_Terms_Terms and chi2 measure to tag clusters) with the top 2 labels for the tagging variable (Number of labels to show for each cluster).
In the map “purdue university” is a location in US where the university is located next to the city of Lafayette in the state of Louisiana (around 2 kilometres from the 127,657 inhabitant city, according 2015 U.S. Census estimates). In CorText Geocoding Service, this location is considered as an independent geographical object (with its own postal code, where administrative boundary of the university does no overlap the boundary of the city of Lafayette).
The other locations mentioned onto the map are city names, even if the coordinates of the locations found (longitude and latitude) are under the scale of these cities.
With the top 2 semantic clusters labels written next to the clusters, it is possible to estimate the thematic specialisation of geographic communities built on the top of authors collaborations in the 3D printing scientific field.