Terms Extraction

Terms extraction automatically identifies terms pertaining to a given corpus. In fact, Natural Language Processing (only for english, french, spanish or german text) tools that we use allow us to identify not only simple terms but also multi-terms (called n-grams).

How to use the script

  • Textual fields definition – Select the textual fields you wish to analyze and index:

Screenshot from 2016-08-16 15:18:27

   If one chooses to analyse a Web Of Science dataset, then the available textual fields would be Title, Abstract, Keywords (provided by authors), ISIID (Keywords provided by WOS) and Addresses.  

Screenshot from 2016-08-16 15:18:51

  • Terms list filtering – lexical extraction script aims at identifying the most salient multi- terms according to statistical criteria (see technical description below).You can also exclude any term below the a given minimum frequency. Another important parameter is the list size you wish to extract. This parameter may have strong impact on script speed, so it is advised to keep this parameter below 1000
  • language – Don’t forget to specify the language of your dataset – only french, german, spanish and english are taken into account. German and Spanish have been less extensively tested than French and English (any feedback is welcome!). If you select “Other”, no grammatical processing will be applied to the data, meaning only statistical criteria based on collocation of words will be used to derive phrases.

Screenshot from 2016-08-16 15:19:24

  • monograms – You are also given the possibility to exclude monograms (that is terms composed of only one word). It is advised to exclude monograms as they tend to be less informative terms.

Screenshot from 2016-08-16 15:19:44

  • Maximal length (max number of words) – It is also possible to limit the maximum number of words encapsulated in a multi-term. Three is reasonable, but feel free to try to identify longer multi-terms.

Screenshot from 2016-08-16 15:20:11

  • Advanced settings – These advanced options are described below.

Screenshot from 2016-08-16 15:20:42

  • Sampling – For large datasets, and if you are advised to sample your original dataset. Terms extraction will only be based on sample sub-corpuses with a given number of documents (randomly drawn from original successive datasets). Nevertheless, detected terms will be indexed in the whole corpus whatever the sampling strategy.

Screenshot from 2016-08-16 15:21:24

  • Optionally you can name the new indexation that will be generated – By default a new table Terms will be created. It is possible to choose a different name for this table and hence manage various indexes at the same time. The indexed variable will be suffixed by ISITerms + the string you entered. Please use simple characters (no accents, etc.) and avoid using spaces.

Advanced settings: 

  • Frequency (c-value) computation level: The frequency of terms can be computed at the document level (meaning that terms frequency are computed based on the number of distinct documents they appear in) or at the sentence level (default choice, meaning that repetitions of a terms across sentences inside a document will be taken in account in the frequency calculation). As the frequencies of a term (main form) are calculated during the indexation process (so, after the extraction and selection of the relevant keywords), results for Frequency computation level are stored in the folder indexed_list, inside the multiterms_statistics_expanded.csv file
  • Specificity score: The selection of most pertinent terms results from a trade-off between their specificity and their frequency (see technical explanations below). By default, specificity is computed as a chi2 score (summed over all words). It is also possible to use a simpler tf.idf score to do so. Another option is to use the “pigeonhole pertinence measure” which gives more weight to words which tend to cooccur several times per document (as a consequence the measure is only valid for large enough documents (typically at least several sentences).  One can also deactivate the role of specificity in the final ranking of extracted terms  such that only top N most frequent terms will be retrieved. 
  • linguistic pre-processing: One can also totally deactivate the linguistic pre-processing (pos-tagging, chunking, stemming). It is useful when treating texts in other languages than English, French, German or Spanish, but also in cases where one does not want to reduce the extraction to a single grammatical class. 
  • grammatical criterion: By default, noun phrases are identified and extracted but you can also choose to try to identify adjectives or verbs.
  • Automatically index the corpus: The script first extracts a list of terms and then indexes the corpus. It is possible not to index the corpus by checking this box.
  • Pivot Words: Only multi-terms containing this string will be extracted.
  • Starting Character: Only terms starting with this character shall be extracted. It comes in handy if you wish to index hashtags for instance (#). Simply deactivate linguistic pre-processing and make sure to also deactivate linguistic pre-processing in that case. 

In the Dynamics panel, one can temporally slice the corpus to apply different lexical extractions.

  • Time periods – Different lexical extraction process will  be applied to  the different time periods defined (either from the original time range or from a customized time range if one was computed before). Time slices are either regular (uniform distribution of timesteps per period) or homogeneous (uniform distribution of documents per period)

Screenshot from 2016-08-16 15:22:02

The final result is compiled in a csv file which can be either downloaded  and edited offline or with a spreadsheet editor like open office (recommended) or even edited online using the online csv editor provided by CorText (simply click the csveditor.php file).

Methodological background

Automatic multi-terms extraction is a typical task in NLP, yet the existing tools are not always well suited when one wishes to extract only the most salient terms. As specificity computing is time and resource expansive, we have developed an automatic method to extract lists of terms that we suspect to be the best candidates for lexical extension. Thus we will be interested in groups of relevant terms featuring both high unithood and high termhood as defined in (Kageura, K., & Umino, B., 1996).

The whole processing of textual data can be described as follows: it first relies on classic linguistic processes that ends up defining sets of candidate noun phrases.

POS-tagging : Part-of-Speech Tagging tool first tags every terms according to its grammatical type : noun, adjective, verb, adverb, etc.

Chunking : Tags are then used to identify noun phrases in the corpus, a noun phrase can be minimally defined as a list of successive nouns and adjectives. This step builds the set of our possible multi-terms.

Normalizing : We correct small orthographical differences between multi-terms regarding the presence/absence of hyphens. For example: we consider that the multi-terms “single-strand polymer” and “single strand polymer” belong to the same class.

Stemming: Multi-terms can be gathered together if they share the same stem. For example, singular and plurals are automatically grouped into the same class ( e.g. “fullerene” and “fullerenes” are two possible forms of the stem: “fullerene”).

The grammatical constraints provide an exhaustive list of possible multi-terms grouped into stemmed classes, but we still need to select the N most pertinent of them. Two assumptions are typically made by linguists when trying to identify the most significant multi-terms in a corpus: pertinent terms tend to appear more frequently and longer phrases are more likely to be relevant. 

To sort the list of candidate terms we then apply a simple statistical criterium which entails the following steps:

Counting : We enumerate every multi-term belonging to a given stemmed class in the whole corpus to obtain their total number of occurrences (frequency). In this step, if two candidate multi-terms are nested, we only increment the frequency of the larger chain. For example if “spherical fullerenes” is found in an abstract, we only increment the multi-stem : “spheric fullerene” but not the smaller stem “fullerene”.1

Items are then sorted according to their unithood, and the list is pruned to the 4N multi-stems with the highest frequency. This step removes less frequent multi-stems, but more importantly it makes it possible to have the second-order analysis on the remaining list as it follows.

Last, we adopt a similar approach to Van Eck et al. (2011) to get rid of irrelevant multi-terms that may still be very frequent like: “review of literature” or “past articles”. The rationale that we follow is that irrelevant terms should have an unbiased distribution compared to other terms in the list, that is to say neutral terms may appear in any documents in the corpus whatever the precise thematics they address. We first compute the co-occurrence matrix M between each item in the list. We then define the termhood θ of a multi-stem as the sum of the chi-square values it takes with every other classes in the list. We rank the list according to θ and only the N most specific multi-stems are conserved.  Alternatively one can also simply rank terms using G2 values (semantic specificity like chi2),  pigeon-hole (similar to Gf-idf scores, terms with higher local frequency are ranked higher) or simply by frequency.

  • van Eck, N. J., & Waltman, L. (2011). Text mining and visualization using VOSviewer. Arxiv preprint arXiv:1109.2058.