Sdgs And Kets Tagger

The SDGs and KETs Tagger script automatically classifies textual content, such as publication abstracts, based on predefined categories from:

  • Sustainable Development Goals (SDGs): A set of 17 global goals established by the United Nations in 2015 as part of the 2030 Agenda for Sustainable Development. These goals provide a shared framework for achieving peace and prosperity for people and the planet, with each goal linked to specific targets and indicators.

  • Key Enabling Technologies (KETs): A group of six strategic technologies identified by the European Commission as vital for fostering innovation, competitiveness, and sustainable growth in Europe. These include micro- and nanoelectronics, nanotechnology, industrial biotechnology, advanced materials, photonics, and advanced manufacturing technologies.

The tagger can operate using either SDG or KET classification schemes, or a combined mode (ket_sdg) that integrates both frameworks for broader thematic coverage.

This tagging is based on a pre-trained classifier developed by the University of Sheffield through the GATE RISIS-KNOWMAK service ().

The RISIS-KNOWMAK Ontology is a structured classification system developed to support the analysis of research and innovation dynamics. It is used to categorize entities such as scientific publications, projects, patents, and organizations based on standardized vocabularies. The ontology integrates key domains including scientific fields, societal challenges (e.g. SDGs), and technological areas (e.g. KETs), enabling interoperability and comparative analysis across datasets.

Parameters


Textual Fields

Select the text fields from your corpus that the tagger should process (e.g., abstracts, titles, etc.).

Classification Type

Choose the tagging category:

  • ket: Key Enabling Technologies

  • sdg: Sustainable Development Goals

  • ket_sdg: Combines both KET and SDG tagging

Dataset Type

Specify the nature of the documents in your corpus:

  • publication

  • patent

  • project

Advanced Settings

 

Filters

Choose whether to activate filters:

  • yes : Filters are applied based on PMI (Pointwise Mutual Information) scores and thresholds.

  • no: No filtering; all results from GATE will be included.

Download the Dashboard

Choose whether to export the dashboard file with scores:

  • yes: Will generate and download a dashboard file with tagging scores.

  • no : No dashboard file will be downloaded.

Output

The result is a tagged version of your corpus, where textual elements (e.g., abstracts) are annotated with SDG and/or KET labels depending on your settings.

If “Download the dashboard” is set to “yes”, a .tsv file containing classification scores and metadata will be available for download.

This dataset provides detailed metrics about the presence and significance of keywords extracted from a set of documents, linked to thematic classes and topics. The scores combine statistical association measures (like PMI) with boosting logic to identify meaningful and distinctive terms across documents and topics.

Column Descriptions

Column Name Description
identifier Unique identifier for the document or entry analyzed (e.g., an abstract or publication ID).
class Thematic class or category assigned to the document or keyword, often representing a conceptual domain.
topicID The identifier of the topic (from clustering or topic modeling) linked to the keyword or document.
unboosted Base score of the keyword without any class-specific boosting. Often derived from frequency or statistical association.
unboosted_pmi Pointwise Mutual Information (PMI) score for the keyword without boosting, indicates informativeness.
Score Final score of the keyword, including boosting effects (e.g., class preference or weights).
Score_pmi Final PMI score after boosting, capturing both statistical association and class enhancement.
boostedBy Indicates the class or topic that boosted the keyword. This field remains blank if no boosting occurred.
diffunboostpmi Difference between boosted and unboosted PMI scores, measures the impact of boosting.
diffsuperclasspmi Measures the difference between the superclass-level PMI and the unboosted PMI, if a superclass exists.
alldiffpmi Sum of all PMI differences (boosted vs. unboosted and/or vs. superclass), indicating global significance shift.
total_keywords Total number of significant keywords identified in this class-topic pairing.
nbclass Number of different classes in which the keyword appears, used to assess specificity or dispersion.
keyword The keyword or term extracted and evaluated in context.
score_keyword Statistical score (e.g., TF-IDF or frequency weight) specifically for this keyword in the document/class context.
textlength Length of the document text (in characters) used for normalization or filtering.
selectorno Selection flag or number (e.g., "selected" or "discarded") based on filtering criteria.
ABSTRACT_TITLE Selected textual field (or concatenation of the selected textual fields) of the document used as the source text for keyword extraction.
avgboostedpmi Average PMI score of all boosted keywords in this context, measures global informativeness post-boosting.
avgalldiffpmi Average of all PMI differences for keywords, used to evaluate thematic distinctiveness or semantic shift.

 

Measures Used in the SDGs and KETs Tagger

The classification mechanism relies on several scoring techniques to associate text segments (e.g., abstracts, titles) with SDG or KET categories. Below are the main metrics involved:

Pointwise Mutual Information (PMI)


PMI is a statistical measure that evaluates how much more often two terms co-occur than would be expected by chance (Church, K. W., & Hanks, P. (1990)). In this context, it measures the association between a term in the text and a target category (e.g., SDG 7 or KET “Photonics”).

Formula:

​

Where:

  • P(x,y) is the probability of co-occurrence between term x and category y

  • P(x) and P(y) are their respective marginal probabilities

Use in Tagging:

  • Terms with higher PMI scores are more strongly associated with a category.

  • The algorithm computes PMI using pre-compiled co-occurrence tables extracted from reference corpora labeled with SDGs or KETs.

Boosted Scores


Boosted scores adjust raw PMI values by integrating contextual, positional, or hierarchical information. This prevents overrepresentation of terms that are frequent but not necessarily discriminative.

Boosting Techniques Include:

  • Keyword Boosting: If a term appears in a manually curated list of “seed keywords” per category, its score is increased.

  • Hierarchy Boosting: A term associated with a subcategory may inherit part of the score from its parent category (e.g., in SDG ontology).

  • Positional Weighting: Terms in the title or the beginning of the abstract may be given more weight.


PMI alone may not reflect importance or contextual relevance. Boosting allows the tagger to emphasize semantically or positionally significant terms.

Filter Logic

When filters is enabled, the SDGs and KETs Tagger applies a multi-step SQL filtering pipeline to retain only meaningful and unambiguous keyword–class–identifier associations.

Ensure Sufficient Keyword Support

Create a temporary table to keep only (identifier, class) pairs supported by at least two distinct keywords.

Filter Out Short Documents

Compute total character length of title + abstract and retain documents with ≥ 300 characters.

Compute Specificity Metrics

Keyword-level:

  • Number of associated classes and identifiers.

  • Min/max/avg values of boosted and unboosted PMI scores.

Class-level:

  • Average boosted PMI, unboosted PMI, and superclass PMI for each class (on sufficiently long documents with ≥2 keywords).

Keyword ambiguity:

  • Max difference in PMI scores across classes for each keyword (used to resolve ambiguity).

Apply Final Filtering Rules

In gate_dashboard_selection, we retain a classification if at least one of the following holds:

    • The (identifier, class) has ≥ 2 keywords.

    • The keyword is in the manual whitelist (e.g. "bioinformatics approach", "carbon farming").

    • The keyword has ≥ 3 words.

    • The keyword’s PMI scores (boosted or unboosted) exceed class averages.

    • The document is long enough and keyword_score ≥ 2.

We exclude any classification where:

    • The keyword is on the blacklist (e.g. "green", "technology").

    • None of the inclusion rules are satisfied.

Results are tagged as:

    • selectorno = 'selected' if retained.

    • selectorno = 'no' if filtered out.

🔕 When filters = false

The pipeline skips all filtering steps and all associations are retained.

Use Case: Visualizing Thematic Clusters and SDGs with Network Mapping

After processing your corpus using the SDGs and KETs Tagger script, you can generate a semantic network visualization to explore the distribution and relationships of tagged keywords across thematic areas and Sustainable Development Goals (SDGs).

This network graph reveals clusters of keywords that were annotated according to SDG classifications. Each color-coded cluster represents a distinct thematic domain, such as:

  • Sustainable agriculture

  • Pharmaceutical innovation

  • Waste management

  • Nutrition and disease prevention

  • Advanced polymers and materials

  • Air quality and pollutants

In addition, key concepts like “plant cells,” “nucleic acids,” or “enzymes” appear as central nodes. These terms frequently link multiple SDG-relevant themes, thereby highlighting cross-domain relevance.

Thanks to the SDGs and KETs Tagger script, this type of network mapping transforms textual data into a strategic landscape. It guides analysis, fosters insight, and supports decision-making in research planning, funding allocation, or innovation monitoring.

References

United Nations SDG overview: SDGs

European Commission – Accelerating technological change and hyperconnectivity : KETs

Maynard, D., Petrak, J., Song, X., & Funk, A. (2019). Report on Ontologies and Tagging.

GATE Classification Tool – Technical Documentation : RISIS-KNOWMAK GATE Classification

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics

This work was partially supported by European Union under grant agreement No. 825091 Horizon2020 Research and Innovation Programme RISIS².