DivinWD: Exploring the Diversity of Scientific Publications in Wikidata

Exploring diversity indicators for scientific publications using Wikidata + external bibliographic sources

Corpus: 2010–2024 focus Main dimensions: language · field · gender · geography · affiliation

What the paper is about

The paper investigates whether Wikidata can serve as a reliable basis for bibliometric analyses and for computing diversity indicators over the scientific record. It introduces DivinWD (DIVersity IN WikiData), a set of metrics and a reproducible pipeline that integrates Wikidata with major bibliographic infrastructures (Crossref, Dimensions, OpenAlex, Scopus, Semantic Scholar) plus organizational data from ROR.

Core idea: use Wikidata’s open, community-curated graph as a hub; validate and enrich it using external sources; then compute diversity metrics that can support monitoring inclusivity in science.

At a glance

1.2M+
articles integrated (2010–2024 mentioned as the main analysis window)
5
bibliographic sources compared to Wikidata
54
distinct languages observed (for language diversity index)
23
fields of study (Semantic Scholar S2FOS categories)

Main contributions

Pipeline overview (from the methodology figure)

  1. Data gathering: select Wikidata scholarly articles; extract DOIs, dates, languages, and author QIDs; collect author attributes (gender, citizenship, employer) and institutions (ROR IDs).
  2. Data processing: match records by DOI (publications) and ROR ID (organizations); enrich missing metadata (language, abstract, field of study, institution types).
  3. Data analysis: compute diversity metrics and visualize trends over time.

Note: Field of study is assigned using Semantic Scholar labels when available; otherwise inferred from titles/abstracts via S2FOS. Gender and nationality can be inferred from names using Genderize when missing in Wikidata.

Data selection rules (Wikidata subset)

  • Items must be instance of (P31) scholarly article.
  • Authors must be linked via author (P50) (not string-only P2093) and be instance of human (Q5).
  • Publication year must be consistent when multiple publication date (P577) statements exist.

Key findings (high level)

Diversity metrics

The paper uses the Shannon diversity index to capture both richness (how many categories are present) and evenness (how balanced they are). Because some author attributes can belong to multiple categories at once (e.g., multiple citizenships), it adapts the formulation by splitting each author’s contribution across their categories.

Indices are normalized to the range [0, 1] using the maximum possible value based on the number of observed categories.

Limitations and cautions highlighted

Suggested “what to do with this”