What the paper is about
The paper investigates whether Wikidata can serve as a reliable basis for bibliometric analyses and for computing diversity indicators over the scientific record. It introduces DivinWD (DIVersity IN WikiData), a set of metrics and a reproducible pipeline that integrates Wikidata with major bibliographic infrastructures (Crossref, Dimensions, OpenAlex, Scopus, Semantic Scholar) plus organizational data from ROR.
Core idea: use Wikidata’s open, community-curated graph as a hub; validate and enrich it using external sources; then compute diversity metrics that can support monitoring inclusivity in science.
At a glance
Main contributions
- A reproducible pipeline to extract and analyze scholarly publication metadata from Wikidata at scale.
- A coverage/accuracy comparison between Wikidata and Crossref, Dimensions, OpenAlex, Scopus, and Semantic Scholar.
- Diversity indicators computed across language, discipline, gender, geography (citizenship), and affiliation type, with practical guidelines.
Pipeline overview (from the methodology figure)
- Data gathering: select Wikidata scholarly articles; extract DOIs, dates, languages, and author QIDs; collect author attributes (gender, citizenship, employer) and institutions (ROR IDs).
- Data processing: match records by DOI (publications) and ROR ID (organizations); enrich missing metadata (language, abstract, field of study, institution types).
- Data analysis: compute diversity metrics and visualize trends over time.
Note: Field of study is assigned using Semantic Scholar labels when available; otherwise inferred from titles/abstracts via S2FOS. Gender and nationality can be inferred from names using Genderize when missing in Wikidata.
Data selection rules (Wikidata subset)
- Items must be
instance of (P31) scholarly article. - Authors must be linked via
author (P50)(not string-onlyP2093) and beinstance of human (Q5). - Publication year must be consistent when multiple
publication date (P577)statements exist.
Key findings (high level)
- Strong coverage biases in Wikidata scholarly items: overrepresentation of English-language publications and of medicine/hard sciences.
- Demographic and geographic skew: authors are disproportionately from Western countries and male categories dominate the gender distribution.
- High matchability by DOI: most selected Wikidata articles have at least one correspondence in the integrated bibliographic sources; author-count consistency is largely confirmable by at least one external source.
- Affiliation data is sparse: only a minority of authors have usable time-aligned employer/affiliation typing for the affiliation diversity dimension.
Diversity metrics
The paper uses the Shannon diversity index to capture both richness (how many categories are present) and evenness (how balanced they are). Because some author attributes can belong to multiple categories at once (e.g., multiple citizenships), it adapts the formulation by splitting each author’s contribution across their categories.
- Articles: language, field of study
- Authors: gender, citizenship (geography), affiliation type (ROR taxonomy)
Indices are normalized to the range [0, 1] using the maximum possible value based on the number of observed categories.
Limitations and cautions highlighted
- Community-curated data quality: Wikidata constraints are flexible, enabling modeling but allowing inconsistencies.
- Inference risks: name-based demographic inference (e.g., Genderize) has known limitations and can misrecognize groups.
- Privacy/ethics: diversity monitoring must balance transparency and reproducibility with minimizing exposure of personal data.
Suggested “what to do with this”
- Use Wikidata as an open hub for bibliometrics, but validate coverage and triangulate with external sources.
- Report diversity indicators with explicit notes on missingness, inference, and selection criteria.
- Use the resulting indicators to support auditing and evaluation of DEI initiatives in science (at aggregate levels).