Clustering of Scientific Citations in Wikipedia

Machine learning algorithms can automatically identify topics in Wikipedia's science articles.
Finn Årup Nielsen

Abstract

The instances of templates in Wikipedia form an interesting data set of structured information. The so-called cite journal template is primarily used for citation to articles in scientific journals. These citations using the template can be extracted and analyzed: Non-negative matrix factorization is performed on a (article x journal) matrix resulting in a soft clustering of Wikipedia articles and scientific journals, each cluster more or less representing a scientific topic.

The study

With a machine learning algorithm it is possible to get an overview of how Wikipedia cites off-site scientific journals: Like in a previous study the citations that Wikipedia authors make were extracted and counted. The algorithm then analyzes the count and will typically put the Wikipedia articles and scientific journals into “meaningful” clusters that represent scientific topics.

Below is an image of the clusters from an analysis of one 2007 dump of Wikipedia. In this particular analysis the dominating clusters in the model are about astronomy, Einstein, medicine, intelligence, bacteria and human leukocyte antigen.

Cluster bush visualization of clusters in scientific
	citations in Wikipedia

Cluster bush visualization of clusters in outbound scientific citations in Wikipedia. On each cluster is shown part of the title of representative Wikipedia articles for the cluster.

DTU Informatics Cite journal miner is web-pages presenting the result of the newest analyses of this data. Highly cited journals are listed, as well as the clusters.

Technical details

Non-negative matrix factorization (NMF) is used to decompose a matrix which elements counts the number of times a science journal is cited from an Wikipedia article. The Brede Toolbox implements the algorithm in Matlab. This algorithm as also been used in a previous neuroinformatics text mining study.

Not all the more than two million articles of Wikipedia are analyzed with NMF, only those Wikipedia articles that includes the Cite journal template, which are only tens of thousands articles.

References and Downloads

The study was accepted for a presentation at the 2008 Wikimania conference on the third day. I made a comment, One level deeper: Polymorphism wiki, on the PLoS Biology web-site that pertains to the work.

Part of the question session of the presentation at Wikimania was filmed by the Bibliotheca Alexandrina. An MPEG movie (188MB) is available.

Author

Finn
	      Nielsen Finn Årup Nielsen is a senior researcher at the Department of Informatics and Mathematical Modelling at the Technical University of Denmark on a grant from the Lundbeckfonden to CIMBI. He is also attached to Neurobiology Research Unit at the Copenhagen University Hospital Rigshospitalet. He contributes from time to time on the Danish and English language Wikipedias as the “fnielsen” user.

Other study by the author: Scientific Citations in Wikipedia

$Id: Nielsen2008Clustering.html,v 1.14 2008/09/01 17:19:31 fn Exp $