JHU Math Department ArXive Collaboration Analysis

JHU Math Department ArXive Collaboration Analysis

To familiarize myself with the software, I decided to create a data set and load it into Centrifuge.

I needed to find a data set that made the best use of Centrifuge’s strengths — visualizing big data to discover patterns and relationships.

I decided to use the arXiv. Instead of traditional but expensive academic journals, researchers can submit papers to the arXiv.  Some, but not all, of the papers are preprints of articles due to appear in academic journals

The arXiv hosts papers on a variety of topics, but its core focus is mathematics, physics and other related fields.  Some of the papers contain groundbreaking research.  Others are sharp rebukes of perceived nonsense.

I decided to focus my analysis on my alma mater, the Johns Hopkins University Department of Mathematics.  To obtain the data, I used the arXiv’s article metadata API.  First, I retrieved the metadata for all of the math department’s faculty members.  I also  retrieved the meta data for anyone who coauthored a paper with a member of the faculty, or anyone who authored a paper with a first order coauthor.  In graph theory terminology, I selected the nodes representing the authors with a distance of less than or equal to 2.

Before I share my results, I should explain what a relationship graph is.  Simply put, a relationship graph is a a way to visualize data that highlights the interconnections between data elements. more0

A simple relationship graph

In the relationship graph above, each circle represents an author.  If there is a line between the circles, the authors collaborated on a paper.  For example, the relationship graph above shows that W Wilson collaborated with Nitu Kitchloo.  In turn, Nitu Kitchloo collaborated with Jack Morava.

If I was only interested in a pair of authors, it would be easy to use a SQL database to determine if they had collaborated.  However, if I needed to identify interconnected groups of authors, a SQL database would be more difficult to use.  I would need to complex nested joins to identify groups of collaborators.  Even if I did manage to extract the data, I would still be unable to visualize the relationships.


The JHU Math Department

The above graph shows the entire math department, plus any coauthors up to 2 degrees of separation away.  Authors currently on the JHU faculty are shown in yellow.  Those who are not currently on the JHU faculty are shown in blue.   The  circles for the JHU authors are all the same size.  For the non-JHU authors, the size of circle represents the number of articles.   Observe that the graph has two large connected components.  That is, there are two large groups of authors who are interconnected through collaboration relationships.


The largest component.


The second largest component.


The remaining author.

I should caveat that there were a few minor issues with getting the article metadata from the arXiv.  Not every author participated in the arXiv’s authority control system.  As a result, some authors may have been confused with others with similar names.  In addition, it’s important to remember that the arXiv isn’t the only venue for publication.  Any collaboration outside of the arXiv isn’t shown in the graphs above.


Bookmark and Share