Member-only story
Demonstrating ML on graphs
Analyzing arXiv articles with neo4j

Introduction
Graph databases promise to offer new capabilities to data scientists. In this article, we demonstrate some of them using arXiv metadata. In especial, we are going to use neo4j and Python to analyze a sample of 600 arXiv articles. All Python code as well as the raw data used for the example can be found in GitHub.
For anyone not familiar with it, arXiv is
a free distribution service and an open-access archive for 2,036,494 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics (source: arxiv.org)
Our sample consists of 600 articles on Mathematics from arXiv. We are going to use only the id of each article, its title, its authors’ list, and the area of mathematics it belongs to (math subject classification). It should be noted that an article can have primary and secondary classification. The code used for extracting data and importing them to neo4j can also be found in GitHub.
The nodes in our graph will consist of three different kinds:
- articles (colored blue)
- authors (colored orange)
- classifications (colored pink)
(Directed) Edges in our graph are of two different kinds:
- edges from authors to articles they have written
- edges from articles to classifications
An example can be seen in the next image.

Importing data to neo4j
The first step is to insert data to neo4j. The next image is an example of the available data (authors’ names are blurred). We are going to use the first four columns.
We are going to use code as in Tomaz Bratanic’s “Construct a biomedical knowledge graph with NLP”. The code below for connecting to neo4j, executing neo4j queries, and importing data are taken, or adapted from this article…