Demonstrating ML on graphs
Analyzing arXiv articles with neo4j
Introduction
Graph databases promise to offer new capabilities to data scientists. In this article, we demonstrate some of them using arXiv metadata. In especial, we are going to use neo4j and Python to analyze a sample of 600 arXiv articles. All Python code as well as the raw data used for the example can be found in GitHub.
For anyone not familiar with it, arXiv is
a free distribution service and an open-access archive for 2,036,494 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics (source: arxiv.org)
Our sample consists of 600 articles on Mathematics from arXiv. We are going to use only the id of each article, its title, its authors’ list, and the area of mathematics it belongs to (math subject classification). It should be noted that an article can have primary and secondary classification. The code used for extracting data and importing them to neo4j can also be found in GitHub.
The nodes in our graph will consist of three different kinds:
- articles (colored blue)
- authors (colored orange)