Member-only story

Demonstrating ML on graphs

Analyzing arXiv articles with neo4j

--

Graph of arXiv articles — Image by author

Introduction

Graph databases promise to offer new capabilities to data scientists. In this article, we demonstrate some of them using arXiv metadata. In especial, we are going to use neo4j and Python to analyze a sample of 600 arXiv articles. All Python code as well as the raw data used for the example can be found in GitHub.

For anyone not familiar with it, arXiv is

a free distribution service and an open-access archive for 2,036,494 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics (source: arxiv.org)

Our sample consists of 600 articles on Mathematics from arXiv. We are going to use only the id of each article, its title, its authors’ list, and the area of mathematics it belongs to (math subject classification). It should be noted that an article can have primary and secondary classification. The code used for extracting data and importing them to neo4j can also be found in GitHub.

The nodes in our graph will consist of three different kinds:

  1. articles (colored blue)
  2. authors (colored orange)
  3. classifications (colored pink)

(Directed) Edges in our graph are of two different kinds:

  1. edges from authors to articles they have written
  2. edges from articles to classifications

An example can be seen in the next image.

Example of nodes and edges in our arXiv metadata graph

Importing data to neo4j

The first step is to insert data to neo4j. The next image is an example of the available data (authors’ names are blurred). We are going to use the first four columns.

We are going to use code as in Tomaz Bratanic’s “Construct a biomedical knowledge graph with NLP”. The code below for connecting to neo4j, executing neo4j queries, and importing data are taken, or adapted from this article…

--

--

No responses yet

Write a response