Natural Language Processing has made huge advancements in the last years. Currently, various implementations of neural networks are cutting edge and it seems that everybody talks about them. But, sometimes a simpler solution might be preferable. After all, one should try to walk before running. In this short article, I am going to demonstrate a simple method for clustering documents with Python. All code is available at GitHub (please note that it might be better to view the code in nbviewer).

We are going to cluster Wikipedia articles using k-means algorithm. The steps for doing that are the following:

  1. fetch…


Two methods using Tensorflow and Keras

Image for post
Image for post
Photo by Clay Banks on Unsplash

In this article we will demonstrate two different methods of using autoencoders. In especial, we will try to classify credit card transactions into fraudulent and non-fraudulent by using autoencoders. The dataset we are going to use is the “Credit Card Fraud Detection” dataset and can be found in Kaggle. The full code is available on GitHub. In it there is a link for opening and executing the code in Colab, so feel free to experiment. The code is written in Python and uses Tensorflow and Keras.

The dataset contains 284,807 credit card transactions from european cardholders. For security reasons the…


Segmentation of online customers by RFM-country and combination with part I

Image for post
Image for post
Photo by Hal Gatewood on Unsplash

Customer segmentation is one of the most common uses of data analysis/data science. In this the second part of a two posts series, where we see an example of customer segmentation. The dataset we use is the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.


Segmentation of online customers by item description

Introduction

Image for post
Image for post
Photo by Charisse Kenion on Unsplash

Customer segmentation is one of the most common uses of data analysis/data science. In this two posts series, we will see an example of customer segmentation. We are going to use the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.


Image for post
Image for post
Generated text based on Tolstoy’s text (image by the author)

Currently, neural networks represent the state-of-the-art in the field of text generation. Deep neural nets like GPT-3 with billions of parameters and trained on TB of data are truly impressive. But, using neural nets is not the only way to generate text. In this post, we demonstrate how transition matrices can help with text generation.

What is a transition matrix

Imagine having a system with a finite set of different states. The system can move from one state to another with a certain probability. A transition matrix encodes all these in a matrix of the form:

Classic examples are:

  • Monopoly, where the states are…


Like most of the world who is closed in his home because of COVID-19, I am curious about the evolution of the pandemic. And like many other data scientists, I am trying to use data science to have a glance at the future of the pandemic. I this post I describe such an attempt. More especially, I describe my attempt to predict the number of COVID-related deaths from the number of patients in ventilator. It is a (mostly) failed attempt. Now, you might be wondering why wasting time on a failure. I can offer two reasons. First, in the process…


A short web scrapping and model building project

Image for post
Image for post
Photo by Marcelo Leal on Unsplash

With a second, more severe wave of COVID-19 hitting my country (Greece) I was left wondering how the pandemic will unfold. Our National Public Health Organisation (EODY) daily publishes the number of confirmed COVID-19 cases. This number increased dramatically the last month along with the number of COVID-19 patients in ventilators and deaths.

The first thing that one might have is to use the time series of confirmed coronavirus cases to predict the future. The problem with this approach is that this number is dependent on the number of COVID tests that…


Power BI is Microsoft’s attempt to create a data visualization/business intelligence tool. According to Wikipedia, it was designed in 2010 and became generally available in July 2015. Currently, there is a free version of Power BI called Power BI Desktop that can be downloaded from Microsoft.

In this article, I am going to demonstrate Power BI by creating a COVID-19 dashboard.

You can read on or watch the video below. (It is my first such video, so go easy on me. One of the reasons for creating it was for testing some tools. …


Opinion

Choosing a language for Data Science — A personal view

Image for post
Image for post
Photo by Markus Spiske on Unsplash

Being an enthusiastic reader of Towards Data Science for a few time I couldn’t help but notice the debate on the best language for Data Science. Examples are:

  1. A compact comparison: Julia, R and Python — Data science in 2020
  2. Introducing Julia: An Alternative to Python and R for Data Science
  3. 5 Ways Julia Is Better Than Python

This is just a short post with some of my thoughts on the subject. Unfortunately, I am not familiar with Julia, hence it is mostly on using R and/or Python.

Use the right tool for the right job

In my opinion, one should use the language that best fits one…


Image for post
Image for post
Photo by Fusion Medical Animation on Unsplash

Το άρθρο αυτό έχει μερικές πληροφορίες σχετικά με τον κορονοϊό (COVID-19). Η πηγή των πληροφοριών αυτών είναι το Ευρωπαϊκό Κέντρο για τον Πρόληψη και τον Έλεγχο Λοιμώξεων(European Centre for Disease Prevention and Control). Ειδικότερα, περιέχει μετάφραση κομματιών των δύο κειμένων πιο κάτω.

This article contains information in greek about COVID-19. Source of information is the European Centre for Disease Prevention and Control. In especial, the article contains translation of pieces of text from the two sources below.

Η πηγή των παρακάτω πληροφοριών είναι:

  • η 8η ενημέρωση της ταχείας εκτίμησης κινδύνου (Rapid Risk Assesment 8th update)
  • η 7η ενημέρωση της ταχείας…

Dimitris Panagopoulos

Research mathematician turned to Data Scientist https://www.linkedin.com/in/dpanagopoulos/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store