Natural Language Processing has made huge advancements in the last years. Currently, various implementations of neural networks are cutting edge and it seems that everybody talks about them. But, sometimes a simpler solution might be preferable. After all, one should try to walk before running. In this short article, I am going to demonstrate a simple method for clustering documents with Python. All code is available at GitHub (please note that it might be better to view the code in nbviewer).
We are going to cluster Wikipedia articles using k-means algorithm. The steps for doing that are the following:
In this article we will demonstrate two different methods of using autoencoders. In especial, we will try to classify credit card transactions into fraudulent and non-fraudulent by using autoencoders. The dataset we are going to use is the “Credit Card Fraud Detection” dataset and can be found in Kaggle. The full code is available on GitHub. In it there is a link for opening and executing the code in Colab, so feel free to experiment. The code is written in Python and uses Tensorflow and Keras.
The dataset contains 284,807 credit card transactions from european cardholders. For security reasons the…
Customer segmentation is one of the most common uses of data analysis/data science. In this the second part of a two posts series, where we see an example of customer segmentation. The dataset we use is the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.
Customer segmentation is one of the most common uses of data analysis/data science. In this two posts series, we will see an example of customer segmentation. We are going to use the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.
Currently, neural networks represent the state-of-the-art in the field of text generation. Deep neural nets like GPT-3 with billions of parameters and trained on TB of data are truly impressive. But, using neural nets is not the only way to generate text. In this post, we demonstrate how transition matrices can help with text generation.
Imagine having a system with a finite set of different states. The system can move from one state to another with a certain probability. A transition matrix encodes all these in a matrix of the form:
Classic examples are:
Like most of the world who is closed in his home because of COVID-19, I am curious about the evolution of the pandemic. And like many other data scientists, I am trying to use data science to have a glance at the future of the pandemic. I this post I describe such an attempt. More especially, I describe my attempt to predict the number of COVID-related deaths from the number of patients in ventilator. It is a (mostly) failed attempt. Now, you might be wondering why wasting time on a failure. I can offer two reasons. First, in the process…
A short web scrapping and model building project
With a second, more severe wave of COVID-19 hitting my country (Greece) I was left wondering how the pandemic will unfold. Our National Public Health Organisation (EODY) daily publishes the number of confirmed COVID-19 cases. This number increased dramatically the last month along with the number of COVID-19 patients in ventilators and deaths.
The first thing that one might have is to use the time series of confirmed coronavirus cases to predict the future. The problem with this approach is that this number is dependent on the number of COVID tests that…
Power BI is Microsoft’s attempt to create a data visualization/business intelligence tool. According to Wikipedia, it was designed in 2010 and became generally available in July 2015. Currently, there is a free version of Power BI called Power BI Desktop that can be downloaded from Microsoft.
In this article, I am going to demonstrate Power BI by creating a COVID-19 dashboard.
You can read on or watch the video below. (It is my first such video, so go easy on me. One of the reasons for creating it was for testing some tools. …
Being an enthusiastic reader of Towards Data Science for a few time I couldn’t help but notice the debate on the best language for Data Science. Examples are:
This is just a short post with some of my thoughts on the subject. Unfortunately, I am not familiar with Julia, hence it is mostly on using R and/or Python.
In my opinion, one should use the language that best fits one…
Το άρθρο αυτό έχει μερικές πληροφορίες σχετικά με τον κορονοϊό (COVID-19). Η πηγή των πληροφοριών αυτών είναι το Ευρωπαϊκό Κέντρο για τον Πρόληψη και τον Έλεγχο Λοιμώξεων(European Centre for Disease Prevention and Control). Ειδικότερα, περιέχει μετάφραση κομματιών των δύο κειμένων πιο κάτω.
This article contains information in greek about COVID-19. Source of information is the European Centre for Disease Prevention and Control. In especial, the article contains translation of pieces of text from the two sources below.
Η πηγή των παρακάτω πληροφοριών είναι: