Hướng dẫn grouping similar sentences python

What data science articles gain more attraction from the readers [Part 2]

Photo by Hanson Lu on Unsplash

In this series of articles we are analysing historical archives of data science publications to understand what topics are more popular with the readers. Previously we covered how to get the data that will be used for further analysis.

We will cover how to clean text data we collected earlier , group similar topics using network graphs and establish patterns within these clusters in this article.

Data summary

Let’s remind ourselves how the data looks like. It is combination of articles obtained from three data sources [field: ‘Source’] — Analytics Vidhya [‘avd’], TDS [‘tds’] and Towards AI [‘tai’].

We collected titles, subtitles, claps and responses from individual articles in archives of the publications.

import pandas as pd# Reading the data obtained using code here.
avd = pd.read_csv['analytics_vidhya_data.csv']
tds = pd.read_csv['medium_articles.csv']
tai = pd.read_csv['towards_ai_data.csv']
avd['source'] = 'avd'
tds['source'] = 'tds'
tai['source'] = 'tai'
# Create single data set, join title and subtitle
single_matrix = pd.concat[[avd, tds, tai]]
single_matrix['title_subtitle'] = [' '.join[[str[i],str[j]]] for i, j in zip[single_matrix['Title'].fillna[''], single_matrix['Subtitle'].fillna['']]]

Articles data set

We added an additional column in the data set called ‘title_subtitle’ which is the join of columns ‘Title’ and ‘Subtitle’, we will mainly use this column in order to have a better view of the topic the article belongs to. Quite interestingly 39% of articles don’t have subtitles and a very small proportion [0.13%] don’t have titles.

Let’s quickly look at the claps and responses distributions for every data source. We start with box plots, we use seaborn library in Python to create our plots.

# We will use seaborn to create all plots
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots[1, 2, figsize=[8, 5]]
# Claps
sns.boxplot[ax=axes[0], x="source", y="Claps", data=single_matrix]
# Responses
sns.boxplot[ax=axes[1], x="source", y="Responses", data=single_matrix]

We can see that Towards Data Science has not only more activity, but also quite a few outliers with individual articles gaining a lot of attraction from readers. Of course, the activity for each source depends on the size of publication, for larger publications we observe more writers and readers.

When it comes to responses, we observe far less activity in comparison to claps across all sources, although such behaviour is not very unexpected.

Box plots for claps and responses split by source

Next, we remove outliers and visualise distributions of the fields to have a clearer picture.

# Code to create distribution subplots
fig, axes = plt.subplots[2, 1, figsize=[8, 8]]
# Claps
sns.distplot[avd['Claps'][avd['Claps']

Chủ Đề