Hướng dẫn eda python cheat sheet

Introduction

The secret behind creating powerful predictive models is to understand the data really well. Thereby, it is suggested to maneuver the essential steps of data exploration to build a healthy model.

Here is a cheat sheet to help you with various codes and steps while performing exploratory data analysis in Python. We have also released a pdf version of the sheet this time so that you can easily copy / paste these codes.

Hướng dẫn eda python cheat sheet

You can easily copy / paste these code and keep them handy by downloading the PDF version of this infographic here: Data Exploration in Python.pdf

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Hướng dẫn eda python cheat sheet

Python Cheat Sheet là gì ?

Python Cheat Sheet là một bản tổng hợp đầy đủ những kiến thức nền tảng và quan trọng nhất khi tự học Python cho người mới bắt đầu. Được biên soạn bởi Arianne Colton và Sean Chen, bản Cheat sheet này sẽ hướng dẫn bạn đi qua toàn bộ những khái niệm cơ bản của ngôn ngữ lập trình Python, giúp tiết kiệm thời gian và công sức.

Tài liệu tự học Python cho người mới bắt đầu

Download Cheat sheet bản PDF Tại đây.

(Tham khảo cách bắt đầu học Python như thế nào để ứng dụng trong phân tích dữ liệu Tại đây)

(Tham khảo bài viết 8 kỹ năng cần có để trở thành Data Analyst)

(Tham khảo bài viết về Tư duy phân tích & ứng dụng dữ liệu để trả lời câu hỏi)

Hướng dẫn eda python cheat sheet
Hướng dẫn eda python cheat sheet
Đọc thêm các bài viết chia sẻ kiến thức về phân tích dữ liệu: https://datapot.vn/blog/

Chuỗi Video Hướng dẫn thực hành Lab và sử dụng các tài nguyên của Microsoft: https://www.youtube.com/c/Datapotvn/videos

Update tài nguyên từ Microsoft, DA-100 exam questions và exam topics tại Fanpage của Datapot: https://www.facebook.com/DatapotAnalytics/

Hướng dẫn eda python cheat sheet

4.800.000 5.200.000 

Hướng dẫn eda python cheat sheet

Every day use of pandas functions as a data scientist

Hướng dẫn eda python cheat sheet

Pandas is a python library used in data manipulation ( create, delete, and update the data).

It is one of the most commonly used libraries for data analysis in python. Pandas offer data structures and operations for manipulating numerical and time-series data.

Pandas First Steps

Install and import

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

conda install pandas

OR

pip install pandas

Alternatively, if you’re currently viewing this article in a Jupyter notebook you can run this cell:

Hướng dẫn eda python cheat sheet

How to install pandas

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it’s used so much:

Hướng dẫn eda python cheat sheet

importing pandas into jupyer and ‘pd’ stands alias name for pandas.

For this excersis taken dataset of Loan Prediction and can download the dataset from : https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#ProblemStatement

1. Import necessary Libraries

Hướng dẫn eda python cheat sheet

2. Load dataset (Test & Train)

# Read train and test dataset
train = pd.read_csv(“train_ctrUa4K.csv”)
test = pd.read_csv(“test_lAUu6dG.csv”)

3. Head()

Viewing your data:

a) The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

b) .head() outputs the first five rows of your DataFrame by default, including column header and the content of each row.

Hướng dẫn eda python cheat sheet

First 5 rows of Test & Train datasets

c) But we could also pass a number as well: .head(3) would output the top 3 rows.

Hướng dẫn eda python cheat sheet

First 3 rows of Test & Train datasets

4. tail()

.tail() outputs the last five rows of your DataFrame by default, including column header and the content of each row.

Hướng dẫn eda python cheat sheet

Last 5 rows of Test & Train datasets

But we can also pass a number as well: .tail(2) would output the top 2 rows.

Hướng dẫn eda python cheat sheet

Last 2 rows of Test & Train datasets

5. shape

Gives the size of the data frame in the format (row, column).

Hướng dẫn eda python cheat sheet

Displays shape of Train & Test Dataset

6. Info()

prints the column header and the data type stored in each column. It also gives the number of non-null values and the memory the data takes.

Hướng dẫn eda python cheat sheet

Displays data types & missing values of each feature

7. dtypes()

Pandas DataFrame.dtypes attribute to find out the data type (dtype) of each column in the given dataframe.

Hướng dẫn eda python cheat sheet

Displays only data type of each feature

8. Count()

Pandas dataframe.count() is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.

Hướng dẫn eda python cheat sheet

Gives no of data points in each feature

9. Value_counts()

Pandas .value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Hướng dẫn eda python cheat sheet

Out of 614 data points, Y repeating 422 times & N 192 times

10. Unique()

a) shows all the non-repeating values of a particular column.

b) Pandas unique() function return unique values in the feature. Uniques are returned in order of appearance, this does NOT sort.

Hướng dẫn eda python cheat sheet

For feature “ Education” showing non-repeating values

11. Printing Column Names

To get all the column headers of a Pandas DataFrame as a list, df.columns.values attribute will return a list of column headers.

Hướng dẫn eda python cheat sheet

Shows all column names

12. Describe()

Pandas describe() is used to view some basic statistical details like mean, median, standard deviation, and percentiles of all the numerical values in your dataset.

Hướng dẫn eda python cheat sheet

13. Missing Values()

In Pandas missing data is represented by two value:

  • None: None is a Python singleton object that is often used for missing data in Python code.
  • NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

Hướng dẫn eda python cheat sheet

No of Missing values in each feature

Hướng dẫn eda python cheat sheet

No of Missing values in each feature and corresponding percentage. By using missing value percentage can take decision wheather to drop feature or not.

That’s It!

Thanks for reading!

Found this article useful? Follow me (Anuganti Suresh) on Medium and check out my most popular articles! Please 👏 this article to share it!

References:

Clap if you liked the article!