Hướng dẫn python data cleaning tools
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Data Cleaning With pandas and NumPy Show Nội dung chính
Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. Therefore, if you are just stepping into this field or planning to step into this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers. In this tutorial, we’ll leverage Python’s Pandas and NumPy libraries to clean data. We’ll cover the following:
Here are the datasets that we will be using:
You can download the datasets from Real Python’s GitHub repository in order to follow the examples here. This tutorial assumes a basic understanding of the Pandas and NumPy libraries, including Panda’s workhorse Let’s import the required modules and get started! >>>
Dropping Columns in a DataFrameOften, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades. In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime. Pandas provides a handy way of removing unwanted columns or rows from a First, let’s create a >>>
When we look at the first five entries using the We can drop these columns in the following way: >>>
Above, we defined a list that contains the names of all the columns we want to drop. Next,
we call the When we inspect the >>>
Alternatively, we could also remove the columns by passing them to
the >>>
This syntax is more intuitive and readable. What we’re trying to do here is directly apparent. Changing the Index of a DataFrameA Pandas For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the >>>
Let’s replace the existing index with this column using >>>
We can access each record in a
straightforward way with >>>
In other words, 206 is the first label of the index. To access it by position, we could use Previously, our index was a RangeIndex: integers starting from You may have noticed that we reassigned the variable to the object returned by the method with
Tidying up Fields in the DataSo far, we have removed unnecessary columns and changed the index of our Upon inspection, all of the data types are currently the It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings: >>>
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road: >>>
A particular book can have only one date of publication. Therefore, we need to do the following:
Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year: The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions. The Let’s see what happens when we run this regex across our dataset: >>>
Technically, this column still has >>>
This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values: >>>
Great! That’s done! Combining str Methods with NumPy to Clean ColumnsAbove, you may have noticed the use of To clean the >>>
Here, Essentially, It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions: >>>
We’ll be making use of these two functions to clean >>>
We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’. Let’s take a look at two specific entries: >>>
These two books were published in the same place, but one has hyphens in the name of the place while the other does not. To clean this column in one sweep, we can use We clean the column as follows: >>>
We combine them with >>>
Here, the The replacement to be used is a string
representing our desired place of publication. We also replace hyphens with a space with Although there is more dirty data in this dataset, we will discuss only these two columns for now. Let’s have a look at the first five entries, which look a lot crisper than when we started out: >>>
Cleaning the Entire Dataset Using the applymap FunctionIn certain situations, you will see that the “dirt” is not localized to one column but is more spread out. There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas Let’s look at an example. We will create a
We see that we have periodic state names followed by the university towns in that state: We can take advantage of this pattern by creating a list of >>>
We can wrap this list in a DataFrame and set the columns as “State”
and “RegionName”. Pandas will take each element in the list and set The resulting DataFrame looks like this: >>>
While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame: >>>
In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, >>>
Pandas’ >>>
First, we define a Python function that takes an element from the Depending on the check, values are returned accordingly by the function. Finally, the >>>
The Renaming Columns and Skipping RowsOften, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes. In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels. To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:
Now, we’ll read it into a Pandas DataFrame: >>>
This is messy indeed! The columns are the string form of integers indexed at 0. The row which
should have been our header (i.e. the one to be used to set the column names) is at Also, if we were to go to the source of this dataset, we’d see that Therefore, we need to do two things:
We can skip rows and set the header while reading the CSV file by passing some parameters to the This function takes a lot of optional parameters, but in this case we only need one ( >>>
We now have the correct row set as the header and all unnecessary rows removed. Take note of how Pandas has changed the name of the column containing the name of the countries
from To rename the columns, we will make use of a DataFrame’s Let’s start by defining a dictionary that maps current column names (as keys) to more usable ones (the dictionary’s values): >>>
We call the >>>
Setting inplace to >>>
Python Data Cleaning: Recap and ResourcesIn this tutorial, you learned how you can drop unnecessary information from a dataset using the Moreover, you learned how to clean Knowing about data cleaning is very important, because it is a big part of data science. You now have a basic understanding of how Pandas and NumPy can be leveraged to clean datasets! Check out the links below to find additional resources that will help you on your Python data science journey:
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Data Cleaning With pandas and NumPy |