one-hot encoding là gì

Target Encoding Vs. One-hot Encoding with Simple Examples

Svideloc
Follow
Jan 16, 2020 · 5 min read

For machine learning algorithms, categorical data can be extremely useful. However, in its original form, it is unrecognizable to most models. In order to solve this problem, we can use different encoding techniques to make our categorical data legible.

In this post, we will cover two common encoding techniques used in data science models Target Encoding and One-hot Encoding.

Target Encoding

From the documentation linked above, Target Encoding is defined as the process in which

features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

Example Target Encoding

Table 1: Dataframe With Target Encoded Animal Values

To better understand what this means, lets look at an example. In Table 1, we have categorical data in the Animal column, and we have our binary target in the Target column. In the final column, we have the encoded animal values. So how did we get there?

  1. Group the data by each category and count the number of occurrences of each target [Table 2].
  2. Next, calculate the probability of Target 1 occurring given each specific Animal Group. When we do this we get the following values in Table 2:
Table 2: Simplified Table to Show how Target Encoding is Calculating the Probability

3. Finally, add back in the new column, which gives the probability value of each Animal Group. This is shown in the first dataframe [Table 1]. Now you have a numerical value that represents the Animal feature that can be recognized by machine learning algorithms.

Note that when you do target encoding in sklearn, your values may be slightly different than what you get with the above methodology. This is because we have only taken into account the posterior probability so far. Sklearn also looks at the prior probability, which in this case would be the probability of the target being 1. In this case, this probability is 0.5 since we have the target equal to 1 half of the time. Sklearn then uses this metric to help smooth out the encoded value so as to not give too much weight to the target encoded feature which is dependent on the target.

What This Looks Like in Code

In the below code, the category_encoders library is used to do the target encoding the fast way [not manually, as explained above].

  1. Import Libraries
import pandas as pd
from category_encoders import TargetEncoder

2. Target Encode & Clean DataFrame

encoder = TargetEncoder[]
df['Animal Encoded'] = encoder.fit_transform[df['Animal'], df['Target']]

This will output the dataframe found in Table 1. This can also be iterable among multiple categorical features.

Benefits of Target Encoding

Target encoding is a simple and quick encoding method that doesnt add to the dimensionality of the dataset. Therefore it may be used as a good first try encoding method.

Limitations of Target Encoding

Target encoding is dependent on the distribution of the target which means target encoding requires careful validation as it can be prone to overfitting. This method is also dataset-specific and will only show significant improvements some of the time.

One-hot Encoding

One-hot encoding is easier to conceptually understand. This type of encoding simply produces one feature per category, each binary. Or for the example above, creating a new feature for cat, dog, and hamster. In the column cat, for example, we show that a cat exists with a 1, and it doesnt exist with a 0. Lets look at the same example to make more sense of this:

Example One-hot Encoding

Using the same data as above when we one-hot encode, our data will look like:

Table 3: One-hot Encoded Dataframe

Notice now we have three new columns: isCat, isDog, and isHamster. Each 1' signifies that the feature contains the animal in the feature title. If there is a 0, then we dont have that animal. Once again, we now have new features that a machine learning algorithm can interpret.

Lets look at the steps to get here.

  1. Import Packages
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
  1. Label Encode [give a number value to each category, i.e. cat = 0] shown in the Animal Encoded column in Table 3.
le = LabelEncoder[]
df['Animal Encoded'] = le.fit_transform[df.Animal]
  1. One-hot Encode & Clean Dataframe
encoder = OneHotEncoder[categories = 'auto']X = encoder.fit_transform[
df['Animal Encoded'].values.reshape[-1,1]].toarray[]
dfonehot = pd.DataFrame[X]
df = pd.concat[[df, dfonehot], axis =1]
df.columns = ['Animal','Target','Animal Encoded',
'isCat','isDog','isHamster']

This will output the full dataframe shown in Table 3. Now you have 3 new features that can be understood by a machine learning algorithm.

Benefits of One-hot Encoding

One-hot encoding works well with nominal data and eliminates any issue of higher categorical values influencing data, since we are creating each column in the binary 1 or 0.

Limitations of One-hot Encoding

One-hot encoding can create very high dimensionality depending on the number of categorical features you have and the number of categories per feature. This can become problematic not only in smaller datasets but also potentially in larger datasets as well. Combining PCA with one-hot encoding can help reduce that dimensionality when running models. One-hot encoding can also be problematic in tree-based models. See here for further discussion.

Conclusion

There are many different types of encoding. I recommend exploring other options as you dive into specific datasets. With that said, these two are common approaches that can help you make use of the categorical features that exist within datasets.

You can access the code I used for this blog on my GitHub.

Video liên quan

Chủ Đề