Hướng dẫn python masking sensitive data

Open Your Private Data To The World

Hướng dẫn python masking sensitive data

https://c1.wallpaperflare.com/preview/176/830/383/classified-background-blog-business.jpg

We live in a world full of data. And most of the time data is personal and sensitive. We face various kinds of data on a daily basis, and usually our clients do not want to share their private data with thirdparties. But what if they want to hire a freelancer or an outsourcing company? How can they pass the data freely, but also not worry about potential data leaks? That’s where data anonymization/data masking comes in.

Simply put, data anonymization is a process of permutating data in such a way that we can’t tell anymore who the data refers to. Depending on the end goal we might also want anonymization not affect possible insights within the data and its statistical properties. In order to achieve that we must thoroughly understand our data, the techniques we are applying, and what properties we wish to preserve.

Hướng dẫn python masking sensitive data

https://www.wallpaperflare.com/classified-background-blog-business-communication-data-wallpaper-ulpvz

We have developed a simple, but rich with functionality Python library for data anonymization-anonympy. Anonympy is a general toolkit for data anonymization and masking, as for now, it provides numerous functions for tabular and image anonymization. It utilizes pandas efficiency and encapsulates existing libraries such as Faker. Our goal was to make data anonymization & masking as intuitive as possible. Let’s see how it works.

Anonymizing a Dataset

As an example, let us anonymize a dataset — uk-500.csv.

Let’s load the dataset and see what kind of data we have.

import pandas as pdurl = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()

Hướng dẫn python masking sensitive data

By looking at the columns, we see that all columns contain some sensitive information. Good, it means we will have to anonymize all the columns and will be able to show all the power of the anonympy package.

Let’s first start by installing anonympy library. This can be done in 2 steps:

pip install anonympy
pip install cape-privacy — no-deps

It’s important to know the type of the column before applying any transformations. Let’s check the data types and see what methods are available to us.

from anonympy.pandas import dfAnonymizer
from anonympy.pandas.utils import available_methods
anonym = dfAnonymizer(df)print('Numeric columns', anonym.numeric_columns)
print('Categorical columns', anonym.categorical_columns)
print('Date columns', anonym.datetime_columns)

Hướng dẫn python masking sensitive data

Let’s now check what methods are available for table anonymization.

available_methods()

Hướng dẫn python masking sensitive data

Using the list of available methods, we can now start applying the transformations. Let’s add some random noise to age column, round the values in salary and partially mask email column.

anonym.numeric_noise('age')  
anonym.numeric_rounding('salary')
anonym.categorical_email_masking('email')

To see the changes we can call to_df(), or for short summary we can call info() method.

anonym.info()

Hướng dẫn python masking sensitive data

A good choice would be to substitute names in first_name column with fake ones. For that, we must check if Faker has a corresponding method for that.

from anonympy.pandas.utils import fake_methods
fake_methods('f') # args: None / 'all' / any letter

Faker has a method called first_name, let’s permutate the column.

# if we wanted a different method use a dictionary {column_name:method_name}
# anonym.categorical_fake({'first_name': 'first_name_female'}
anonym.categorical_fake('first_name')

Checking fake_methods for other column names it turns out, Faker also has methods for address and city. The web column can be substituted with url method and phone with phone_number.

# this will change `address` and `city` because column names correspond to the method names
anonym.categorical_fake_auto()
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'})

Last column left to anonymize is birthdate. Since we have age column which contains the same information, we could drop this column using column_supression method. However, for the sake of clarity let’s add some noise to it.

anonym.datetime_noise('birthdate')

That’s it. Let’s have a look at our new, anonymized dataset!

Hướng dẫn python masking sensitive data

Summary

Data privacy and protection is an important part of data handling and should be given proper attention. Therefore, at ArtLabs we decided to create a convenient tool for this use case. You might also want to check our image anonymization guidelines here and our Google Colab for Tabular data anonymization here.We also welcome any contributions to our anonympy open-source repository!

Special thanks to Shakhansho Sabzaliev, student at University of Central Asia majoring in Computer Science and ArtLabs Machine Learning Intern, for his contribution to this post and anonympy open-source repository!