Hướng dẫn dùng sql dedupe python

DeDuplicating: SQL vs. Python

Both SQL and Python offer powerful functions to help data engineers clean data and eliminate dreaded ‘dupes’ in datasets.

Hướng dẫn dùng sql dedupe python

Photo by JESHOOTS.COM on Unsplash

One of the most important processes a data engineer can master is deduplicating values in order to provide clean data for data consumers. Since raw data can vary in format and cleanliness it is vital that data…

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)[source]

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameterssubsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplacebool, default False

Whether to drop duplicates in place or to return a copy.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

ReturnsDataFrame or None

DataFrame with duplicates removed or None if inplace=True.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

I'm looking to remove the names that are repeated.

My code is connected to retrieve information from SQL server to Python.

def get_all_artist():
    query="SELECT artist_name FROM Sheet1"
    all_artist = execute_read_query (conn, query)
    for artist_record in all_artist:
        print(str(artist_record[0]))
    return (all_artist)

The artist_name that I am retrieving on SQL are:

BTS
BTS
BTS
TWICE
TWICE
TWICE
TWICE
TWICE
HEIZE
HEIZE
KHALID
KHALID
KHALID
ERIC CHOU
ERIC CHOU
ERIC CHOU
SAM SMITH
SAM SMITH
SAM SMITH
AGUST D

However, I'd only like to remove the duplicates on Python without removing any rows in my SQL table:

BTS
TWICE
HEIZE
KHALID
ERIC CHOU
SAM SMITH
AGUST D