I think iterrows
here is not necessary, because slowiest iterate solution in pandas [and output is Series, here are necessary dicts].
First add scalar columns to DataFrame and rename columns names:
df1 = df.rename[columns={'Start Date':'start_date'}].rename[columns=str.lower]
df1.insert[3, 'scoring', 'total']
df1['end_date'] = '2020-03-31'
df1['start_date'] = df1['start_date'].dt.strftime['%Y-%m-%d']
print [df1]
region sector brand id scoring start_date end_date
7188 US 41 40000 total 2006-03-06 2020-03-31
7189 US 41 40345 total 2017-11-06 2020-03-31
7190 US 41 40123 total 2019-01-12 2020-03-31
7191 US 42 40145 total 2001-02-06 2020-03-31
7192 US 42 40185 total 2013-03-16 2020-03-31
And then convert to list of dicts by DataFrame.to_dict
and loop:
for d in df1.to_dict['record']:
print [d]
{'region': 'US', 'sector': 41, 'brand id': 40000, 'scoring': 'total', 'start_date': '2006-03-06', 'end_date': '2020-03-31'}
{'region': 'US', 'sector': 41, 'brand id': 40345, 'scoring': 'total', 'start_date': '2017-11-06', 'end_date': '2020-03-31'}
{'region': 'US', 'sector': 41, 'brand id': 40123, 'scoring': 'total', 'start_date': '2019-01-12', 'end_date': '2020-03-31'}
{'region': 'US', 'sector': 42, 'brand id': 40145, 'scoring': 'total', 'start_date': '2001-02-06', 'end_date': '2020-03-31'}
{'region': 'US', 'sector': 42, 'brand id': 40185, 'scoring': 'total', 'start_date': '2013-03-16', 'end_date': '2020-03-31'}
If pandas.DataFrame
is iterated by for
loop as it is, column names are returned. You can iterate over columns and rows of pandas.DataFrame
with the iteritems[]
, iterrows[]
, and itertuples[]
methods.
This article describes the following contents.
- Iterate
pandas.DataFrame
infor
loop as it is - Iterate columns of
pandas.DataFrame
DataFrame.iteritems[]
- Iterate rows of
pandas.DataFrame
DataFrame.iterrows[]
DataFrame.itertuples[]
- Iterate only specific columns
- Update
values in
for
loop - Speed comparison
For more information on the for
statement in Python, see the following article.
- for loop in Python [with range, enumerate, zip, etc.]
Use the following pandas.DataFrame
as an example.
import pandas as pd
import numpy as np
df = pd.DataFrame[{'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
index=['Alice', 'Bob']]
print[df]
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
Iterate pandas.DataFrame
in for
loop as it is
If you iterate pandas.DataFrame
in a for
loop as is, the column names are returned in order.
for column_name in df:
print[column_name]
# age
# state
# point
Iterate columns of pandas.DataFrame
DataFrame.iteritems[]
The iteritems[]
method iterates over columns and returns [column name, Series]
, a tuple with the column name and the content as pandas.Series
.
- pandas.DataFrame.iteritems — pandas 1.4.2 documentation
for column_name, item in df.iteritems[]:
print[column_name]
print['------']
print[type[item]]
print[item]
print['------']
print[item[0], item['Alice'], item.Alice]
print[item[1], item['Bob'], item.Bob]
print['======\n']
# age
# ------
#
# Alice 24
# Bob 42
# Name: age, dtype: int64
# ------
# 24 24 24
# 42 42 42
# ======
#
# state
# ------
#
# Alice NY
# Bob CA
# Name: state, dtype: object
# ------
# NY NY NY
# CA CA CA
# ======
#
# point
# ------
#
# Alice 64
# Bob 92
# Name: point, dtype: int64
# ------
# 64 64 64
# 92 92 92
# ======
#
Iterate rows of pandas.DataFrame
The iterrows[]
and itertuples[]
methods iterate over rows. The itertuples[]
method is faster.
If you only need the values for a particular column, it is even faster to iterate over the elements of a given column individually, as explained next. The results of the speed comparison are shown at the end.
DataFrame.iterrows[]
The iterrows[]
method iterates over
rows and returns [index, Series]
, a tuple with the index and the content as pandas.Series
.
- pandas.DataFrame.iterrows — pandas 1.4.2 documentation
for index, row in df.iterrows[]:
print[index]
print['------']
print[type[row]]
print[row]
print['------']
print[row[0], row['age'], row.age]
print[row[1], row['state'], row.state]
print[row[2], row['point'], row.point]
print['======\n']
# Alice
# ------
#
# age 24
# state NY
# point 64
# Name: Alice, dtype: object
# ------
# 24 24 24
# NY NY NY
# 64 64 64
# ======
#
# Bob
# ------
#
# age 42
# state CA
# point 92
# Name: Bob, dtype: object
# ------
# 42 42 42
# CA CA CA
# 92 92 92
# ======
#
DataFrame.itertuples[]
The itertuples[]
method iterates over rows and returns a tuple of the index and the content. The first element of the tuple is the index.
- pandas.DataFrame.itertuples — pandas 1.4.2 documentation
By default, it returns a namedtuple
named Pandas
. Because it is a namedtuple
, you can access the value of each element by .
as well as []
.
for row in df.itertuples[]:
print[type[row]]
print[row]
print['------']
print[row[0], row.Index]
print[row[1], row.age]
print[row[2], row.state]
print[row[3], row.point]
print['======\n']
#
# Pandas[Index='Alice', age=24, state='NY', point=64]
# ------
# Alice Alice
# 24 24
# NY NY
# 64 64
# ======
#
#
# Pandas[Index='Bob', age=42, state='CA', point=92]
# ------
# Bob Bob
# 42 42
# CA CA
# 92 92
# ======
#
A normal tuple is returned if the name
parameter is set to None
.
for row in df.itertuples[name=None]:
print[type[row]]
print[row]
print[row[0], row[1], row[2], row[3]]
print['======\n']
#
# ['Alice', 24, 'NY', 64]
# Alice 24 NY 64
# ======
#
#
# ['Bob', 42, 'CA', 92]
# Bob 42 CA 92
# ======
#
Iterate only specific columns
If you only need the elements of a particular column, you can also write as follows.
The pandas.DataFrame
column is pandas.Series
.
print[df['age']]
# Alice 24
# Bob 42
# Name: age, dtype: int64
print[type[df['age']]]
#
If you apply pandas.Series
to a for
loop, you can get its values in order. You can get the values of that column in order by specifying a column of pandas.DataFrame
and applying it to a for
loop.
for age in df['age']:
print[age]
# 24
# 42
You can also get the values of multiple columns with the built-in zip[]
function.
- zip[] in Python: Get elements from multiple lists
for age, point in zip[df['age'], df['point']]:
print[age, point]
# 24 64
# 42 92
Use the index
attribute if you want to get the index. As in the example above, you can get it together with other columns by zip[]
.
print[df.index]
# Index[['Alice', 'Bob'], dtype='object']
print[type[df.index]]
#
for index in df.index:
print[index]
# Alice
# Bob
for index, state in zip[df.index, df['state']]:
print[index, state]
# Alice NY
# Bob CA
Update values in for
loop
The pandas.Series
returned by the iterrows[]
method is a copy, not a view, so changing it will not update the original
data.
for index, row in df.iterrows[]:
row['point'] += row['age']
print[df]
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
You can update it by selecting elements of the original DataFrame
with at[]
.
for index, row in df.iterrows[]:
df.at[index, 'point'] += row['age']
print[df]
# age state point
# Alice 24 NY 88
# Bob 42 CA 134
See the following article on at[]
.
- pandas: Get/Set element values with at, iat, loc, iloc
However, in many cases, it is not necessary to use a for
loop to update an element or to add a new column based on an existing column. It is simpler and faster
to write without a for
loop.
Same process without a for
loop:
df = pd.DataFrame[{'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
index=['Alice', 'Bob']]
df['point'] += df['age']
print[df]
# age state point
# Alice 24 NY 88
# Bob 42 CA 134
You can add a new column.
df['new'] = df['point'] + df['age'] * 2
print[df]
# age state point new
# Alice 24 NY 88 136
# Bob 42 CA 134 218
You can also apply NumPy functions to each element of a column.
df['age_sqrt'] = np.sqrt[df['age']]
print[df]
# age state point new age_sqrt
# Alice 24 NY 88 136 4.898979
# Bob 42 CA 134 218 6.480741
For strings, various methods are provided to process the columns directly. The following is an example of converting to lower case and selecting the first character.
- pandas: Handle strings [replace, strip, case conversion, etc.]
- pandas: Slice substrings from each element in columns
df['state_0'] = df['state'].str.lower[].str[0]
print[df]
# age state point new age_sqrt state_0
# Alice 24 NY 88 136 4.898979 n
# Bob 42 CA 134 218 6.480741 c
Speed comparison
Compare the speed of iterrows[]
, itertuples[]
, and the method of specifying columns.
Use pandas.DataFrame
with 100 rows and 10 columns as
an example. It is a simple example with only numeric elements, row name index
and column name columns
are default sequential numbers.
- numpy.arange[], linspace[]: Generate ndarray with evenly spaced values
- pandas: Get first/last n rows of DataFrame with head[], tail[], slice
import pandas as pd
df = pd.DataFrame[pd.np.arange[1000].reshape[100, 10]]
print[df.shape]
# [100, 10]
print[df.head[]]
# 0 1 2 3 4 5 6 7 8 9
# 0 0 1 2 3 4 5 6 7 8 9
# 1 10 11 12 13 14 15 16 17 18 19
# 2 20 21 22 23 24 25 26 27 28 29
# 3 30 31 32 33 34 35 36 37 38 39
# 4 40 41 42 43 44 45 46 47 48 49
print[df.tail[]]
# 0 1 2 3 4 5 6 7 8 9
# 95 950 951 952 953 954 955 956 957 958 959
# 96 960 961 962 963 964 965 966 967 968 969
# 97 970 971 972 973 974 975 976 977 978 979
# 98 980 981 982 983 984 985 986 987 988 989
# 99 990 991 992 993 994 995 996 997 998 999
Note that
the code below uses the Jupyter Notebook magic command %%timeit
and does not work when run as a Python script.
- Measure execution time with timeit in Python
%%timeit
for i, row in df.iterrows[]:
pass
# 4.53 ms ± 325 µs per loop [mean ± std. dev. of 7 runs, 100 loops each]
%%timeit
for t in df.itertuples[]:
pass
# 981 µs ± 43.8 µs per loop [mean ± std. dev. of 7 runs, 1000 loops each]
%%timeit
for t in df.itertuples[name=None]:
pass
# 718 µs ± 10.9 µs per loop [mean ± std. dev. of 7 runs, 1000 loops each]
%%timeit
for i in df[0]:
pass
# 15.6 µs ± 446 ns per loop [mean ± std. dev. of 7 runs, 100000 loops each]
%%timeit
for i, j, k in zip[df[0], df[4], df[9]]:
pass
# 46.1 µs ± 588 ns per loop [mean ± std. dev. of 7 runs, 10000 loops each]
%%timeit
for t in zip[df[0], df[1], df[2], df[3], df[4], df[5], df[6], df[7], df[8], df[9]]:
pass
# 147 µs ± 3.78 µs per loop [mean ± std. dev. of 7 runs, 10000 loops each]
iterrows[]
is slow because it converts each row to pandas.Series
.
itertuples[]
is faster than iterrows[]
, but the method of specifying columns is the fastest. In the example environment, it is faster than itertuples[]
even if all columns
are specified.
As the number of rows increases, iterrows[]
becomes even slower. You should try using itertuples[]
or column specification in such a case.
Of course, as mentioned above, it is best not to use the for
loop if it is not necessary.