I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets.
I know I can randomly sample one tenth of the original pandas DataFrame using:
partition_1 = pandas.DataFrame.sample[frac=[1/10]]
However, how can I obtain the other nine partitions? If I'd do pandas.DataFrame.sample[frac=[1/10]]
again, there exists the possibility that my
subsets are not disjoint.
Thanks for the help!
asked Jul 25, 2016 at 14:19
1
Starting with this.
dfm = pd.DataFrame[{'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo']*2,
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']*2}]
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
8 foo one
9 bar one
10 foo two
11 bar three
12 foo two
13 bar two
14 foo one
15 foo three
Usage:
Change "4" to "10", use [i] to get the slices.
np.random.seed[32] # for reproducible results.
np.array_split[dfm.reindex[np.random.permutation[dfm.index]],4][1]
A B
2 foo two
5 bar two
10 foo two
12 foo two
np.array_split[dfm.reindex[np.random.permutation[dfm.index]],4][3]
A B
13 foo two
11 bar three
0 foo one
7 foo three
answered Jul 25, 2016 at 15:05
MerlinMerlin
22.8k37 gold badges119 silver badges197 bronze badges
0
use np.random.permutations
:
df.loc[np.random.permutation[df.index]]
it will shuffle the dataframe and keep column names, after you could split the dataframe into 10.
answered Jul 25, 2016 at 14:23
SerialDevSerialDev
2,71121 silver badges34 bronze badges
Say df
is your dataframe, and you want N_PARTITIONS
partitions of roughly equal size
[they will be of exactly equal size if len[df]
is divisible by N_PARTITIONS
].
Use np.random.permutation
to permute the array np.arange[len[df]]
. Then take slices of that array with step N_PARTITIONS
, and extract the corresponding rows of your dataframe with .iloc[]
.
import numpy as np
permuted_indices = np.random.permutation[len[df]]
dfs = []
for i in range[N_PARTITIONS]:
dfs.append[df.iloc[permuted_indices[i::N_PARTITIONS]]]
Since you are on Python 2.7, it might be better to switch range[N_PARTITIONS]
by xrange[N_PARTITIONS]
to get an iterator instead of a list.
answered Jul 25, 2016 at 14:42
Not the answer you're looking for? Browse other questions tagged python python-2.7 pandas dataframe partitioning or ask your own question.
View Discussion
Improve Article
Save Article
View Discussion
Improve Article
Save Article
We can try different approaches for splitting Dataframe to get the desired results. Let’s take an example of a dataset of diamonds.
Python3
import
seaborn as sns
import
pandas as pd
import
numpy as np
df
=
sns.load_dataset[
'diamonds'
]
df.head[]
Output:
Method 1: Splitting Pandas Dataframe by row index
In the below code, the dataframe is divided into two parts, first 1000 rows, and
remaining rows. We can see the shape of the newly formed dataframes as the output of the given code.
Python3
df_1
=
df.iloc[:
1000
,:]
df_2
=
df.iloc[
1000
:,:]
print
[
"Shape of new dataframes - {} , {}"
.
format
[df_1.shape, df_2.shape]]
Output:
Method 2:
Splitting Pandas Dataframe by groups formed from unique column values
Here, we will first grouped the data by column value “color”. The newly formed dataframe consists of grouped data with color = “E”.
Python3
grouped
=
df.groupby[df.color]
df_new
=
grouped.get_group[
"E"
]
df_new
Output:
Method 3 : Splitting Pandas Dataframe in predetermined sized chunks
In the above code, we can see that we have formed a new dataset of a size of 0.6 i.e. 60% of total rows [or length of the dataset], which now consists of 32364 rows. These rows are selected randomly.
Python3
df_split
=
df.sample[frac
=
0.6
,random_state
=
200
]
df_split.reset_index[]
Output: