How do i partition a pandas dataframe in python?

I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets.

I know I can randomly sample one tenth of the original pandas DataFrame using:

partition_1 = pandas.DataFrame.sample[frac=[1/10]]

However, how can I obtain the other nine partitions? If I'd do pandas.DataFrame.sample[frac=[1/10]] again, there exists the possibility that my subsets are not disjoint.

Thanks for the help!

asked Jul 25, 2016 at 14:19

1

Starting with this.

 dfm = pd.DataFrame[{'A' : ['foo', 'bar', 'foo', 'bar',  'foo', 'bar', 'foo', 'foo']*2,
                      'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']*2}] 

     A      B
0   foo    one
1   bar    one
2   foo    two
3   bar  three
4   foo    two
5   bar    two
6   foo    one
7   foo  three
8   foo    one
9   bar    one
10  foo    two
11  bar  three
12  foo    two
13  bar    two
14  foo    one
15  foo  three

Usage: 
Change "4" to "10", use [i] to get the slices.  

np.random.seed[32] # for reproducible results.
np.array_split[dfm.reindex[np.random.permutation[dfm.index]],4][1]
      A    B
2   foo  two
5   bar  two
10  foo  two
12  foo  two

np.array_split[dfm.reindex[np.random.permutation[dfm.index]],4][3]

     A      B
13  foo    two
11  bar  three
0   foo    one
7   foo  three

answered Jul 25, 2016 at 15:05

MerlinMerlin

22.8k37 gold badges119 silver badges197 bronze badges

0

use np.random.permutations :

df.loc[np.random.permutation[df.index]]

it will shuffle the dataframe and keep column names, after you could split the dataframe into 10.

answered Jul 25, 2016 at 14:23

SerialDevSerialDev

2,71121 silver badges34 bronze badges

Say df is your dataframe, and you want N_PARTITIONS partitions of roughly equal size [they will be of exactly equal size if len[df] is divisible by N_PARTITIONS].

Use np.random.permutation to permute the array np.arange[len[df]]. Then take slices of that array with step N_PARTITIONS, and extract the corresponding rows of your dataframe with .iloc[].

import numpy as np

permuted_indices = np.random.permutation[len[df]]

dfs = []
for i in range[N_PARTITIONS]:
    dfs.append[df.iloc[permuted_indices[i::N_PARTITIONS]]]

Since you are on Python 2.7, it might be better to switch range[N_PARTITIONS] by xrange[N_PARTITIONS] to get an iterator instead of a list.

answered Jul 25, 2016 at 14:42

Not the answer you're looking for? Browse other questions tagged python python-2.7 pandas dataframe partitioning or ask your own question.

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    We can try different approaches for splitting Dataframe to get the desired results. Let’s take an example of a dataset of diamonds. 
     

    Python3

    import seaborn as sns

    import pandas as pd

    import numpy as np

    df  = sns.load_dataset['diamonds']

    df.head[]

    Output: 
     

    Method 1: Splitting Pandas Dataframe by row index
    In the below code, the dataframe is divided into two parts, first 1000 rows, and remaining rows. We can see the shape of the newly formed dataframes as the output of the given code.
     

    Python3

    df_1 = df.iloc[:1000,:]

    df_2 = df.iloc[1000:,:]

    print["Shape of new dataframes - {} , {}".format[df_1.shape, df_2.shape]]

    Output: 
     

    Method 2: Splitting Pandas Dataframe by groups formed from unique column values
    Here, we will first grouped the data by column value “color”. The newly formed dataframe consists of grouped data with color = “E”.
     

    Python3

    grouped = df.groupby[df.color]

    df_new = grouped.get_group["E"]

    df_new

    Output: 
     

    Method 3 : Splitting Pandas Dataframe in predetermined sized chunks
    In the above code, we can see that we have formed a new dataset of a size of 0.6 i.e. 60% of total rows [or length of the dataset], which now consists of 32364 rows. These rows are selected randomly. 
     

    Python3

    df_split = df.sample[frac=0.6,random_state=200]

    df_split.reset_index[]

    Output: 
     


    How do I partition a panda in Python?

    Pandas str. split[] . Instead of splitting the string at every occurrence of separator/delimiter, it splits the string only at the first occurrence. In the split function, the separator is not stored anywhere, only the text around it is stored in a new list/Dataframe.

    How do you split a data frame in half?

    In the above example, the data frame 'df' is split into 2 parts 'df1' and 'df2' on the basis of values of column 'Weight'. Method 2: Using Dataframe. groupby[]. This method is used to split the data into groups based on some criteria.

    How do you split a column into a Dataframe in Python?

    split[] Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.

    Which splits a data frame and returns a data frame?

    dlply: Split data frame, apply function, and return results in a... d_ply: Split data frame, apply function, and discard results. each: Aggregate multiple functions into a single function. empty: Check if a data frame is empty.

    Chủ Đề