Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625
Answer from jezrael on Stack Overflow
🌐
w3resource
w3resource.com › python-exercises › pandas › python-pandas-data-frame-exercise-38.php
Pandas: Divide a DataFrame in a given ratio - w3resource
Sample data: Original DataFrame: 0 1 0 0.316147 -0.767359 1 -0.813410 -2.522672 2 0.869615 1.194704 3 -0.892915 -0.055133 4 -0.341126 0.518266 5 1.857342 1.361229 6 -0.044353 -1.205002 7 -0.726346 -0.535147 8 -1.350726 0.563117 9 1.051666 -0.441533 70% of the said DataFrame: 0 1 8 -1.350726 0.563117 2 0.869615 1.194704 5 1.857342 1.361229 6 -0.044353 -1.205002 3 -0.892915 -0.055133 1 -0.813410 -2.522672 0 0.316147 -0.767359 30% of the said DataFrame: 0 1 4 -0.341126 0.518266 7 -0.726346 -0.535147 9 1.051666 -0.441533 ... import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 2)) print("Original DataFrame:") print(df) part_70 = df.sample(frac=0.7,random_state=10) part_30 = df.drop(part_70.index) print("\n70% of the said DataFrame:") print(part_70) print("\n30% of the said DataFrame:") print(part_30)
🌐
TutorialsPoint
tutorialspoint.com › divide-a-dataframe-in-a-ratio
Divide a DataFrame in a ratio
November 2, 2023 - The other way to divide the Dataframe in the ratio is by using the sample() function with the DataFrame. It takes the two parameters frac used to define the fraction and random_state which takes the seed value for the random number generator. The below is the syntax.
🌐
GeeksforGeeks
geeksforgeeks.org › divide-a-pandas-dataframe-randomly-in-a-given-ratio
Divide a Pandas DataFrame randomly in a given ratio | GeeksforGeeks
October 25, 2021 - Pandas is one of those packages and makes importing and analyzing data much easier.Pandas dataframe.div() is used to find the floating division of the dataframe and other ... When a part of any column in Dataframe is important and the need is to take it separate, we can split a column on the basis of the requirement.
🌐
GeeksforGeeks
geeksforgeeks.org › divide-a-dataframe-in-a-ratio
Divide a DataFrame in a ratio | GeeksforGeeks
August 17, 2020 - Divide a Pandas Dataframe task is very useful in case of split a given dataset into train and test data for training and testing purposes in the field of Machine Learning, Artificial Intelligence, etc. Let's see how to divide the pandas dataframe randomly into given ratios.
🌐
Stack Overflow
stackoverflow.com › questions › 75158585 › split-data-in-ratio-by-using-column-value-in-python
pandas - Split data in ratio by using column value in Python - Stack Overflow
df2 = [[],[],[]] for index,row in df1.iterrows(): lst = spliter(df1['Column B'][index]) for i in range(1,lst[0]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Train') for i in range(1,lst[1]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Test') for i in range(1,lst[2]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Val') df3 = pd.DataFrame(columns = ['Column A','Column B','Column C']) df3['Column A'] = df2[0] df3['Column B'] = df2[1] df3['Column C'] = df2[2] print(df3)
🌐
Untitled Publication
elisa.hashnode.dev › split-a-dataframe
Split a Pandas Dataframe in Python - Elisa's Blog - Hashnode
September 28, 2021 - split the dataframe into two parts: its first N rows (or a percentage of the number of rows) and the rest; split it randomly into two parts, by a number of rows or percentage. As example data, I will use my Spotify streaming history during some ...
🌐
Spark By {Examples}
sparkbyexamples.com › home › pandas › how to split pandas dataframe?
How to Split Pandas DataFrame? - Spark By {Examples}
December 6, 2024 - Pandas loc[] is another property that is used to operate on the column and row labels. Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame, # Split the DataFrame # Using iloc[] by rows df1 = df.iloc[:2,:] df2 = df.iloc[2:,:] print(df1) print("---------------------------") print(df2)
Find elsewhere
🌐
Delft Stack
delftstack.com › home › howto › python pandas › split pandas dataframe
How to Split Pandas DataFrame | Delft Stack
February 2, 2024 - We can specify the rows to be included in each split in the iloc property. [:2,:] represents select the rows up to row with index 2 exclusive (the row with index 2 is not included) and all the columns from the DataFrame. Hence, apprix_df.iloc[:2,:] selects first two rows from the DataFrame apprix_df with index 0 and 1. import pandas as pd apprix_df = pd.DataFrame( { "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"], "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"], "Qualification": ["MBA", "MS", "MS", "PhD", "MS"], } ) print("Apprix Team DataFrame:") print(apprix_df, "\
🌐
w3resource
w3resource.com › python-exercises › pandas › python-pandas-data-frame-exercise-67.php
Pandas: Split a given DataFrame into two random subsets - w3resource
Write a Pandas program to randomly split a DataFrame into two subsets using a specified ratio and then verify the split sizes. Write a Pandas program to partition a DataFrame into training and testing sets randomly and then reset their indices. Write a Pandas program to randomly divide a DataFrame into two parts and then export each subset to separate CSV files. Write a Pandas program to split a DataFrame into two random groups and then compute the mean of a numeric column in each group. ... PREV : Select Columns by Data Type.
Top answer
1 of 11
359

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]
2 of 11
90

I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks
🌐
GitHub
github.com › pandas-dev › pandas › issues › 57934
ENH: Add split method to DataFrame for flexible row-based partitioning · Issue #57934 · pandas-dev/pandas
March 20, 2024 - One approach could involve splitting a DataFrame into smaller chunks using numpy.array_split. This function can divide an array or DataFrame into a specified number of parts and handle any remainders by evenly distributing them across splits.
Author   gclopton
Top answer
1 of 4
23

np.array_split

If you want to generalise to n splits, np.array_split is your friend (it works with DataFrames well).

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

train_test_split

A windy solution using train_test_split for stratified splitting.

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where X is a DataFrame of your features, and y is a single-columned DataFrame of your labels.

2 of 4
10

Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split() twice.

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Below is a complete working example.

Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% bar and 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

Here is the example dataset:

df = pd.DataFrame( { 'A': list(range(0, 100)),
                     'B': list(range(100, 0, -1)),
                     'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )

df.head()
#    A    B label
# 0  0  100   foo
# 1  1   99   foo
# 2  2   98   foo
# 3  3   97   foo
# 4  4   96   foo

df.shape
# (100, 3)

df.label.value_counts()
# foo    75
# bar    15
# baz    10
# Name: label, dtype: int64

Now, let's call the split_stratified_into_train_val_test() function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

df_train, df_val, df_test = \
    split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)

The three dataframes df_train, df_val, and df_test contain all the original rows but their sizes will follow the above ratio.

df_train.shape
#(60, 3)

df_val.shape
#(20, 3)

df_test.shape
#(20, 3)

Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% bar and 10% baz.

df_train.label.value_counts()
# foo    45
# bar     9
# baz     6
# Name: label, dtype: int64

df_val.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

df_test.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64
🌐
CopyProgramming
copyprogramming.com › howto › how-to-split-dataframe-randomly-into-given-ratio-according-id
Python: A guide on randomly dividing a dataframe into desired ratios based on unique identifiers
August 8, 2023 - Is there a way to randomly split the dataframe based on the "id" column, ensuring a 70/30 ratio? However, it is important to note that even though the value 7 occurs 3 times in the "id" column, it should only be considered as 1/10 when determining the ratio. Is it possible to divide split data into three sets (train, validation and test it? Unfortunately, it is not beneficial in this scenario. import pandas as pd d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]} df = pd.DataFrame(data=d)
🌐
Stack Overflow
stackoverflow.com › questions › 71127140 › is-there-a-faster-or-better-way-to-segregate-dataset-into-80-20-ratio-in-python
pandas - Is there a faster or better way to segregate dataset into 80 20 ratio in python? - Stack Overflow
X.shape #output is => (2555904, 1024, 2) X[0] #Output is => array([[ 0.0420274 , 0.23476323], [-0.2728826 , 0.40513492], [-0.26707262, 0.22749889], ..., [-0.7055947 , -0.28693035], [-0.41157...
🌐
Readthedocs
librecommender.readthedocs.io › en › v1.2.1 › api › data › split.html
Split - Lib 1.2.1 Recommender - LibRecommender
Split the data randomly. ... multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts.