pandas dataframe split by ratio

How to split a DataFrame in pandas in predefined percentages?

stackoverflow.com › questions › 43777243 › how-to-split-a-dataframe-in-pandas-in-predefined-percentages

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

Answer from jezrael on Stack Overflow

w3resource

w3resource.com › python-exercises › pandas › python-pandas-data-frame-exercise-38.php

Pandas: Divide a DataFrame in a given ratio - w3resource

Sample data: Original DataFrame: 0 1 0 0.316147 -0.767359 1 -0.813410 -2.522672 2 0.869615 1.194704 3 -0.892915 -0.055133 4 -0.341126 0.518266 5 1.857342 1.361229 6 -0.044353 -1.205002 7 -0.726346 -0.535147 8 -1.350726 0.563117 9 1.051666 -0.441533 70% of the said DataFrame: 0 1 8 -1.350726 0.563117 2 0.869615 1.194704 5 1.857342 1.361229 6 -0.044353 -1.205002 3 -0.892915 -0.055133 1 -0.813410 -2.522672 0 0.316147 -0.767359 30% of the said DataFrame: 0 1 4 -0.341126 0.518266 7 -0.726346 -0.535147 9 1.051666 -0.441533 ... import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 2)) print("Original DataFrame:") print(df) part_70 = df.sample(frac=0.7,random_state=10) part_30 = df.drop(part_70.index) print("\n70% of the said DataFrame:") print(part_70) print("\n30% of the said DataFrame:") print(part_30)

TutorialsPoint

tutorialspoint.com › divide-a-dataframe-in-a-ratio

Divide a DataFrame in a ratio

November 2, 2023 - The other way to divide the Dataframe in the ratio is by using the sample() function with the DataFrame. It takes the two parameters frac used to define the fraction and random_state which takes the seed value for the random number generator. The below is the syntax.

Stack Overflow

stackoverflow.com › questions › 43777243 › how-to-split-a-dataframe-in-pandas-in-predefined-percentages

python 3.x - How to split a DataFrame in pandas in predefined percentages? - Stack Overflow

Top answer

1 of 3

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

2 of 3

Creating a dataframe with 70% values of original dataframe
part_1 = df.sample(frac = 0.7)
Creating dataframe with rest of the 30% values
part_2 = df.drop(part_1.index)

GeeksforGeeks

geeksforgeeks.org › divide-a-pandas-dataframe-randomly-in-a-given-ratio

Divide a Pandas DataFrame randomly in a given ratio | GeeksforGeeks

October 25, 2021 - Pandas is one of those packages and makes importing and analyzing data much easier.Pandas dataframe.div() is used to find the floating division of the dataframe and other ... When a part of any column in Dataframe is important and the need is to take it separate, we can split a column on the basis of the requirement.

GeeksforGeeks

geeksforgeeks.org › divide-a-dataframe-in-a-ratio

Divide a DataFrame in a ratio | GeeksforGeeks

August 17, 2020 - Divide a Pandas Dataframe task is very useful in case of split a given dataset into train and test data for training and testing purposes in the field of Machine Learning, Artificial Intelligence, etc. Let's see how to divide the pandas dataframe randomly into given ratios.

Stack Overflow

stackoverflow.com › questions › 68279596 › split-pandas-dataframe-by-label-with-ratio

python - split pandas dataframe by label with ratio - Stack Overflow

Top answer

1 of 2

Try this:

train = df.groupby('label').sample(frac=.8)
test = df.loc[df.index.difference(train.index)]

2 of 2

You can use DataFrame.sample() for this:

training_data_ratio = 0.8
train_spam = spam.sample(frac=training_data_ratio, random_state=0)
test_spam = spam.drop(train_spam.index)

And, similarly for the non spam data.

In addition, if you need to check that how many of entries are spam and not spam, you can use value_counts:

>>> df.label.value_counts()
spam        3
not spam    2
Name: label, dtype: int64

Stack Overflow

stackoverflow.com › questions › 75158585 › split-data-in-ratio-by-using-column-value-in-python

pandas - Split data in ratio by using column value in Python - Stack Overflow

df2 = [[],[],[]] for index,row in df1.iterrows(): lst = spliter(df1['Column B'][index]) for i in range(1,lst[0]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Train') for i in range(1,lst[1]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Test') for i in range(1,lst[2]+1): df2[0].append(df1['Column A'][index]) df2[1].append(i) df2[2].append('Val') df3 = pd.DataFrame(columns = ['Column A','Column B','Column C']) df3['Column A'] = df2[0] df3['Column B'] = df2[1] df3['Column C'] = df2[2] print(df3)

Untitled Publication

elisa.hashnode.dev › split-a-dataframe

Split a Pandas Dataframe in Python - Elisa's Blog - Hashnode

September 28, 2021 - split the dataframe into two parts: its first N rows (or a percentage of the number of rows) and the rest; split it randomly into two parts, by a number of rows or percentage. As example data, I will use my Spotify streaming history during some ...

Spark By {Examples}

sparkbyexamples.com › home › pandas › how to split pandas dataframe?

How to Split Pandas DataFrame? - Spark By {Examples}

December 6, 2024 - Pandas loc[] is another property that is used to operate on the column and row labels. Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame, # Split the DataFrame # Using iloc[] by rows df1 = df.iloc[:2,:] df2 = df.iloc[2:,:] print(df1) print("---------------------------") print(df2)

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 70520218 › how-to-split-dataframe-randomly-into-given-ratio-according-id

python - How to split dataframe randomly into given ratio according id - Stack Overflow

Top answer

1 of 1

Hopefully below code will help you , len_per is 30 percentage of total unique ids you have

 import pandas as pd
 import random
 d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]}

 df = pd.DataFrame(data=d)
 len_per = int(len(set(df['id'])) / 100 * 30)
 ids = random.sample(set(df["id"]), len_per)

 df1_30 = df[df["id"].isin(ids)]
 df1_70 = df[~df["id"].isin(ids)]

OutPut

Delft Stack

delftstack.com › home › howto › python pandas › split pandas dataframe

How to Split Pandas DataFrame | Delft Stack

February 2, 2024 - We can specify the rows to be included in each split in the iloc property. [:2,:] represents select the rows up to row with index 2 exclusive (the row with index 2 is not included) and all the columns from the DataFrame. Hence, apprix_df.iloc[:2,:] selects first two rows from the DataFrame apprix_df with index 0 and 1. import pandas as pd apprix_df = pd.DataFrame( { "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"], "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"], "Qualification": ["MBA", "MS", "MS", "PhD", "MS"], } ) print("Apprix Team DataFrame:") print(apprix_df, "\

w3resource

w3resource.com › python-exercises › pandas › python-pandas-data-frame-exercise-67.php

Pandas: Split a given DataFrame into two random subsets - w3resource

Write a Pandas program to randomly split a DataFrame into two subsets using a specified ratio and then verify the split sizes. Write a Pandas program to partition a DataFrame into training and testing sets randomly and then reset their indices. Write a Pandas program to randomly divide a DataFrame into two parts and then export each subset to separate CSV files. Write a Pandas program to split a DataFrame into two random groups and then compute the mean of a numeric column in each group. ... PREV : Select Columns by Data Type.

Stack Overflow

stackoverflow.com › questions › 17315737 › split-a-large-pandas-dataframe

python - Split a large pandas dataframe - Stack Overflow

Top answer

1 of 11

359

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

2 of 11

I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks

GitHub

github.com › pandas-dev › pandas › issues › 57934

ENH: Add split method to DataFrame for flexible row-based partitioning · Issue #57934 · pandas-dev/pandas

March 20, 2024 - One approach could involve splitting a DataFrame into smaller chunks using numpy.array_split. This function can divide an array or DataFrame into a specified number of parts and handle any remainders by evenly distributing them across splits.

Author gclopton

Stack Overflow

stackoverflow.com › questions › 24147278 › how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas › 24151789

python - How do I create test and train samples from one dataframe with pandas? - Stack Overflow

Top answer

1 of 16

991

Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

2 of 16

488

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Stack Overflow

stackoverflow.com › questions › 50781562 › stratified-splitting-of-pandas-dataframe-into-training-validation-and-test-set

python - Stratified splitting of pandas dataframe into training, validation and test set - Stack Overflow

Top answer

1 of 4

`np.array_split`

If you want to generalise to n splits, np.array_split is your friend (it works with DataFrames well).

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

`train_test_split`

A windy solution using train_test_split for stratified splitting.

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where X is a DataFrame of your features, and y is a single-columned DataFrame of your labels.

2 of 4

Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split() twice.

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Below is a complete working example.

Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% bar and 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

Here is the example dataset:

df = pd.DataFrame( { 'A': list(range(0, 100)),
                     'B': list(range(100, 0, -1)),
                     'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )

df.head()
#    A    B label
# 0  0  100   foo
# 1  1   99   foo
# 2  2   98   foo
# 3  3   97   foo
# 4  4   96   foo

df.shape
# (100, 3)

df.label.value_counts()
# foo    75
# bar    15
# baz    10
# Name: label, dtype: int64

Now, let's call the split_stratified_into_train_val_test() function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

df_train, df_val, df_test = \
    split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)

The three dataframes df_train, df_val, and df_test contain all the original rows but their sizes will follow the above ratio.

df_train.shape
#(60, 3)

df_val.shape
#(20, 3)

df_test.shape
#(20, 3)

Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% bar and 10% baz.

df_train.label.value_counts()
# foo    45
# bar     9
# baz     6
# Name: label, dtype: int64

df_val.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

df_test.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

CopyProgramming

copyprogramming.com › howto › how-to-split-dataframe-randomly-into-given-ratio-according-id

Python: A guide on randomly dividing a dataframe into desired ratios based on unique identifiers

August 8, 2023 - Is there a way to randomly split the dataframe based on the "id" column, ensuring a 70/30 ratio? However, it is important to note that even though the value 7 occurs 3 times in the "id" column, it should only be considered as 1/10 when determining the ratio. Is it possible to divide split data into three sets (train, validation and test it? Unfortunately, it is not beneficial in this scenario. import pandas as pd d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]} df = pd.DataFrame(data=d)

Stack Overflow

stackoverflow.com › questions › 71127140 › is-there-a-faster-or-better-way-to-segregate-dataset-into-80-20-ratio-in-python

pandas - Is there a faster or better way to segregate dataset into 80 20 ratio in python? - Stack Overflow

X.shape #output is => (2555904, 1024, 2) X[0] #Output is => array([[ 0.0420274 , 0.23476323], [-0.2728826 , 0.40513492], [-0.26707262, 0.22749889], ..., [-0.7055947 , -0.28693035], [-0.41157...

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.DataFrame.divide.html

pandas.DataFrame.divide — pandas 2.1.4 documentation

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Readthedocs

librecommender.readthedocs.io › en › v1.2.1 › api › data › split.html

Split - Lib 1.2.1 Recommender - LibRecommender

Split the data randomly. ... multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts.