Brave Search

stackoverflow.com › questions › 50491630 › randomly-split-a-numpy-array

The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.

Answer from Gianluca Micchi on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 50491630 › randomly-split-a-numpy-array

python - Randomly split a numpy array - Stack Overflow

Top answer

1 of 4

20

The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.

2 of 4

9

One way may be to try using train_test_split from sklearn documentation:

import numpy as np
from sklearn.model_selection import train_test_split

# creating matrix
input_matrix = np.arange(46928*28*28).reshape((46928,28,28))
print('Input shape: ', input_matrix.shape)
# splitting into two matrices of second matrix by size
second_size = 5000/46928

X1, X2 = train_test_split(input_matrix, test_size=second_size)

print('X1 shape: ', X1.shape)
print('X2 shape: ', X2.shape)

Result:

Input shape:  (46928, 28, 28)
X1 shape:  (41928, 28, 28)
X2 shape:  (5000, 28, 28)

NumPy

numpy.org › doc › 2.3 › reference › generated › numpy.partition.html

numpy.partition — NumPy v2.3 Manual

Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary

NumPy

numpy.org › doc › stable › reference › generated › numpy.partition.html

numpy.partition — NumPy v2.4 Manual

Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary

w3resource

w3resource.com › python-exercises › numpy › python-numpy-sorting-and-searching-exercise-7.php

NumPy: Partition an array in a specified position and move all the smaller elements to the left - w3resource

NumPy Sorting and Searching Exercises, Practice and Solution: Write a NumPy program to partition a given array in a specified position and move all the smaller elements values to the left of the partition, and the remaining values to the right, in arbitrary order (based on random choice).

Stack Overflow

stackoverflow.com › questions › 63562943 › partition-array-into-n-random-chunks-of-different-sizes-with-numpy

python - Partition array into N random chunks of different sizes with Numpy - Stack Overflow

Example

>>> import numpy as np
>>> np.random.seed(13)
>>> a = np.arange(20)
>>> b = np.random.permutation(a)
>>> b
array([11, 12,  0,  1,  8,  5,  7, 15, 14, 13,
        3, 17,  9,  4,  2,  6, 19, 10, 16, 18])

>>> chunk_size = [10, 5, 3, 2]
>>> np.cumsum(chunk_size)
array([10, 15, 18, 20])

>>> np.split(b, np.cumsum(chunk_size))
[array([11, 12,  0,  1,  8,  5,  7, 15, 14, 13]),
 array([ 3, 17,  9,  4,  2]), array([ 6, 19, 10]), array([16, 18]),
 array([], dtype=int64)]

You could avoid the trailing empty array by omitting the last value in chunk_size, as it is implied by the size of a and the sum of the previous values:

>>> np.split(b, np.cumsum(chunk_size[:-1]))  # [10, 5, 3] -- 2 is implied
[array([11, 12,  0,  1,  8,  5,  7, 15, 14, 13]),
 array([ 3, 17,  9,  4,  2]), array([ 6, 19, 10]), array([16, 18])]

2 of 2

1

Thanks to Divakar

import numpy as np
np.random.seed(13)
dist = np.arange(0, 3286, 1)
chunk_size = [975, 708, 515, 343, 269, 228, 77, 57, 42, 33, 11, 9, 7, 4, 3, 1, 1, 1, 1, 1]
dist = [np.random.choice(dist,_, replace=False) for _ in chunk_size]

Stack Overflow

stackoverflow.com › questions › 3674409 › how-to-split-partition-a-dataset-into-training-and-test-datasets-for-e-g-cros

python - How to split/partition a dataset into training and test datasets for, e.g., cross validation? - Stack Overflow

Top answer

1 of 14

172

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

2 of 14

72

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you're trying to split into training and test.

NumPy

numpy.org › doc › 2.1 › reference › generated › numpy.partition.html

numpy.partition — NumPy v2.1 Manual

Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary

NumPy

numpy.org › devdocs › reference › generated › numpy.split.html

numpy.split — NumPy v2.5.dev0 Manual

Split an array into multiple sub-arrays as views into ary · Array to be divided into sub-arrays

Rlucas7

rlucas7.github.io › posts › 2025 › 09 › subset-sampling

Sampling A Random Partition -

September 5, 2025 - A small-ish example of using Stirling numbers to generate a random partition uniformly across all partitions with k non-empty subsets. import numpy as np from scipy.special import stirling2 # also note that the default is `exact=False` # so instead let's # make it exact=True n = 10 N = np.array([[i] for i in range(1, n+1)]) # build out Stirling triangle triangle = stirling2(N, np.array(list(range(1, n+1))), exact=True) # looks good!

Find elsewhere

Google Bing Mojeek

NumPy

numpy.org › doc › stable › reference › generated › numpy.argpartition.html

numpy.argpartition — NumPy v2.4 Manual

If provided with a sequence of k-th it will partition all of them into their sorted position at once.

TECH CHAMPION

tech-champion.com › home › posts › randomly partitioning a pandas dataframe: 5 efficient methods

Randomly Partitioning a Pandas DataFrame: 5 Efficient Methods

February 14, 2025 - Using a random seed ensures that the same partitioning is obtained each time the code is executed, facilitating consistent experimental results. This is particularly important for collaborative projects and reproducibility in research.

NumPy

numpy.org › doc › stable › reference › generated › numpy.ndarray.partition.html

numpy.ndarray.partition — NumPy v2.4 Manual

If provided with a sequence of kth it will partition all elements indexed by kth of them into their sorted position at once.

Medium

medium.com › @amit25173 › understanding-numpy-partition-with-examples-121dd10226c7

Understanding numpy.partition with Examples | by Amit Yadav | Medium

February 8, 2025 - No need to sort the whole thing — just partition and grab them. Works on different data types — It handles integers, floats, and even multi-dimensional arrays (which we’ll explore next). That’s your first step into numpy.partition—a powerful tool that saves computation time while keeping things organized just enough to be useful.

NumPy

numpy.org › devdocs › reference › generated › numpy.partition.html

numpy.partition — NumPy v2.5.dev0 Manual

Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary

W3Schools

w3schools.com › python › numpy › numpy_random.asp

Introduction to Random Numbers in NumPy

In this tutorial we will be using pseudo random numbers. NumPy offers the random module to work with random numbers.

NumPy

numpy.org › doc › 2.2 › reference › generated › numpy.partition.html

numpy.partition — NumPy v2.2 Manual

Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary

Plain English

python.plainenglish.io › shuffle-split-and-stack-numpy-arrays-83f82033bf17

Shuffle, Split, and Stack NumPy Arrays in Python | Python in Plain English

December 26, 2020 - How to randomly select, shuffle, split, and stack NumPy arrays for machine learning tasks without libraries such as sci-kit learn or Pandas.

NumPy

numpy.org › doc › stable › reference › generated › numpy.array_split.html

numpy.array_split — NumPy v2.4 Manual

Split an array into multiple sub-arrays · Please refer to the split documentation. The only difference between these functions is that array_split allows indices_or_sections to be an integer that does not equally divide the axis. For an array of length l that should be split into n sections, ...

GeeksforGeeks

geeksforgeeks.org › python › numpy-partition-in-python

numpy.partition() in Python - GeeksforGeeks

December 28, 2018 - # Python program explaining # partition() function import numpy as geek # input array in_arr = geek.array([ 2, 0, 1, 5, 4, 9]) print ("Input array : ", in_arr) out_arr = geek.partition(in_arr, 3) print ("Output partitioned array : ", out_arr)

NumPy

numpy.org › doc › 2.0 › reference › generated › numpy.argpartition.html

numpy.argpartition — NumPy v2.0 Manual

Array of indices that partition a along the specified axis. If a is one-dimensional, a[index_array] yields a partitioned a.