The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.

To use the same idea, simply replace the first line with

ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)

That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.

choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)    
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind

On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.

Answer from Gianluca Micchi on Stack Overflow
🌐
NumPy
numpy.org › doc › 2.3 › reference › generated › numpy.partition.html
numpy.partition — NumPy v2.3 Manual
Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary
🌐
NumPy
numpy.org › doc › stable › reference › generated › numpy.partition.html
numpy.partition — NumPy v2.4 Manual
Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary
🌐
w3resource
w3resource.com › python-exercises › numpy › python-numpy-sorting-and-searching-exercise-7.php
NumPy: Partition an array in a specified position and move all the smaller elements to the left - w3resource
NumPy Sorting and Searching Exercises, Practice and Solution: Write a NumPy program to partition a given array in a specified position and move all the smaller elements values to the left of the partition, and the remaining values to the right, in arbitrary order (based on random choice).
Top answer
1 of 14
172

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

2 of 14
72

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you're trying to split into training and test.

🌐
NumPy
numpy.org › doc › 2.1 › reference › generated › numpy.partition.html
numpy.partition — NumPy v2.1 Manual
Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary
🌐
NumPy
numpy.org › devdocs › reference › generated › numpy.split.html
numpy.split — NumPy v2.5.dev0 Manual
Split an array into multiple sub-arrays as views into ary · Array to be divided into sub-arrays
🌐
Rlucas7
rlucas7.github.io › posts › 2025 › 09 › subset-sampling
Sampling A Random Partition -
September 5, 2025 - A small-ish example of using Stirling numbers to generate a random partition uniformly across all partitions with k non-empty subsets. import numpy as np from scipy.special import stirling2 # also note that the default is `exact=False` # so instead let's # make it exact=True n = 10 N = np.array([[i] for i in range(1, n+1)]) # build out Stirling triangle triangle = stirling2(N, np.array(list(range(1, n+1))), exact=True) # looks good!
Find elsewhere
🌐
NumPy
numpy.org › doc › stable › reference › generated › numpy.argpartition.html
numpy.argpartition — NumPy v2.4 Manual
If provided with a sequence of k-th it will partition all of them into their sorted position at once.
🌐
TECH CHAMPION
tech-champion.com › home › posts › randomly partitioning a pandas dataframe: 5 efficient methods
Randomly Partitioning a Pandas DataFrame: 5 Efficient Methods
February 14, 2025 - Using a random seed ensures that the same partitioning is obtained each time the code is executed, facilitating consistent experimental results. This is particularly important for collaborative projects and reproducibility in research.
🌐
NumPy
numpy.org › doc › stable › reference › generated › numpy.ndarray.partition.html
numpy.ndarray.partition — NumPy v2.4 Manual
If provided with a sequence of kth it will partition all elements indexed by kth of them into their sorted position at once.
🌐
Medium
medium.com › @amit25173 › understanding-numpy-partition-with-examples-121dd10226c7
Understanding numpy.partition with Examples | by Amit Yadav | Medium
February 8, 2025 - No need to sort the whole thing — just partition and grab them. Works on different data types — It handles integers, floats, and even multi-dimensional arrays (which we’ll explore next). That’s your first step into numpy.partition—a powerful tool that saves computation time while keeping things organized just enough to be useful.
🌐
NumPy
numpy.org › devdocs › reference › generated › numpy.partition.html
numpy.partition — NumPy v2.5.dev0 Manual
Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... Try it in your browser! >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary
🌐
W3Schools
w3schools.com › python › numpy › numpy_random.asp
Introduction to Random Numbers in NumPy
In this tutorial we will be using pseudo random numbers. NumPy offers the random module to work with random numbers.
🌐
NumPy
numpy.org › doc › 2.2 › reference › generated › numpy.partition.html
numpy.partition — NumPy v2.2 Manual
Consequently, partitioning along the last axis is faster and uses less space than partitioning along any other axis. The sort order for complex numbers is lexicographic. If both the real and imaginary parts are non-nan then the order is determined by the real parts except when they are equal, in which case the order is determined by the imaginary parts. The sort order of np.nan is bigger than np.inf. ... >>> import numpy as np >>> a = np.array([7, 1, 7, 7, 1, 5, 7, 2, 3, 2, 6, 2, 3, 0]) >>> p = np.partition(a, 4) >>> p array([0, 1, 2, 1, 2, 5, 2, 3, 3, 6, 7, 7, 7, 7]) # may vary
🌐
Plain English
python.plainenglish.io › shuffle-split-and-stack-numpy-arrays-83f82033bf17
Shuffle, Split, and Stack NumPy Arrays in Python | Python in Plain English
December 26, 2020 - How to randomly select, shuffle, split, and stack NumPy arrays for machine learning tasks without libraries such as sci-kit learn or Pandas.
🌐
NumPy
numpy.org › doc › stable › reference › generated › numpy.array_split.html
numpy.array_split — NumPy v2.4 Manual
Split an array into multiple sub-arrays · Please refer to the split documentation. The only difference between these functions is that array_split allows indices_or_sections to be an integer that does not equally divide the axis. For an array of length l that should be split into n sections, ...
🌐
GeeksforGeeks
geeksforgeeks.org › python › numpy-partition-in-python
numpy.partition() in Python - GeeksforGeeks
December 28, 2018 - # Python program explaining # partition() function import numpy as geek # input array in_arr = geek.array([ 2, 0, 1, 5, 4, 9]) print ("Input array : ", in_arr) out_arr = geek.partition(in_arr, 3) print ("Output partitioned array : ", out_arr)
🌐
NumPy
numpy.org › doc › 2.0 › reference › generated › numpy.argpartition.html
numpy.argpartition — NumPy v2.0 Manual
Array of indices that partition a along the specified axis. If a is one-dimensional, a[index_array] yields a partitioned a.