The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.
To use the same idea, simply replace the first line with
ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)
That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.
choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind
On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.
The error is that randint is giving some repeated indices. You can test it by printing len(set(ind)) and you will see it is smaller than 5000.
To use the same idea, simply replace the first line with
ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)
That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~.
choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind
On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split, which makes me think that the two are doing exactly the same thing.
One way may be to try using train_test_split from sklearn documentation:
import numpy as np
from sklearn.model_selection import train_test_split
# creating matrix
input_matrix = np.arange(46928*28*28).reshape((46928,28,28))
print('Input shape: ', input_matrix.shape)
# splitting into two matrices of second matrix by size
second_size = 5000/46928
X1, X2 = train_test_split(input_matrix, test_size=second_size)
print('X1 shape: ', X1.shape)
print('X2 shape: ', X2.shape)
Result:
Input shape: (46928, 28, 28)
X1 shape: (41928, 28, 28)
X2 shape: (5000, 28, 28)
One way to ensure that every element of a is contained in exactly one chunk would be to create a random permutation of a first and then split it with np.split.
In order to get an array of splitting indices for np.split from chunk_size you can use np.cumsum.
Example
>>> import numpy as np
>>> np.random.seed(13)
>>> a = np.arange(20)
>>> b = np.random.permutation(a)
>>> b
array([11, 12, 0, 1, 8, 5, 7, 15, 14, 13,
3, 17, 9, 4, 2, 6, 19, 10, 16, 18])
>>> chunk_size = [10, 5, 3, 2]
>>> np.cumsum(chunk_size)
array([10, 15, 18, 20])
>>> np.split(b, np.cumsum(chunk_size))
[array([11, 12, 0, 1, 8, 5, 7, 15, 14, 13]),
array([ 3, 17, 9, 4, 2]), array([ 6, 19, 10]), array([16, 18]),
array([], dtype=int64)]
You could avoid the trailing empty array by omitting the last value in chunk_size, as it is implied by the size of a and the sum of the previous values:
>>> np.split(b, np.cumsum(chunk_size[:-1])) # [10, 5, 3] -- 2 is implied
[array([11, 12, 0, 1, 8, 5, 7, 15, 14, 13]),
array([ 3, 17, 9, 4, 2]), array([ 6, 19, 10]), array([16, 18])]
Thanks to Divakar
import numpy as np
np.random.seed(13)
dist = np.arange(0, 3286, 1)
chunk_size = [975, 708, 515, 343, 269, 228, 77, 57, 42, 33, 11, 9, 7, 4, 3, 1, 1, 1, 1, 1]
dist = [np.random.choice(dist,_, replace=False) for _ in chunk_size]
If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
or
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.
There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:
from sklearn.model_selection import train_test_split
data, labels = np.arange(10).reshape((5, 2)), range(5)
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)
This way you can keep in sync the labels for the data you're trying to split into training and test.