You can replace this just for that column using replace:
df['workclass'].replace('?', np.NaN)
or for the whole df:
df.replace('?', np.NaN)
UPDATE
OK I figured out your problem, by default if you don't pass a separator character then read_csv will use commas ',' as the separator.
Your data and in particular one example where you have a problematic line:
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
has in fact a comma and a space as the separator so when you passed the na_value=['?'] this didn't match because all your values have a space character in front of them all which you can't observe.
if you change your line to this:
rawfile = pd.read_csv(filename, header=None, names=DataLabels, sep=',\s', na_values=["?"])
then you should find that it all works:
27 54 NaN 180211 Some-college 10
Answer from EdChum on Stack OverflowYou can replace this just for that column using replace:
df['workclass'].replace('?', np.NaN)
or for the whole df:
df.replace('?', np.NaN)
UPDATE
OK I figured out your problem, by default if you don't pass a separator character then read_csv will use commas ',' as the separator.
Your data and in particular one example where you have a problematic line:
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
has in fact a comma and a space as the separator so when you passed the na_value=['?'] this didn't match because all your values have a space character in front of them all which you can't observe.
if you change your line to this:
rawfile = pd.read_csv(filename, header=None, names=DataLabels, sep=',\s', na_values=["?"])
then you should find that it all works:
27 54 NaN 180211 Some-college 10
Use numpy.nan
Numpy - Replace a number with NaN
import numpy as np
df.applymap(lambda x: np.nan if x == '?' else x)
python - How to replace NaN values in a dataframe column - Stack Overflow
I need to replace NaN in one column with value for other col
python - Replace invalid values with None in Pandas DataFrame - Stack Overflow
python - How to replace a range of values with NaN in Pandas data-frame? - Stack Overflow
Videos
DataFrame.fillna() or Series.fillna() will do this for you.
Example:
In [7]: df
Out[7]:
0 1
0 NaN NaN
1 -0.494375 0.570994
2 NaN NaN
3 1.876360 -0.229738
4 NaN NaN
In [8]: df.fillna(0)
Out[8]:
0 1
0 0.000000 0.000000
1 -0.494375 0.570994
2 0.000000 0.000000
3 1.876360 -0.229738
4 0.000000 0.000000
To fill the NaNs in only one column, select just that column.
In [12]: df[1] = df[1].fillna(0)
In [13]: df
Out[13]:
0 1
0 NaN 0.000000
1 -0.494375 0.570994
2 NaN 0.000000
3 1.876360 -0.229738
4 NaN 0.000000
Or you can use the built in column-specific functionality:
df = df.fillna({1: 0})
It is not guaranteed that the slicing returns a view or a copy. You can do
df['column'] = df['column'].fillna(value)
I've been working on learning Python and for something to code, I picked some VBA that I had.
In VBA:
If Cells(I, "C").Value <> "" And Cells(I, "B").Value = "" Then
Cells(I, "B").Value = Cells(I, "C").Value
End IfIt simply checks if colC is not Null and colB is Null, then replaces colB with the value from colC.
I can read in the csv file, I was able to select and delete some rows I didn't want, but I can't seem to get the syntax right for this...
Actually in later versions of pandas this will give a TypeError:
df.replace('-', None)
TypeError: If "to_replace" and "value" are both None then regex must be a mapping
You can do it by passing either a list or a dictionary:
In [11]: df.replace('-', df.replace(['-'], [None]) # or .replace('-', {0: None})
Out[11]:
0
0 None
1 3
2 2
3 5
4 1
5 -5
6 -1
7 None
8 9
But I recommend using NaNs rather than None:
In [12]: df.replace('-', np.nan)
Out[12]:
0
0 NaN
1 3
2 2
3 5
4 1
5 -5
6 -1
7 NaN
8 9
I prefer the solution using replace with a dict because of its simplicity and elegance:
df.replace({'-': None})
You can also have more replacements:
df.replace({'-': None, 'None': None})
And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.
dataframe
You can use pd.DataFrame.mask:
df.mask((df >= -200) & (df <= -100), inplace=True)
This method replaces elements identified by True values in a Boolean array with a specified value, defaulting to NaN if a value is not specified.
Equivalently, use pd.DataFrame.where with the reverse condition:
df.where((df < -200) | (df > -100), inplace=True)
series
As with many methods, Pandas helpfully includes versions which work with series rather than an entire dataframe. So, for a column df['A'], you can use pd.Series.mask with pd.Series.between:
df['A'].mask(df['A'].between(-200, -100), inplace=True)
For chaining, note inplace=False by default, so you can also use:
df['A'] = df['A'].mask(df['A'].between(-200, -100))
You can do it this way:
In [145]: df = pd.DataFrame(np.random.randint(-250, 50, (10, 3)), columns=list('abc'))
In [146]: df
Out[146]:
a b c
0 -188 -63 -228
1 -59 -70 -66
2 -110 39 -146
3 -67 -228 -232
4 -22 -180 -140
5 -191 -136 -188
6 -59 -30 -128
7 -201 -244 -195
8 -248 -30 -25
9 11 1 20
In [148]: df[(df>=-200) & (df<=-100)] = np.nan
In [149]: df
Out[149]:
a b c
0 NaN -63.0 -228.0
1 -59.0 -70.0 -66.0
2 NaN 39.0 NaN
3 -67.0 -228.0 -232.0
4 -22.0 NaN NaN
5 NaN NaN NaN
6 -59.0 -30.0 NaN
7 -201.0 -244.0 NaN
8 -248.0 -30.0 -25.0
9 11.0 1.0 20.0
Randomly replace values in a numpy array
# The dataset
data = pd.read_csv('iris.data')
mat = data.iloc[:,:4].as_matrix()
Set the number of values to replace. For example 20%:
# Edit: changed len(mat) for mat.size
prop = int(mat.size * 0.2)
Randomly choose indices of the numpy array:
i = [random.choice(range(mat.shape[0])) for _ in range(prop)]
j = [random.choice(range(mat.shape[1])) for _ in range(prop)]
Change values with NaN
mat[i,j] = np.NaN
Dropout for any array dimension
Another way to do that with an array of more than 2 dimensions would be to use the numpy.put() function:
import numpy as np
import random
from sklearn import datasets
data = datasets.load_iris()['data']
def dropout(a, percent):
# create a copy
mat = a.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
This function returns a modified array:
modified = dropout(data, 0.2)
We can verify that the correct number of values have been modified:
np.sum(np.isnan(modified))/float(data.size)
[out]:
0.2
Depending on the data structure you are keeping the values there might be different solutions.
If you are using Numpy arrays, you can employ np.insert method which is referred here:
import numpy as np
a = np.arrray([(122.0, 1.0, -47.0), (123.0, 1.0, -47.0), (125.0, 1.0, -44.0)]))
np.insert(a, 2, np.nan, axis=0)
array([[ 122., 1., -47.],
[ 123., 1., -47.],
[ nan, nan, nan],
[ 125., 1., -44.]])
If you are using Pandas you can use instance method replace on the objects of the DataFrames as referred here:
In [106]:
df.replace('N/A',np.NaN)
Out[106]:
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
In the code above, the first argument can be your arbitrary input which you want to change.
Hello, I'm currently attempting the Kaggle housing prices challenge seen in this link. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
I have a concatenated table which combines the training and testing tables into one in order to handle all missing values at once.
combine_df = pd.concat([train, test], axis=0, sort=False)
combine_df.drop(['Id', 'SalePrice'], axis=1, inplace=True)
I then attempt to fill all NaN categorical values with the following lines below. Where null_columns is a list of columns that I want to replace NaN values.
combine_df[null_columns] = combine_df[null_columns].fillna('0', inplace=True)However, this line changes every value in the columns into a NaN value instead of replacing NaN values with '0' as seen in the output below which shows the amount of NaN values for each column.
BsmtQual 2919 BsmtCond 2919 BsmtExposure 2919 BsmtFinType1 2919 BsmtFinType2 2919 GarageType 2919 GarageFinish 2919 GarageQual 2919 GarageCond 2919
I've tried using .replace, a lambda function, and also using .loc and all of them end up doing the same thing as the code above. What is going on with my code that causes this? I've also been unable to find anything regarding this on stack overflow. Any help would be greatly appreciated.
There is no one size fits all. So you cannot assume that one technique will work the best for all the datasets.
That being said the goal of imputing missing values is to ensure that after imputation, the distribution of the column does not change. So if you have a feature that follows a left skewed distribution, then after imputation the distribution should not change much.
Following this logic use multiple imputation techniques to see which one retains the original distribution of the feature you are imputing the values for.
Mean is suitable when you have a Gaussian distribution of continuous data. Mode is suitable when your column has categorical data and one category is clearly more like to occur than others. Median is better when your data has outliers which can skew the mean. You can opt to remove rows with missing values if the numbers of rows is very small compared to the total number of rows. There are other techniques which can be useful depending on the situation like training a model to fill missing values, MICE (for missing at Random type data), KNNImputer and LOCF.
Alternatively, if you have a significant number of missing values, you can see how the results are different when you impute missing values and when you ignore rows with missing values.