Obviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]
Answer from Anurag Uniyal on Stack OverflowObviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]
My error was very simple: the text file containing the data had some space (so not visible) character on the last line.
As an output of grep, I had 45 instead of just 45.
Could not convert string to float
python - ValueError: count not convert string to float - Bioinformatics Stack Exchange
python - Error: could not convert string 'X' to float64 - Stack Overflow
python - Error could not convert string to float: '' - Stack Overflow
The following would be useful,
print (heterozygosity_df.columns)
This looks like static typing issue within pandas. What I suspect is heterozygosity_df.['chrI'] is a column in the dataframe. What I think has happened is there's a mix of strings and floats within this column. pandas has set this as a "string" but you are wanting to perform numerical operations. Thus the solution is simply
print(heterozygosity_df.dtypes) # this should state 'chrI' is a "category" or "string"
heterozygosity_df['chrI'] = heterozygosity_df['chrI'].astype(float)
print(heterozygosity_df.dtypes)
If you have multiple changes the syntax is
heterozygosity_df = heterozygosity_df.astype({'chrI':'float', 'egColumn':'category'})
I suspect there will be other errors, e.g. the header is the first row of the data column. This is because again for pandas to automatically assign a column to a "string" means there must be a string value within the column.
From the comments. I see whats happening. The easy solution is ...
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4,
....
}
heterozygosity_df['chr'] = heterozygosity_df['chr'].str.replace(replacement, regex=True)
From the comments ... good it works ... this is what I would have personally done ..
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4, # continue for all chromosomes
}
heterozygosity_df = pd.read_csv("file.tsv", sep="\t", header=None).set_axis(['chr', 'pos', 'het'], axis=1, copy=False)
heterozygosity_df['chr'] = heterozygosity_df['chr'].str.replace(replacement, regex=True).astype('int')
Normally str should be in place because when 'chr' is imported its an object. However, if it works thats the only thing that counts.
Ah! I finally got it to work! I used your command but I had to change it up a little bit to work with my data. The final command I used is this:
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4,
"chrV": 5,
"chrVI": 6,
"chrVII": 7,
"chrVIII": 8,
"chrIX": 9,
"chrX": 10,
"chrXI": 11,
"chrXII": 12,
"chrXIII": 13,
"chrXIV": 14,
"chrXV": 15,
"chrXVI": 16,
"chrmt": 17
}
heterozygosity_df['chr'] = heterozygosity_df['chr'].replace(replacement, regex=False)
And with that I was able to generate plots showing all of the chromosomes! (I have to figure out how to change the axis labels from 1,2,3,4...10 back to the chromosome but that's a future problem). Thank you so much for all your help!!
You have empty strings in your pd.Series, which cannot be readily converted to a float data type. What you can do is check for them and remove them. An example script is:
import pandas as pd
a=pd.DataFrame([['a','b','c'],['2.42','','3.285']]).T
a.columns=['names', 'nums']
a['nums']=a['nums'][a['nums']!=''].astype(float)
Note: if you try to run a['nums']=a['nums'].astype(float) before selecting non-empty strings the same error that you've mentioned will be thrown.
First use this line to obtain the current dtypes:
col_dtypes = dict([(k, v.name) for k, v in dict(df.dtypes).items()])
Like so:
xls3 = pd.read_csv('path/to/file')
col_dtypes = dict([(k, v.name) for k, v in dict(xls3.dtypes).items()])
print(col_dtypes)
Copy the value that is printed. It should be like this:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
Then, for the column which whose datatype you know isn't object, change it to the required type ('int32', 'int64', 'float32' or 'float64') Example: The datatypes might be detected as:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
If we know Vesturland is supposed to be Float, then we can edit this to be:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64', ...
}
Now, with this snippet you can find the non-numeric values:
def clean_non_numeric_values(series, col_type):
illegal_value_pos = []
for i in range(len(series)):
try:
if col_type == 'int64' or col_type == 'int32':
val = int(series[i])
elif col_type == 'float32' or col_type == 'float64':
val = float(series[i])
except:
illegal_value_pos.append(i)
# series[i] = None # We can set the illegal values to None
# to remove them later using xls3.dropna()
return series, illegal_value_pos
# Now we will manually replace the dtype of the column Vesturland like so:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64'
}
for col in list(xls3.columns):
if col_dtypes[col] in ['int32', 'int64', 'float32', 'float64']:
series, illegal_value_pos = (
clean_non_numeric_values(series=xls3[col], col_type=col_dtypes[col])
)
xls3[col] = series
print(illegal_value_pos)
if illegal_value_pos:
illegal_rows = xls3.iloc[illegal_value_pos]
# This will print all the illegal values.
print(illegal_rows[col])
Now you can use this information to remove the non-numeric values from the dataframe.
Warning: Since this uses a for loop, it is slow but it will help you to remove the values you don't want.