Use skiprows=[1] in read_csv to skip the row 1, the conversion to float should be automatic:
df = pd.read_csv('test.csv', sep = ';', decimal = ',', skiprows=[1])
output:
print(df)
Speed A
0 700 -72.5560
1 800 -58.9103
2 900 -73.1678
3 1000 -78.2272
print(df.dtypes)
Speed int64
A float64
dtype: object
Why your code did not work
When reading the "[rpm];[N.m]" line, the csv parsers determines that the column(s) are strings, not floats. So the decimal specifier is simply ignored and the -72,556 values remain as string, with the comma.
I have a dollar value from a statement -52.23 and I'm simply trying to convert it to a float so I can use it in some calculations but I keep getting this error and after some research I still don't understand why.
I have the following code going through a statement pulling out each value. The values are in a csv file and the format looks like this:
07/06/2022, 07/07/2022, AMZN, Shopping, Sale, -52.23
Loop to pull the values:
with open(file, mode='r') as f:
csv_reader = csv.reader(f)
for row in csv_reader:
date = row[0]
name = row[2]
amt = float(row[5])
category = 'misc'
transaction = (date, name, category, amt)
print(transaction)Current printed result (when doing amt = row[5])
('07/06/22', 'AMZN', 'misc', '-52.23')Update: Header of the csv file was the issue, once I skipped that row it worked.
filepathtocsv = r"me2l.txt" #this is me2l file
data = pd.read_csv(filepathtocsv, sep ='\t', encoding = 'UTF-8-sig', error_bad_lines= False,skiprows= 7, header = None, dtype= str, skipinitialspace= True).iloc[:,2:22] #dtype prevents auto casting
#UTF - 8 with BOM reading. Skipping first row as it contains merged column - bom = byte order mark which is utf - x
#skipping column that contains asterisk and slicing row to contaning header row
data.insert(9,'Vendor ID','') #insert at index 9
data[['Vendor ID','Name of Vendor']] = data[11].str.split(" ",1,expand = True)
#data['PO Value'] = data[19]*data[21]
data = data.replace(',','',regex = True)
cols_kept = [2,3,4,7,10,'Vendor ID','Name of Vendor',12,15,21,19]
data = data[cols_kept]
data.columns = ["Purch.Doc.","Item","PO doc type","Req.Name","Doc/Date","Vendor ID","Name of Vendor","Short Text","MATERIAL_GROUP","Outstanding on PO","Quantity"]
data = data[data['Outstanding on PO'].str.contains("VAL")==False]
data['Outstanding on PO'] = data['Outstanding on PO'].str.replace('[a-zA-Z]','',regex = True)
data['Quantity'] = data['Quantity'].str.replace('[a-zA-Z\s]','', regex = True)
# data.replace({'Outstanding on PO':{'GBP':''}}, regex = True, inplace = True)
# data['Quantity'] = data['Quantity'].str.strip()
# data['Outstanding on PO'] = data['Outstanding on PO'].str.strip()
data = data.astype({"Outstanding on PO":"float", "Quantity":"float","Purch.Doc.":str,"Item":str})I want to convert Quantity and Outstanding on PO to float. Trying to use the regex
[a-zA-Z\s] to remove whitespace and letters from the numbers but I get this annoying error that '' couldn't be converted to a float???
A side question, how do I select and apply str.replace to multiple columns instead of:
data['Outstanding on PO'] = data['Outstanding on PO'].str.replace('[a-zA-Z\s]','',regex = True)
data['Quantity'] = data['Quantity'].str.replace('[a-zA-Z\s]','', regex = True)'' is not even whitespace? What is it exactly?
The following would be useful,
print (heterozygosity_df.columns)
This looks like static typing issue within pandas. What I suspect is heterozygosity_df.['chrI'] is a column in the dataframe. What I think has happened is there's a mix of strings and floats within this column. pandas has set this as a "string" but you are wanting to perform numerical operations. Thus the solution is simply
print(heterozygosity_df.dtypes) # this should state 'chrI' is a "category" or "string"
heterozygosity_df['chrI'] = heterozygosity_df['chrI'].astype(float)
print(heterozygosity_df.dtypes)
If you have multiple changes the syntax is
heterozygosity_df = heterozygosity_df.astype({'chrI':'float', 'egColumn':'category'})
I suspect there will be other errors, e.g. the header is the first row of the data column. This is because again for pandas to automatically assign a column to a "string" means there must be a string value within the column.
From the comments. I see whats happening. The easy solution is ...
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4,
....
}
heterozygosity_df['chr'] = heterozygosity_df['chr'].str.replace(replacement, regex=True)
From the comments ... good it works ... this is what I would have personally done ..
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4, # continue for all chromosomes
}
heterozygosity_df = pd.read_csv("file.tsv", sep="\t", header=None).set_axis(['chr', 'pos', 'het'], axis=1, copy=False)
heterozygosity_df['chr'] = heterozygosity_df['chr'].str.replace(replacement, regex=True).astype('int')
Normally str should be in place because when 'chr' is imported its an object. However, if it works thats the only thing that counts.
Ah! I finally got it to work! I used your command but I had to change it up a little bit to work with my data. The final command I used is this:
replacement = {
"chrI": 1,
"chrII": 2,
"chrIII": 3,
"chrIV": 4,
"chrV": 5,
"chrVI": 6,
"chrVII": 7,
"chrVIII": 8,
"chrIX": 9,
"chrX": 10,
"chrXI": 11,
"chrXII": 12,
"chrXIII": 13,
"chrXIV": 14,
"chrXV": 15,
"chrXVI": 16,
"chrmt": 17
}
heterozygosity_df['chr'] = heterozygosity_df['chr'].replace(replacement, regex=False)
And with that I was able to generate plots showing all of the chromosomes! (I have to figure out how to change the axis labels from 1,2,3,4...10 back to the chromosome but that's a future problem). Thank you so much for all your help!!
The source data file is not clean. You should read in the file first and then parse to float.
import pandas as pd
df = pd.read_csv('kidney_disease.csv')
cols = ['pcv','wc','rc']
df = df[cols]
for col in cols:
df[col] = pd.to_numeric(df[col],downcast='float',errors='coerce')
print(df.dtypes)
Output
pcv float32
wc float32
rc float32
dtype: object
This will result in nan values where strings could not be converted. You should examine your dataset to see what other cleaning may be required.
You can try a custom conversion function:
def str_to_float(x):
return float(x.strip())
data = pd.read_csv(file, usecols=["pcv", "wc", "rc"],
dtype={"pcv": np.float64, "wc": np.float64, "rc": np.float64},
converters={"pcv": str_to_float, "wc": str_to_float, "rc": str_to_float})
These strings have commas as thousands separators so you will have to remove them before the call to float:
df[column] = (df[column].str.split()).apply(lambda x: float(x[0].replace(',', '')))
This can be simplified a bit by moving split inside the lambda:
df[column] = df[column].apply(lambda x: float(x.split()[0].replace(',', '')))
Another solution with list comprehension, if need apply string functions working only with Series (columns of DataFrame) like str.split and str.replace:
df = pd.concat([df[col].str.split()
.str[0]
.str.replace(',','').astype(float) for col in df], axis=1)
#if need convert column Purchase count to int
df['Purchase count'] = df['Purchase count'].astype(int)
print (df)
Total Revenue Average Revenue Purchase count Rate
Date
Monday 1304.4 20.07 2345 1.54