Use astype with replace:
df = pd.DataFrame({'ID':[805096730.0,805096730.0]})
df['ID'] = df['ID'].astype(str).replace('\.0', '', regex=True)
print (df)
ID
0 805096730
1 805096730
Or add parameter dtype:
df = pd.read_excel(file, dtype={'ID':str})
Answer from jezrael on Stack OverflowUse astype with replace:
df = pd.DataFrame({'ID':[805096730.0,805096730.0]})
df['ID'] = df['ID'].astype(str).replace('\.0', '', regex=True)
print (df)
ID
0 805096730
1 805096730
Or add parameter dtype:
df = pd.read_excel(file, dtype={'ID':str})
Check type of your numbers before converting them to strings. It seems that they are floats, rather than integers. If this is the case, convert your numbers to integers:
df = pd.DataFrame([123.0, 456.0])
df = df.apply(int, axis=1)
0 123
1 456
Then, convert it into strings:
df = df.apply(str)
print(df.iloc[1])
'456'
You have a few options...
1) convert everything to integers.
df.astype(int)
<=35 >35
Cut-off
Calcium 0 1
Copper 1 0
Helium 0 8
Hydrogen 0 1
2) Use round:
>>> df.round()
<=35 >35
Cut-off
Calcium 0 1
Copper 1 0
Helium 0 8
Hydrogen 0 1
but not always great...
>>> (df - .2).round()
<=35 >35
Cut-off
Calcium -0 1
Copper 1 -0
Helium -0 8
Hydrogen -0 1
3) Change your display precision option in Pandas.
pd.set_option('precision', 0)
>>> df
<=35 >35
Cut-off
Calcium 0 1
Copper 1 0
Helium 0 8
Hydrogen 0 1
Since pandas 0.17.1 you can set the displayed numerical precision by modifying the style of the particular data frame rather than setting the global option:
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))
df

df.style.set_precision(2)

It is also possible to apply column specific styles
df.style.format({
'A': '{:,.1f}'.format,
'B': '{:,.3f}'.format,
})

python - How to remove decimal point from string using pandas - Stack Overflow
python - Deleting decimals from a pandas dataframe - Stack Overflow
how to remove decimal points from values ?
pandas - Remove Decimal Point in a Dataframe with both Numbers and String Using Python - Stack Overflow
I think the field is automatically parsed as float when reading the excel. I would correct it afterwards:
df['column_name'] = df['column_name'].astype(int)
If your column contains Nulls you canยดt convert to integer so you will need to fill nulls first:
df['column_name'] = df['column_name'].fillna(0).astype(int)
Then you can concatenate and store the way you were doing it
Your question has nothing to do with Spark or PySpark. It's related to Pandas.
This is because Pandas interpret and infer columns' data type automatically. Since all the values of your column are numeric, Pandas will consider it as float data type.
To avoid this, pandas.ExcelFile.parse method accepts an argument called converters, you could use this to tell Pandas the specific column data type by:
# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])
OR
# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
cols = excel_file.parse(sheet_name).columns
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
return converters
df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)
OR
# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)
You need to re-assign dataframe
(which is, what I suppose your error is):
>>> import pandas as pd
>>> df = pd.DataFrame(data={"col": [24.00, 2.00, 3.00]})
>>> df.dtypes
col float64
dtype: object
>>> df
col
0 24.0
1 2.0
2 3.0
>>> df=df.astype(int)
>>> df
col
0 24
1 2
2 3
>>> df.dtypes
col int32
dtype: object
You can solve this by setting the pandas option, precision to 0.
import pandas as pd
df = pd.DataFrame(data={"col": [24.00, 2.00, 3.00]})
print(df)
col
0 24.0
1 2.0
2 3.0
pd.set_option('precision',0)
print(df)
col
0 24
1 2
2 3
my boyfriend has moved his excel table to python but it has added .0 to his values (eg 160 becomes 160.0) is there anyway to fix this and remove decimals ?
Use a function and apply to whole column:
In [94]:
df = pd.DataFrame({'Movies':['Save the last dance', '2012.0']})
df
Out[94]:
Movies
0 Save the last dance
1 2012.0
[2 rows x 1 columns]
In [95]:
def trim_fraction(text):
if '.0' in text:
return text[:text.rfind('.0')]
return text
df.Movies = df.Movies.apply(trim_fraction)
In [96]:
df
Out[96]:
Movies
0 Save the last dance
1 2012
[2 rows x 1 columns]
Here is hint for you ,
In case of Valid number ,
a="2012.0"
try:
a=float(a)
a=int(a)
print a
except:
print a
Output:
2012
In case of String like "Dance with Me"
a="Dance with Me"
try:
a=float(a)
a=int(a)
print a
except:
print a
Output:
Dance with Me
use astype(np.int64)
s = pd.Series(['', 8.00735e+09, 4.35789e+09, 6.10644e+09])
mask = pd.to_numeric(s).notnull()
s.loc[mask] = s.loc[mask].astype(np.int64)
s
0
1 8007350000
2 4357890000
3 6106440000
dtype: object
In Pandas/NumPy, integers are not allowed to take NaN values, and arrays/series (including dataframe columns) are homogeneous in their datatype --- so having a column of integers where some entries are None/np.nan is downright impossible.
EDIT:data.phone.astype('object')
should do the trick; in this case, Pandas treats your column as a series of generic Python objects, rather than a specific datatype (e.g. str/float/int), at the cost of performance if you intend to run any heavy computations with this data (probably not in your case).
Assuming you want to keep those NaN entries, your approach of converting to strings is a valid possibility:
data.phone.astype(str).str.split('.', expand = True)[0]
should give you what you're looking for (there are alternative string methods you can use, such as .replace or .extract, but .split seems the most straightforward in this case).
Alternatively, if you are only interested in the display of floats (unlikely I'd suppose), you can do pd.set_option('display.float_format','{:.0f}'.format), which doesn't actually affect your data.
If values are strings first convert to floats and then to integers:
df['Net Sales'] = df['Net Sales'].astype(float).astype(int)
If values are floats use:
df['Net Sales'] = df['Net Sales'].astype(int)
Your solution should be changed with \d+ for match digits after .:
df['Net Sales'] = df['Net Sales'].astype(str).replace('\.\d+', '', regex=True).astype(int)
print (df)
Net Sales
0 123
1 34
2 65
Or youcan use split by dot and select first list by indexing:
df['Net Sales'] = df['Net Sales'].astype(str).str.split('.').str[0].astype(int)
You can coerce the datatype to int, Just a note in case you have nans in your data, the conversion to int doesn't work as they have float data type, so regex solution might be better.
df['Net Sales'] = df['Net Sales'].astype('int')
or in case of regex:
df['Net Sales'] = df['Net Sales'].astype('str').replace(r'\.\d+$', '', regex=True).astype('int')
Example:
import pandas as pd
df = pd.DataFrame({"Net Sales" : [1.5, 2.5]})
df['Net Sales'] = df['Net Sales'].astype('int')
df['Net Sales'] = df['Net Sales'].astype('str').replace(r'\.\d+$', '', regex=True).astype('int')
Output:
# Net Sales
#0 1
#1 2
You can try using as df['col'] = (df['col']*100).astype(int)
as below:
df = pd.DataFrame({'col': [1.10, 2.20, 3.30, 4.40]})
df['col'] = (df['col']*100).astype(int)
print(df)
Output:
col
0 110
1 220
2 330
3 440
If - as your comment suggests - the data just all needs to be multiplied by 100...
df['columnName'] = df['columnName'].apply(lambda x: x*100)
You can apply str.replace to the Name column in combination with regular expressions:
import pandas as pd
# Example DataFrame
df = pd.DataFrame.from_dict({'Name' : ['May21', 'James', 'Adi22', 'Hello', 'Girl90'],
'Volume': [23, 12, 11, 34, 56],
'Value' : [21321, 12311, 4435, 32454, 654654]})
df['Name'] = df['Name'].str.replace('\d+', '')
print(df)
Output:
Name Value Volume
0 May 21321 23
1 James 12311 12
2 Adi 4435 11
3 Hello 32454 34
4 Girl 654654 56
In the regular expression \d stands for "any digit" and + stands for "one or more".
Thus, str.replace('\d+', '') means: "Replace all occurring digits in the strings with nothing".
You can do it like so:
df.Name = df.Name.str.replace('\d+', '')
To play and explore, check the online Regular expression demo here: https://regex101.com/r/Y6gJny/2
Whatever is matched by the pattern \d+ i.e 1 or more digits, will be replaced by empty string.