This has been answered in the comments where it was noted that the following works:
df.astype({'date': 'datetime64[ns]'})
In addition, you can set the dtype when reading in the data:
pd.read_csv('path/to/file.csv', parse_dates=['date'])
python - Pandas 'astype' with date (or datetime) - Stack Overflow
python - Pandas: change data type of Series to String - Stack Overflow
python - Fastest way to cast all dataframe columns to float - pandas astype slow - Stack Overflow
Pandas astype('str) does not change the column to string
The object dtype is how pandas stores strings in dataframes. If for some particular reason you absolutely can't have that data as an object dtype, you're going to have to save it to something other than a dataframe.
https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object
More on reddit.comHow can I prevent astype() from modifying the original DataFrame?
What should I do if astype() fails with ValueError?
Can astype() handle conversions of string-based categorical columns?
Videos
This has been answered in the comments where it was noted that the following works:
df.astype({'date': 'datetime64[ns]'})
In addition, you can set the dtype when reading in the data:
pd.read_csv('path/to/file.csv', parse_dates=['date'])
datetime
Since you can't pass datetime format to astype(), it's a little primitive and it's better to use pd.to_datetime() instead.
df['date'] = pd.to_datetime(df['date'])
For example, if the dates in the data are of the format %d/%m/%Y such as 01/04/2020, astype() would incorrectly parse it as 2020-01-04 whereas with pd.to_datetime(), you can pass the correct format.
If you need to convert multiple columns into datetime64 (which is often the reason astype() is used), then you can apply pd.to_datetime().
df = pd.DataFrame({'date1': ['01/04/2020'], 'date2': ['02/04/2020']})
df = df.apply(pd.to_datetime, format='%d/%m/%Y')
Even with read_csv, you have some control over the format, e.g.
df = pd.read_csv('file.csv', parse_dates=['date'], dayfirst=True)
date
If you want to cast into date, then you can first cast to datetime64[ns] and then use dt.date to get a column of datetime.date objects:
df['date'] = pd.to_datetime(df['date']).dt.date
The column dtype will become object though (on which you can still perform vectorized operations such as adding days, comparing dates etc.), so if you plan to work on it a lot in pandas, it's more performative to use datetime64 instead. For example, adding a day is extremely fast on datetime64 columns, not so much on date columns:
s_dt = pd.Series(pd.date_range('1700', None, 10000, 'D'))
s_d = s_dt.dt.date
%timeit x = s_dt + pd.Timedelta(days=1)
# 344 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit y = s_d + pd.Timedelta(days=1)
# 56.1 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
With that being said, if you dump it into a database (such as sqlite), an object dtype column of datetime.date objects stored as a DATE type (whereas datetime64[ns] will be stored as TIMESTAMP).
Pandas datetime dtype is from numpy datetime64, so if you have pandas<2.0, you can use the following as well (since pandas 2.0, unitless datetime64 is not supported anymore). There's no date dtype (although you can perform vectorized operations on a column that holds datetime.date values).
df = df.astype({'date': np.datetime64})
# or (on a little endian system)
df = df.astype({'date': '<M8'})
# (on a big endian system)
df = df.astype({'date': '>M8'})
A new answer to reflect the most current practices: as of now (v1.2.4), neither astype('str') nor astype(str) work.
As per the documentation, a Series can be converted to the string datatype in the following ways:
df['id'] = df['id'].astype("string")
df['id'] = pandas.Series(df['id'], dtype="string")
df['id'] = pandas.Series(df['id'], dtype=pandas.StringDtype)
End to end example:
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'John', 'Alice'],
'Age': [25, 30, 35, 25, 30],
'City': ['New York', 'London', 'Paris', 'New York', 'London'],
'Salary': [50000, 60000, 70000, 50000, 60000],
'Category': ['A', 'B', 'C', 'A', 'B']
}
df = pd.DataFrame(data)
# Print the DataFrame
print("Original DataFrame:")
print(df)
print("\nData types:")
print(df.dtypes)
cat_cols_ = None
# Apply the code to change data types
if not cat_cols_:
# Get the columns with object data type
object_columns = df.select_dtypes(include=['object']).columns.tolist()
if len(object_columns) > 0:
print(f"\nObject columns found, converting to string: {object_columns}")
# Convert object columns to string type
df[object_columns] = df[object_columns].astype('string')
# Get the categorical columns (including string and category data types)
cat_cols_ = df.select_dtypes(include=['category', 'string']).columns.tolist()
# Print the updated DataFrame and data types
print("\nUpdated DataFrame:")
print(df)
print("\nUpdated data types:")
print(df.dtypes)
print(f"\nCategorical columns (cat_cols_): {cat_cols_}")
Original DataFrame:
Name Age City Salary Category
0 John 25 New York 50000 A
1 Alice 30 London 60000 B
2 Bob 35 Paris 70000 C
3 John 25 New York 50000 A
4 Alice 30 London 60000 B
Data types:
Name object
Age int64
City object
Salary int64
Category object
dtype: object
Object columns found, converting to string: ['Name', 'City', 'Category']
Updated DataFrame:
Name Age City Salary Category
0 John 25 New York 50000 A
1 Alice 30 London 60000 B
2 Bob 35 Paris 70000 C
3 John 25 New York 50000 A
4 Alice 30 London 60000 B
Updated data types:
Name string[python]
Age int64
City string[python]
Salary int64
Category string[python]
dtype: object
Categorical columns (cat_cols_): ['Name', 'City', 'Category']
You can convert all elements of id to str using apply
df.id.apply(str)
0 123
1 512
2 zhub1
3 12354.3
4 129
5 753
6 295
7 610
Edit by OP:
I think the issue was related to the Python version (2.7.), this worked:
df['id'].astype(basestring)
0 123
1 512
2 zhub1
3 12354.3
4 129
5 753
6 295
7 610
Name: id, dtype: object
No need for apply, just use DataFrame.astype directly.
df.astype(np.float64)
apply-ing is also going to give you a pretty bad performance hit.
Example
df = pd.DataFrame(np.arange(10**7).reshape(10**4, 10**3))
%timeit df.astype(np.float64)
1 loop, best of 3: 288 ms per loop
%timeit df.apply(lambda x: x.astype(np.float64), axis=0)
1 loop, best of 3: 748 ms per loop
%timeit df.apply(lambda x: x.astype(np.float64), axis=1)
1 loop, best of 3: 2.95 s per loop One efficient way would be to work with array data and cast it back to a dataframe, like so -
pd.DataFrame(df.values.astype(np.float64))
Runtime test -
In [144]: df = pd.DataFrame(np.random.randint(11,99,(5000,5000)))
In [145]: %timeit df.astype(np.float64) # @Mitch's soln
10 loops, best of 3: 121 ms per loop
In [146]: %timeit pd.DataFrame(df.values.astype(np.float64))
10 loops, best of 3: 42.5 ms per loop
The casting back to dataframe wasn't that costly -
In [147]: %timeit df.values.astype(np.float64)
10 loops, best of 3: 42.3 ms per loop # Casting to dataframe costed 0.2ms
Hey,
I have tried a lot of options for changing a pandas dataframe column values from object type to string type. However, the datatype does not change. Does any one of you have any ideas?
Before
ltctest.info()
Gives: <class 'pandas.core.frame.DataFrame'> Int64Index: 100 entries, 144334 to 144434 Data columns (total 6 columns): author 100 non-null object body 100 non-null object created_utc 100 non-null int64 id 100 non-null object score 100 non-null int64 datetime 100 non-null datetime64[ns] dtypes: datetime64ns, int64(2), object(3) memory usage: 5.5+ KB
ltctest['body'] = ltctest['body'].astype(str)
Results in the same info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 100 entries, 144334 to 144434 Data columns (total 6 columns): author 100 non-null object body 100 non-null object created_utc 100 non-null int64 id 100 non-null object score 100 non-null int64 datetime 100 non-null datetime64[ns] dtypes: datetime64ns, int64(2), object(3) memory usage: 5.5+ KB
Thanks!