since strings data types have variable length, it is by default stored as object dtype. If you want to store them as string type, you can do something like this.
df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes,
or alternatively
df['column'] = df['column'].astype('|S') # which will by default set the length to the max len it encounters
Answer from Siraj S. on Stack Overflowsince strings data types have variable length, it is by default stored as object dtype. If you want to store them as string type, you can do something like this.
df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes,
or alternatively
df['column'] = df['column'].astype('|S') # which will by default set the length to the max len it encounters
Did you try assigning it back to the column?
df['column'] = df['column'].astype('str')
Referring to this question, the pandas dataframe stores the pointers to the strings and hence it is of type 'object'. As per the docs ,You could try:
df['column_new'] = df['column'].str.split(',')
Unable to convert a pandas object to a string in my DataFrame
Python:Pandas - Object to string type conversion in dataframe - Stack Overflow
python - Pandas distinction between str and object types - Stack Overflow
Convert a column of text saved as an object to string
Videos
Trying to use the YouTube API to pull through some videos for data analysis and am currently using just two videos in a dataframe to play around with the functionality as I'm new to all of this.
I'm using another API to get the transcripts for each video but I need to input the video_id into that API to get transcripts for each video.
The only problem is everything is stored as an object and whenever I try .astype(str) or something like that, it still says the data is an object and means I can't do anything with the data when a string is a required argument for the other API
This is what I get when calling .info() on my dataframe:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 video_id 2 non-null object 1 publishedAt 2 non-null object 2 channelId 2 non-null object 3 title 2 non-null object 4 description 2 non-null object 5 channelTitle 2 non-null object 6 tags 2 non-null object 7 categoryId 2 non-null object 8 liveBroadcastContent 2 non-null object 9 defaultAudioLanguage 2 non-null object dtypes: object(10) memory usage: 288.0+ bytes
Any help would be really appreciated or an explanation of how these issues are usually handled
Use astype('string') instead of astype(str) :
df['column'] = df['column'].astype('string')
You could read the excel specifying the dtype as str:
df = pd.read_excel("Excelfile.xlsx", dtype=str)
then use string replace in particulars column as below:
df['particulars'] = df[df['particulars'].str.replace('/','')]
Notice that the df assignment is also a dataframe in '[]' brackets.
When you're using the below command in your program, it returns a string which you're trying to assign to a dataframe column. Hence the error.
df['particulars'] = df['particulars'].str.replace('/',' ')
Numpy's string dtypes aren't python strings.
Therefore, pandas deliberately uses native python strings, which require an object dtype.
First off, let me demonstrate a bit of what I mean by numpy's strings being different:
In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.
If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:
In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')
While the object dtype versions can be arbitrary length:
In [6]: y[1] = 'a really really really long'
In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.
Finally, numpy's strings are actually mutable, while Python strings are not. For example:
In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
dtype='|S7')
For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.
The difference between 'string' and object dtypes in pandas
As of pandas 1.5.3, there are two main differences between the two dtypes.
1. Null handling
object dtype can store not only strings but also mixed data types, so if you want to cast the values into strings, astype(str) is the prescribed method. This however casts all values into strings, even NaNs become literal 'nan' strings. string is a nullable dtype, so casting as 'string' preserves NaNs as null values.
x = pd.Series(['a', float('nan'), 1], dtype=object)
x.astype(str).tolist() # ['a', 'nan', '1']
x.astype('string').tolist() # ['a', <NA>, '1']
A consequence of this is that string operations (e.g. counting characters, comparison) that are performed on object dtype columns return numpy.int or numpy.bool etc. whereas the same operations performed on 'string' dtype return nullable pd.Int64 or pd.Boolean dtypes. In particular, NaN comparisons return False (because NaN is not equal to any value) for comparisons performed on object dtypes, while pd.NA remains pd.NA for comparisons performed on 'string' dtype.
x = pd.Series(['a', float('nan'), 'b'], dtype=object)
x == 'a'
0 True
1 False
2 False
dtype: bool
y = pd.Series(['a', float('nan'), 'b'], dtype='string')
y == 'a'
0 True
1 <NA>
2 False
dtype: boolean
So with 'string' dtype, null handling is more flexible because you can call fillna() etc. to handle null values however you want to.1
2. string dtype is clearer
If a pandas column is object dtype, values in it can be replaced with anything. For example, a string in it can be replaced by an integer and that's OK (e.g. x below). It might have unwanted consequences afterwards if you expect each value in it to be strings. string dtype does not have that problem because a string can only be replaced by another string (e.g. y below).
x = pd.Series(['a', 'b'], dtype=str)
y = pd.Series(['a', 'b'], dtype='string')
x[1] = 3 # OK
y[1] = 3 # ValueError
y[1] = '3' # OK
This has the advantage where you can use select_dtypes() to select only string columns. In other words, with object dtypes, there is no way to identify string columns, but with 'string' dtypes, there is.
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [[1], [2,3], [4,5]]}).astype({'A': 'string'})
df.select_dtypes('string') # only selects the string column
A
0 a
1 b
2 c
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [[1], [2,3], [4,5]]})
df.select_dtypes('object') # selects the mixed dtype column as well
A B
0 a [1]
1 b [2, 3]
2 c [4, 5]
3. Memory efficiency
String Dtype 'string' has storage options (python and pyarrow) and if the strings are short, pyarrow is very efficient. Look at the following example:
lst = np.random.default_rng().integers(1000000, size=1000).astype(str).tolist()
x = pd.Series(lst, dtype=object)
y = pd.Series(lst, dtype='string[pyarrow]')
x.memory_usage(deep=True) # 63041
y.memory_usage(deep=True) # 10041
As you can see, if the strings are short (at most 6 characters in the example above), pyarrow is consumes over 6 times less memory. However, as the following example shows, if the strings are long, there's barely any difference.
z = x * 1000
w = (y.astype(str) * 1000).astype('string[pyarrow]')
z.memory_usage(deep=True) # 5970128
w.memory_usage(deep=True) # 5917128
1 Similar intuition already exists for str.contains, str.match for example.
x = pd.Series(['a', float('nan'), 'b'], dtype=object)
x.str.match('a', na=np.nan)
0 True
1 NaN
2 False
dtype: object