It means:
Copy'O' (Python) objects
Source.
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:
Copy'b' boolean
'i' (signed) integer
'u' unsigned integer
'f' floating-point
'c' complex-floating point
'O' (Python) objects
'S', 'a' (byte-)string
'U' Unicode
'V' raw data (void)
Another answer helps if need check types.
It means:
Copy'O' (Python) objects
Source.
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:
Copy'b' boolean
'i' (signed) integer
'u' unsigned integer
'f' floating-point
'c' complex-floating point
'O' (Python) objects
'S', 'a' (byte-)string
'U' Unicode
'V' raw data (void)
Another answer helps if need check types.
When you see dtype('O') inside dataframe this means Pandas string.
What is dtype?
Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:
Copydf = pd.DataFrame({'float': [1.0],
'int': [1],
'datetime': [pd.Timestamp('20180310')],
'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype
It will output like this:
Copy float int datetime string
0 1.0 1 2018-03-10 foo
---
float64 int64 datetime64[ns] object
---
dtype('O')
You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.
CopyPandas dtype Python type NumPy type Usage
object str string_, unicode_ Text
Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.
Data type object is an instance of numpy.dtype class that understand the data type more precise including:
- Type of the data (integer, float, Python object, etc.)
- Size of the data (how many bytes is in e.g. the integer)
- Byte order of the data (little-endian or big-endian)
- If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
- What are the names of the "fields" of the structure
- What is the data-type of each field
- Which part of the memory block each field takes
- If the data type is a sub-array, what is its shape and data type
In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.
Here is some code for testing with explanation: If we have the dataset as dictionary
Copyimport pandas as pd
import numpy as np
from pandas import Timestamp
data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe
print(df)
print(df.dtypes)
The last lines will examine the dataframe and note the output:
Copy id date role num fnum
0 1 2018-12-12 Support 123 3.14
1 2 2018-12-12 Marketing 234 2.14
2 3 2018-12-12 Business Development 345 -0.14
3 4 2018-12-12 Sales 456 41.30
4 5 2018-12-12 Engineering 567 3.14
id int64
date datetime64[ns]
role object
num int64
fnum float64
dtype: object
All kind of different dtypes
Copydf.iloc[1,:] = np.nan
df.iloc[2,:] = None
But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:
Copyprint(df)
print(df.dtypes)
id date role num fnum
0 1.0 2018-12-12 Support 123.0 3.14
1 NaN NaT NaN NaN NaN
2 NaN NaT None NaN NaN
3 4.0 2018-12-12 Sales 456.0 41.30
4 5.0 2018-12-12 Engineering 567.0 3.14
id float64
date datetime64[ns]
role object
num float64
fnum float64
dtype: object
So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.
You may try also setting single rows:
Copydf.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object
And to note here, if we set string inside a non string column it will become string or object dtype.
Videos
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.
Here is an example:
- the int64 array contains 4 int64 value.
- the object array contains 4 pointers to 3 string objects.

@HYRY's answer is great. I just want to provide a little more context..
Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.
Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.

Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.
The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.
Shamelessly gonna plug my own course on NumPy where I originally discussed this.
Following a Kaggle tutorial where the data set is the melbourne housing data.
I keep seeing this:
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]I understand that we're concerned about columns that have data with low cardinality. I'm confused why we care that the dtype == 'object'. Why does this matter? How does the dtype improve our ability to predict pricing?
For some reason, some of the columns are being loaded as a Decimal rather than as a float - not my team, apparently can't be changed.
Is there a way to identify which columns are Decimal? df[col].dtype just returns "O" which makes it impossible to distinguish from objects using this method.