Strings in a DataFrame, but dtype is object

stackoverflow.com › questions › 21018654 › strings-in-a-dataframe-but-dtype-is-object

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.

Here is an example:

the int64 array contains 4 int64 value.
the object array contains 4 pointers to 3 string objects.

Answer from HYRY on Stack Overflow

Practical Business Python

pbpython.com › pandas_dtypes.html

Overview of Pandas Data Types - Practical Business Python

We would like to get totals added ... is the line that says dtype: object. An object is a string in pandas so it performs a string operation instead of a mathematical one....

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.dtypes.html

pandas.DataFrame.dtypes — pandas 3.0.2 documentation

Return the dtypes in the DataFrame. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.

Discussions

python - What is dtype('O'), in pandas? - Stack Overflow

Copy float int datetime string 0 1.0 1 2018-03-10 foo --- float64 int64 datetime64[ns] object --- dtype('O') You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types. More on stackoverflow.com

stackoverflow.com

[Question] What is the significance of dtype == 'object'?

You can use that to filter only string columns More on reddit.com

r/kaggle

August 11, 2022

python - what are all the dtypes that pandas recognizes? - Stack Overflow

You can find the list of valid numpy.dtypes in the documentation: ... pandas should support these types. Using the astype method of a pandas.Series object with any of the above options as the input argument will result in pandas trying to convert the Series to that type (or at the very least falling back to object type); 'u' is ... More on stackoverflow.com

stackoverflow.com

dtype differs between pandas Series and element therein

Did your "further web browsing" take you to the pandas documentation? https://pandas.pydata.org/docs/user_guide/text.html#string-methods More on reddit.com

r/dfpandas

May 2, 2024

Videos

01:35

YouTube

Resolving the AttributeError: type object 'object' has no attribute ...

September 10, 2024

05:27

YouTube

Pandas - Convert Object Type to Category Type - YouTube

January 10, 2021

09:49

YouTube

How to Convert Data Types in Pandas Data Frame| Python - YouTube

stackoverflow.com › questions › 21018654 › strings-in-a-dataframe-but-dtype-is-object

python - Strings in a DataFrame, but dtype is object - Stack Overflow

Top answer

1 of 4

208

Here is an example:

the int64 array contains 4 int64 value.
the object array contains 4 pointers to 3 string objects.

2 of 4

@HYRY's answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.

Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.

Shamelessly gonna plug my own course on NumPy where I originally discussed this.

Stack Overflow

stackoverflow.com › questions › 37561991 › what-is-dtypeo-in-pandas

python - What is dtype('O'), in pandas? - Stack Overflow

Top answer

1 of 5

211

It means:

Copy'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

Copy'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.

2 of 5

When you see `dtype('O')` inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

Copydf = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

Copy   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

CopyPandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

Type of the data (integer, float, Python object, etc.)
Size of the data (how many bytes is in e.g. the integer)
Byte order of the data (little-endian or big-endian)
If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
What are the names of the "fields" of the structure
What is the data-type of each field
Which part of the memory block each field takes
If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.

Here is some code for testing with explanation: If we have the dataset as dictionary

Copyimport pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

Copy   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

Copydf.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

Copyprint(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

Copydf.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.

Pandas

pandas.pydata.org › docs › reference › api › pandas.api.types.is_object_dtype.html

pandas.api.types.is_object_dtype — pandas 3.0.2 documentation

Object dtype is a generic data type that can hold any Python objects, including strings, lists, and custom objects. ... The array-like or dtype to check. ... Whether or not the array-like or dtype is of the object dtype. ... Check whether the provided array or dtype is of a numeric dtype.

Kaggle

kaggle.com › questions-and-answers › 215448

What exactly an "object" dtype refers to?

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

Note.nkmk.me

note.nkmk.me › home › python › pandas

pandas: How to use astype() to cast dtype of DataFrame | note.nkmk.me

August 9, 2023 - pandas.Series has a single data type (dtype), while pandas.DataFrame can have a different data type for each column. You can specify dtype in various contexts, such as when creating a new object using a constructor or when reading from a CSV file.

Find elsewhere

Google Bing Mojeek

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.astype.html

pandas.DataFrame.astype — pandas 3.0.2 documentation

It supports casting entire objects to a single data type or applying different data types to individual columns using a mapping. ... Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to cast entire pandas object to the same type. Alternatively, use a mapping, e.g. {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

reddit.com › r/kaggle › [question] what is the significance of dtype == 'object'?

r/kaggle on Reddit: [Question] What is the significance of dtype == 'object'?

August 11, 2022 -

Following a Kaggle tutorial where the data set is the melbourne housing data.

I keep seeing this:

categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

I understand that we're concerned about columns that have data with low cardinality. I'm confused why we care that the dtype == 'object'. Why does this matter? How does the dtype improve our ability to predict pricing?

Top answer

1 of 3

You can use that to filter only string columns

2 of 3

Have you tried removing it and seeing if that changes your results?

Kaggle

kaggle.com › general › 188478

What is the difference between Pandas Object & String dtype

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

Statology

statology.org › home › the complete guide to pandas dtypes

The Complete Guide to Pandas dtypes

April 11, 2024 - From the output we can see that the assists column is an integer data type while the minutes column is a floating point data type. This should make sense considering the minutes column has decimal values to represent the fraction of minutes that particular athletes can play in a game. Lastly, we can use the following syntax to display the data type of each column in the pandas DataFrame: #display data type of each column in DataFrame df.dtypes team object points int64 assists int64 minutes float64 all_star bool dtype: object

Pandas

pandas.pydata.org › docs › reference › arrays.html

pandas arrays, scalars, and data types — pandas 3.0.1 documentation

The .dtype of a arrays.ArrowExtensionArray is an ArrowDtype. Pyarrow provides similar array and data type support as NumPy including first-class nullability support for all data types, immutability and more. The table below shows the equivalent pyarrow-backed (pa), pandas extension, and numpy (np) types that are recognized by pandas.

APXML

apxml.com › courses › intro-eda-course › chapter-2-data-loading-inspection-cleaning › understanding-data-types

Pandas Data Types (dtypes) Explained

Pandas provides straightforward ways to check the dtypes of your columns. Using the .dtypes attribute: This attribute returns a Series where the index is the column name and the value is the data type of that column. # Assuming 'df' is your DataFrame print(df.dtypes) ...

Stack Overflow

stackoverflow.com › questions › 29245848 › what-are-all-the-dtypes-that-pandas-recognizes

python - what are all the dtypes that pandas recognizes? - Stack Overflow

Top answer

1 of 3

pandas borrows its dtypes from numpy. For demonstration of this see the following:

import pandas as pd

df = pd.DataFrame({'A': [1,'C',2.]})
df['A'].dtype

>>> dtype('O')

type(df['A'].dtype)

>>> numpy.dtype

You can find the list of valid numpy.dtypes in the documentation:

'?' boolean

'b' (signed) byte

'B' unsigned byte

'i' (signed) integer

'u' unsigned integer

'f' floating-point

'c' complex-floating point

'm' timedelta

'M' datetime

'O' (Python) objects

'S', 'a' zero-terminated bytes (not recommended)

'U' Unicode string

'V' raw data (void)

pandas should support these types. Using the astype method of a pandas.Series object with any of the above options as the input argument will result in pandas trying to convert the Series to that type (or at the very least falling back to object type); 'u' is the only one that I see pandas not understanding at all:

df['A'].astype('u')

>>> TypeError: data type "u" not understood

This is a numpy error that results because the 'u' needs to be followed by a number specifying the number of bytes per item in (which needs to be valid):

import numpy as np

np.dtype('u')

>>> TypeError: data type "u" not understood

np.dtype('u1')

>>> dtype('uint8')

np.dtype('u2')

>>> dtype('uint16')

np.dtype('u4')

>>> dtype('uint32')

np.dtype('u8')

>>> dtype('uint64')

# testing another invalid argument
np.dtype('u3')

>>> TypeError: data type "u3" not understood

To summarise, the astype methods of pandas objects will try and do something sensible with any argument that is valid for numpy.dtype. Note that numpy.dtype('f') is the same as numpy.dtype('float32') and numpy.dtype('f8') is the same as numpy.dtype('float64') etc. Same goes for passing the arguments to pandas astype methods.

To locate the respective data type classes in NumPy, the Pandas docs recommends this:

def subdtypes(dtype):
    subs = dtype.__subclasses__()
    if not subs:
        return dtype
    return [dtype, [subdtypes(dt) for dt in subs]]

subdtypes(np.generic)

Output:

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

Pandas accepts these classes as valid types. For example, dtype={'A': np.float}.

NumPy docs contain more details and a chart:

2 of 3

EDIT Feb 2020 following pandas 1.0.0 release

Pandas mostly uses NumPy arrays and dtypes for each Series (a dataframe is a collection of Series, each which can have its own dtype). NumPy's documentation further explains dtype, data types, and data type objects. In addition, the answer provided by @lcameron05 provides an excellent description of the numpy dtypes. Furthermore, the pandas docs on dtypes have a lot of additional information.

The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32.

By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit). The following will all result in int64 dtypes.

Numpy, however will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform. One of the major changes to version 1.0.0 of pandas is the introduction of pd.NA to represent scalar missing values (rather than the previous values of np.nan, pd.NaT or None, depending on usage).

Pandas extends NumPy's type system and also allows users to write their on extension types. The following lists all of pandas extension types.

1) Time zone handling

Kind of data: tz-aware datetime (note that NumPy does not support timezone-aware datetimes).

Data type: DatetimeTZDtype

Scalar: Timestamp

Array: arrays.DatetimeArray

String Aliases: 'datetime64[ns, ]'

2) Categorical data

Kind of data: Categorical

Data type: CategoricalDtype

Scalar: (none)

Array: Categorical

String Aliases: 'category'

3) Time span representation

Kind of data: period (time spans)

Data type: PeriodDtype

Scalar: Period

Array: arrays.PeriodArray

String Aliases: 'period[]', 'Period[]'

4) Sparse data structures

Kind of data: sparse

Data type: SparseDtype

Scalar: (none)

Array: arrays.SparseArray

String Aliases: 'Sparse', 'Sparse[int]', 'Sparse[float]'

5) IntervalIndex

Kind of data: intervals

Data type: IntervalDtype

Scalar: Interval

Array: arrays.IntervalArray

String Aliases: 'interval', 'Interval', 'Interval[<numpy_dtype>]', 'Interval[datetime64[ns, ]]', 'Interval[timedelta64[]]'

6) Nullable integer data type

Kind of data: nullable integer

Data type: Int64Dtype, ...

Scalar: (none)

Array: arrays.IntegerArray

String Aliases: 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'

7) Working with text data

Kind of data: Strings

Data type: StringDtype

Scalar: str

Array: arrays.StringArray

String Aliases: 'string'

8) Boolean data with missing values

Kind of data: Boolean (with NA)

Data type: BooleanDtype

Scalar: bool

Array: arrays.BooleanArray

String Aliases: 'boolean'

reddit.com › r/dfpandas › dtype differs between pandas series and element therein

r/dfpandas on Reddit: dtype differs between pandas Series and element therein

May 2, 2024 -

I am following this guide on working with text data types. From there, I cobbled the following:

import pandas as pd

# "Int64" dtype for both series and element therein
#--------------------------------------------------
s1 = pd.Series([1, 2, np.nan], dtype="Int64")
s1

   0       1
   1       2
   2    <NA>
   dtype: Int64

type(s1[0])

   numpy.int64

# "string" dtype for series vs. "str" dtype for element therein
#--------------------------------------------------------------
s2 = s1.astype("string")
s2

   Out[13]:
   0       1
   1       2
   2    <NA>
   dtype: string

type(s2[0])

   str

For Int64 series s1, the series type matches the type of the element therein (other than inconsistent case).

For string series s2, the elements therein of a completely different type str. From web browsing, I know that str is the native Python string type while string is the pandas string type. My web browsings further indicate that the pandas string type is the native Python string type (as opposed to the fixed-length mutable string type of NumPy).

In that case, why is there a different name (string vs. str) and why do the names differ in the last two lines of output above? My (possibly wrong) understanding is that the dtype shown for a series reflects the type of the elements therein.

Top answer

1 of 4

Did your "further web browsing" take you to the pandas documentation? https://pandas.pydata.org/docs/user_guide/text.html#string-methods

2 of 4

My bad you literally linked to it, so what exactly don't you understand?

TutorialsPoint

tutorialspoint.com › how-do-stringdtype-objects-differ-from-object-dtype-in-python-pandas

How do StringDtype objects differ from object dtype in Python Pandas?

So it concludes, the dtype object doesn’t store only text data, it is a mixture of all data. Here define pd.StringDtype() explicitly to the dtype parameter of the pandas series method.

w3resource

w3resource.com › pandas › dataframe › dataframe-dtypes.php

Pandas DataFrame property: dtypes - w3resource

August 19, 2022 - This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype. ... Returns: pandas.Series The data type of each column.

Pandas

pandas.pydata.org › docs › reference › api › pandas.Series.dtype.html

pandas.Series.dtype — pandas 3.0.2 documentation - PyData |

Return the dtype object of the underlying data. ... Cast a pandas object to a specified dtype dtype.

Codegive

codegive.com › blog › pandas_object_dtype.php

Pandas Object Dtype (2024): Master Its Secrets & Unlock Peak Data Performance

While you might be familiar with ... The object dtype is a generic, flexible container that pandas uses when it cannot infer a more specific data type, or when a column contains heterogeneous data that doesn't fit a single numerical or boolean type....

Quora

quora.com › What-is-the-object-data-type-in-Pandas

What is the object data type in Pandas? - Quora

Answer: Whenever Pandas does not recognize the data type as one of the small handful of datatypes it can deal with (int, float, string, boolean, …), it just sets the datatype to “object” — that’s a safe bet, since pretty much everything is an object, in Python.

Videos

When you see dtype('O') inside dataframe this means Pandas string.

When you see `dtype('O')` inside dataframe this means Pandas string.