Making Int64 the default integer dtype instead of standard int64 in pandas

stackoverflow.com › questions › 56220651 › making-int64-the-default-integer-dtype-instead-of-standard-int64-in-pandas

You could use a function like this:

def nan_ints(df, convert_strings=False, subset=None):
    types = ["int64", "float64"]
    if subset is None:
        subset = list(df)
    if convert_strings:
        types.append("object")
    for col in subset:
        if df[col].dtype in types:
            df[col] = (
                df[col].astype(float, errors="ignore").astype("Int64", errors="ignore")
            )
    return df

It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument.

df1 = pd.DataFrame({'a':[1.1,2,3,1],
                  'b':[1,2,3,np.nan],
                  'c':['1','2','3',np.nan],
                  'd':[3,2,1,np.nan]})


nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()

Will return the following:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a    4 non-null float64
b    3 non-null Int64
c    3 non-null Int64
d    3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes

if you are going to use this on every DataFrame you could add the function to a module and import it every time you want to use pandas. from my_module import nan_ints Then just use it with something like: nan_ints(pd.read_csv(path))

Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.

Answer from braintho on Stack Overflow

GitHub

github.com › pandas-dev › pandas › issues › 27731

Object dtype for int64 and Int64 values in column · Issue #27731 · pandas-dev/pandas

August 3, 2019 - What is the difference between "Int64" and "int64"? After concat it has "object" dtype.

Author Zoynels

NumPy

numpy.org › doc › stable › user › basics.types.html

Data types — NumPy v2.4 Manual

Note that, above, we could have used the Python float object as a dtype instead of numpy.float64. NumPy knows that int refers to numpy.int_, bool means numpy.bool, that float is numpy.float64 and complex is numpy.complex128. The other data-types do not have Python equivalents. Sometimes the conversion can overflow, for instance when converting a numpy.int64 value 300 to numpy.int8.

Discussions

python - Making Int64 the default integer dtype instead of standard int64 in pandas - Stack Overflow

I would like all my dataframes, regardless of whether they're built up from any one of the constructor overloads, whether they're derived from .read_csv(), .read_xlsx(), .read_sql(), or any other method, to use the new nullable Int64 datatype as the default dtype for all integers, rather than int64. More on stackoverflow.com

stackoverflow.com

What's up with uint64 in numpy?

OP, I was genuinely SHOOK when you revealed the int casting to floats. It seems like the maintainers consider this some unfixable quirk. shō ga nai ¯_(ツ)_/¯ More on reddit.com

r/Python

122

May 17, 2022

python - What is the difference between native int type and the numpy.int types? - Stack Overflow

Can you please help understand what are the main differences (if any) between the native int type and the numpy.int32 or numpy.int64 types? More on stackoverflow.com

stackoverflow.com

Keep getting TypeError: data type 'Int64' not understood

Are you sure it's that line? return dtype in ["Int64", "Float64", "boolean"] should simply return True or False, since it's testing if whatever is in the variable "dtype" matches one of those strings. I mean you didn't do from numpy import *. You didn't do that, right? Anyway, numpy accepts the string "int64" but not "Int64" as an argument for its dtype function and method. >>> import numpy as np >>> >>> np.dtype("int64") dtype('int64') >>> >>> np.int64 == np.dtype('int64') True >>> >>> np.dtype("Int64") Traceback (most recent call last): File "", line 1, in TypeError: data type 'Int64' not understood >>> More on reddit.com

r/learnpython

February 14, 2024

Top answer

1 of 2

You could use a function like this:

def nan_ints(df, convert_strings=False, subset=None):
    types = ["int64", "float64"]
    if subset is None:
        subset = list(df)
    if convert_strings:
        types.append("object")
    for col in subset:
        if df[col].dtype in types:
            df[col] = (
                df[col].astype(float, errors="ignore").astype("Int64", errors="ignore")
            )
    return df

df1 = pd.DataFrame({'a':[1.1,2,3,1],
                  'b':[1,2,3,np.nan],
                  'c':['1','2','3',np.nan],
                  'd':[3,2,1,np.nan]})


nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()

Will return the following:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a    4 non-null float64
b    3 non-null Int64
c    3 non-null Int64
d    3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes

Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.

2 of 2

I would put my money on monkey patching. The easiest way would be to monkey patch the DataFrame constructor. That should go something like this:

import pandas
pandas.DataFrame.__old__init__ = pandas.DataFrame.__init__
def new_init(self, data=None, index=None, columns=None, dtype=pd.Int64Dtype(), copy=False):
    self.__old__init__(data=data, index=index, columns=None, dtype=dtype, copy=copy)

pandas.DataFrame.__init__ = new_init

Of course, you run the risk of breaking the world. Good luck!

reddit.com › r/python › what's up with uint64 in numpy?

r/Python on Reddit: What's up with uint64 in numpy?

May 17, 2022 -

The other day I was writing some code using Numpy uint64 arrays and -- this isn't a joke -- I encountered two bugs with uint64 arrays, one of which has been known for TEN YEARS. Consider the following scenario:

a = np.ones(10, dtype=np.uint64)
a[1] <<= 1

What do you think happens? If you guessed "the second element of a is left shifted from 1 to 2, you'd be wrong! What actually happens is that your program crashes because << isn't implemented for np.uint64, int.

So that's a critical bug that has not been fixed in 10 years, which is utterly buck wild, but at least it's kind of understandable how such a bug could come to be since <<= is probably implemented on top of << in this case and << probably isn't constrained by its output type, meaning it wouldn't be unreasonable np.uint64 << int to return an int. Of course, this is being far too charitable for a number of reasons: np.uint64 << int does not in fact return an int since it crashes, we unequivocally know what the output type should be because we're using a compound assignment operator, and this bug has been known for ten years.

The next bug might be more recent, but I suspect both of these bugs have been present from the get go. This bug is also much more shocking. Consider the following:

np.uint64(1) + 1

What do you think the answer is? Remember, numpy wraps the C type system and in C, integers are promoted to the integer type with higher precision in arithmetic contexts. So we might expect this to yield 2 if numpy copies the C rules, or we might also reasonably expect it to yield np.uint64(2) because even though np.uint64 + int may overflow sometimes, this is usually what users will want.

Dear reader, numpy does neither of these things. It instead does something that is objectively wrong and no sane person would ever defend: it returns np.float64. This is completely bananas. 64 bit floating point numbers cannot represent all 64 bit integers exactly because they only have 53 bits of precision, with the rest used for sign and exponent. For this reason, languages tend to not randumbly convert ints into floats because it destroys information and forces the use of a slower type. The main exception to this is division, but numpy does this for addition and frankly as a C wrapper it should have the C behavior for dividing two integers.

It's just mind boggling to me that numpy, a project with 29619 commits, 1305 contributors, and millions of users, and which is a de facto part of the python standard library, could not only have fundamental bugs with how it handles wrapping C integer types, but that these bugs could be known for over a decade and not just remain unfixed but implicitly be something that will never be fixed. These bugs pretty much make np.uint64 unusable. While you can work around them by wrapping your other parameter in a constructor call np.uint64(b), this is pretty brittle and it's easy to envision a scenario in which a custom function that expects a python int is given a np.uint64 and produces a floating point number or crash.

Top answer

1 of 13

It's just mind boggling to me that numpy, a project with 29619 commits, 1305 contributors, and millions of users Numpy, is also a project with 10'000 issues opened on GitHub. What makes these two bugs so important? If more people would run into it, and it would be so important, why are the GitHub issues so inactive? Hell, the issue you linked was inactive for 5 years until someone ran into it and has now been inactive for another 5 years. Would you, if you were maintaining a project, priotitize your time on a bug that was reported 5 years ago and has been inactive since? This is odd behavior - like there exists in JavaScript - but I don't understand the craze about how huge of a bug this is. See the following quote from one of the related issues: The problem is that v - 1 is uint64 - int64 which due to the lack of int128, we panic and assign type float64 to. With this newfound knowledge, we understand why this happens. It makes sense in its own way given the situation, context, and options available. I am not gonna defend this or pretend that I would like to keep it like this: but I don't know any better alternatives.

2 of 13

OP, I was genuinely SHOOK when you revealed the int casting to floats. It seems like the maintainers consider this some unfixable quirk. shō ga nai ¯_(ツ)_/¯

Stack Overflow

stackoverflow.com › questions › 38155039 › what-is-the-difference-between-native-int-type-and-the-numpy-int-types

python - What is the difference between native int type and the numpy.int types? - Stack Overflow

Top answer

1 of 3

There are several major differences. The first is that python integers are flexible-sized (at least in python 3.x). This means they can grow to accommodate any number of any size (within memory constraints, of course). The numpy integers, on the other hand, are fixed-sized. This means there is a maximum value they can hold. This is defined by the number of bytes in the integer (int32 vs. int64), with more bytes holding larger numbers, as well as whether the number is signed or unsigned (int32 vs. uint32), with unsigned being able to hold larger numbers but not able to hold negative number.

So, you might ask, why use the fixed-sized integers? The reason is that modern processors have built-in tools for doing math on fixed-size integers, so calculations on those are much, much, much faster. In fact, python uses fixed-sized integers behind-the-scenes when the number is small enough, only switching to the slower, flexible-sized integers when the number gets too large.

Another advantage of fixed-sized values is that they can be placed into consistently-sized adjacent memory blocks of the same type. This is the format that numpy arrays use to store data. The libraries that numpy relies on are able to do extremely fast computations on data in this format, in fact modern CPUs have built-in features for accelerating this sort of computation. With the variable-sized python integers, this sort of computation is impossible because there is no way to say how big the blocks should be and no consistentcy in the data format.

That being said, numpy is actually able to make arrays of python integers. But rather than arrays containing the values, instead they are arrays containing references to other pieces of memory holding the actual python integers. This cannot be accelerated in the same way, so even if all the python integers fit within the fixed integer size, it still won't be accelerated.

None of this is the case with Python 2. In Python 2, Python integers are fixed integers and thus can be directly translated into numpy integers. For variable-length integers, Python 2 had the long type. But this was confusing and it was decided this confusion wasn't worth the performance gains, especially when people who need performance would be using numpy or something like it anyway.

2 of 3

Another way to look at the differences is to ask what methods do the 2 kinds of objects have.

In Ipython I can use tab complete to look at methods:

In [1277]: x=123; y=np.int32(123)

int methods and attributes:

In [1278]: x.<tab>
x.bit_length   x.denominator  x.imag         x.numerator    x.to_bytes
x.conjugate    x.from_bytes   x.real

int 'operators'

In [1278]: x.__<tab>
x.__abs__           x.__init__          x.__rlshift__
x.__add__           x.__int__           x.__rmod__
x.__and__           x.__invert__        x.__rmul__
x.__bool__          x.__le__            x.__ror__
...
x.__gt__            x.__reduce_ex__     x.__xor__
x.__hash__          x.__repr__          
x.__index__         x.__rfloordiv__

np.int32 methods and attributes (or properties). Some of the same, but a lot more, basically all the ndarray ones:

In [1278]: y.<tab>
y.T             y.denominator   y.ndim          y.size
y.all           y.diagonal      y.newbyteorder  y.sort
y.any           y.dtype         y.nonzero       y.squeeze   
...
y.cumsum        y.min           y.setflags      
y.data          y.nbytes        y.shape

the y.__ methods look a lot like the int ones. They can do the same math.

In [1278]: y.__<tab>
y.__abs__              y.__getitem__          y.__reduce_ex__
y.__add__              y.__gt__               y.__repr__
...
y.__format__           y.__rand__             y.__subclasshook__
y.__ge__               y.__rdivmod__          y.__truediv__
y.__getattribute__     y.__reduce__           y.__xor__

y is in many ways the same as a 0d array. Not identical, but close.

In [1281]: z=np.array(123,dtype=np.int32)

np.int32 is what I get when I index an array of that type:

In [1300]: A=np.array([0,123,3])

In [1301]: A[1]
Out[1301]: 123

In [1302]: type(A[1])
Out[1302]: numpy.int32

I have to use item to remove all of the numpy wrapping.

In [1303]: type(A[1].item())
Out[1303]: int

As a numpy user, an np.int32 is an int with a numpy wrapper. Or conversely a single element of an ndarray. Usually I don't pay attention as to whether A[0] is giving me the 'native' int or the numpy equivalent. In contrast to some new users, I rarely use np.int32(123); I would use np.array(123) instead.

A = np.array([1,123,0], np.int32)

does not contain 3 np.int32 objects. Rather its data buffer is 3*4=12 bytes long. It's the array overhead that interprets it as 3 ints in a 1d. And view shows me the same databuffer with different interpretations:

In [1307]: A.view(np.int16)
Out[1307]: array([  1,   0, 123,   0,   0,   0], dtype=int16)

In [1310]: A.view('S4')
Out[1310]: array([b'\x01', b'{', b''],   dtype='|S4')

It's only when I index a single element that I get a np.int32 object.

The list L=[1, 123, 0] is different; it's a list of pointers - pointers to int objects else where in memory. Similarly for a dtype=object array.

Find elsewhere

Google Bing Mojeek

Data-apis

data-apis.org › array-api › 2022.12 › API_specification › data_types.html

Data Types — Python array API standard 2022.12 documentation

The default array index data type may be int32 on 32-bit platforms, but the default should be int64 otherwise.

Thecodeforge

thecodeforge.io › home › python › numpy dtype and memory layout — float32, int64 and c vs f order

NumPy dtype and Memory Layout — float32, int64 and C vs F order | TheCodeForge

March 16, 2026 - NumPy arrays have a fixed dtype — the data type of every element. Default is float64 (8 bytes) for floating point and int64 (8 bytes) for integers. Use float32 to halve memory usage in ML applications.

DataCamp

campus.datacamp.com › courses › preprocessing-for-machine-learning-in-python › introduction-to-data-preprocessing

Working with data types | Python

The object type is what pandas uses to refer to a column that consists of string values or contains a mixture of types. int64 and float64 are equivalent to the Python integer and float types, where the 64 refers to the allocation of memory alloted for storing the values, in this case, the number ...

Pandas

pandas.pydata.org › docs › reference › api › pandas.api.types.is_int64_dtype.html

pandas.api.types.is_int64_dtype — pandas 3.0.1 documentation

Check whether an array-like or dtype is of the object dtype. ... Numpy’s 64-bit integer type. ... Depending on system architecture, the return value of is_int64_dtype( int) will be True if the OS uses 64-bit integers and False if the OS uses 32-bit integers.

Python Data Science Handbook

jakevdp.github.io › PythonDataScienceHandbook › 02.01-understanding-data-types.html

Understanding Data Types in Python | Python Data Science Handbook

If we want to explicitly set the data type of the resulting array, we can use the dtype keyword: In [10]: np.array([1, 2, 3, 4], dtype='float32') Out[10]: array([ 1., 2., 3., 4.], dtype=float32) Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists: In [11]: # nested lists result in multi-dimensional arrays np.array([range(i, i + 3) for i in [2, 4, 6]]) Out[11]: array([[2, 3, 4], [4, 5, 6], [6, 7, 8]]) The inner lists are treated as rows of the resulting two-dimensional array.

Data-apis

data-apis.org › array-api › 2024.12 › API_specification › data_types.html

Data Types — Python array API standard 2024.12 documentation

The default array index data type may be int32 on 32-bit platforms, but the default should be int64 otherwise.

reddit.com › r/learnpython › keep getting typeerror: data type 'int64' not understood

r/learnpython on Reddit: Keep getting TypeError: data type 'Int64' not understood

February 14, 2024 -

I have a rather simple dataframe I want to pass to PyCaret.

Datetime as index, then TOTAL which is is a float64 according to df.dtypes.

    from pycaret.time_series import *

init setup

s = setup(df, target="TOTAL", fh = 12, session_id = 123)

It points to this part:

def _is_nullable_numeric(dtype):

----> 9 return dtype in ["Int64", "Float64", "boolean"]

TypeError: data type 'Int64' not understood

I assume it´s related to Numpy, I have tried to set the dtype with .astype() etc, but I am still getting the same error. I have tried setting TOTAL as a float, int32, still same issue.

What gives?

Top answer

1 of 1

Apache TVM

discuss.tvm.apache.org › t › int64-vs-int32-dtype-error › 8555

Int64 vs int32 dtype error - Apache TVM Discuss

December 1, 2020 - I am getting int32 vs int64 error with the following codebase. This is related to int64 indices. And the bug lies in between tensorize and codegen. Does anyone has ideas? @zhiics @kevinthesun @giuseros import tvm from tvm import relay x = relay.var("x", shape=(1, 512, tvm.tir.const(7, 'int64'), tvm.tir.const(7, 'int64')), dtype="int8") y = relay.var("y", shape=(2048, 512, 1, 1), dtype="int8") out = relay.qnn.op.conv2d(x, y, relay.const(-128, 'int32'), ...

Pandas

pandas.pydata.org › docs › reference › arrays.html

pandas arrays, scalars, and data types — pandas 3.0.1 documentation

For timezone-naive data, np.dtype("datetime64[ns]") is used.

Pandas

pandas.pydata.org › docs › reference › api › pandas.Int64Dtype.html

pandas.Int64Dtype — pandas 3.0.1 documentation - PyData |

Int64Dtype · 64-bit nullable integer type. Examples · For Int8Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.Int8Dtype()) >>> ser.dtype Int8Dtype() For Int16Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.Int16Dtype()) >>> ser.dtype Int16Dtype() For Int32Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.Int32Dtype()) >>> ser.dtype Int32Dtype() For Int64Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.Int64Dtype()) >>> ser.dtype Int64Dtype() For UInt8Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.UInt8Dtype()) >>> ser.dtype UInt8Dtype() For UInt16Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.UInt16Dtype()) >>> ser.dtype UInt16Dtype() For UInt32Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.UInt32Dtype()) >>> ser.dtype UInt32Dtype() For UInt64Dtype: >>> ser = pd.Series([2, pd.NA], dtype=pd.UInt64Dtype()) >>> ser.dtype UInt64Dtype() On this page

Numba Discussion

numba.discourse.group › support: how do i do ...?

Uint64 vs int64 indexing performance difference - Support: How do I do ...? - Numba Discussion

Top answer

1 of 7

This is known and expected behavior. Think about it, if an index could be negative you have to check if it was negative and if so do the index wraparound calculation. This adds time but even if all the indices are positive then the presence of the checking code also prohibits vectorization. This …

2 of 7

@DrTodd13 Thank you so much for enlightening me! I learned something new. Would you happen to know what the type would be in for i in range(10)? Is i a uint64 or an int64? I plan to add i to another numpy array of dtype=np.uint64 and then use the resulting sum to index into another numpy array: idx…

Medium

medium.com › @pritioli › essential-numpy-data-types-a-must-know-guide-ad3657f708b7

Essential NumPy Data Types: A Must-Know Guide | by Code & Cognition | Medium

December 22, 2023 - On most modern systems, the default data type for integers in NumPy is int64 )is converted to float64 when specifying float (Python’s default floating-point numbers are typically 64-bit).

Practical Business Python

pbpython.com › pandas_dtypes.html

Overview of Pandas Data Types - Practical Business Python

If we want to see what all the data types are in a dataframe, use df.dtypes ... Customer Number float64 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object