You could use a function like this:
def nan_ints(df, convert_strings=False, subset=None):
types = ["int64", "float64"]
if subset is None:
subset = list(df)
if convert_strings:
types.append("object")
for col in subset:
if df[col].dtype in types:
df[col] = (
df[col].astype(float, errors="ignore").astype("Int64", errors="ignore")
)
return df
It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument.
df1 = pd.DataFrame({'a':[1.1,2,3,1],
'b':[1,2,3,np.nan],
'c':['1','2','3',np.nan],
'd':[3,2,1,np.nan]})
nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()
Will return the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a 4 non-null float64
b 3 non-null Int64
c 3 non-null Int64
d 3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes
if you are going to use this on every DataFrame you could add the function to a module and import it every time you want to use pandas.
from my_module import nan_ints
Then just use it with something like:
nan_ints(pd.read_csv(path))
Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.
Answer from braintho on Stack Overflowpython - Making Int64 the default integer dtype instead of standard int64 in pandas - Stack Overflow
What's up with uint64 in numpy?
python - What is the difference between native int type and the numpy.int types? - Stack Overflow
Keep getting TypeError: data type 'Int64' not understood
When should I use float32 instead of float64?
What happens when you mix dtypes in an operation?
You could use a function like this:
def nan_ints(df, convert_strings=False, subset=None):
types = ["int64", "float64"]
if subset is None:
subset = list(df)
if convert_strings:
types.append("object")
for col in subset:
if df[col].dtype in types:
df[col] = (
df[col].astype(float, errors="ignore").astype("Int64", errors="ignore")
)
return df
It iterates through each column and coverts it to an Int64 if it is a int. If it's a float it will convert to a Int64 only if all of the values in the column could be converted to ints other than the NaN's. I've given you the option to convert strings to Int64 as well with the convert_strings argument.
df1 = pd.DataFrame({'a':[1.1,2,3,1],
'b':[1,2,3,np.nan],
'c':['1','2','3',np.nan],
'd':[3,2,1,np.nan]})
nan_ints(df1,convert_strings=True,subset=['b','c'])
df1.info()
Will return the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
a 4 non-null float64
b 3 non-null Int64
c 3 non-null Int64
d 3 non-null float64
dtypes: Int64(2), float64(2)
memory usage: 216.0 bytes
if you are going to use this on every DataFrame you could add the function to a module and import it every time you want to use pandas.
from my_module import nan_ints
Then just use it with something like:
nan_ints(pd.read_csv(path))
Note: Nullable integer data type is New in version 0.24.0. Here is the documentation.
I would put my money on monkey patching. The easiest way would be to monkey patch the DataFrame constructor. That should go something like this:
import pandas
pandas.DataFrame.__old__init__ = pandas.DataFrame.__init__
def new_init(self, data=None, index=None, columns=None, dtype=pd.Int64Dtype(), copy=False):
self.__old__init__(data=data, index=index, columns=None, dtype=dtype, copy=copy)
pandas.DataFrame.__init__ = new_init
Of course, you run the risk of breaking the world. Good luck!
The other day I was writing some code using Numpy uint64 arrays and -- this isn't a joke -- I encountered two bugs with uint64 arrays, one of which has been known for TEN YEARS. Consider the following scenario:
a = np.ones(10, dtype=np.uint64) a[1] <<= 1
What do you think happens? If you guessed "the second element of a is left shifted from 1 to 2, you'd be wrong! What actually happens is that your program crashes because << isn't implemented for np.uint64, int.
So that's a critical bug that has not been fixed in 10 years, which is utterly buck wild, but at least it's kind of understandable how such a bug could come to be since <<= is probably implemented on top of << in this case and << probably isn't constrained by its output type, meaning it wouldn't be unreasonable np.uint64 << int to return an int. Of course, this is being far too charitable for a number of reasons: np.uint64 << int does not in fact return an int since it crashes, we unequivocally know what the output type should be because we're using a compound assignment operator, and this bug has been known for ten years.
The next bug might be more recent, but I suspect both of these bugs have been present from the get go. This bug is also much more shocking. Consider the following:
np.uint64(1) + 1
What do you think the answer is? Remember, numpy wraps the C type system and in C, integers are promoted to the integer type with higher precision in arithmetic contexts. So we might expect this to yield 2 if numpy copies the C rules, or we might also reasonably expect it to yield np.uint64(2) because even though np.uint64 + int may overflow sometimes, this is usually what users will want.
Dear reader, numpy does neither of these things. It instead does something that is objectively wrong and no sane person would ever defend: it returns np.float64. This is completely bananas. 64 bit floating point numbers cannot represent all 64 bit integers exactly because they only have 53 bits of precision, with the rest used for sign and exponent. For this reason, languages tend to not randumbly convert ints into floats because it destroys information and forces the use of a slower type. The main exception to this is division, but numpy does this for addition and frankly as a C wrapper it should have the C behavior for dividing two integers.
It's just mind boggling to me that numpy, a project with 29619 commits, 1305 contributors, and millions of users, and which is a de facto part of the python standard library, could not only have fundamental bugs with how it handles wrapping C integer types, but that these bugs could be known for over a decade and not just remain unfixed but implicitly be something that will never be fixed. These bugs pretty much make np.uint64 unusable. While you can work around them by wrapping your other parameter in a constructor call np.uint64(b), this is pretty brittle and it's easy to envision a scenario in which a custom function that expects a python int is given a np.uint64 and produces a floating point number or crash.
There are several major differences. The first is that python integers are flexible-sized (at least in python 3.x). This means they can grow to accommodate any number of any size (within memory constraints, of course). The numpy integers, on the other hand, are fixed-sized. This means there is a maximum value they can hold. This is defined by the number of bytes in the integer (int32 vs. int64), with more bytes holding larger numbers, as well as whether the number is signed or unsigned (int32 vs. uint32), with unsigned being able to hold larger numbers but not able to hold negative number.
So, you might ask, why use the fixed-sized integers? The reason is that modern processors have built-in tools for doing math on fixed-size integers, so calculations on those are much, much, much faster. In fact, python uses fixed-sized integers behind-the-scenes when the number is small enough, only switching to the slower, flexible-sized integers when the number gets too large.
Another advantage of fixed-sized values is that they can be placed into consistently-sized adjacent memory blocks of the same type. This is the format that numpy arrays use to store data. The libraries that numpy relies on are able to do extremely fast computations on data in this format, in fact modern CPUs have built-in features for accelerating this sort of computation. With the variable-sized python integers, this sort of computation is impossible because there is no way to say how big the blocks should be and no consistentcy in the data format.
That being said, numpy is actually able to make arrays of python integers. But rather than arrays containing the values, instead they are arrays containing references to other pieces of memory holding the actual python integers. This cannot be accelerated in the same way, so even if all the python integers fit within the fixed integer size, it still won't be accelerated.
None of this is the case with Python 2. In Python 2, Python integers are fixed integers and thus can be directly translated into numpy integers. For variable-length integers, Python 2 had the long type. But this was confusing and it was decided this confusion wasn't worth the performance gains, especially when people who need performance would be using numpy or something like it anyway.
Another way to look at the differences is to ask what methods do the 2 kinds of objects have.
In Ipython I can use tab complete to look at methods:
In [1277]: x=123; y=np.int32(123)
int methods and attributes:
In [1278]: x.<tab>
x.bit_length x.denominator x.imag x.numerator x.to_bytes
x.conjugate x.from_bytes x.real
int 'operators'
In [1278]: x.__<tab>
x.__abs__ x.__init__ x.__rlshift__
x.__add__ x.__int__ x.__rmod__
x.__and__ x.__invert__ x.__rmul__
x.__bool__ x.__le__ x.__ror__
...
x.__gt__ x.__reduce_ex__ x.__xor__
x.__hash__ x.__repr__
x.__index__ x.__rfloordiv__
np.int32 methods and attributes (or properties). Some of the same, but a lot more, basically all the ndarray ones:
In [1278]: y.<tab>
y.T y.denominator y.ndim y.size
y.all y.diagonal y.newbyteorder y.sort
y.any y.dtype y.nonzero y.squeeze
...
y.cumsum y.min y.setflags
y.data y.nbytes y.shape
the y.__ methods look a lot like the int ones. They can do the same math.
In [1278]: y.__<tab>
y.__abs__ y.__getitem__ y.__reduce_ex__
y.__add__ y.__gt__ y.__repr__
...
y.__format__ y.__rand__ y.__subclasshook__
y.__ge__ y.__rdivmod__ y.__truediv__
y.__getattribute__ y.__reduce__ y.__xor__
y is in many ways the same as a 0d array. Not identical, but close.
In [1281]: z=np.array(123,dtype=np.int32)
np.int32 is what I get when I index an array of that type:
In [1300]: A=np.array([0,123,3])
In [1301]: A[1]
Out[1301]: 123
In [1302]: type(A[1])
Out[1302]: numpy.int32
I have to use item to remove all of the numpy wrapping.
In [1303]: type(A[1].item())
Out[1303]: int
As a numpy user, an np.int32 is an int with a numpy wrapper. Or conversely a single element of an ndarray. Usually I don't pay attention as to whether A[0] is giving me the 'native' int or the numpy equivalent. In contrast to some new users, I rarely use np.int32(123); I would use np.array(123) instead.
A = np.array([1,123,0], np.int32)
does not contain 3 np.int32 objects. Rather its data buffer is 3*4=12 bytes long. It's the array overhead that interprets it as 3 ints in a 1d. And view shows me the same databuffer with different interpretations:
In [1307]: A.view(np.int16)
Out[1307]: array([ 1, 0, 123, 0, 0, 0], dtype=int16)
In [1310]: A.view('S4')
Out[1310]: array([b'\x01', b'{', b''], dtype='|S4')
It's only when I index a single element that I get a np.int32 object.
The list L=[1, 123, 0] is different; it's a list of pointers - pointers to int objects else where in memory. Similarly for a dtype=object array.
I have a rather simple dataframe I want to pass to PyCaret.
Datetime as index, then TOTAL which is is a float64 according to df.dtypes.
from pycaret.time_series import *
init setup
s = setup(df, target="TOTAL", fh = 12, session_id = 123)
It points to this part:
def _is_nullable_numeric(dtype):
----> 9 return dtype in ["Int64", "Float64", "boolean"]
TypeError: data type 'Int64' not understood
I assume it´s related to Numpy, I have tried to set the dtype with .astype() etc, but I am still getting the same error. I have tried setting TOTAL as a float, int32, still same issue.
What gives?