a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])
a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])
b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])
- 16bit: 0.1235
- 32bit: 0.12345679
- 64bit: 0.12345678912121212
a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])
a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])
b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])
- 16bit: 0.1235
- 32bit: 0.12345679
- 64bit: 0.12345678912121212
float32 is a 32 bit number - float64 uses 64 bits.
That means that float64’s take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures.
However, float64’s can represent numbers much more accurately than 32 bit floats.
They also allow much larger numbers to be stored.
For your Python-Numpy project I'm sure you know the input variables and their nature.
To make a decision we as programmers need to ask ourselves
- What kind of precision does my output need?
- Is speed not an issue at all?
- what precision is needed in parts per million?
A naive example would be if I store weather data of my city as [12.3, 14.5, 11.1, 9.9, 12.2, 8.2]
Next day Predicted Output could be of 11.5 or 11.5164374
do your think storing float 32 or float 64 would be necessary?
Consequence of using single (float32) or double (float64) precision for saving interpolated data
float32 vs float64 precision lost when casting to int
Big difference between float 32 and float 64 operations in IIR?
go - Golang floating point precision float32 vs float64 - Stack Overflow
Videos
Hi. I had derived a filter for A-weighting from Wikipedia's information, and made a simple implementation using 32-bit floats (in Rust), and did the z-transformation, but found that the filter was unstable: after a 1000 samples or so, the filter output exploded. To see what was happening, I did something similar in an old math program (MathPad, a really nice and simple program for OSX), and didn't find a problem there. Then I looked at the differences and found a small, increasing difference between the two implementations. Then I switched to 64 bit floats (aka doubles), and the problem disappeared.
I'm a bit of a newbie in this field (had some formal training, but very little practice), so I wondered whether this was normal, or whether this is an weird edge case, e.g. due to the way the transfer function was constructed.
Using math.Float32bits and math.Float64bits, you can see how Go represents the different decimal values as a IEEE 754 binary value:
Playground: https://play.golang.org/p/ZqzdCZLfvC
Result:
float32(0.1): 00111101110011001100110011001101
float32(0.2): 00111110010011001100110011001101
float32(0.3): 00111110100110011001100110011010
float64(0.1): 0011111110111001100110011001100110011001100110011001100110011010
float64(0.2): 0011111111001001100110011001100110011001100110011001100110011010
float64(0.3): 0011111111010011001100110011001100110011001100110011001100110011
If you convert these binary representation to decimal values and do your loop, you can see that for float32, the initial value of a will be:
0.20000000298023224
+ 0.10000000149011612
- 0.30000001192092896
= -7.4505806e-9
a negative value that can never never sum up to 1.
So, why does C behave different?
If you look at the binary pattern (and know slightly about how to represent binary values), you can see that Go rounds the last bit while I assume C just crops it instead.
So, in a sense, while neither Go nor C can represent 0.1 exactly in a float, Go uses the value closest to 0.1:
Go: 00111101110011001100110011001101 => 0.10000000149011612
C(?): 00111101110011001100110011001100 => 0.09999999403953552
Edit:
I posted a question about how C handles float constants, and from the answer it seems that any implementation of the C standard is allowed to do either. The implementation you tried it with just did it differently than Go.
Agree with ANisus, go is doing the right thing. Concerning C, I'm not convinced by his guess.
The C standard does not dictate, but most implementations of libc will convert the decimal representation to nearest float (at least to comply with IEEE-754 2008 or ISO 10967), so I don't think this is the most probable explanation.
There are several reasons why the C program behavior might differ... Especially, some intermediate computations might be performed with excess precision (double or long double).
The most probable thing I can think of, is if ever you wrote 0.1 instead of 0.1f in C.
In which case, you might have cause excess precision in initialization
(you sum float a+double 0.1 => the float is converted to double, then result is converted back to float)
If I emulate these operations
float32(float32(float32(0.2) + float64(0.1)) - float64(0.3))
Then I find something near 1.1920929e-8f
After 27 iterations, this sums to 1.6f
Why does it appear that PyTorch tensors give preference to using default element datatype of float32 instead of float64?
For eg., the default element datatype for torch.tensor()is float32. This is the opposite with numpy arrays where the default element datatype for numpy.array()is float64. Why don’t PyTorch make it consistent with numpy arrays and make the default element datatype as float64?
(Ps. I know I can change the element datatypes in the tensor but it would be more convenient if the default was float64)