a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])

a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])

b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])
  • 16bit: 0.1235
  • 32bit: 0.12345679
  • 64bit: 0.12345678912121212
Answer from Furkan Gulsen on Stack Overflow
🌐
Quora
quora.com › What-are-the-differences-between-float32-and-float64
What are the differences between float32 and float64? - Quora
However, float64’s can represent numbers much more accurately than 32 bit floats. They also allow much larger numbers to be stored…although this is very rarely an issue. Generally, you should just use “float” in situations where insanely high precision or very small memory footprint isn’t too important - and let the compiler decide what is most efficient on your particular hardware (which will generally be ‘float32’).
Discussions

Consequence of using single (float32) or double (float64) precision for saving interpolated data
When saving interpolated data (after linear or non-linear warps but also as well as internal representation) we often face the decision if single (float32) or double (float64) precision should be used. I was wondering if there are any comparisons using MRI (fMRI especially) data between the ... More on neurostars.org
🌐 neurostars.org
0
0
February 22, 2017
float32 vs float64 precision lost when casting to int
The float to int conversion behaves different for 32 and 64 precision. More on github.com
🌐 github.com
1
December 14, 2016
Big difference between float 32 and float 64 operations in IIR?
It depends on how extreme the filter is, the filter structure, and what frequencies you're looking at. Very low frequencies (DC filters) can have trouble with 32 bit floating point. It can also depend on the order of operations how the filter is actually implemented in assembly, as rounding errors can accumulate and cause overflow or underflow issues. More on reddit.com
🌐 r/DSP
14
8
February 28, 2016
go - Golang floating point precision float32 vs float64 - Stack Overflow
I wrote a program to demonstrate floating point error in Go: func main() { a := float64(0.2) a += 0.1 a -= 0.3 var i int for i = 0; a More on stackoverflow.com
🌐 stackoverflow.com
🌐
Python⇒Speed
pythonspeed.com › articles › float64-float32-precision
The problem with float32: you only get 16 million values
February 1, 2023 - In contrast, 64-bit floats give you 253 = ~9,000,000,000,000,000 values. This is a much larger number than 16 million. So how do you fit float64s into float32s without losing precision?
🌐
Medium
medium.com › @amit25173 › understanding-numpy-float64-a300ac9e096a
Understanding numpy.float64. If you think you need to spend $2,000… | by Amit Yadav | Medium
February 8, 2025 - The main difference lies in precision. float64 uses 64 bits, offering more precision (about 15–17 decimal places) than float32, which uses only 32 bits and is limited to around 7 decimal places.
🌐
Neurostars
neurostars.org › t › consequence-of-using-single-float32-or-double-float64-precision-for-saving-interpolated-data › 224
Consequence of using single (float32) or double (float64) precision for saving interpolated data - Neurostars
February 22, 2017 - When saving interpolated data (after linear or non-linear warps but also as well as internal representation) we often face the decision if single (float32) or double (float64) precision should be used. I was wondering if…
🌐
ClickHouse
clickhouse.com › introduction
Float32 | Float64 | BFloat16 Types | ClickHouse Docs
Float32 — FLOAT, REAL, SINGLE. Float64 — DOUBLE, DOUBLE PRECISION.
🌐
GitHub
github.com › golang › go › issues › 18310
float32 vs float64 precision lost when casting to int · Issue #18310 · golang/go
December 14, 2016 - The float to int conversion behaves different for 32 and 64 precision.
Author   trajber
Find elsewhere
🌐
Medium
waclawthedev.medium.com › inaccurate-float32-and-float64-how-to-avoid-the-trap-in-go-golang-6de59e66aed9
Inaccurate float32 and float64: how to avoid the trap in Go (golang) | by Wacław The Developer | Medium
December 14, 2021 - var n float64 = 0 for i := 0; i < 10; i++ { n += 0.1 } fmt.Println(n) println(n == 1) You will be surprised, but the output will be · 0.9999999999999999 false · So, what about float32? var n float32 = 0 for i := 0; i < 10; i++ { n += 0.1 } fmt.Println(n) println(n == 1) Maybe you will be confused now, but the output will be ·
🌐
Reddit
reddit.com › r/dsp › big difference between float 32 and float 64 operations in iir?
r/DSP on Reddit: Big difference between float 32 and float 64 operations in IIR?
February 28, 2016 -

Hi. I had derived a filter for A-weighting from Wikipedia's information, and made a simple implementation using 32-bit floats (in Rust), and did the z-transformation, but found that the filter was unstable: after a 1000 samples or so, the filter output exploded. To see what was happening, I did something similar in an old math program (MathPad, a really nice and simple program for OSX), and didn't find a problem there. Then I looked at the differences and found a small, increasing difference between the two implementations. Then I switched to 64 bit floats (aka doubles), and the problem disappeared.

I'm a bit of a newbie in this field (had some formal training, but very little practice), so I wondered whether this was normal, or whether this is an weird edge case, e.g. due to the way the transfer function was constructed.

🌐
NumPy
numpy.org › doc › stable › user › basics.types.html
Data types — NumPy v2.4 Manual
Which is more efficient depends on hardware and development environment; typically on 32-bit systems they are padded to 96 bits, while on 64-bit systems they are typically padded to 128 bits. np.longdouble is padded to the system default; np.float96 and np.float128 are provided for users who ...
Top answer
1 of 2
36

Using math.Float32bits and math.Float64bits, you can see how Go represents the different decimal values as a IEEE 754 binary value:

Playground: https://play.golang.org/p/ZqzdCZLfvC

Result:

float32(0.1): 00111101110011001100110011001101
float32(0.2): 00111110010011001100110011001101
float32(0.3): 00111110100110011001100110011010
float64(0.1): 0011111110111001100110011001100110011001100110011001100110011010
float64(0.2): 0011111111001001100110011001100110011001100110011001100110011010
float64(0.3): 0011111111010011001100110011001100110011001100110011001100110011

If you convert these binary representation to decimal values and do your loop, you can see that for float32, the initial value of a will be:

0.20000000298023224
+ 0.10000000149011612
- 0.30000001192092896
= -7.4505806e-9

a negative value that can never never sum up to 1.

So, why does C behave different?

If you look at the binary pattern (and know slightly about how to represent binary values), you can see that Go rounds the last bit while I assume C just crops it instead.

So, in a sense, while neither Go nor C can represent 0.1 exactly in a float, Go uses the value closest to 0.1:

Go:   00111101110011001100110011001101 => 0.10000000149011612
C(?): 00111101110011001100110011001100 => 0.09999999403953552

Edit:

I posted a question about how C handles float constants, and from the answer it seems that any implementation of the C standard is allowed to do either. The implementation you tried it with just did it differently than Go.

2 of 2
18

Agree with ANisus, go is doing the right thing. Concerning C, I'm not convinced by his guess.

The C standard does not dictate, but most implementations of libc will convert the decimal representation to nearest float (at least to comply with IEEE-754 2008 or ISO 10967), so I don't think this is the most probable explanation.

There are several reasons why the C program behavior might differ... Especially, some intermediate computations might be performed with excess precision (double or long double).

The most probable thing I can think of, is if ever you wrote 0.1 instead of 0.1f in C.
In which case, you might have cause excess precision in initialization
(you sum float a+double 0.1 => the float is converted to double, then result is converted back to float)

If I emulate these operations

float32(float32(float32(0.2) + float64(0.1)) - float64(0.3))

Then I find something near 1.1920929e-8f

After 27 iterations, this sums to 1.6f

🌐
Massed Compute
massedcompute.com › home › faq answers
What are the trade-offs between using float16, float32, and float64 precision for a machine learning task on a GPU? - Massed Compute
July 31, 2025 - Optimize machine learning on GPU: Weigh float16, float32, and float64 trade-offs for precision and performance.
🌐
Google Groups
groups.google.com › g › julia-users › c › Lq2gQf4Ktp4
64 bit system: Force Float default to Float32
Note also that while Float64 * Float32 (and other arithmetic ops) produce Float64, for other "less precise" numeric types, Float32 wins out, making it fairly straightforward to write type-stable generic code:
🌐
Medium
medium.com › @moksh.9 › why-you-cant-trust-float64-for-perfect-precision-and-it-s-not-a-bug-14dc4f934d0c
Why You Can’t Trust float64 for Perfect Precision (and It’s Not a Bug) | by Moksh S | Medium
November 29, 2025 - float32 (Single Precision): Uses 32 bits. float64 (Double Precision): Uses 64 bits. This is the default “float” or “double” in most languages. The standard was created to bring consistency and portability to numerical computing.
🌐
Mozilla
blog.mozilla.org › javascript › 2013 › 11 › 07 › efficient-float32-arithmetic-in-javascript
Efficient float32 arithmetic in JavaScript - The Mozilla Blog
November 7, 2013 - Float32 operations often require less CPU cycles as they need less precision. There is additional CPU overhead if the operands to a float64 operation are float32, since they must first be converted to float64.
🌐
Wikipedia
en.wikipedia.org › wiki › Single-precision_floating-point_format
Single-precision floating-point format - Wikipedia
1 month ago - Single-precision floating-point format (sometimes called FP32, float32, or float) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of ...
🌐
GitHub
github.com › erikbern › ann-benchmarks › issues › 20
float32 vs float64 · Issue #20 · erikbern/ann-benchmarks
June 8, 2016 - Below are my justifications for using float32 in the context of approximate K-NN search: The search process is usually approximate, and the feature data are usually noisy by nature, so typically the precision of float64 is not necessary.
Author   aaalgo