float32 vs float64 precision

The real difference between float32 and float64

stackoverflow.com › questions › 43440821 › the-real-difference-between-float32-and-float64

a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])

a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])

b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])

16bit: 0.1235
32bit: 0.12345679
64bit: 0.12345678912121212

Answer from Furkan Gulsen on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 43440821 › the-real-difference-between-float32-and-float64

python - The real difference between float32 and float64 - Stack Overflow

Top answer

1 of 3

a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])

a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])

b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])

16bit: 0.1235
32bit: 0.12345679
64bit: 0.12345678912121212

2 of 3

float32 is a 32 bit number - float64 uses 64 bits.

That means that float64’s take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures.

However, float64’s can represent numbers much more accurately than 32 bit floats.

They also allow much larger numbers to be stored.

For your Python-Numpy project I'm sure you know the input variables and their nature.

To make a decision we as programmers need to ask ourselves

What kind of precision does my output need?
Is speed not an issue at all?
what precision is needed in parts per million?

A naive example would be if I store weather data of my city as [12.3, 14.5, 11.1, 9.9, 12.2, 8.2]

Next day Predicted Output could be of 11.5 or 11.5164374

do your think storing float 32 or float 64 would be necessary?

Quora

quora.com › What-are-the-differences-between-float32-and-float64

What are the differences between float32 and float64? - Quora

However, float64’s can represent numbers much more accurately than 32 bit floats. They also allow much larger numbers to be stored…although this is very rarely an issue. Generally, you should just use “float” in situations where insanely high precision or very small memory footprint isn’t too important - and let the compiler decide what is most efficient on your particular hardware (which will generally be ‘float32’).

Discussions

Consequence of using single (float32) or double (float64) precision for saving interpolated data

When saving interpolated data (after linear or non-linear warps but also as well as internal representation) we often face the decision if single (float32) or double (float64) precision should be used. I was wondering if there are any comparisons using MRI (fMRI especially) data between the ... More on neurostars.org

neurostars.org

February 22, 2017

float32 vs float64 precision lost when casting to int

The float to int conversion behaves different for 32 and 64 precision. More on github.com

github.com

December 14, 2016

Big difference between float 32 and float 64 operations in IIR?

It depends on how extreme the filter is, the filter structure, and what frequencies you're looking at. Very low frequencies (DC filters) can have trouble with 32 bit floating point. It can also depend on the order of operations how the filter is actually implemented in assembly, as rounding errors can accumulate and cause overflow or underflow issues. More on reddit.com

r/DSP

February 28, 2016

go - Golang floating point precision float32 vs float64 - Stack Overflow

I wrote a program to demonstrate floating point error in Go: func main() { a := float64(0.2) a += 0.1 a -= 0.3 var i int for i = 0; a More on stackoverflow.com

stackoverflow.com

Videos

07:58

YouTube

Floats in Go (float32 vs float64) - YouTube

November 25, 2025

04:52

YouTube

NumPy Float Data Types: float16 vs float32 vs float64 Explained ...

October 28, 2025

01:30

YouTube

Error in astype float32 vs float64 for integer - YouTube

August 5, 2024

03:03

YouTube

The real difference between float32 and float64 - YouTube

May 3, 2024

View all

Python⇒Speed

pythonspeed.com › articles › float64-float32-precision

The problem with float32: you only get 16 million values

February 1, 2023 - In contrast, 64-bit floats give you 253 = ~9,000,000,000,000,000 values. This is a much larger number than 16 million. So how do you fit float64s into float32s without losing precision?

Medium

medium.com › @amit25173 › understanding-numpy-float64-a300ac9e096a

Understanding numpy.float64. If you think you need to spend $2,000… | by Amit Yadav | Medium

February 8, 2025 - The main difference lies in precision. float64 uses 64 bits, offering more precision (about 15–17 decimal places) than float32, which uses only 32 bits and is limited to around 7 decimal places.

Neurostars

neurostars.org › t › consequence-of-using-single-float32-or-double-float64-precision-for-saving-interpolated-data › 224

Consequence of using single (float32) or double (float64) precision for saving interpolated data - Neurostars

February 22, 2017 - When saving interpolated data (after linear or non-linear warps but also as well as internal representation) we often face the decision if single (float32) or double (float64) precision should be used. I was wondering if…

ClickHouse

clickhouse.com › introduction

Float32 | Float64 | BFloat16 Types | ClickHouse Docs

Float32 — FLOAT, REAL, SINGLE. Float64 — DOUBLE, DOUBLE PRECISION.

GitHub

github.com › golang › go › issues › 18310

float32 vs float64 precision lost when casting to int · Issue #18310 · golang/go

December 14, 2016 - The float to int conversion behaves different for 32 and 64 precision.

Author trajber

Find elsewhere

Google Bing Mojeek

Medium

waclawthedev.medium.com › inaccurate-float32-and-float64-how-to-avoid-the-trap-in-go-golang-6de59e66aed9

Inaccurate float32 and float64: how to avoid the trap in Go (golang) | by Wacław The Developer | Medium

December 14, 2021 - var n float64 = 0 for i := 0; i < 10; i++ { n += 0.1 } fmt.Println(n) println(n == 1) You will be surprised, but the output will be · 0.9999999999999999 false · So, what about float32? var n float32 = 0 for i := 0; i < 10; i++ { n += 0.1 } fmt.Println(n) println(n == 1) Maybe you will be confused now, but the output will be ·

reddit.com › r/dsp › big difference between float 32 and float 64 operations in iir?

r/DSP on Reddit: Big difference between float 32 and float 64 operations in IIR?

February 28, 2016 -

Hi. I had derived a filter for A-weighting from Wikipedia's information, and made a simple implementation using 32-bit floats (in Rust), and did the z-transformation, but found that the filter was unstable: after a 1000 samples or so, the filter output exploded. To see what was happening, I did something similar in an old math program (MathPad, a really nice and simple program for OSX), and didn't find a problem there. Then I looked at the differences and found a small, increasing difference between the two implementations. Then I switched to 64 bit floats (aka doubles), and the problem disappeared.

I'm a bit of a newbie in this field (had some formal training, but very little practice), so I wondered whether this was normal, or whether this is an weird edge case, e.g. due to the way the transfer function was constructed.

Top answer

1 of 5

2 of 5

Sounds like a non-numerically safe operation in your formulas. 32 floats only have ~7 digits of precision. 64 floats have almost 16. If you are needing precision down past 7 digits, you can accumulate a lot of error because of the nature of the floating point numbers. Also the more operations you do, the more likely you get errors in the lower digits as well.

NumPy

numpy.org › doc › stable › user › basics.types.html

Data types — NumPy v2.4 Manual

Which is more efficient depends on hardware and development environment; typically on 32-bit systems they are padded to 96 bits, while on 64-bit systems they are typically padded to 128 bits. np.longdouble is padded to the system default; np.float96 and np.float128 are provided for users who ...

Stack Overflow

stackoverflow.com › questions › 22337418 › golang-floating-point-precision-float32-vs-float64

go - Golang floating point precision float32 vs float64 - Stack Overflow

Top answer

1 of 2

Using math.Float32bits and math.Float64bits, you can see how Go represents the different decimal values as a IEEE 754 binary value:

Playground: https://play.golang.org/p/ZqzdCZLfvC

Result:

float32(0.1): 00111101110011001100110011001101
float32(0.2): 00111110010011001100110011001101
float32(0.3): 00111110100110011001100110011010
float64(0.1): 0011111110111001100110011001100110011001100110011001100110011010
float64(0.2): 0011111111001001100110011001100110011001100110011001100110011010
float64(0.3): 0011111111010011001100110011001100110011001100110011001100110011

If you convert these binary representation to decimal values and do your loop, you can see that for float32, the initial value of a will be:

0.20000000298023224
+ 0.10000000149011612
- 0.30000001192092896
= -7.4505806e-9

a negative value that can never never sum up to 1.

So, why does C behave different?

If you look at the binary pattern (and know slightly about how to represent binary values), you can see that Go rounds the last bit while I assume C just crops it instead.

So, in a sense, while neither Go nor C can represent 0.1 exactly in a float, Go uses the value closest to 0.1:

Go:   00111101110011001100110011001101 => 0.10000000149011612
C(?): 00111101110011001100110011001100 => 0.09999999403953552

Edit:

I posted a question about how C handles float constants, and from the answer it seems that any implementation of the C standard is allowed to do either. The implementation you tried it with just did it differently than Go.

2 of 2

Agree with ANisus, go is doing the right thing. Concerning C, I'm not convinced by his guess.

The C standard does not dictate, but most implementations of libc will convert the decimal representation to nearest float (at least to comply with IEEE-754 2008 or ISO 10967), so I don't think this is the most probable explanation.

There are several reasons why the C program behavior might differ... Especially, some intermediate computations might be performed with excess precision (double or long double).

The most probable thing I can think of, is if ever you wrote 0.1 instead of 0.1f in C.
In which case, you might have cause excess precision in initialization
(you sum float a+double 0.1 => the float is converted to double, then result is converted back to float)

If I emulate these operations

float32(float32(float32(0.2) + float64(0.1)) - float64(0.3))

Then I find something near 1.1920929e-8f

After 27 iterations, this sums to 1.6f

Massed Compute

massedcompute.com › home › faq answers

What are the trade-offs between using float16, float32, and float64 precision for a machine learning task on a GPU? - Massed Compute

July 31, 2025 - Optimize machine learning on GPU: Weigh float16, float32, and float64 trade-offs for precision and performance.

Google Groups

groups.google.com › g › julia-users › c › Lq2gQf4Ktp4

64 bit system: Force Float default to Float32

Note also that while Float64 * Float32 (and other arithmetic ops) produce Float64, for other "less precise" numeric types, Float32 wins out, making it fairly straightforward to write type-stable generic code:

Medium

medium.com › @moksh.9 › why-you-cant-trust-float64-for-perfect-precision-and-it-s-not-a-bug-14dc4f934d0c

Why You Can’t Trust float64 for Perfect Precision (and It’s Not a Bug) | by Moksh S | Medium

November 29, 2025 - float32 (Single Precision): Uses 32 bits. float64 (Double Precision): Uses 64 bits. This is the default “float” or “double” in most languages. The standard was created to bring consistency and portability to numerical computing.

Mozilla

blog.mozilla.org › javascript › 2013 › 11 › 07 › efficient-float32-arithmetic-in-javascript

Efficient float32 arithmetic in JavaScript - The Mozilla Blog

November 7, 2013 - Float32 operations often require less CPU cycles as they need less precision. There is additional CPU overhead if the operands to a float64 operation are float32, since they must first be converted to float64.

Wikipedia

en.wikipedia.org › wiki › Single-precision_floating-point_format

Single-precision floating-point format - Wikipedia

1 month ago - Single-precision floating-point format (sometimes called FP32, float32, or float) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of ...

IEEE 754 standard: binary32

GitHub

github.com › erikbern › ann-benchmarks › issues › 20

float32 vs float64 · Issue #20 · erikbern/ann-benchmarks

June 8, 2016 - Below are my justifications for using float32 in the context of approximate K-NN search: The search process is usually approximate, and the feature data are usually noisy by nature, so typically the precision of float64 is not necessary.

Author aaalgo

reddit.com › r/pytorch › why pytorch tensors gives preference to float32 element datatype

r/pytorch on Reddit: Why PyTorch tensors gives preference to float32 element datatype

August 13, 2020 -

Why does it appear that PyTorch tensors give preference to using default element datatype of float32 instead of float64?

For eg., the default element datatype for torch.tensor()is float32. This is the opposite with numpy arrays where the default element datatype for numpy.array()is float64. Why don’t PyTorch make it consistent with numpy arrays and make the default element datatype as float64?

(Ps. I know I can change the element datatypes in the tensor but it would be more convenient if the default was float64)

Top answer

1 of 4

GPUs are optimized for fp32.

2 of 4

It's a speed and accuracy trade off. Looking at GPU compute, at best (V100)double precision is 2x slower than single precision. On a 2080Ti it's 24x slower! Consumer/prosumer grade GPUs really don't need double precision, it only is used in simulations (weather, finite element, fluid dynamics,etc). When you look at the resolution of single precision it is fine for most ML tasks. Numpy was originally built with scientific computing in mind, so they lean toward more precise. You will notice a 2x speed up and memory savings using single instead of double in numpy.

Lorric

lorric.com › en › Articles › flowmeter-technology › flowmeter-technology › floating-point

What Is a Floating Point Number? Understanding 32-Bit and 64-Bit - Lorric

April 2, 2025 - Example: Representing the number 15 as a floating-point value · Step 1: Convert 15 to binary