Brave Search

float16 vs float32 for convolutional neural networks

stackoverflow.com › questions › 46613748 › float16-vs-float32-for-convolutional-neural-networks

Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:

Neural net training very tolerant of reduced precision

Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.

Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.
Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.
And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.

Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.

Answer from Maxim on Stack Overflow

reddit.com › r/programminglanguages › float16 vs float32

r/ProgrammingLanguages on Reddit: Float16 vs Float32

August 6, 2019 -

Hi

in my toy language, I want to have a single float type "decimal". I am not sure if I should go with f16 or f32 internally.

I assume f32 will take more memory but in today's world is that even relevant.

I also read somewhere that GPUs don't support f32 and I need to have f16 anyway if I want to use any UI libraries.

At this point, I really am not sure what i should go for. I really want to keep a single floating type. My language is not targetted at IoT devices and performance is one of the goals.

Top answer

1 of 3

22

Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:

Neural net training very tolerant of reduced precision

Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.

Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.
Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.
And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.

Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.

2 of 3

5

float16 training is tricky: your model might not converge when using standard float16, but float16 does save memory, and is also faster if you are using the latest Volta GPUs. Nvidia recommends "Mixed Precision Training" in the latest doc and paper.

To better use float16, you need to manually and carefully choose the loss_scale. If loss_scale is too large, you may get NANs and INFs; if loss_scale is too small, the model might not converge. Unfortunately, there is no common loss_scale for all models, so you have to choose it carefully for your specific model.

If you just want to reduce the memory usage, you could also try tf. to_bfloat16, which might converge better.

Discussions

No performance difference between Float16 and Float32 optimized TensorRT models

I am currently using the Python API for TensorRT (ver. 7.1.0) to convert from ONNX (ver. 1.9) to Tensor RT. I have two models, one with weights, parameters and inputs in Float16, and another one with Float32. The model … More on forums.developer.nvidia.com

forums.developer.nvidia.com

1

0

July 29, 2021

Why to keep parameters in float32, why not in (b)float16?

I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leaving that aside, why would you not do this? Is there any downside? More on discuss.pytorch.org

discuss.pytorch.org

1

May 15, 2023

Massive performance penalty for Float16 compared to Float32

So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up. Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64. Time. Not speed. More on discourse.julialang.org

discourse.julialang.org

0

November 3, 2017

python - The real difference between float32 and float64 - Stack Overflow

I want to understand the actual difference between float16 and float32 in terms of the result precision. For instance, NumPy allows you to choose the range of the datatype you want (np.float16, np. More on stackoverflow.com

stackoverflow.com

Videos

youtube.com

Exploring Float32, Float16, and BFloat16 for Deep Learning in ...

04:52

YouTube

NumPy Float Data Types: float16 vs float32 vs float64 Explained ...

October 28, 2025

08:33

YouTube

Mixed Precision Training: Bfloat16 vsFloat32 - YouTube

July 16, 2025

View all

Theaiedge

newsletter.theaiedge.io › p › float32-vs-float16-vs-bfloat16

Float32 vs Float16 vs BFloat16? - by Damien Benveniste

July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.

linkedin.com › posts › damienbenveniste_float32-float16-or-bfloat16-why-does-that-activity-7204163208486526976-5cvQ

How Float32 works for Deep Learning

We cannot provide a description for this page right now

Medium

medium.com › @manyi.yim › bfloat16-vs-float32-vs-float16-back-to-the-basics-80d4aec49ca8

dtypes of tensors: bfloat16 vs float32 vs float16 | by Manyi | Medium

July 27, 2024 - bfloat16 is a shortened version of the 32-bit IEEE 754 single-precision floating-point format (float32). It preserves the dynamic range of float32 numbers by retaining 8 exponent bits and allows for fast conversion to and from a float32 number.

NVIDIA Developer Forums

forums.developer.nvidia.com › robotics & edge computing › jetson systems › jetson agx xavier

No performance difference between Float16 and Float32 optimized TensorRT models - Jetson AGX Xavier - NVIDIA Developer Forums

July 29, 2021 - I am currently using the Python API for TensorRT (ver. 7.1.0) to convert from ONNX (ver. 1.9) to Tensor RT. I have two models, one with weights, parameters and inputs in Float16, and another one with Float32. The model …

linkedin.com › pulse › float32-vs-float16-bfloat16-damien-benveniste-av3oc

Float32 vs Float16 vs BFloat16?

July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.

Find elsewhere

Google Bing Mojeek

YouTube

youtube.com › watch

What are Float32, Float16 and BFloat16 Data Types? - YouTube

07:49

Float32, Float16 or BFloat16! Why does that matter for Deep Learning? Those are just different levels of precision. Float32 is a way to represent a floating ...

Published July 19, 2024

Python⇒Speed

pythonspeed.com › articles › float64-float32-precision

The problem with float32: you only get 16 million values

February 1, 2023 - But it does so at a cost: float32 can only store a much smaller range of numbers, with less precision.

ResearchGate

researchgate.net › figure › Comparison-of-the-float32-bfloat16-and-float16-numerical-formats-The-bfloat16-format_fig4_366410363

Comparison of the float32, bfloat16, and float16 numerical formats. The... | Download Scientific Diagram

Download scientific diagram | Comparison of the float32, bfloat16, and float16 numerical formats. The bfloat16 format implements the same range as the float32 format but with lower precision.

PyTorch Forums

discuss.pytorch.org › mixed-precision

Why to keep parameters in float32, why not in (b)float16? - mixed-precision - PyTorch Forums

May 15, 2023 - I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leavin…

Massed Compute

massedcompute.com › home › faq answers

What are the key differences between float16 and float32 data types in matrix operations? - Massed Compute

July 31, 2025 - Explore the key differences between float16 and float32 in matrix operations, including precision and performance implications.

Julia Programming Language

discourse.julialang.org › general usage › performance

Massive performance penalty for Float16 compared to Float32 - Performance - Julia Programming Language

November 3, 2017 - So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up. Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64. Time. Not speed.

TensorFlow

tensorflow.org › tensorflow core › mixed precision

Mixed precision | TensorFlow Core

March 23, 2024 - NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs and supporting Intel CPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices.

GitHub

github.com › xbeat › Machine-Learning › blob › main › Exploring Float32, Float16, and BFloat16 for Deep Learning in Python.md

Machine-Learning/Exploring Float32, Float16, and BFloat16 for Deep Learning in Python.md at main · xbeat/Machine-Learning

To balance accuracy and performance, mixed precision training uses a combination of float types. Typically, Float16 or BFloat16 is used for forward and backward passes, while Float32 is used for weight updates and accumulations.

Author xbeat

Stack Overflow

stackoverflow.com › questions › 43440821 › the-real-difference-between-float32-and-float64 › 52804163

python - The real difference between float32 and float64 - Stack Overflow

Top answer

1 of 3

97

a = np.array([0.123456789121212,2,3], dtype=np.float16)
print("16bit: ", a[0])

a = np.array([0.123456789121212,2,3], dtype=np.float32)
print("32bit: ", a[0])

b = np.array([0.123456789121212121212,2,3], dtype=np.float64)
print("64bit: ", b[0])

16bit: 0.1235
32bit: 0.12345679
64bit: 0.12345678912121212

2 of 3

46

float32 is a 32 bit number - float64 uses 64 bits.

That means that float64’s take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures.

However, float64’s can represent numbers much more accurately than 32 bit floats.

They also allow much larger numbers to be stored.

For your Python-Numpy project I'm sure you know the input variables and their nature.

To make a decision we as programmers need to ask ourselves

What kind of precision does my output need?
Is speed not an issue at all?
what precision is needed in parts per million?

A naive example would be if I store weather data of my city as [12.3, 14.5, 11.1, 9.9, 12.2, 8.2]

Next day Predicted Output could be of 11.5 or 11.5164374

do your think storing float 32 or float 64 would be necessary?

ClickHouse

clickhouse.com › introduction

Float32 | Float64 | BFloat16 Types | ClickHouse Docs

ClickHouse supports conversions between Float32 and BFloat16 which can be done using the toFloat32() or toBFloat16 functions.

PyTorch Forums

discuss.pytorch.org › vision

Why is there such a huge performance gap between bfloat16, float16, and float32? - vision - PyTorch Forums

April 28, 2025 - I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its performance after conversion ‘’‘python def convert_bf16_fp16_to_fp32(model): for param in model.parameters(): if param.dtype == torch.bfloat16 or param.dtype == torch.float16: param.data = param.data.to(dtype=torch.float16) for buffer in model.buffers(): if buffer.dtype ==...

Hugging Face

discuss.huggingface.co › 🤗transformers

Loading in Float32 vs Float16 has very different speed - 🤗Transformers - Hugging Face Forums

February 20, 2025 - I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works. This is the code that I am using and the only thing changing is the dtype passed.

X

x.com › DamiBenveniste › status › 1798399304144753049

Damien Benveniste on X: "Float32, Float16 or BFloat16! Why does that matter for Deep Learning? Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits. With https://t.co/1Q0qq7l2tT" / X

Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.