Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:

Neural net training very tolerant of reduced precision

Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.

  • Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.

  • Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.

  • And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.

Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.

Answer from Maxim on Stack Overflow
🌐
Reddit
reddit.com › r/programminglanguages › float16 vs float32
r/ProgrammingLanguages on Reddit: Float16 vs Float32
August 6, 2019 -

Hi

in my toy language, I want to have a single float type "decimal". I am not sure if I should go with f16 or f32 internally.

I assume f32 will take more memory but in today's world is that even relevant.

I also read somewhere that GPUs don't support f32 and I need to have f16 anyway if I want to use any UI libraries.

At this point, I really am not sure what i should go for. I really want to keep a single floating type. My language is not targetted at IoT devices and performance is one of the goals.

Top answer
1 of 3
22

Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:

Neural net training very tolerant of reduced precision

Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.

  • Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.

  • Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.

  • And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.

Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.

2 of 3
5

float16 training is tricky: your model might not converge when using standard float16, but float16 does save memory, and is also faster if you are using the latest Volta GPUs. Nvidia recommends "Mixed Precision Training" in the latest doc and paper.

To better use float16, you need to manually and carefully choose the loss_scale. If loss_scale is too large, you may get NANs and INFs; if loss_scale is too small, the model might not converge. Unfortunately, there is no common loss_scale for all models, so you have to choose it carefully for your specific model.

If you just want to reduce the memory usage, you could also try tf. to_bfloat16, which might converge better.

Discussions

No performance difference between Float16 and Float32 optimized TensorRT models
I am currently using the Python API for TensorRT (ver. 7.1.0) to convert from ONNX (ver. 1.9) to Tensor RT. I have two models, one with weights, parameters and inputs in Float16, and another one with Float32. The model … More on forums.developer.nvidia.com
🌐 forums.developer.nvidia.com
1
0
July 29, 2021
Why to keep parameters in float32, why not in (b)float16?
I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leaving that aside, why would you not do this? Is there any downside? More on discuss.pytorch.org
🌐 discuss.pytorch.org
1
1
May 15, 2023
Massive performance penalty for Float16 compared to Float32
So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up. Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64. Time. Not speed. More on discourse.julialang.org
🌐 discourse.julialang.org
0
0
November 3, 2017
python - The real difference between float32 and float64 - Stack Overflow
I want to understand the actual difference between float16 and float32 in terms of the result precision. For instance, NumPy allows you to choose the range of the datatype you want (np.float16, np. More on stackoverflow.com
🌐 stackoverflow.com
🌐
Theaiedge
newsletter.theaiedge.io › p › float32-vs-float16-vs-bfloat16
Float32 vs Float16 vs BFloat16? - by Damien Benveniste
July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.
🌐
Medium
medium.com › @manyi.yim › bfloat16-vs-float32-vs-float16-back-to-the-basics-80d4aec49ca8
dtypes of tensors: bfloat16 vs float32 vs float16 | by Manyi | Medium
July 27, 2024 - bfloat16 is a shortened version of the 32-bit IEEE 754 single-precision floating-point format (float32). It preserves the dynamic range of float32 numbers by retaining 8 exponent bits and allows for fast conversion to and from a float32 number.
🌐
NVIDIA Developer Forums
forums.developer.nvidia.com › robotics & edge computing › jetson systems › jetson agx xavier
No performance difference between Float16 and Float32 optimized TensorRT models - Jetson AGX Xavier - NVIDIA Developer Forums
July 29, 2021 - I am currently using the Python API for TensorRT (ver. 7.1.0) to convert from ONNX (ver. 1.9) to Tensor RT. I have two models, one with weights, parameters and inputs in Float16, and another one with Float32. The model …
🌐
LinkedIn
linkedin.com › pulse › float32-vs-float16-bfloat16-damien-benveniste-av3oc
Float32 vs Float16 vs BFloat16?
July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.
Find elsewhere
🌐
YouTube
youtube.com › watch
What are Float32, Float16 and BFloat16 Data Types? - YouTube
Float32, Float16 or BFloat16! Why does that matter for Deep Learning? Those are just different levels of precision. Float32 is a way to represent a floating ...
Published   July 19, 2024
🌐
Python⇒Speed
pythonspeed.com › articles › float64-float32-precision
The problem with float32: you only get 16 million values
February 1, 2023 - But it does so at a cost: float32 can only store a much smaller range of numbers, with less precision.
🌐
ResearchGate
researchgate.net › figure › Comparison-of-the-float32-bfloat16-and-float16-numerical-formats-The-bfloat16-format_fig4_366410363
Comparison of the float32, bfloat16, and float16 numerical formats. The... | Download Scientific Diagram
Download scientific diagram | Comparison of the float32, bfloat16, and float16 numerical formats. The bfloat16 format implements the same range as the float32 format but with lower precision.
🌐
PyTorch Forums
discuss.pytorch.org › mixed-precision
Why to keep parameters in float32, why not in (b)float16? - mixed-precision - PyTorch Forums
May 15, 2023 - I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leavin…
🌐
Massed Compute
massedcompute.com › home › faq answers
What are the key differences between float16 and float32 data types in matrix operations? - Massed Compute
July 31, 2025 - Explore the key differences between float16 and float32 in matrix operations, including precision and performance implications.
🌐
Julia Programming Language
discourse.julialang.org › general usage › performance
Massive performance penalty for Float16 compared to Float32 - Performance - Julia Programming Language
November 3, 2017 - So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up. Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64. Time. Not speed.
🌐
TensorFlow
tensorflow.org › tensorflow core › mixed precision
Mixed precision | TensorFlow Core
March 23, 2024 - NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs and supporting Intel CPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices.
🌐
GitHub
github.com › xbeat › Machine-Learning › blob › main › Exploring Float32, Float16, and BFloat16 for Deep Learning in Python.md
Machine-Learning/Exploring Float32, Float16, and BFloat16 for Deep Learning in Python.md at main · xbeat/Machine-Learning
To balance accuracy and performance, mixed precision training uses a combination of float types. Typically, Float16 or BFloat16 is used for forward and backward passes, while Float32 is used for weight updates and accumulations.
Author   xbeat
🌐
ClickHouse
clickhouse.com › introduction
Float32 | Float64 | BFloat16 Types | ClickHouse Docs
ClickHouse supports conversions between Float32 and BFloat16 which can be done using the toFloat32() or toBFloat16 functions.
🌐
PyTorch Forums
discuss.pytorch.org › vision
Why is there such a huge performance gap between bfloat16, float16, and float32? - vision - PyTorch Forums
April 28, 2025 - I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its performance after conversion ‘’‘python def convert_bf16_fp16_to_fp32(model): for param in model.parameters(): if param.dtype == torch.bfloat16 or param.dtype == torch.float16: param.data = param.data.to(dtype=torch.float16) for buffer in model.buffers(): if buffer.dtype ==...
🌐
Hugging Face
discuss.huggingface.co › 🤗transformers
Loading in Float32 vs Float16 has very different speed - 🤗Transformers - Hugging Face Forums
February 20, 2025 - I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works. This is the code that I am using and the only thing changing is the dtype passed.