bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.

Check out AMP if you choose float16.

Answer from MWB on Stack Overflow
🌐
Medium
medium.com › @manyi.yim › bfloat16-vs-float32-vs-float16-back-to-the-basics-80d4aec49ca8
dtypes of tensors: bfloat16 vs float32 vs float16 | by Manyi | Medium
July 27, 2024 - It preserves the dynamic range of float32 numbers by retaining 8 exponent bits and allows for fast conversion to and from a float32 number. While bfloat16 uses the same number of bits as float16, it has a wider dynamic range but lower precision.
Discussions

Why is there such a huge performance gap between bfloat16, float16, and float32?
I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its … More on discuss.pytorch.org
🌐 discuss.pytorch.org
0
0
April 28, 2025
Difference in SM performance of float16 and bfloat16
CUDA C++ Programming Guide (nvidia.com) states that Compute Capability 8.0 and 8.6 throughput of “16-bit floating-point add, multiply, multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results). More on forums.developer.nvidia.com
🌐 forums.developer.nvidia.com
0
0
August 7, 2024
Why to keep parameters in float32, why not in (b)float16?
I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leaving that aside, why would you not do this? Is there any downside? More on discuss.pytorch.org
🌐 discuss.pytorch.org
1
1
May 15, 2023
Reflection and the Never-Ending Confusion Between FP16 and BF16
While everything in your post is technically accurate, it does feel like you are wildly exaggerating just how destructive the BF16 to FP16 conversion is. According to tests performed by llama.cpp developers the perplexity difference between BF16 and FP16 is literally 10x less than even FP16 to Q8. And while perplexity is not a perfect measurement by any means it certainly points toward the conversion not being remotely as catastrophic as you make it out to be. And honestly it makes sense that it wouldn't really make that much of a difference in practice. BF16's main advantage is that it can represent some extremely high values that FP16 cannot, which matter during training, but trained checkpoints usually does not end up with a lot of those values in first placed. And since FP16 actually has a higher precision in terms of decimal places you don't lose anything in that regard during the conversion. Also it's worth pointing out that llama.cpp still converts models to FP16 by default before they get quanted to other formats. You have to go out of your way to keep the model in BF16. So most GGUFs fund on HF is likely based on FP16 conversion. If that actually lead to major downgrades in performance that default would have been changed ages ago, but it hasn't precisely because there has been no evidence produced that it actually does. More on reddit.com
🌐 r/LocalLLaMA
27
59
September 9, 2024

bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.

Check out AMP if you choose float16.

Answer from MWB on Stack Overflow
🌐
LinkedIn
linkedin.com › pulse › float32-vs-float16-bfloat16-damien-benveniste-av3oc
Float32 vs Float16 vs BFloat16?
July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.
🌐
Wikipedia
en.wikipedia.org › wiki › Bfloat16_floating-point_format
bfloat16 floating-point format - Wikipedia
1 week ago - The bfloat16 format, being a shortened IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus ...
Find elsewhere
🌐
PyTorch Forums
discuss.pytorch.org › vision
Why is there such a huge performance gap between bfloat16, float16, and float32? - vision - PyTorch Forums
April 28, 2025 - I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its performance after conversion ...
🌐
GitHub
github.com › stas00 › ml-ways › blob › master › numbers › bfloat16-vs-float16-study.ipynb
ml-ways/numbers/bfloat16-vs-float16-study.ipynb at master · stas00/ml-ways
"This is the main function, that tries to do very simply increments in `bfloat16` and then converting the result to `float16` and showing the discrepancies."
Author   stas00
🌐
Nick Higham
nhigham.com › 2018 › 12 › 03 › half-precision-arithmetic-fp16-versus-bfloat16
Half Precision Arithmetic: fp16 Versus bfloat16 – Nick Higham
April 23, 2020 - With a small modification, I can make the Julia code type stable. Performance testing with 1000 iterations, BFloat16 is about 5x slower than Float64, but Float16 is significantly slower.
🌐
Theaiedge
newsletter.theaiedge.io › p › float32-vs-float16-vs-bfloat16
Float32 vs Float16 vs BFloat16? - by Damien Benveniste
July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.
🌐
NVIDIA Developer Forums
forums.developer.nvidia.com › accelerated computing › cuda › cuda programming and performance
Difference in SM performance of float16 and bfloat16 - CUDA Programming and Performance - NVIDIA Developer Forums
August 7, 2024 - CUDA C++ Programming Guide (nvidia.com) ... multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results)....
🌐
Cerebras
cerebras.ai › blog › to-bfloat-or-not-to-bfloat-that-is-the-question
To Bfloat or not to Bfloat? That is the Question! - Cerebras
March 11, 2025 - Our experiments demonstrated that choosing bfloat16 is beneficial over pure float32 or a mixed version with float16. It improves efficiency of the training, uses less memory during training, saves space while maintaining the same accuracy level.
🌐
Beam
beam.cloud › blog › bf16-vs-fp16
BF16 vs FP16: A Comparison of Performance and Efficiency
April 14, 2025 - Run sandboxes, inference, and training with ultrafast boot times, instant autoscaling, and a developer experience that just works.
🌐
PyTorch Forums
discuss.pytorch.org › mixed-precision
Why to keep parameters in float32, why not in (b)float16? - mixed-precision - PyTorch Forums
May 15, 2023 - I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leavin…
🌐
Reddit
reddit.com › r/localllama › reflection and the never-ending confusion between fp16 and bf16
r/LocalLLaMA on Reddit: Reflection and the Never-Ending Confusion Between FP16 and BF16
September 9, 2024 -

Let’s set aside the API drama for a moment. This topic deserves careful consideration, as I keep seeing the same mistake made repeatedly.

The author of Reflection is facing issues with the model uploaded to Hugging Face. After three different uploads, the model on Hugging Face still performs much worse than what the author claims it is capable of. People have tested it, and it is underperforming even compared to the baseline LLaMA 3.1 70B.

I’m not sure if Reflection is a scam or not, but there’s a significant issue with the weights.

  • LLama 3.1 70B was trained using BF16, and the wigths are uploaded in BF16: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

  • Reflection 70B was converted into FP16: https://huggingface.co/mattshumer/ref_70_e3

Does this make a difference? Yes, it makes a massive difference. BF16 and FP16 are very different formats, and they are not compatible. You cannot convert a BF16 model to FP16 without losing a lot of information.

FP16 has a 5-bit exponent and a 10-bit mantissa, while BF16 has an 8-bit exponent and a 7-bit mantissa. There is no way to convert a BF16 model to FP16, or vice versa, without significant loss of information. The BF16 to FP16 conversion is especially damaging. FP16 is not suitable for neural networks unless you use a complex mixed-precision training approach (https://arxiv.org/abs/1710.03740). On the other hand, BF16, developed by DeepMind (which stands for Brain Float 16) works out of the box for training neural networks.

FP16 was used in the early days for encoder-only models like BERT and RoBERTa, which were typically run in FP16. However, T5 was released in BF16, and since then, no other major model has used FP16 because it simply doesn’t work well. The only reason FP16 was used in the past is that Nvidia didn’t support BF16 until the A100 GPU came out. Google TPUs, however, had BF16 support, which is why T5 was trained in BF16.

I’m bringing this up because, despite FP16 being a dead format, and BF16 being the format used for every big model, many people still confuse them. This seems to have happened to the author of Reflection. Please, do not use FP16, and above all, do not attempt to convert BF16 weights into FP16, it will ruin your model.

Top answer
1 of 5
48
While everything in your post is technically accurate, it does feel like you are wildly exaggerating just how destructive the BF16 to FP16 conversion is. According to tests performed by llama.cpp developers the perplexity difference between BF16 and FP16 is literally 10x less than even FP16 to Q8. And while perplexity is not a perfect measurement by any means it certainly points toward the conversion not being remotely as catastrophic as you make it out to be. And honestly it makes sense that it wouldn't really make that much of a difference in practice. BF16's main advantage is that it can represent some extremely high values that FP16 cannot, which matter during training, but trained checkpoints usually does not end up with a lot of those values in first placed. And since FP16 actually has a higher precision in terms of decimal places you don't lose anything in that regard during the conversion. Also it's worth pointing out that llama.cpp still converts models to FP16 by default before they get quanted to other formats. You have to go out of your way to keep the model in BF16. So most GGUFs fund on HF is likely based on FP16 conversion. If that actually lead to major downgrades in performance that default would have been changed ages ago, but it hasn't precisely because there has been no evidence produced that it actually does.
2 of 5
24
While you make an valid and important point about floating point formats in general, let's not set aside the API drama in this specific case. Apply some bayesian reasoning: if Schumer has been conclusively shown to be profoundly misrepresenting his work on several vital points (like which base model is used, its size, and whether it is open source) that is highly informative for whether we should look to innocent format mixups as the explanation for lack of replication of claimed results for the uploaded model.
🌐
GitHub
github.com › bitsandbytes-foundation › bitsandbytes › issues › 1030
Float16 vs Bfloat16 when doing 4-bit and 8-bit quantization or doing half precision - I am asking torch_type · Issue #1030 · bitsandbytes-foundation/bitsandbytes
February 4, 2024 - Which one we should prefer when doing quantization When doing 4-bit : float16 or bfloat16? When doing 8-bit : float16 or bfloat16? When doing half precision 16-bit : float16 or bfloat16? torch_type = torch.float16 vs torch_type = torch.b...
Author   FurkanGozukara
🌐
Hugging Face
huggingface.co › PygmalionAI › pygmalion-7b › discussions › 3
PygmalionAI/pygmalion-7b · Reasoning behind bfloat instead of float?
April 30, 2023 - Hi there! So, the reason we were using bfloat16 during our training is due to our usage of FSDP for our model parallelism scheme. Using FP16 with FSDP results in overflows/underflows showing up during training, which obviously leads to problems - this is why we had to use bfloat16.
🌐
Substack
kaitchup.substack.com › p › mind-your-data-type-bf16-vs-fp16
BF16 vs. FP16 vs. FP32 for Gemma 3 Inference — Mind Your Data Type
March 17, 2025 - It matches float32 while halving memory requirements. Most modern GPUs now support bfloat16, but many deployed older GPUs lack native support. Additionally, some CUDA kernels are optimized specifically for float16 inference, leading to better ...