Brave Search

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

stackoverflow.com › questions › 69399917 › how-to-select-half-precision-bfloat16-vs-float16-for-your-trained-model

bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.

Check out AMP if you choose float16.

Answer from MWB on Stack Overflow

reddit.com › r/stablediffusion › cogstudio: difference between dtypes (bfloat16 vs float16) | pinokio

r/StableDiffusion on Reddit: CogStudio: Difference between dtypes (bfloat16 vs float16) | Pinokio

October 15, 2024 -

I'm would love to know what the difference is between bfloat16 and float16. Does anyone know of any difference? I know that older machines may not support bfloat16 but I would love to know more. Does float16 take longer or incompatible with certain gpus?

Top answer

1 of 2

2

Using bfloat16 makes things easier because it has the same range as float32, but with less precision. Previously you had to write a lot of extra code to avoid overflow errors going between float32 and float16, which has a much smaller range. The speed should be the same, because they're both 16 bits it's just how you use them.

2 of 2

2

You don't really have a choice if you want to use 5B. You have to use bfloat16. If you look at the CogVideo thread from about a month ago, someone posted the differences between the two. The video made with FP16 was distinctly different from the BF16 one. If you don't have an 3000+ Nvidia GPU, you won't be able to run 5B(BF16) because it will need a lot of VRAM. On my 3060 it only needs a few GB. On my 7900xtx or Mac, it wants something like 32GB.

Medium

medium.com › @manyi.yim › bfloat16-vs-float32-vs-float16-back-to-the-basics-80d4aec49ca8

dtypes of tensors: bfloat16 vs float32 vs float16 | by Manyi | Medium

July 27, 2024 - It preserves the dynamic range of float32 numbers by retaining 8 exponent bits and allows for fast conversion to and from a float32 number. While bfloat16 uses the same number of bits as float16, it has a wider dynamic range but lower precision.

Discussions

Why is there such a huge performance gap between bfloat16, float16, and float32?

I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its … More on discuss.pytorch.org

discuss.pytorch.org

0

April 28, 2025

Difference in SM performance of float16 and bfloat16

CUDA C++ Programming Guide (nvidia.com) states that Compute Capability 8.0 and 8.6 throughput of “16-bit floating-point add, multiply, multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results). More on forums.developer.nvidia.com

forums.developer.nvidia.com

0

August 7, 2024

Why to keep parameters in float32, why not in (b)float16?

I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leaving that aside, why would you not do this? Is there any downside? More on discuss.pytorch.org

discuss.pytorch.org

1

May 15, 2023

Reflection and the Never-Ending Confusion Between FP16 and BF16

While everything in your post is technically accurate, it does feel like you are wildly exaggerating just how destructive the BF16 to FP16 conversion is. According to tests performed by llama.cpp developers the perplexity difference between BF16 and FP16 is literally 10x less than even FP16 to Q8. And while perplexity is not a perfect measurement by any means it certainly points toward the conversion not being remotely as catastrophic as you make it out to be. And honestly it makes sense that it wouldn't really make that much of a difference in practice. BF16's main advantage is that it can represent some extremely high values that FP16 cannot, which matter during training, but trained checkpoints usually does not end up with a lot of those values in first placed. And since FP16 actually has a higher precision in terms of decimal places you don't lose anything in that regard during the conversion. Also it's worth pointing out that llama.cpp still converts models to FP16 by default before they get quanted to other formats. You have to go out of your way to keep the model in BF16. So most GGUFs fund on HF is likely based on FP16 conversion. If that actually lead to major downgrades in performance that default would have been changed ages ago, but it hasn't precisely because there has been no evidence produced that it actually does. More on reddit.com

r/LocalLLaMA

27

59

September 9, 2024

Videos

07:49

YouTube

What are Float32, Float16 and BFloat16 Data Types? - YouTube

July 19, 2024

youtube.com

Exploring Float32, Float16, and BFloat16 for Deep Learning in ...

08:33

YouTube

Mixed Precision Training: Bfloat16 vsFloat32 - YouTube

July 16, 2025

08:01

YouTube

TPUs, systolic arrays, and bfloat16: accelerate your deep learning ...

stackoverflow.com › questions › 69399917 › how-to-select-half-precision-bfloat16-vs-float16-for-your-trained-model

tensorflow - How to select half precision (BFLOAT16 vs FLOAT16) for your trained model? - Stack Overflow

Top answer

1 of 1

9

bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.

Check out AMP if you choose float16.

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

stackoverflow.com › questions › 69399917 › how-to-select-half-precision-bfloat16-vs-float16-for-your-trained-model

bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.

Check out AMP if you choose float16.

Answer from MWB on Stack Overflow

linkedin.com › pulse › float32-vs-float16-bfloat16-damien-benveniste-av3oc

Float32 vs Float16 vs BFloat16?

July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.

Wikipedia

en.wikipedia.org › wiki › Bfloat16_floating-point_format

bfloat16 floating-point format - Wikipedia

1 week ago - The bfloat16 format, being a shortened IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus ...

bfloat16 floating-point format Encoding of special values Range and precision Examples

Find elsewhere

Google Bing Mojeek

PyTorch Forums

discuss.pytorch.org › vision

Why is there such a huge performance gap between bfloat16, float16, and float32? - vision - PyTorch Forums

April 28, 2025 - I am trying to reduce the hard disk and memory usage of my model through quantization. The original type of the model is bfloat16. I am trying to perform a forced conversion test on the model using this code to test its performance after conversion ...

GitHub

github.com › stas00 › ml-ways › blob › master › numbers › bfloat16-vs-float16-study.ipynb

ml-ways/numbers/bfloat16-vs-float16-study.ipynb at master · stas00/ml-ways

"This is the main function, that tries to do very simply increments in `bfloat16` and then converting the result to `float16` and showing the discrepancies."

Author stas00

Nick Higham

nhigham.com › 2018 › 12 › 03 › half-precision-arithmetic-fp16-versus-bfloat16

Half Precision Arithmetic: fp16 Versus bfloat16 – Nick Higham

April 23, 2020 - With a small modification, I can make the Julia code type stable. Performance testing with 1000 iterations, BFloat16 is about 5x slower than Float64, but Float16 is significantly slower.

Theaiedge

newsletter.theaiedge.io › p › float32-vs-float16-vs-bfloat16

Float32 vs Float16 vs BFloat16? - by Damien Benveniste

July 19, 2024 - Those are just different levels of precision. Float32 is a way to represent a floating point number with 32 bits (1 or 0), and Float16 / BFloat16 is a way to represent the same number with just 16 bits.

NVIDIA Developer Forums

forums.developer.nvidia.com › accelerated computing › cuda › cuda programming and performance

Difference in SM performance of float16 and bfloat16 - CUDA Programming and Performance - NVIDIA Developer Forums

August 7, 2024 - CUDA C++ Programming Guide (nvidia.com) ... multiply-add” arithmetic instruction is different for fp16 (256 results per Clock Cycle per SM) and bfloat16 (128 results)....

Cerebras

cerebras.ai › blog › to-bfloat-or-not-to-bfloat-that-is-the-question

To Bfloat or not to Bfloat? That is the Question! - Cerebras

March 11, 2025 - Our experiments demonstrated that choosing bfloat16 is beneficial over pure float32 or a mixed version with float16. It improves efficiency of the training, uses less memory during training, saves space while maintaining the same accuracy level.

Beam

beam.cloud › blog › bf16-vs-fp16

BF16 vs FP16: A Comparison of Performance and Efficiency

April 14, 2025 - Run sandboxes, inference, and training with ultrafast boot times, instant autoscaling, and a developer experience that just works.

PyTorch Forums

discuss.pytorch.org › mixed-precision

Why to keep parameters in float32, why not in (b)float16? - mixed-precision - PyTorch Forums

May 15, 2023 - I wonder if I should keep my model parameters in float16 or bfloat16? This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then? But leavin…

Medium

medium.com › @furkangozukara › what-is-the-difference-between-fp16-and-bf16-here-a-good-explanation-for-you-d75ac7ec30fa

What is the difference between FP16 and BF16? Here a good explanation for you | by Furkan Gözükara - PhD Computer Engineer, SECourses | Medium

July 29, 2024 - BF16 (BFloat16): BF16 also uses 16 bits, but with a different distribution. It has 1 sign bit, 8 bits for the exponent, and 7 bits for the mantissa.

reddit.com › r/localllama › reflection and the never-ending confusion between fp16 and bf16

r/LocalLLaMA on Reddit: Reflection and the Never-Ending Confusion Between FP16 and BF16

September 9, 2024 -

Let’s set aside the API drama for a moment. This topic deserves careful consideration, as I keep seeing the same mistake made repeatedly.

The author of Reflection is facing issues with the model uploaded to Hugging Face. After three different uploads, the model on Hugging Face still performs much worse than what the author claims it is capable of. People have tested it, and it is underperforming even compared to the baseline LLaMA 3.1 70B.

I’m not sure if Reflection is a scam or not, but there’s a significant issue with the weights.

LLama 3.1 70B was trained using BF16, and the wigths are uploaded in BF16: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Reflection 70B was converted into FP16: https://huggingface.co/mattshumer/ref_70_e3

Does this make a difference? Yes, it makes a massive difference. BF16 and FP16 are very different formats, and they are not compatible. You cannot convert a BF16 model to FP16 without losing a lot of information.

FP16 has a 5-bit exponent and a 10-bit mantissa, while BF16 has an 8-bit exponent and a 7-bit mantissa. There is no way to convert a BF16 model to FP16, or vice versa, without significant loss of information. The BF16 to FP16 conversion is especially damaging. FP16 is not suitable for neural networks unless you use a complex mixed-precision training approach (https://arxiv.org/abs/1710.03740). On the other hand, BF16, developed by DeepMind (which stands for Brain Float 16) works out of the box for training neural networks.

FP16 was used in the early days for encoder-only models like BERT and RoBERTa, which were typically run in FP16. However, T5 was released in BF16, and since then, no other major model has used FP16 because it simply doesn’t work well. The only reason FP16 was used in the past is that Nvidia didn’t support BF16 until the A100 GPU came out. Google TPUs, however, had BF16 support, which is why T5 was trained in BF16.

I’m bringing this up because, despite FP16 being a dead format, and BF16 being the format used for every big model, many people still confuse them. This seems to have happened to the author of Reflection. Please, do not use FP16, and above all, do not attempt to convert BF16 weights into FP16, it will ruin your model.

Top answer

1 of 5

48

While everything in your post is technically accurate, it does feel like you are wildly exaggerating just how destructive the BF16 to FP16 conversion is. According to tests performed by llama.cpp developers the perplexity difference between BF16 and FP16 is literally 10x less than even FP16 to Q8. And while perplexity is not a perfect measurement by any means it certainly points toward the conversion not being remotely as catastrophic as you make it out to be. And honestly it makes sense that it wouldn't really make that much of a difference in practice. BF16's main advantage is that it can represent some extremely high values that FP16 cannot, which matter during training, but trained checkpoints usually does not end up with a lot of those values in first placed. And since FP16 actually has a higher precision in terms of decimal places you don't lose anything in that regard during the conversion. Also it's worth pointing out that llama.cpp still converts models to FP16 by default before they get quanted to other formats. You have to go out of your way to keep the model in BF16. So most GGUFs fund on HF is likely based on FP16 conversion. If that actually lead to major downgrades in performance that default would have been changed ages ago, but it hasn't precisely because there has been no evidence produced that it actually does.

2 of 5

24

While you make an valid and important point about floating point formats in general, let's not set aside the API drama in this specific case. Apply some bayesian reasoning: if Schumer has been conclusively shown to be profoundly misrepresenting his work on several vital points (like which base model is used, its size, and whether it is open source) that is highly informative for whether we should look to innocent format mixups as the explanation for lack of replication of claimed results for the uploaded model.

GitHub

github.com › bitsandbytes-foundation › bitsandbytes › issues › 1030

Float16 vs Bfloat16 when doing 4-bit and 8-bit quantization or doing half precision - I am asking torch_type · Issue #1030 · bitsandbytes-foundation/bitsandbytes

February 4, 2024 - Which one we should prefer when doing quantization When doing 4-bit : float16 or bfloat16? When doing 8-bit : float16 or bfloat16? When doing half precision 16-bit : float16 or bfloat16? torch_type = torch.float16 vs torch_type = torch.b...

Author FurkanGozukara

reddit.com › r/machinelearning › [d] what do papers mean when they say they "trained using bfloat16"?

r/MachineLearning on Reddit: [D] What do papers mean when they say they "trained using bfloat16"?

February 5, 2024 -

Previous to talking to a colleague I assumed they meant mixed precision using bfloat16, but he told me he regularly trains models using pure bfloat16, as in first transforms the model to bfloat16 and then proceeds with regular model training... is this common?

Top answer

1 of 2

28

The standard precision for training is float32, which means it uses 32 bit floats. Bfloat16 uses 16 bit, which reduces the memory storage by half and also speeds up computation, but with a tradeoff of lower accuracy (depending on a lot of factors). It is common if the model does not fit in a single GPU or multiple GPUs, and is more common with LLMs.

2 of 2

5

I think behind the scenes most operations are accumulating to float32 so indeed it’s mixed precision despite weights being bfloat16

Hugging Face

huggingface.co › PygmalionAI › pygmalion-7b › discussions › 3

PygmalionAI/pygmalion-7b · Reasoning behind bfloat instead of float?

April 30, 2023 - Hi there! So, the reason we were using bfloat16 during our training is due to our usage of FSDP for our model parallelism scheme. Using FP16 with FSDP results in overflows/underflows showing up during training, which obviously leads to problems - this is why we had to use bfloat16.

X

x.com › MrCatid › status › 1734654261835841543

catid в X: „bfloat16 vs float16: BF16 runs a little faster on TensorRT and produces very slightly better result - But it's almost unmeasurably small. TensorRT-LLM is 20% faster than HF transformers using Flash Attention 2. Right now I'm leaning towards not using TensorRT-LLM because I don't“ / X

bfloat16 vs float16: BF16 runs a little faster on TensorRT and produces very slightly better result - But it's almost unmeasurably small. TensorRT-LLM is 20% faster than HF transformers using Flash Attention 2.

Substack

kaitchup.substack.com › p › mind-your-data-type-bf16-vs-fp16

BF16 vs. FP16 vs. FP32 for Gemma 3 Inference — Mind Your Data Type

March 17, 2025 - It matches float32 while halving memory requirements. Most modern GPUs now support bfloat16, but many deployed older GPUs lack native support. Additionally, some CUDA kernels are optimized specifically for float16 inference, leading to better ...