Why is there no 2-byte float and does an implementation already exist?

stackoverflow.com › questions › 5766882 › why-is-there-no-2-byte-float-and-does-an-implementation-already-exist

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it

2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

float32: 24²=576 (100%)

float16: 11²=121 (21%)

bfloat16: 8²=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

Answer from phuclv on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 5766882 › why-is-there-no-2-byte-float-and-does-an-implementation-already-exist

c++ - Why is there no 2-byte float and does an implementation already exist? - Stack Overflow

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it

2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

float32: 24²=576 (100%)

float16: 11²=121 (21%)

bfloat16: 8²=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

2 of 10

19

Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c

Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).

GNU

gcc.gnu.org › onlinedocs › gcc › Half-Precision.html

Half-Precision (Using the GNU Compiler Collection (GCC))

For C++, x86 provides a builtin type named _Float16 which contains same data format as C.

Why is there no 2-byte float and does an implementation already exist?

stackoverflow.com › questions › 5766882 › why-is-there-no-2-byte-float-and-does-an-implementation-already-exist

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it

2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

float32: 24²=576 (100%)

float16: 11²=121 (21%)

bfloat16: 8²=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

Answer from phuclv on Stack Overflow

Wikipedia

en.wikipedia.org › wiki › Half-precision_floating-point_format

Half-precision floating-point format - Wikipedia

6 days ago - Half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular ...

History IEEE 754 half-precision binary floating-point format: binary16 ARM alternative half-precision Uses of half precision Support by programming languages Hardware support Further reading

GitHub

github.com › artyom-beilis › float16

GitHub - artyom-beilis/float16: half float library for C and for z80

half float library for C and for z80. Contribute to artyom-beilis/float16 development by creating an account on GitHub.

Starred by 41 users

Forked by 9 users

Quora

quora.com › Can-half-precision-floats-be-used-in-C

Can half precision floats be used in C++? - Quora

C++23 has std::float16_t for half precision floats but it is implementation dependent. The language has support for them but they are an optional part of the language.

reddit.com › r/c_programming › half float c library?

r/C_Programming on Reddit: Half Float C library?

January 24, 2022 -

I’m looking for a half precision float / float16 (https://en.wikipedia.org/wiki/Half-precision_floating-point_format?wprov=sfti1) library for C (C99). I only could find C++ ones. Any recommendations?