TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

  • https://github.com/ramenhut/half
  • https://github.com/minhhn2910/cuda-half2
  • https://github.com/tianshilei1992/half_precision
  • https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it


2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

  • Why are double preferred over float?
  • Should I use double or float?
  • When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

  • float32: 242=576 (100%)
  • float16: 112=121 (21%)
  • bfloat16: 82=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

  • bfloat16 - Hardware Numerics Definition
  • Using bfloat16 with TensorFlow models
  • What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

Answer from phuclv on Stack Overflow
Top answer
1 of 10
26

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

  • https://github.com/ramenhut/half
  • https://github.com/minhhn2910/cuda-half2
  • https://github.com/tianshilei1992/half_precision
  • https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it


2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

  • Why are double preferred over float?
  • Should I use double or float?
  • When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

  • float32: 242=576 (100%)
  • float16: 112=121 (21%)
  • bfloat16: 82=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

  • bfloat16 - Hardware Numerics Definition
  • Using bfloat16 with TensorFlow models
  • What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

2 of 10
19

Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c

Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).

🌐
GNU
gcc.gnu.org › onlinedocs › gcc › Half-Precision.html
Half-Precision (Using the GNU Compiler Collection (GCC))
For C++, x86 provides a builtin type named _Float16 which contains same data format as C.

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

  • https://github.com/ramenhut/half
  • https://github.com/minhhn2910/cuda-half2
  • https://github.com/tianshilei1992/half_precision
  • https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it


2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

  • Why are double preferred over float?
  • Should I use double or float?
  • When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

  • float32: 242=576 (100%)
  • float16: 112=121 (21%)
  • bfloat16: 82=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

  • bfloat16 - Hardware Numerics Definition
  • Using bfloat16 with TensorFlow models
  • What is tf.bfloat16 "truncated 16-bit floating point"?

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

Answer from phuclv on Stack Overflow
🌐
Wikipedia
en.wikipedia.org › wiki › Half-precision_floating-point_format
Half-precision floating-point format - Wikipedia
6 days ago - Half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular ...
🌐
GitHub
github.com › artyom-beilis › float16
GitHub - artyom-beilis/float16: half float library for C and for z80
half float library for C and for z80. Contribute to artyom-beilis/float16 development by creating an account on GitHub.
Starred by 41 users
Forked by 9 users
Languages   C 60.4% | Assembly 36.3% | C++ 1.8% | Makefile 1.5% | C 60.4% | Assembly 36.3% | C++ 1.8% | Makefile 1.5%
🌐
Quora
quora.com › Can-half-precision-floats-be-used-in-C
Can half precision floats be used in C++? - Quora
C++23 has std::float16_t for half precision floats but it is implementation dependent. The language has support for them but they are an optional part of the language.
🌐
ROOT
root.cern › doc › master › float16_8C.html
tutorials/io/float16.C File Reference - CERN Root
Returns the common (base-10) logarithm of x. ... Definition in file float16.C.
Find elsewhere
🌐
GitHub
gist.github.com › milhidaka › 95863906fe828198f47991c813dbe233
float16 -> float32 conversion in C · GitHub
float16 -> float32 conversion in C. GitHub Gist: instantly share code, notes, and snippets.
🌐
ONNX Runtime
onnxruntime.ai › docs › api › c › struct_ort_1_1_float16__t.html
ONNX Runtime: Ort::Float16_t Struct Reference
This struct is used for converting float to float16 and back so the user could feed inputs and fetch outputs using these type.
🌐
GitHub
github.com › Railstars › OpenLCB › blob › master › float16.c
OpenLCB/float16.c at master · Railstars/OpenLCB
April 28, 2019 - C++ implementation of the OpenLCB model railroad layout control protocol. For Arduino, AVR-8, and ARM Cortex M3 - OpenLCB/float16.c at master · Railstars/OpenLCB
Author   Railstars
🌐
CERN
root.cern.ch › doc › v624 › float16_8C.html
ROOT: tutorials/io/float16.C File Reference
A TTree represents a columnar dataset. ... tbb::task_arena is an alias of tbb::interface7::task_arena, which doesn't allow to forward declare tb... ... Return the Standard Deviation of an array a with length n. ... Definition in file float16.C.
🌐
Cprogramming
cboard.cprogramming.com › c-programming › 92250-storing-float-16-bits.html
Storing a float in 16 bits
July 31, 2007 - I wrote up the code shown above whilst replying, so I didn't know the range. With such a small number of significant digits, you could go for "8.8". The code would need to change from shifting by 4 to shifting by 8 and & 15 to & 255. The multiplication by 16 should be multiplication by 256.
🌐
ROOT
root.cern › doc › v634 › float16_8C.html
tutorials/io/float16.C File Reference - ROOT
Returns the common (base-10) logarithm of x. ... Definition in file float16.C.
🌐
SourceForge
half.sourceforge.net
half: Half-precision floating-point library
This is a C++ header-only library to provide an IEEE 754 conformant 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions and common mathematical functions. It aims for both efficiency and ease of use, trying to accurately mimic the behaviour ...
Top answer
1 of 3
12

The exponent needs to be unbiased, clamped and rebiased. This is the fast code I use:

unsigned int fltInt32;
unsigned short fltInt16;

fltInt16 = (fltInt32 >> 31) << 5;
unsigned short tmp = (fltInt32 >> 23) & 0xff;
tmp = (tmp - 0x70) & ((unsigned int)((int)(0x70 - tmp) >> 4) >> 27);
fltInt16 = (fltInt16 | tmp) << 10;
fltInt16 |= (fltInt32 >> 13) & 0x3ff;

This code will be even faster with a lookup table for the exponent, but I use this one because it is easily adapted to a SIMD workflow.

Limitations of the implementation:

  • Overflowing values that cannot be represented in float16 will give undefined values.
  • Underflowing values will return an undefined value between 2^-15 and 2^-14 instead of zero.
  • Denormals will give undefined values.

Be careful with denormals. If your architecture uses them, they may slow down your program tremendously.

2 of 3
4

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

  1. Check for overflow when doing the thing with the exponents while you're at it.

  2. Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even.

🌐
LLVM
reviews.llvm.org › D33719
⚙ D33719 Add _Float16 as a C/C++ source language type
May 31, 2017 - This adds _Float16 as a source language type, which is a 16-bit floating point type defined in C11 extension ISO/IEC TS 18661-3 · This enumerator is the same as CXType_Float128 above, is that intended
🌐
IBM
ibm.com › docs › en › zos › 2.4.0
C/C++ data type definitions
We cannot provide a description for this page right now
🌐
Cppreference
en.cppreference.com › w › cpp › types › floating-point.html
Fixed width floating-point types (since C++23) - cppreference.com
February 13, 2025 - If the implementation supports any of the following ISO 60559 types as an extended floating-point type, then: · The type std::bfloat16_t is known as Brain Floating-Point
🌐
LLVM Discussion Forums
discourse.llvm.org › clang frontend
[RFC] implementation of _Float16 - Clang Frontend - LLVM Discussion Forums
May 10, 2017 - Hi, ARMv8.2-A introduces as an optional extension half-precision data-processing instructions for Advanced SIMD and floating-point in both AArch64 and AArch32 states [1], and we are looking into implementing C/C+±langua…