Brave Search

16-bit computer number format

Half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of … Wikipedia

Wikipedia

en.wikipedia.org › wiki › Half-precision_floating-point_format

Half-precision floating-point format - Wikipedia

2 days ago - Almost all modern uses follow the IEEE 754-2008 standard, where the 16-bit base-2 format is referred to as binary16, and the exponent uses 5 bits. This can express values in the range ±65,504, with the minimum value above 1 being 1 + 1/1024.

History IEEE 754 half-precision binary floating-point format: binary16 ARM alternative half-precision Uses of half precision Support by programming languages Hardware support Further reading

Wikipedia

en.wikipedia.org › wiki › Bfloat16_floating-point_format

bfloat16 floating-point format - Wikipedia

6 days ago - Bfloat16 is designed to maintain the number range from the 32-bit IEEE 754 single-precision floating-point format (binary32), while reducing the precision from 24 bits to 8 bits.

bfloat16 floating-point format Encoding of special values Range and precision Examples

Stack Overflow

stackoverflow.com › questions › 872544 › what-range-of-numbers-can-be-represented-in-a-16-32-and-64-bit-ieee-754-syste

floating point - What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems? - Stack Overflow

Top answer

1 of 7

125

For a given IEEE-754 floating point number X, if

2^E <= abs(X) < 2^(E+1)

then the distance from X to the next largest representable floating point number (epsilon) is:

epsilon = 2^(E-52)    % For a 64-bit float (double precision)
epsilon = 2^(E-23)    % For a 32-bit float (single precision)
epsilon = 2^(E-10)    % For a 16-bit float (half precision)

The above equations allow us to compute the following:

For half precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005.
For single precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005.
For double precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

2 of 7

For floating-point integers (I'll give my answer in terms of IEEE double-precision), every integer between 1 and 2^53 is exactly representable. Beyond 2^53, integers that are exactly representable are spaced apart by increasing powers of two. For example:

Every 2nd integer between 2^53 + 2 and 2^54 can be represented exactly.
Every 4th integer between 2^54 + 4 and 2^55 can be represented exactly.
Every 8th integer between 2^55 + 8 and 2^56 can be represented exactly.
Every 16th integer between 2^56 + 16 and 2^57 can be represented exactly.
Every 32nd integer between 2^57 + 32 and 2^58 can be represented exactly.
Every 64th integer between 2^58 + 64 and 2^59 can be represented exactly.
Every 128th integer between 2^59 + 128 and 2^60 can be represented exactly.
Every 256th integer between 2^60 + 256 and 2^61 can be represented exactly.
Every 512th integer between 2^61 + 512 and 2^62 can be represented exactly. . . .

Integers that are not exactly representable are rounded to the nearest representable integer, so the worst case rounding is 1/2 the spacing between representable integers.

Apache MXNet

mxnet.apache.org › versions › 1.9.1 › api › faq › float16

Float16 | Apache MXNet

The float16 data type is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.0000000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the inherent reduced precision ...

linkedin.com › pulse › float32-vs-float16-bfloat16-damien-benveniste-av3oc

Float32 vs Float16 vs BFloat16?

July 19, 2024 - Float 32 can range between -3.4e^38 and 3.4e^38, the range of Float16 is between -6.55e^4 and 6.55e^4 (so a much smaller range!), and BFloat has the same range as Float32.

Cornell University

people.ece.cornell.edu › land › courses › ece4760 › RP2040 › C_SDK_floating_point › index_floating_point.html

16-bit floating point

The system implemented is 16-bit ... in limited precision, and therefore faster, floating point than in fixed point. The 16-bit floats have a dynamic range of 1e5 and resolution of 1e-4....

Medium

medium.com › @tushar.sharma0214 › demystifying-floating-point-numbers-in-swift-the-float16-deep-dive-for-beginners-f33a4b92db85

Demystifying Floating-Point Numbers in Swift — The Float16 Deep Dive for Beginners

July 8, 2025 - Smallest subnormal in Float16: Exponent bits = 00000 Mantissa = 0000000001 · This represents a number very close to 0: ≈ 0.0000000596 · 00001 to 11110 → Normal numbers (actual exponent = stored - 15 → range: -14 to +15) 00000 → Subnormals ...

Theaiedge

newsletter.theaiedge.io › p › float32-vs-float16-vs-bfloat16

Float32 vs Float16 vs BFloat16? - by Damien Benveniste

July 19, 2024 - Float 32 can range between -3.4e^38 and 3.4e^38, the range of Float16 is between -6.55e^4 and 6.55e^4 (so a much smaller range!), and BFloat has the same range as Float32.

Find elsewhere

Google Bing Mojeek

Densitylabs

densitylabs.io › blog › understanding-numerical-precision-in-deep-learning-float-32-float-16-and-b-float-16

Float 32, Float 16, and B Float 16 - Density Labs

July 19, 2024 - Float 32 can represent values from -3.4 x 10^38 to 3.4 x 10^38, while Float 16 ranges from -6.55 x 10^4 to 6.55 x 10^4. This difference makes converting from Float 32 to Float 16 challenging due to potential overflow errors.

Stack Overflow

stackoverflow.com › questions › 872544 › what-range-of-numbers-can-be-represented-in-16-32-and-64-bit-ieee-754-system

floating point - What range of numbers can be represented in 16-, 32-, and 64-bit IEEE-754 systems? - Stack Overflow

Top answer

1 of 8

126

For a given IEEE-754 floating point number X, if

2^E <= abs(X) < 2^(E+1)

then the distance from X to the next largest representable floating point number (epsilon) is:

epsilon = 2^(E-52)    % For a 64-bit float (double precision)
epsilon = 2^(E-23)    % For a 32-bit float (single precision)
epsilon = 2^(E-10)    % For a 16-bit float (half precision)

The above equations allow us to compute the solutions to the following:

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is S1. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is S4. Any larger than this and the distance between floating point numbers is greater than 0.0005.

For double precision, S1 = 2^52, S4 = 2^42
For single precision, S1 = 2^23, S4 = 2^13
For half precision, S1 = 2^10, S4 = 1

2 of 8

Every 2nd integer between 2^53 + 2 and 2^54 can be represented exactly.
Every 4th integer between 2^54 + 4 and 2^55 can be represented exactly.
Every 8th integer between 2^55 + 8 and 2^56 can be represented exactly.
Every 16th integer between 2^56 + 16 and 2^57 can be represented exactly.
Every 32nd integer between 2^57 + 32 and 2^58 can be represented exactly.
Every 64th integer between 2^58 + 64 and 2^59 can be represented exactly.
Every 128th integer between 2^59 + 128 and 2^60 can be represented exactly.
Every 256th integer between 2^60 + 256 and 2^61 can be represented exactly.
Every 512th integer between 2^61 + 512 and 2^62 can be represented exactly. . . .

Integers that are not exactly representable are rounded to the nearest representable integer, so the worst case rounding is 1/2 the spacing between representable integers.

Mathematics LibreTexts

math.libretexts.org › bookshelves › scientific computing, simulations, and modeling › scientific computing (staab) › 3: introduction to data types

3.2: Floating Point Numbers - Mathematics LibreTexts

July 20, 2022 - • Float16 stores 4 decimal digits and the max is about 32,000.

Apple Developer

developer.apple.com › documentation › metalperformanceshaders › mpsimagefeaturechannelformat › float16

MPSImageFeatureChannelFormat.float16 | Apple Developer Documentation

IEEE-754 16-bit floating-point type (half precision). Representable normal range is . 11 bits of precision + exponent.

APXML

apxml.com › courses › how-to-build-a-large-language-model › chapter-20-mixed-precision-training-techniques › introduction-floating-point-formats

Introduction to Floating-Point Formats (FP32, FP16, BF16)

FP16FP16 range. import torch # Create an FP16 tensor fp16_tensor = torch.tensor([1.0, 2.0, 3.0]).half() # or .to(torch.float16) print(f"Data type: {fp16_tensor.dtype}") print(f"Memory per element (bytes): {fp16_tensor.element_size()}") # Demonstrate range issue (underflow) small_val_fp32 = ...

Apple Developer

developer.apple.com › documentation › swift › float16

Float16 | Apple Developer Documentation

A half-precision (16-bit), floating-point value type.

Milania

milania.de › blog › Fundamentals_of_floating-point_numbers_and_mixed_precision_training

Fundamentals of floating-point numbers and mixed precision training - Milania's Blog

October 3, 2024 - Subnormal numbers (\eqref{eq:MixedPrecision_Subnormal}) follow a special definition to represent small values close to 0.0 which are outside the range of normal numbers. Floating-point numbers are defined according to the IEEE 754 standard. With the help of the sign bit \(S\), we can make our number either positive or negative. \((S)_2 = 0\) represents positive and \((S)_2 = 1\) represents negative values2. Using float16 as example, the bits \(b_{14}, b_{13}, \ldots, b_{10}\) of the exponent \(E\) represent a positive integer binary number \begin{equation} {(E)}_2 = \sum_{i = 0}^{e-1} b_{10+i} \cdot 2^{i} \end{equation}

Medium

moocaholic.medium.com › fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407

FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO | by Grigory Sapunov | Medium

May 17, 2020 - TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

reddit.com › r/algorithms › 16-bit floats - how practical are they?

r/algorithms on Reddit: 16-bit floats - how practical are they?

March 9, 2023 -

Interesting that in machine learning short floats are used because it is more important to squeeze all the numbers into the RAM of a GPU card, than get 7 significant figures. But I am curious what is feasible with 7 or 10 bit mantissa. For example, if I want eigenvectors of a matrix for PCA, I use double precision, as that is an iterative process.